Dataset Preparation:
Load a healthcare-related dataset (e.g., predicting the likelihood of a patient developing heart disease based on health indicators such as age, blood pressure, cholesterol, etc.).
Split the data into training (80%) and test (20%) sets.

In [2]:
import pandas as pd

# Define the file path
file_path = r'C:\Users\harik\OneDrive\Documents\NWU DOCS\ML\kritik\week 5\diabetes\Healthcare-Diabetes.csv'

# Load the dataset
diabetes_data = pd.read_csv(file_path)

# Display the first 5 rows of the dataset to verify loading
print(diabetes_data.head())


   Id  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0   1            6      148             72             35        0  33.6   
1   2            1       85             66             29        0  26.6   
2   3            8      183             64              0        0  23.3   
3   4            1       89             66             23       94  28.1   
4   5            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [8]:
# Step 3: Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

# Step 4: Define features (X) and target (y)
X = data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = data['Outcome']

# Step 5: Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Print the shapes of the training and testing sets
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Missing values in each column:
Id                          0
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
Shape of X_train: (2214, 8)
Shape of X_test: (554, 8)
Shape of y_train: (2214,)
Shape of y_test: (554,)


In [5]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Load the dataset
data = pd.read_csv("C:/Users/harik/OneDrive/Documents/NWU DOCS/ML/kritik/week 5/diabetes/Healthcare-Diabetes.csv")

# Step 3: Define features (X) and target (y)
X = data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 
          'DiabetesPedigreeFunction', 'Age']]  # Features
y = data['Outcome']  # Target variable

# Step 4: Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Implement and train Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(X_train, y_train)

# Step 6: Implement and train Gradient Boosting Machine (GBM)
gbm_model = GradientBoostingClassifier()
gbm_model.fit(X_train, y_train)

# Step 7: Implement and train Random Forest Classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Step 8: Predict on the test set using all three models
svm_predictions = svm_model.predict(X_test)
gbm_predictions = gbm_model.predict(X_test)
rf_predictions = rf_model.predict(X_test)

# Step 9: Evaluate the models using accuracy score and classification report
print("\n--- Support Vector Machine (SVM) Results ---")
print(f"Accuracy: {accuracy_score(y_test, svm_predictions)}")
print(classification_report(y_test, svm_predictions))

print("\n--- Gradient Boosting Machine (GBM) Results ---")
print(f"Accuracy: {accuracy_score(y_test, gbm_predictions)}")
print(classification_report(y_test, gbm_predictions))

print("\n--- Random Forest Classifier Results ---")
print(f"Accuracy: {accuracy_score(y_test, rf_predictions)}")
print(classification_report(y_test, rf_predictions))



--- Support Vector Machine (SVM) Results ---
Accuracy: 0.7689530685920578
              precision    recall  f1-score   support

           0       0.79      0.89      0.84       367
           1       0.71      0.53      0.61       187

    accuracy                           0.77       554
   macro avg       0.75      0.71      0.72       554
weighted avg       0.76      0.77      0.76       554


--- Gradient Boosting Machine (GBM) Results ---
Accuracy: 0.8808664259927798
              precision    recall  f1-score   support

           0       0.89      0.94      0.91       367
           1       0.87      0.76      0.81       187

    accuracy                           0.88       554
   macro avg       0.88      0.85      0.86       554
weighted avg       0.88      0.88      0.88       554


--- Random Forest Classifier Results ---
Accuracy: 0.98014440433213
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       367
           1   

Hyperparameter Tuning:
Use GridSearchCV or RandomizedSearchCV to tune hyperparameters for each of the models (e.g., SVM's kernel, Random Forest's n_estimators, etc.).

here i modified few of the n values  for fast run time changes which allows the code to run quickly 

Smaller Hyperparameter Grids: Reduced the number of options in each hyperparameter grid.
Reduced n_iter: Set n_iter=5 in RandomizedSearchCV to limit the number of random samples, which speeds up the tuning process.
Reduced Cross-Validation Folds: Set cv=3 for fewer cross-validation folds to decrease computational load.

In [3]:
# Step 10: Define smaller hyperparameter grids for tuning
# SVM Hyperparameters
svm_param_grid = {
    'C': [0.1, 1],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale']
}

# Gradient Boosting Hyperparameters
gbm_param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.1],
    'max_depth': [3, 5]
}

# Random Forest Hyperparameters
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5]
}

# Step 11: Randomized Search for Support Vector Machine (SVM)
svm_random = RandomizedSearchCV(SVC(), svm_param_grid, n_iter=5, refit=True, verbose=1, cv=3, n_jobs=-1)
svm_random.fit(X_train, y_train)

# Best parameters and evaluation for SVM
print("\n--- Best Parameters for SVM ---")
print(svm_random.best_params_)
svm_best_predictions = svm_random.predict(X_test)
print(f"SVM Accuracy after tuning: {accuracy_score(y_test, svm_best_predictions)}")
print(classification_report(y_test, svm_best_predictions))

# Step 12: Randomized Search for Gradient Boosting Machine (GBM)
gbm_random = RandomizedSearchCV(GradientBoostingClassifier(), gbm_param_grid, n_iter=5, refit=True, verbose=1, cv=3, n_jobs=-1)
gbm_random.fit(X_train, y_train)

# Best parameters and evaluation for GBM
print("\n--- Best Parameters for GBM ---")
print(gbm_random.best_params_)
gbm_best_predictions = gbm_random.predict(X_test)
print(f"GBM Accuracy after tuning: {accuracy_score(y_test, gbm_best_predictions)}")
print(classification_report(y_test, gbm_best_predictions))

# Step 13: Randomized Search for Random Forest Classifier
rf_random = RandomizedSearchCV(RandomForestClassifier(), rf_param_grid, n_iter=5, refit=True, verbose=1, cv=3, n_jobs=-1)
rf_random.fit(X_train, y_train)

# Best parameters and evaluation for Random Forest
print("\n--- Best Parameters for Random Forest ---")
print(rf_random.best_params_)
rf_best_predictions = rf_random.predict(X_test)
print(f"Random Forest Accuracy after tuning: {accuracy_score(y_test, rf_best_predictions)}")
print(classification_report(y_test, rf_best_predictions))




Fitting 3 folds for each of 4 candidates, totalling 12 fits

--- Best Parameters for SVM ---
{'kernel': 'linear', 'gamma': 'scale', 'C': 1}
SVM Accuracy after tuning: 0.7635379061371841
              precision    recall  f1-score   support

           0       0.79      0.89      0.83       367
           1       0.70      0.52      0.60       187

    accuracy                           0.76       554
   macro avg       0.74      0.70      0.72       554
weighted avg       0.76      0.76      0.75       554

Fitting 3 folds for each of 4 candidates, totalling 12 fits





--- Best Parameters for GBM ---
{'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.1}
GBM Accuracy after tuning: 0.9819494584837545
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       367
           1       0.98      0.96      0.97       187

    accuracy                           0.98       554
   macro avg       0.98      0.98      0.98       554
weighted avg       0.98      0.98      0.98       554

Fitting 3 folds for each of 5 candidates, totalling 15 fits

--- Best Parameters for Random Forest ---
{'n_estimators': 200, 'min_samples_split': 2, 'max_depth': None}
Random Forest Accuracy after tuning: 0.9819494584837545
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       367
           1       0.98      0.96      0.97       187

    accuracy                           0.98       554
   macro avg       0.98      0.98      0.98       554
weighted avg       0.98      0.98     

Model Evaluation:
Evaluate each model using the following metrics:
Accuracy
Precision, Recall, F1-score
AUC-ROC
Compare the performance of the models on the test data.

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Step 1: Evaluate SVM
svm_best_predictions = svm_random.predict(X_test)
svm_auc = roc_auc_score(y_test, svm_random.predict_proba(X_test)[:, 1])
print("\n--- SVM Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, svm_best_predictions)}")
print(f"Precision: {precision_score(y_test, svm_best_predictions)}")
print(f"Recall: {recall_score(y_test, svm_best_predictions)}")
print(f"F1 Score: {f1_score(y_test, svm_best_predictions)}")
print(f"AUC-ROC: {svm_auc}")

# Step 2: Evaluate GBM
gbm_best_predictions = gbm_random.predict(X_test)
gbm_auc = roc_auc_score(y_test, gbm_random.predict_proba(X_test)[:, 1])
print("\n--- GBM Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, gbm_best_predictions)}")
print(f"Precision: {precision_score(y_test, gbm_best_predictions)}")
print(f"Recall: {recall_score(y_test, gbm_best_predictions)}")
print(f"F1 Score: {f1_score(y_test, gbm_best_predictions)}")
print(f"AUC-ROC: {gbm_auc}")

# Step 3: Evaluate Random Forest
rf_best_predictions = rf_random.predict(X_test)
rf_auc = roc_auc_score(y_test, rf_random.predict_proba(X_test)[:, 1])
print("\n--- Random Forest Evaluation ---")
print(f"Accuracy: {accuracy_score(y_test, rf_best_predictions)}")
print(f"Precision: {precision_score(y_test, rf_best_predictions)}")
print(f"Recall: {recall_score(y_test, rf_best_predictions)}")
print(f"F1 Score: {f1_score(y_test, rf_best_predictions)}")
print(f"AUC-ROC: {rf_auc}")

# Step 4: Plot ROC Curves
plt.figure(figsize=(10, 6))

# SVM ROC Curve
fpr_svm, tpr_svm, _ = roc_curve(y_test, svm_random.predict_proba(X_test)[:, 1])
plt.plot(fpr_svm, tpr_svm, label=f'SVM (AUC = {svm_auc:.2f})')

# GBM ROC Curve
fpr_gbm, tpr_gbm, _ = roc_curve(y_test, gbm_random.predict_proba(X_test)[:, 1])
plt.plot(fpr_gbm, tpr_gbm, label=f'GBM (AUC = {gbm_auc:.2f})')

# Random Forest ROC Curve
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_random.predict_proba(X_test)[:, 1])
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {rf_auc:.2f})')

plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.show()


NotFittedError: This RandomizedSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [9]:
svm_random = RandomizedSearchCV(SVC(probability=True), svm_param_grid, n_iter=5, refit=True, verbose=1, cv=3, n_jobs=-1)


In [11]:
# Step 11: Randomized Search for Support Vector Machine (SVM)
svm_random = RandomizedSearchCV(SVC(probability=True), svm_param_grid, n_iter=5, refit=True, verbose=1, cv=3, n_jobs=-1)
svm_random.fit(X_train, y_train)

# Best parameters and evaluation for SVM
print("\n--- Best Parameters for SVM ---")
print(svm_random.best_params_)
svm_best_predictions = svm_random.predict(X_test)

# Evaluate SVM
evaluate_model(y_test, svm_best_predictions, "SVM")




Fitting 3 folds for each of 4 candidates, totalling 12 fits
