DATA COLLECTION AND PREPROCESSING

In this section, we load the dataset, preprocess the data by standardizing the features, and handle class imbalance using SMOTE.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

# Load the dataset
diabetes_dataset = pd.read_csv('diabetes.csv')

# Data description
print("Dataset Description:")
print(diabetes_dataset.describe())

# Splitting data into features (X) and target (Y)
X = diabetes_dataset.drop(columns='Outcome', axis=1)
Y = diabetes_dataset['Outcome']

# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Handle data imbalance using SMOTE
smote = SMOTE(random_state=42)
X, Y = smote.fit_resample(X, Y)
print("After SMOTE - Outcome Counts:", pd.Series(Y).value_counts())

# Split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=42)
print("Data shapes:", X.shape, X_train.shape, X_test.shape)

Dataset Description:
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000

EXPLORATORY DATA ANALYSIS (EDA)

This section involves describing the dataset to understand its distribution, relationships, and basic statistics.

In [None]:
# EDA section can include visualizations or further analysis.
# In this script, the dataset description from above provides basic exploratory insights.
print("EDA completed with dataset description above.")

EDA completed with dataset description above.


MACHINE LEARNING MODEL DEVELOPMENT

In this section, we define the XGBoost model, perform hyperparameter tuning using GridSearchCV, and train the model with the best parameters.

In [None]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from xgboost import XGBClassifier

# Define the model
xgb_model = XGBClassifier(objective='binary:logistic', random_state=42, eval_metric='logloss')

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'colsample_bytree': [0.8, 1.0],
    'subsample': [0.8, 1.0]
}

grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='accuracy', cv=StratifiedKFold(n_splits=5), verbose=2, n_jobs=-1)
grid_search.fit(X_train, Y_train)

# Best parameters and model
best_xgb = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.8}


In this section, we define the SVC model.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the SVM model
svm_model = SVC(kernel='rbf', C=1, gamma='scale', random_state=42)

# Train the SVM model
svm_model.fit(X_train, Y_train)

# Make predictions
Y_pred = svm_model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(Y_test, Y_pred))
print("\nClassification Report:")
print(classification_report(Y_test, Y_pred))
print("\nAccuracy Score:", accuracy_score(Y_test, Y_pred))

Confusion Matrix:
[[74 26]
 [19 81]]

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.74      0.77       100
           1       0.76      0.81      0.78       100

    accuracy                           0.78       200
   macro avg       0.78      0.78      0.77       200
weighted avg       0.78      0.78      0.77       200


Accuracy Score: 0.775


DISEASE PREDICTION AND RISK ASSESSMENT

Here, we evaluate the trained model using accuracy, classification report, and confusion matrix. Additionally, feature importance is visualized.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

# Evaluate the model
train_predictions = best_xgb.predict(X_train)
train_accuracy = accuracy_score(Y_train, train_predictions)
print("Training Data Accuracy:", train_accuracy)

test_predictions = best_xgb.predict(X_test)
test_accuracy = accuracy_score(Y_test, test_predictions)
print("Test Data Accuracy:", test_accuracy)

# Classification report and confusion matrix
print("Classification Report:\n", classification_report(Y_test, test_predictions))
print("Confusion Matrix:\n", confusion_matrix(Y_test, test_predictions))

# Save the model and scaler
with open('diabetes_model_xgb.pkl', 'wb') as model_file:
    pickle.dump(best_xgb, model_file)
with open('scaler.pkl', 'wb') as scaler_file:
    pickle.dump(scaler, scaler_file)
print("XGBoost model and scaler saved successfully!")

# Feature importance visualization
feature_importances = pd.DataFrame({
    'Feature': diabetes_dataset.columns[:-1],
    'Importance': best_xgb.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importances, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importances in XGBoost')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.savefig('xgb_feature_importances.png')
plt.clf()
print("Feature importances chart saved.")

Training Data Accuracy: 0.99875
Test Data Accuracy: 0.795
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.75      0.79       100
           1       0.77      0.84      0.80       100

    accuracy                           0.80       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.80      0.79       200

Confusion Matrix:
 [[75 25]
 [16 84]]
XGBoost model and scaler saved successfully!



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=feature_importances, x='Importance', y='Feature', palette='viridis')


Feature importances chart saved.


<Figure size 1000x600 with 0 Axes>

Now, using these model for predicting is User is Diabetic or not.

In [None]:
import pickle
import numpy as np

# Load the trained model
with open('diabetes_model_xgb.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

# Load the scaler object
with open('scaler.pkl', 'rb') as scaler_file:
    scaler = pickle.load(scaler_file)

# Function for predicting diabetes
def predict_diabetes(input_data):
    # Convert input data to a numpy array and reshape it
    input_data_array = np.array(input_data).reshape(1, -1)

    # Scale the input data using the loaded scaler
    input_data_scaled = scaler.transform(input_data_array)

    # Make predictions using the loaded model
    prediction = loaded_model.predict(input_data_scaled)
    proba = loaded_model.predict_proba(input_data_scaled)

    return prediction[0], proba[0]

# Input data
input_data = [0, 137, 40, 35, 168, 43.1, 2.288, 33]

# Predict diabetes
prediction, probability = predict_diabetes(input_data)

# Output the results
print("\nPrediction for input data:")
if prediction == 0:
    print("The person is not diabetic")
else:
    print("The person is diabetic")
print(f"Probability of being diabetic: {probability[1]*100:.2f}%")



Prediction for input data:
The person is diabetic
Probability of being diabetic: 94.49%




In [None]:
# Boxplot for each column
for column in diabetes_dataset.columns:
    plt.figure(figsize=(8, 6))
    sns.boxplot(y=diabetes_dataset[column], palette='Set2')
    plt.title(f'Boxplot of {column}')
    plt.ylabel(column)
    plt.tight_layout()
    plt.savefig(f'{column}_boxplot.png')
    plt.clf()
    print(f'Boxplot for {column} saved.')

# Heatmap of correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = diabetes_dataset.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Heatmap of Feature Correlations')
plt.tight_layout()
plt.savefig('heatmap_correlation.png')
plt.clf()
print("Heatmap of feature correlations saved.")

# Bar chart for Outcome counts
outcome_counts = diabetes_dataset['Outcome'].value_counts()
plt.figure(figsize=(8, 6))
sns.barplot(x=outcome_counts.index, y=outcome_counts.values, palette='husl')
plt.title('Outcome Counts')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('outcome_counts_barchart.png')
plt.clf()
print("Bar chart for Outcome counts saved.")

# Line chart of feature averages by Outcome
feature_means = diabetes_dataset.groupby('Outcome').mean().T
plt.figure(figsize=(12, 6))
for outcome in feature_means.columns:
    plt.plot(feature_means.index, feature_means[outcome], marker='o', label=f'Outcome {outcome}')
plt.title('Feature Averages by Outcome')
plt.xlabel('Features')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.savefig('feature_averages_linechart.png')
plt.clf()
print("Line chart of feature averages by Outcome saved.")

# Histogram for each column
for column in diabetes_dataset.columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(diabetes_dataset[column], kde=True, color='skyblue', bins=30)
    plt.title(f'Histogram of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.savefig(f'{column}_histogram.png')
    plt.clf()
    print(f'Histogram for {column} saved.')

# Pairplot to visualize relationships
sns.pairplot(diabetes_dataset, hue='Outcome', palette='viridis')
plt.savefig('pairplot_features.png')
plt.clf()
print("Pairplot for feature relationships saved.")



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for Pregnancies saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for Glucose saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for BloodPressure saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for SkinThickness saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for Insulin saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for BMI saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for DiabetesPedigreeFunction saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for Age saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(y=diabetes_dataset[column], palette='Set2')


Boxplot for Outcome saved.
Heatmap of feature correlations saved.



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=outcome_counts.index, y=outcome_counts.values, palette='husl')


Bar chart for Outcome counts saved.
Line chart of feature averages by Outcome saved.
Histogram for Pregnancies saved.
Histogram for Glucose saved.
Histogram for BloodPressure saved.
Histogram for SkinThickness saved.
Histogram for Insulin saved.
Histogram for BMI saved.
Histogram for DiabetesPedigreeFunction saved.
Histogram for Age saved.


  plt.figure(figsize=(8, 6))


Histogram for Outcome saved.
Pairplot for feature relationships saved.


<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 1000x800 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 800x600 with 0 Axes>

<Figure size 2074.36x2000 with 0 Axes>