<a href="https://colab.research.google.com/github/thuseethan/machine_learning/blob/main/Diabetes_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Diabetes Diagnosis using Pima Indians Diabetes Database**

Within this notebook lies a comprehensive Python implementation designed to detect Diabetes within individuals, leveraging the rich dataset offered by the Pima Indians Diabetes Database available on Kaggle (https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).

*The copyright for this code belongs to Thuseethan Selvarajah (thuseethan@gmail.com) © 22 Dec 2023*

**Importing all the necessary packages and libraries.**

In [46]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

**Here, (1) Load the dataset and define the data [X] and target [y] are performed.**

Before loading the diabetes.csv data, do the followings:
1.   Access the Pima Indians Diabetes Database on Kaggle through this link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
2.   Download the dataset available at the provided link.
1.   Rename the downloaded dataset file to "diabetes.csv."
2.   Upload the "diabetes.csv" file to the session storage in Colab.
1.   Execute the below script designed to load the uploaded "diabetes.csv" file into the Colab environment.







In [2]:
diabetes_data = pd.read_csv('diabetes.csv')
features = ['Pregnancies', 'Insulin', 'BMI', 'Age','Glucose','BloodPressure','DiabetesPedigreeFunction']
X = diabetes_data[features]
y = diabetes_data.Outcome

In [3]:
# View first five rows in the data.
X.head(5)

Unnamed: 0,Pregnancies,Insulin,BMI,Age,Glucose,BloodPressure,DiabetesPedigreeFunction
0,6,0,33.6,50,148,72,0.627
1,1,0,26.6,31,85,66,0.351
2,8,0,23.3,32,183,64,0.672
3,1,94,28.1,21,89,66,0.167
4,0,168,43.1,33,137,40,2.288


In [4]:
# View first five rows in the target column.
y.head(5)

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

**Split the dataset into train and test sets.**
*   The Python code segments a dataset into distinct portions for training and testing.
*   'X' represents the input data, while 'y' signifies the corresponding labels.
*   It creates 'X_train' for training data and 'X_test' for testing data.
*   'y_train' stores the labels for the training set, and 'y_test' holds the labels for the testing set.
*   With 'test_size=0.3', 30% of the data is allocated for testing.
*   Setting 'random_state=0' ensures reproducibility of the split when the code runs multiple times.

In [11]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)

**Preprocessing [Feature Scaling]**
1.   *Understanding Euclidean Distance:* Many machine learning algorithms utilize the Euclidean distance between data points in their computations.
2.   *Magnitude Influence:* High-magnitude features exert a stronger impact on distance calculations than lower-magnitude ones.
1.   *Addressing Disproportionate Impact:* To counter this, employ feature standardization or Z-score normalization techniques.
2.   *Using "StandardScaler":* Implement normalization using the "StandardScaler" class available in the "sklearn.preprocessing" module.
1.   *Equal Contribution:* Normalization ensures that all features contribute equally to distance calculations, enhancing algorithm performance across diverse feature magnitudes.











In [5]:
std_scl = StandardScaler()

# Apply the standard scaler normalization to both train and test set separately.
X_train = std_scl.fit_transform(X_train)
X_test = std_scl.fit_transform(X_test)

**Train and test multiple machine learning models**

**Dummy classifier**
*   The Python code initializes a baseline model using the 'DummyClassifier' from scikit-learn.
*   It employs a 'stratified' strategy, meaning it generates predictions based on the training set's class distribution, and 'random_state=42' ensures reproducibility by fixing the random seed to 42.



In [31]:
# A baseline dummy classifier
baseline = DummyClassifier(strategy = 'stratified', random_state = 42)
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)


target_names = ['Diabetic', 'Nondiabetic']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

    Diabetic       0.65      0.67      0.66       149
 Nondiabetic       0.36      0.33      0.34        82

    accuracy                           0.55       231
   macro avg       0.50      0.50      0.50       231
weighted avg       0.54      0.55      0.55       231



**Logistic regression model**
*   The Python code creates a logistic regression model using scikit-learn's 'LogisticRegression.'
*   It uses multiple solvers to optimize the model's parameters and sets the maximum number of iterations for convergence to 1000.
*   The solver with the best accuracy is selected.
*   The classification report for the best logistic regression model (solver) is presented.







In [38]:
# Logistic Regression with different solvers
solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']
best_accuracy = 0
best_solver = None
best_model = None

reports = {}

for solver in solvers:
    log_reg = LogisticRegression(solver=solver, max_iter=1000)
    log_reg.fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    reports[solver] = classification_report(y_test, y_pred, output_dict=True)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_solver = solver
        best_model = log_reg

print(f"Best Solver: {best_solver}, Accuracy: {best_accuracy}")
print("Classification Report for Best Model-Solver Combination:")
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Best Solver: liblinear, Accuracy: 0.8051948051948052
Classification Report for Best Model-Solver Combination:
              precision    recall  f1-score   support

           0       0.81      0.91      0.86       149
           1       0.79      0.61      0.69        82

    accuracy                           0.81       231
   macro avg       0.80      0.76      0.77       231
weighted avg       0.80      0.81      0.80       231



**K-Nearest neighbour model**
*   Initializes a k-nearest neighbors classifier using scikit-learn's 'KNeighborsClassifier'.
*   Sets the parameter 'n_neighbors' to the value 'k', which determines the number of neighboring data points used to make predictions in the k-nearest neighbors algorithm.

In [47]:
# K-Nearest neighbour model with different number of neighbours
neighbours = [3, 4, 5, 6, 7, 8]
best_accuracy = 0
best_neighbours = None
best_model = None

reports = {}

for k in neighbours:
    k_nei = KNeighborsClassifier(n_neighbors = k)
    k_nei.fit(X_train, y_train)
    y_pred = k_nei.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    reports[k] = classification_report(y_test, y_pred, output_dict=True)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_neighbours = k
        best_model = k_nei

print(f"Best Neighbours: {best_neighbours}, Accuracy: {best_accuracy}")
print("Classification Report for Best Model-Neighbour Combination:")
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Best Neighbours: 7, Accuracy: 0.7835497835497836
Classification Report for Best Model-Neighbour Combination:
              precision    recall  f1-score   support

           0       0.81      0.87      0.84       149
           1       0.72      0.63      0.68        82

    accuracy                           0.78       231
   macro avg       0.77      0.75      0.76       231
weighted avg       0.78      0.78      0.78       231



**Support vector machine model**
*   Creates a SVM model using scikit-learn's SVC function.
*   It sets the kernel type for the SVM as specified by the variable "krl".
*   Diferent kernels are used as given in "kernels" list





In [49]:
# Support vector machine model with different kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
best_accuracy = 0
best_kernel = None
best_model = None

reports = {}

for krl in kernels:
    svm_mdl = SVC(kernel=krl)
    svm_mdl.fit(X_train, y_train)
    y_pred = svm_mdl.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    reports[krl] = classification_report(y_test, y_pred, output_dict=True)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_kernel = krl
        best_model = svm_mdl

print(f"Best Kernel: {best_kernel}, Accuracy: {best_accuracy}")
print("Classification Report for Best Model-Kernel Combination:")
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Best Kernel: linear, Accuracy: 0.8095238095238095
Classification Report for Best Model-Kernel Combination:
              precision    recall  f1-score   support

           0       0.81      0.92      0.86       149
           1       0.81      0.61      0.69        82

    accuracy                           0.81       231
   macro avg       0.81      0.76      0.78       231
weighted avg       0.81      0.81      0.80       231

