# Prediction of diabetes at early stage

This notebook is a work flow for various Python-based machine learning model for predicting of diabetes at early stage.

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

## 1. Problem Definition

Given the set of parameters, can we predict a person who has early stage diabetes

# 2. Data

Data Set Information:

This has been collected using direct questionnaires from the patients of Sylhet Diabetes
Hospital in Sylhet, Bangladesh and approved by a doctor.
Data Set Information:

This has been col-
lected using direct questionnaires from the patients of Sylhet Diabetes
Hospital in Sylhet, Bangladesh and approved by a doctor.
Attribute Information:

Relevant Papers:

Likelihood Prediction of Diabetes at Early Stage Using Data Mining Techniques:

Authors and affiliations:
* M. M. Faniqul IslamEmail
* Rahatara Ferdousi
* Sadikur Rahman
* Humayra Yasmin Bushra
* Citation Request:

Islam, MM Faniqul, et al. 'Likelihood prediction of diabetes at early stage using data mining techniques.' Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore, 2020. 113-125.

Islam, MM Faniqul, et al. 'Likelihood prediction of diabetes at early stage using data mining techniques.' Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore, 2020. 113-125.

https://www.kaggle.com/ishandutta/early-stage-diabetes-risk-prediction-dataset?select=diabetes_data_upload.csv

# 3. Evalution 

Evalution of the model to be on the the Precision, Recall, F1 Scores and Accurcy, we hope to achive 95%

# 4. Features

## Inputs
* Age 1. 20-65
* Sex 1. Male, 2.Female
* Polyuria 1.Yes, 2.No.
* Polydipsia 1.Yes, 2.No.
* sudden weight loss 1.Yes, 2.No.
* weakness 1.Yes, 2.No.
* Polyphagia 1.Yes, 2.No.
* Genital thrush 1.Yes, 2.No.
* visual blurring 1.Yes, 2.No.
* Itching 1.Yes, 2.No.
* Irritability 1.Yes, 2.No.
* delayed healing 1.Yes, 2.No.
* partial paresis 1.Yes, 2.No.
* muscle stiness 1.Yes, 2.No.
* Alopecia 1.Yes, 2.No.
* Obesity 1.Yes, 2.No.

## Labels
* Class 1.Positive, 2.Negative.

## Standard import

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Loading the data

In [None]:
# df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML Self-Projects/prediction of diabetes at early stage/diabetes_data_upload.csv')
df = pd.read_csv('../input/early-stage-diabetes-risk-prediction-dataset/diabetes_data_upload.csv')
df

## Data Exploration (Exploratory Data Analysis (EDA) )

In [None]:
df

In [None]:
df.info()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Number of Positive vs Negative Cases')
sns.countplot(data=df, x='class');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Histogram of Age')
sns.histplot(data=df, x='Age', bins=30, kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Histogram of Age with Positive vs Negative Cases')
sns.histplot(data=df, x='Age', hue='class', bins=30, kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Number of Positive vs Negative Cases with gender')
sns.countplot(data=df, x='class', hue='Gender');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Number of Positive vs Negative Cases with Obesity')
sns.countplot(data=df, x='class', hue='Obesity');

### Changing DF with dummies vars

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
df

### Heatmap of Correlation

In [None]:
plt.figure(figsize=(20,20))
plt.title('Heatmap of Correlation')
sns.heatmap(data=df.corr(),annot= True);

In [None]:
(df.corr()['class_Positive'].sort_values()[:-1]).round(2)

From the Data corralation from the top corrlation:

Feathers that have higher correlation of being positive:
* Gender (-0.45) : Female, higher chance of being positive.
* Alopecia (-0.27) : Low chance of being positive
* weakness (0.24)
* visual blurring (0.25)
* Irritability (0.30)
* Polyphagia (0.34)
* Partial paresis (0.43)
* Sudden weight loss (0.44)
* Polydipsia (0.65)
* Polyuria (0.67)

# 5. Modelling

In [None]:
df.head()

In [None]:
X = df.drop('class_Positive', axis=1)
y = df['class_Positive']

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

## Baseling Modelling

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier()}

In [None]:
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores.sort_values('Score')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Precision Score')
plt.xticks(rotation=90);

From the baseline modeling, we will choose RandomForestClassifier (0.990385) to have a in-depth look.

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix

## Random Forest Classifier

### Baseline Model

In [None]:
rf_base_model = RandomForestClassifier()
rf_base_model.fit(X_train, y_train)
y_preds = rf_base_model.predict(X_test)

In [None]:
print(classification_report(y_test, y_preds))

In [None]:
plot_confusion_matrix(rf_base_model, X_test, y_test)

### Grid Search CV model 1

In [None]:
from sklearn.model_selection import GridSearchCV
from warnings import filterwarnings

In [None]:
params = {'n_estimators' : [50,100,150],
          'criterion': ['gini','entropy'],
          'bootstrap': [True, False],
          'oob_score' : [True, False]}

In [None]:
rf_model_1 = RandomForestClassifier(random_state=42)

In [None]:
filterwarnings('ignore')
gs_rf_model_1 = GridSearchCV(rf_model_1,params,scoring='precision',cv=5,verbose=1)
gs_rf_model_1.fit(X_train,y_train)

In [None]:
y_preds = gs_rf_model_1.predict(X_test)

In [None]:
print(classification_report(y_test,y_preds))

In [None]:
plot_confusion_matrix(gs_rf_model_1, X_test, y_test)

In [None]:
gs_rf_model_1.best_params_

### Grid Search CV model 2

In [None]:
params = {'n_estimators' : [60,80,100,120],
          'criterion': ['gini','entropy'],
          'oob_score' : [True, False]}

In [None]:
rf_model_2 = RandomForestClassifier(random_state=42)

In [None]:
gs_rf_model_2 = GridSearchCV(rf_model_2,params,scoring='precision',cv=5,verbose=1)
gs_rf_model_2.fit(X_train,y_train)

In [None]:
y_preds = gs_rf_model_2.predict(X_test)

In [None]:
print(classification_report(y_test,y_preds))

In [None]:
plot_confusion_matrix(gs_rf_model_2, X_test, y_test)

In [None]:
gs_rf_model_2.best_params_

### Grid Search CV model 3

In [None]:
params = {'n_estimators' : [70,75,80,85,90],
          'criterion': ['entropy'],
          'oob_score' : [True, False]}

In [None]:
rf_model_3 = RandomForestClassifier(random_state=42)

In [None]:
gs_rf_model_3 = GridSearchCV(rf_model_3,params,scoring='precision',cv=5,verbose=1)
gs_rf_model_3.fit(X_train,y_train)

In [None]:
y_preds = gs_rf_model_3.predict(X_test)

In [None]:
print(classification_report(y_test,y_preds))

In [None]:
plot_confusion_matrix(gs_rf_model_3, X_test, y_test)

In [None]:
gs_rf_model_3.best_params_

# 6. Model Evaulation

From the Grid Search CV testing, we have decided to use the random Forest Classifier with the following hyperparameters

In [None]:
from sklearn.metrics import plot_roc_curve

In [None]:
model = RandomForestClassifier(n_estimators=80,criterion='gini', oob_score=True, random_state=42)
model.fit(X_train,y_train)
y_preds = model.predict(X_test)

### Feature Importances

In [None]:
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns)
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances.sort_values(0).T)

### ROC Curve

In [None]:
plot_roc_curve(model, X_test,y_test)

## Classification Report

In [None]:
print(classification_report(y_test,y_preds))

## Confusion Matrix

In [None]:
plot_confusion_matrix(model, X_test, y_test)

Due to the slightly unbalanced Dataset we will based it on the Precision score of 97%