# Diabetes Prediction

### Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

### Content

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

* **Pregnancies**: Number of times pregnant
* **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure**: Diastolic blood pressure (mm Hg)
* **SkinThickness**: Triceps skin fold thickness (mm)
* **Insulin**: 2-Hour serum insulin (mu U/ml)
* **BMI**: Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction**: Diabetes pedigree function
* **Age**: Age (years)
* **Outcome**: Class variable (0 = Person hasn't diabetes or 1 = Person has diabetes)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#sns.set_palette('Set2')
#sns.set_style('white')
import plotly.express as px
import plotly.graph_objects as go

#Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#Models ML
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#Boosting
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

#Metrics
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.metrics import roc_curve, auc
from sklearn import metrics

### Import dataset

In [None]:
data = pd.read_csv('/kaggle/input/diabetes-data-set/diabetes.csv')
data.head()

In [None]:
# There is not NaN or null values in columns
data.info()

In [None]:
data.describe()

We see that variables like Glucose, BloodPressure, SkinThickness, Insulin and BMI have values equal to 0.
We'll replace this values for the value of mean of each column of the dataset.

In [None]:
select_col = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

In [None]:
# Sum of values equal zero in each column
for col in data[select_col]:
    print('{}:'.format(col) , data[data[col]==0].value_counts().sum())

In [None]:
# Value of median 
data.median()

In [None]:
# Replacing the values equal to 0 by the median
for col in data[select_col]:
    data.loc[:,col].replace(0, data.loc[:,col].median(), inplace=True)

In [None]:
for col in data[select_col]:
    print('{}:'.format(col) , data[data[col]==0].value_counts().sum())

In [None]:
data.head()

# Data visualization
## Outcomes

In [None]:
sns.countplot(x='Outcome',data=data, palette='Set2')
plt.title('Count of people with and without diabetes')
plt.grid()

In [None]:
data.Outcome.value_counts()

## Ages

In [None]:
# Range of Ages
sns.histplot(data['Age'], bins=4)
plt.title('Distribution of ages')
plt.grid()

## Pregnancies

In [None]:
# Pregnancy count per woman
sns.countplot(x='Pregnancies', data=data)
plt.title('Pregnancies')
plt.grid()

## Glucose (mg/dL)

**Oral glucose tolerance test**: For this test, you fast overnight, and the fasting blood sugar level is measured. Then you drink a sugary liquid, and blood sugar levels are tested periodically for the next two hours.

A blood sugar level less than 140 mg/dL (7.8 mmol/L) is ***normal***. A reading of more than 200 mg/dL (11.1 mmol/L) after two hours indicates ***diabetes***. A reading between 140 and 199 mg/dL (7.8 mmol/L and 11.0 mmol/L) indicates ***prediabetes***.

source: <a>https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451</a>

In [None]:
sns.histplot(data['Glucose'], kde=True)
plt.grid()

## Diastolic blood pressure (mm Hg)

The diastolic reading is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen.

This is what diastolic blood pressure number means:
* ***Normal***: Lower than 80 mmHg
* ***Stage 1 hypertension***: 80-89 mmHg
* ***Stage 2 hypertension***: 90 mmHg or more 

source: <a>https://www.webmd.com/hypertension-high-blood-pressure/guide/diastolic-and-systolic-blood-pressure-know-your-numbers</a>

In [None]:
sns.displot(data['BloodPressure'],kde=True, color='green')
plt.grid()

## Body Mass Index (BMI)

The body mass index (BMI) is a measure that uses the height and weight to work out if the weight is healthy. The BMI calculation divides an adult's weight in kilograms by their height in metres squared.

If your BMI is:

* Below 18.5 – you're in the ***underweight*** range
* Between 18.5 and 24.9 – you're in the ***healthy weight*** range
* Between 25 and 29.9 – you're in the ***overweight*** range
* Between 30 and 39.9 – you're in the ***obese*** range

source: <a>https://www.nhs.uk/common-health-questions/lifestyle/what-is-the-body-mass-index-bmi/</a>

In [None]:
sns.displot(data['BMI'], kde=True, color='red')
plt.grid()

## SkinThickness
From a laboratory in Argentina, I found that an optimal value of skin thickness for people over 12 years of age is 3 to 17 Ul/ml.

source: <a>https://www.labmoreira.com/nuevos-examenes.asp?strClave=2</a>

In [None]:
sns.displot(data['SkinThickness'], kde=True, color='orange')
plt.grid()

In [None]:
# Blood pressure lower than 80 is normal

fig = px.scatter(data, x="Age", y='BloodPressure',
             size="Glucose", color="Outcome",
                 hover_data=["BMI"], log_x=True, size_max=12, 
                 color_continuous_scale=[[0, 'rgb(102, 194, 165)'], [1.0, 'rgb(225, 128, 114)']],
                 title="General view"
                 )
fig.add_shape(type="line",
    x0=20, y0=80, x1=85, y1=80,
    line=dict(color="blue",width=2,dash="dash")
 )

fig.show()

In [None]:
# We differentiate the data with Outcome 0 and 1
out_0 = data[data['Outcome']==0]
out_1 = data[data['Outcome']==1]

In [None]:
# We'll buil a function to represent de differents distributions with respect to "Outcome"
def visualization(variable):
    fig=go.Figure()
    fig.add_trace(go.Box(y=out_0[variable],name=0,marker_color='rgb(102, 194, 165)',boxpoints="all",whiskerwidth=0.3))
    fig.add_trace(go.Box(y=out_1[variable],name=1,marker_color='rgb(225, 128, 114)',boxpoints="all",whiskerwidth=0.3))
    fig.update_layout(title="{} distribution with respect to Outcome".format(variable),height=600)
    fig.show()

In [None]:
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']

for column in data[columns]:
    visualization(column)

# Correlations
We'll observing the correlation wich exist between all the variables.

In [None]:
plt.figure(figsize=(16,9))
corr = data.corr()
sns.heatmap(abs(corr), lw=1, annot=True, cmap='Set2')
plt.show()

We can see that the variable that most correlates with the "Outcome" is "Glucose" with a value of 0.49, while the lowest correlation is "BloodPresure" with a value of 0.16

## Data processing and scaling
We'll split the data into training and testing sets. Then we'll scale using StandarScale.

Both using diferents libraries from *scikit-learn*. For the training and testing data we'll use *train_test_split* from *sklearn.model_selection* and for scaling we'll use *StandardScaler* from *sklearn.preprocessing*

In [None]:
# training and normalization of data
X = data.iloc[:,:8]
Y = data.iloc[:,8]

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state=7)

SScaler = StandardScaler()
X_train = SScaler.fit_transform(X_train)
X_test = SScaler.fit_transform(X_test)

# Evaluation models

First we will build two functions. The first to represent the confusion matrix and some parameters of metrics, and the second to represent the roc curve

In [None]:
def impressions(model,accuracy):
    print('Accuracy: {} %'.format(accuracy))
    print('Mean squared error: ', round(mean_squared_error(Y_test,Y_pred),3))
    

    cm=confusion_matrix(Y_test,Y_pred)
    class_label = [0, 1]
    df_cm = pd.DataFrame(cm, index=class_label,columns=class_label)
    sns.heatmap(df_cm,annot=True,cmap='Set2',linewidths=2,fmt='d')
    plt.title("Confusion Matrix",fontsize=15)
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.show()

In [None]:
def roc_curve(Y_test, Y_score):
    from sklearn.metrics import roc_curve, auc
    fpr, tpr, thresholds = roc_curve(Y_test, Y_score)
    score = metrics.auc(fpr, tpr)

    fig = px.area(
        #fpr = False Positive Rate; tpr= True Positive Rate
        x=fpr, y=tpr,
        title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
        labels=dict(x='False Positive Rate', y='True Positive Rate'),
        width=700, height=500
    )

    fig.add_shape(
        type='line', line=dict(dash='dash'),
        x0=0, x1=1, y0=0, y1=1
    )

    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    fig.update_xaxes(constrain='domain')
    fig.show()    

## Logistic Regression

In [None]:
LogR= LogisticRegression()
LogR.fit(X_train,Y_train)
Y_pred= LogR.predict(X_test)

LogR_accuracy= round(accuracy_score(Y_test,Y_pred),5)*100

impressions(LogR,LogR_accuracy)

Y_score = LogR.predict_proba(X_test)[:,1]
roc_curve(Y_test,Y_score)

### K-Nearest Neighbors

In [None]:
KNN= KNeighborsClassifier(n_neighbors=10)
KNN.fit(X_train,Y_train)
Y_pred= KNN.predict(X_test)

KNN_accuracy= round(accuracy_score(Y_test,Y_pred), 5)*100 # Accuracy

impressions(KNN,KNN_accuracy)

Y_score = KNN.predict_proba(X_test)[:,1]
roc_curve(Y_test,Y_score)

### Support Vector Machine

In [None]:
from sklearn.svm import SVC

svc= SVC(kernel='rbf')
svc.fit(X_train,Y_train)
Y_pred= svc.predict(X_test)

svc_accuracy= round(accuracy_score(Y_test,Y_pred), 5)*100 # Accuracy

impressions(svc,svc_accuracy)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc= RandomForestClassifier(n_estimators=200, random_state=5, criterion='gini', max_depth=100)
rfc.fit(X_train,Y_train)
Y_pred= rfc.predict(X_test)

rfc_accuracy= round(accuracy_score(Y_test,Y_pred), 5)*100 # Accuracy

impressions(rfc,rfc_accuracy)

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=10, max_depth=100)
tree.fit(X_train,Y_train)
Y_pred= tree.predict(X_test)

tree_accuracy= round(accuracy_score(Y_test,Y_pred), 5)*100 # Accuracy

impressions(tree,tree_accuracy)

Y_score = tree.predict_proba(X_test)[:,1]

### Adaboost Classifier

In [None]:
ADA=AdaBoostClassifier(learning_rate= 0.15,n_estimators= 40)
ADA.fit(X_train,Y_train)
Y_pred= ADA.predict(X_test)

ADA_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

impressions(ADA,ADA_accuracy)

### Gradient Boosting Classifier


In [None]:
GB= GradientBoostingClassifier(n_estimators=30,learning_rate=0.22,loss="deviance")
GB.fit(X_train,Y_train)
Y_pred= GB.predict(X_test)

GB_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

impressions(GB,GB_accuracy)

In [None]:
models_accuracy = {
    'Logistic Regression': LogR_accuracy,
    'K-Nearest Neighbors' : KNN_accuracy,
    'Support Vector Machine' : svc_accuracy,
    'Random Forest': rfc_accuracy,
    'Decission Tree Classifier': tree_accuracy,
    'ADABoost Classifier': ADA_accuracy,
    'Gradient Boosting Classifier': GB_accuracy
    
}

In [None]:
results = pd.DataFrame([[key, models_accuracy[key]] for key in models_accuracy.keys()],
                       columns=['Models', 'Accuracies']).sort_values('Accuracies', ascending=False)
results

---

**This is the first notebook I upload to Kaggle. I'm new to the world of data science so any feedback is very welcome!**