# PIMA INDIAN DIABETES EDA
In this Kernel I have predicted the chances of diabetes using PIMA Indian dataset.This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

# About Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

**Features:**

- **Pregnancies -** Number of times pregnant.
- **Glucose  -** Plasma glucose concentration a 2 hours in an oral glucose tolerance testPlasma glucose concentration a 2 hours in an oral glucose tolerance test.
- **BloodPressure -** Diastolic blood pressure (mm Hg).
- **SkinThickness -** Triceps skinfold thickness (mm).
- **Insulin -** 2-Hour serum insulin (mu U/ml).
- **BMI -** Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction -** Diabetes pedigree function.
- **Age -** Age in years.

**Target Variable :**

- **Outcome -** Class variable 1 if patient has diagnosed diabetes and 0 if not.

## Steps to be Followed :
Following steps I have taken to apply machine learning models:

1. Importing Essential Libraries.
2. Data Preparation & Data Cleaning.
3. Data Visualization 
4. Feature Engineering to discover essential features in the process of applying machine learning.
5. Encoding Categorical Variables.
6. Train Test Split
7. Apply Machine Learning Algorithm
8. Cross Validation
9. Model Evaluation

## Model Evaluation :
- [Cross Validation Score] (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
- [Confusion Matrix] (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
- [Plotting ROC-AUC Curve] (https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
- [Sensitivity and Specitivity] (https://en.wikipedia.org/wiki/Sensitivity_and_specificity)
- [Classification Error] (https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)

# Importing Essential Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import itertools
plt.style.use('fivethirtyeight')
from sklearn import model_selection
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import  accuracy_score, f1_score, precision_score,confusion_matrix, recall_score, roc_auc_score
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC

In [None]:
df=pd.read_csv('../input/diabetes.csv')

In [None]:
# Lets look at some of the sample data 
df.head()

In [None]:
df.describe()

This dataset is known to have missing values.Specifically, there are missing observations for some columns that are marked as a zero value.

## Data Cleaning & Data Preparation
In this step we will find missing entries, if there then fill them with median or mean values, checking data types of all the features to find any inconsistency.

In [None]:
df.isna().any() # checking No. of Missing Values.

In [None]:
print(df.dtypes)

In [None]:
df.head(50)

It seems from the above table that there are zero entries in BMI, Blood Pressure,Glucose, Skin Thickness and Insulin which are meaningless so we will fill it with their median values before fitting it into the machine learning models.

**Replacing zero entries BMI, Blood Pressure,Glucose, Skin Thickness and Insulin with their median values**

In [None]:
# Calculate the median value for BMI
median_bmi = df['BMI'].median()
# Substitute it in the BMI column of the
# dataset where values are 0
df['BMI'] = df['BMI'].replace(
    to_replace=0, value=median_bmi)

median_bloodp = df['BloodPressure'].median()
# Substitute it in the BloodP column of the
# dataset where values are 0
df['BloodPressure'] = df['BloodPressure'].replace(
    to_replace=0, value=median_bloodp)

# Calculate the median value for PlGlcConc
median_plglcconc = df['Glucose'].median()
# Substitute it in the PlGlcConc column of the
# dataset where values are 0
df['Glucose'] = df['Glucose'].replace(
    to_replace=0, value=median_plglcconc)

# Calculate the median value for SkinThick
median_skinthick = df['SkinThickness'].median()
# Substitute it in the SkinThick column of the
# dataset where values are 0
df['SkinThickness'] = df['SkinThickness'].replace(
    to_replace=0, value=median_skinthick)

# Calculate the median value for SkinThick
median_skinthick = df['Insulin'].median()
# Substitute it in the SkinThick column of the
# dataset where values are 0
df['Insulin'] = df['Insulin'].replace(
    to_replace=0, value=median_skinthick)

In [None]:
df.head(50)

Now, all the zero entries are now filled with the median values.

## Data Visualization

In [None]:
sns.countplot(data=df, x = 'Outcome', label='Count')

DB, NDB = df['Outcome'].value_counts()
print('Number of patients diagnosed with Diabtetes disease: ',DB)
print('Number of patients not diagnosed with Diabtetes disease: ',NDB)

# Brief Analysis of the Data

In [None]:
columns=df.columns[:8]
plt.subplots(figsize=(18,15))
length=len(columns)
for i,j in itertools.zip_longest(columns,range(length)):
    plt.subplot((length/2),3,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    df[i].hist(bins=20,edgecolor='black')
    plt.title(i)
plt.show()

It seems to have even distribution of data in all the features of the dataset.

# Analysis of Diabetic Cases

In [None]:
df1=df[df['Outcome']==1]
columns=df.columns[:8]
plt.subplots(figsize=(18,15))
length=len(columns)
for i,j in itertools.zip_longest(columns,range(length)):
    plt.subplot((length/2),3,j+1)
    plt.subplots_adjust(wspace=0.2,hspace=0.5)
    df1[i].hist(bins=20,edgecolor='black')
    plt.title(i)
plt.show()

In [None]:
sns.pairplot(df, hue = 'Outcome', vars = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI','DiabetesPedigreeFunction','Age'] )

In [None]:
sns.jointplot("Pregnancies", "Insulin", data=df, kind="reg")

# Feature Engineering
Now, its time to add important features to the dataset and see their effect by visualizing them.

**Feature 1 : BMI Indicator**<br>
I m adding BMI Indicator feature as we know :
If you have a BMI of:
- Under 18.5 – you are considered underweight and possibly malnourished.
- 18.5 to 24.9 – you are within a healthy weight range for young and middle-aged adults.
- 25.0 to 29.9 – you are considered overweight.
- Over 30 – you are considered obese.

In [None]:
def set_bmi(row):
    if row["BMI"] < 18.5:
        return "Under"
    elif row["BMI"] >= 18.5 and row["BMI"] <= 24.9:
        return "Healthy"
    elif row["BMI"] >= 25 and row["BMI"] <= 29.9:
        return "Over"
    elif row["BMI"] >= 30:
        return "Obese"

In [None]:
df = df.assign(BM_DESC=df.apply(set_bmi, axis=1))

df.head()

**Feature 2: Insulin Indicative Range** <br>
If insulin level (2-Hour serum insulin (mu U/ml)) is >= 16 and <= 166, then it is normal range
else it is considered as Abnormal

In [None]:
def set_insulin(row):
    if row["Insulin"] >= 16 and row["Insulin"] <= 166:
        return "Normal"
    else:
        return "Abnormal"

In [None]:
df = df.assign(INSULIN_DESC=df.apply(set_insulin, axis=1))

df.head()

In [None]:
sns.countplot(data=df, x = 'INSULIN_DESC', label='Count')

AB, NB = df['INSULIN_DESC'].value_counts()
print('Number of patients Having Abnormal Insulin Levels: ',AB)
print('Number of patients Having Normal Insulin Levels: ',NB)

It seems from the above plot that more than 500 patients have Abnormal Insulin Levels where as around 250 patients have Normal Insulin Levels.

In [None]:
sns.countplot(data=df, x = 'BM_DESC', label='Count')

UD,H,OV,OB = df['BM_DESC'].value_counts()
print('Number of patients Having Underweight BMI Index: ',UD)
print('Number of patients Having Healthy BMI Index: ',H)
print('Number of patients Having Overweigth BMI Index: ',OV)
print('Number of patients Having Obese BMI Index: ',OB)

In [None]:
g = sns.FacetGrid(df, col="INSULIN_DESC", row="Outcome", margin_titles=True)
g.map(plt.scatter,"Glucose", "BloodPressure",  edgecolor="w")
plt.subplots_adjust(top=1.1)

In [None]:
g = sns.FacetGrid(df, col="Outcome", row="INSULIN_DESC", margin_titles=True)
g.map(plt.hist, "Age", color="red")
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Disease by INSULIN and Age');

In [None]:
sns.boxplot(x="Age", y="INSULIN_DESC", hue="Outcome", data=df);

It seems from the above plot that patients having normal insulin levels are more diabetic within the age range from 25 and 42
where as patients having anormal insulin levels are more diabetic in the age range of late 20's to mid 40's.

In [None]:
sns.boxplot(x="Age", y="BM_DESC", hue="Outcome", data=df);

From the above plot it is evident that patients who are obese as per BMI index are more diabetic in early age of 25 where as patients who are overweight are prone to diabetes in early 30's

As far as data is concerned it is the data of all women patients at least 21 years old of Pima Indian heritage.So, the findings may differ in other cases.

## Label Encoding

In this step we will encode the categorical variables BM_DESC,INSULIN_DESC into numerical values before fitting it into machine learning models.

In [None]:
df["INSULIN_DESC"] = df.INSULIN_DESC.apply(lambda  x:1 if x=="Normal" else 0)

Segregating Features and Target Variable.

I have taken X as Feature variable and y as target variable.

In [None]:
X=pd.get_dummies(df,drop_first=True)
X=X.drop(['Outcome'],axis=1)
y = df['Outcome']

In [None]:
X.head()

In [None]:
y.head()

## Splitting data into Training & Testing Set
The training dataset and test dataset must be similar, usually have the same predictors or variables. They differ on the observations and specific values in the variables. If you fit the model on the training dataset, then you implicitly minimize error or find correct responses. The fitted model provides a good prediction on the training dataset. Then you test the model on the test dataset. If the model predicts good also on the test dataset, you have more confidence. You have more confidence since the test dataset is similar to the training dataset, but not the same nor seen by the model. It means the model transfers prediction or learning in real sense.

So,by splitting dataset into training and testing subset, we can efficiently measure our trained model since it never sees testing data before.Thus it's possible to prevent overfitting.

I am just splitting dataset into 20% of test data and remaining 80% will used for training the model.

I have used stratify parameter.This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.

For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,stratify=y, random_state = 1234)

## Feature Scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.
To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling.

In [None]:
sc_X = StandardScaler()
X_train_scaled = pd.DataFrame(sc_X.fit_transform(X_train))
X_test_scaled = pd.DataFrame(sc_X.transform(X_test))

## Applying Machine Learning Models

In [None]:
logi = LogisticRegression(random_state = 0, penalty = 'l1')
logi.fit(X_train_scaled, y_train)

In [None]:
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train_scaled, y_train, verbose=True)

In [None]:
random_forest = RandomForestClassifier(n_estimators = 100,criterion='gini', random_state = 47)
random_forest.fit(X_train_scaled, y_train)

In [None]:
svc_model_l = SVC(kernel='linear',probability=True)
svc_model_l.fit(X_train_scaled, y_train)

In [None]:
svc_model_r = SVC(kernel='rbf',probability=True)
svc_model_r.fit(X_train_scaled, y_train)

## Cross validation

In [None]:
kfold = model_selection.KFold(n_splits=10, random_state=7)
scoring = 'accuracy'

acc_logi = cross_val_score(estimator = logi, X = X_train_scaled, y = y_train, cv = kfold,scoring=scoring)
acc_logi.mean()

acc_xgb = cross_val_score(estimator = xgb_classifier, X = X_train_scaled, y = y_train, cv = kfold,scoring=scoring)
acc_xgb.mean()

acc_rand = cross_val_score(estimator = random_forest, X = X_train_scaled, y = y_train, cv = kfold, scoring=scoring)
acc_rand.mean()

acc_svc_l = cross_val_score(estimator = svc_model_l, X = X_train_scaled, y = y_train, cv = kfold,scoring=scoring)
acc_svc_l.mean()

acc_svc_r = cross_val_score(estimator = svc_model_r, X = X_train_scaled, y = y_train, cv = kfold,scoring=scoring)
acc_svc_r.mean()

## Model Evaluation
In this step we will compare different performance metric such as cross validation accuracy, Precision,Recall,F1 Score, ROC etc.

In [None]:
y_predict_logi = logi.predict(X_test_scaled)
acc= accuracy_score(y_test, y_predict_logi)
roc=roc_auc_score(y_test, y_predict_logi)
prec = precision_score(y_test, y_predict_logi)
rec = recall_score(y_test, y_predict_logi)
f1 = f1_score(y_test, y_predict_logi)

results = pd.DataFrame([['Logistic Regression',acc, acc_logi.mean(),prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
y_predict_x = xgb_classifier.predict(X_test_scaled)
roc=roc_auc_score(y_test, y_predict_x)
acc = accuracy_score(y_test, y_predict_x)
prec = precision_score(y_test, y_predict_x)
rec = recall_score(y_test, y_predict_x)
f1 = f1_score(y_test, y_predict_x)

model_results = pd.DataFrame([['XG Boost',acc, acc_xgb.mean(),prec,rec, f1,roc]],
               columns = ['Model','Accuracy', 'Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results = results.append(model_results, ignore_index = True)
results

In [None]:
y_predict_r = random_forest.predict(X_test_scaled)
roc=roc_auc_score(y_test, y_predict_r)
acc = accuracy_score(y_test, y_predict_r)
prec = precision_score(y_test, y_predict_r)
rec = recall_score(y_test, y_predict_r)
f1 = f1_score(y_test, y_predict_r)

model_results = pd.DataFrame([['Random Forest',acc, acc_rand.mean(),prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results = results.append(model_results, ignore_index = True)
results

In [None]:
y_predict_s = svc_model_l.predict(X_test_scaled)
roc=roc_auc_score(y_test, y_predict_s)
acc = accuracy_score(y_test, y_predict_s)
prec = precision_score(y_test, y_predict_s)
rec = recall_score(y_test, y_predict_s)
f1 = f1_score(y_test, y_predict_s)

model_results = pd.DataFrame([['SVC Linear',acc, acc_svc_l.mean(),prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results = results.append(model_results, ignore_index = True)
results

In [None]:
y_predict_s1 = svc_model_r.predict(X_test_scaled)
roc=roc_auc_score(y_test, y_predict_s1)
acc = accuracy_score(y_test, y_predict_s1)
prec = precision_score(y_test, y_predict_s1)
rec = recall_score(y_test, y_predict_s1)
f1 = f1_score(y_test, y_predict_s1)

model_results = pd.DataFrame([['SVC RBF',acc, acc_svc_r.mean(),prec,rec, f1,roc]],
               columns = ['Model', 'Accuracy','Cross Val Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results = results.append(model_results, ignore_index = True)
results

# Plotting ROC Curve
AUC(Area Under Curve) - ROC (Receiver Operating Characterstics) curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

The ROC curve is plotted with TPR(True Positive Rate) against the FPR (False Positive Rate) where TPR is on y-axis and FPR is on the x-axis.

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

plt.figure()

# Add the models to the list that you want to view on the ROC plot
models = [
{
    'label': 'Logistic Regression',
    'model': LogisticRegression(random_state = 0, penalty = 'l1'),
},
{
    'label': 'XG Boost',
    'model': XGBClassifier(),
},
    {
    'label': 'Random Forest Gini',
    'model': RandomForestClassifier(n_estimators = 100,criterion='gini', random_state = 47),
},
    {
    'label': 'Support Vector Machine-L',
    'model': SVC(kernel='linear',probability=True)} ,
        {
    'label': 'Support Vector Machine-RBF',
    'model': SVC(kernel='rbf',probability=True) ,
}
]

# Below for loop iterates through your models list
for m in models:
    model = m['model'] # select the model
    model.fit(X_train_scaled, y_train) # train the model
    y_pred=model.predict(X_test_scaled) # predict the test data
# Compute False postive rate, and True positive rate
    fpr, tpr, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test_scaled)[:,1])
# Calculate Area under the curve to display on the plot
    auc = metrics.roc_auc_score(y_test,model.predict(X_test_scaled))
# Now, plot the computed values
    plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], auc))
# Custom settings for the plot 
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

# Confusion Matrix

In [None]:
cm_logi = confusion_matrix(y_test, y_predict_logi)
plt.title('Confusion matrix of the Logistic classifier')
sns.heatmap(cm_logi,annot=True,fmt="d")
plt.show()

In [None]:
cm_x = confusion_matrix(y_test, y_predict_x)
plt.title('Confusion matrix of the XGB classifier')
sns.heatmap(cm_x,annot=True,fmt="d")
plt.show()

In [None]:
cm_r = confusion_matrix(y_test, y_predict_r)
plt.title('Confusion matrix of the Random Forest classifier')
sns.heatmap(cm_r,annot=True,fmt="d")
plt.show()

In [None]:
cm = confusion_matrix(y_test, y_predict_s)
plt.title('Confusion matrix of the SVC Linear classifier')
sns.heatmap(cm,annot=True,fmt="d")
plt.show()

As we have seen from the above model evaluation, Logistic Regression and SVC Linear are best model for this dataset. so we will perform further Model evaluation of Logistic Regression.

# Model Evaluation Part 2
In this part we will further find Classification error,sensitivity and specifitivity of our logistic regression model.

In [None]:
TP = cm_logi[1, 1]
TN = cm_logi[0, 0]
FP = cm_logi[0, 1]
FN = cm_logi[1, 0]

In [None]:
classification_error = (FP + FN) / float(TP + TN + FP + FN)

print(classification_error)

The model has 18.83% of classification error.

# Sensitivity & Specifitivity
Before finding sensitivity and specifitivity we must know what these terms are :

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as a classification function:

- **Sensitivity** (also called the true positive rate, the recall, or probability of detection[1] in some fields) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).
- **Specificity** (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

In [None]:
sensitivity = TP / float(FN + TP)

print(sensitivity)

In [None]:
specificity = TN / (TN + FP)

print(specificity)

The model is highly specific and less sensitive model.