# Pima Indian Diabetes Dataset
## Attribute Information:

    1) Number of times pregnant
    2) Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    3) Diastolic blood pressure (mm Hg)
    4) Triceps skin fold thickness (mm)
    5) 2-Hour serum insulin (mu U/ml)
    6) Body mass index (weight in kg/(height in m)^2)
    7) Diabetes pedigree function
    8) Age (years)
    9) Outcome (0 - No or 1 - Yes)

In [None]:
#Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
%matplotlib inline
sns.set_style("darkgrid")
warnings.filterwarnings("ignore")
df = pd.read_csv('diabetes.csv')

In [None]:
#Checking the head of the dataset.
df.head()

In [None]:
#Checking the dataset info
df.info()

It can be seen that SkinThickness has a value of 0 for index 2 in df.head() which makes no sense. So instead of NaN, missing values are actually represented as 0's.

In [None]:
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

In [None]:
df_2 = df.set_index('Outcome')
sns.heatmap(df_2.isnull(), cbar=False);

In [None]:
df.isnull().sum().sort_values()

We won't drop any columns as all columns have important pieces of data and hence dimensionality reduction shouldn't take place So we will first split the dataframes based on outcome, fill the missing values with the mean of the data, and then recombine the dataframe.

In [None]:
df_0=df[df['Outcome']==0]
df_1=df[df['Outcome']==1]
df_0.fillna(df_0.mean(),inplace=True)
df_1.fillna(df_1.mean(),inplace=True)

In [None]:
data=pd.concat([df_0,df_1])
data.describe()

* **UNIVARIATE ANALYSIS**

Violin Plot to see the distribution of variables when both Diabetic and Non-Diabetic

In [None]:
fig,ax=plt.subplots(4,2,figsize=(15,15))
fig.delaxes(ax[3,1])

for i,cols in enumerate(['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction']):
    fig = sns.violinplot(x = 'Outcome',y = cols,data = df_1,ax=ax[i//2,i%2],color='turquoise',violinmode='overlay')
    fig = sns.violinplot(x = 'Outcome',y = cols,data = df_0,ax=ax[i//2,i%2],color='coral',violinmode='overlay')
    fig.set(xticklabels=[" "])
plt.tight_layout();

In [None]:
#Distribution for each feature to understand how our data is organized
data.hist(figsize = (12,10));

In [None]:
fig,ax=plt.subplots(4,2,figsize=(15,15))
fig.delaxes(ax[3,1])

for i,cols in enumerate(['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction']):
    fig = sns.swarmplot(x = 'Outcome', y = cols,data = data,ax=ax[i//2,i%2],palette = 'Set3')  
    fig.set(xticklabels=["Don't have Diabetes","Have Diabetes"])
plt.tight_layout();

In [None]:
#Sorting age column data into age groups and making a new column.
bins = [20,30,40,50,60,70]
data['age_bins']=pd.cut(data['Age'], bins=bins)
data.head()

In [None]:
fig,ax = plt.subplots(1,2,figsize=(15,5))

ax[0].set_title('Presence of Diabetes')
fig = sns.countplot(x='Outcome',data=data,palette='rocket',ax = ax[0])
fig.set(xticklabels=["Don't have Diabetes","Have Diabetes"])
ax[0].set_ylabel("Number of People")

ax[1].set_title('Presence of Diabetes by Age Group')
sns.countplot(x='age_bins',data=data ,hue='Outcome',palette='rocket',ax = ax[1])
ax[1].legend(["Dont have Diabetes","Have Diabetes"],loc = 1)
ax[1].set_xlabel('Age Group')
ax[1].set_ylabel('Number of People');

* **BIVARIATE ANALYSIS**

In [None]:
#Pairplot to conduct a bivariate analysis and see if there is high correlation between any 2 variables.
sns.pairplot(data=data,hue='Outcome');

Additionally we will check for high correlation of values by using heatmap

In [None]:
plt.figure(figsize = (8,8))
sns.heatmap(data.corr(),annot = True,square = True);

Highest Correlations: Glucose and Insulin - 0.58, BMI and SkinThickness - 0.65, Pregnancy and Age - 0.54.
Since none of these are extremely high we won't be dropping columns.

In [None]:
data.drop('age_bins',axis = 1,inplace = True)

In [None]:
fig,ax=plt.subplots(4,2,figsize=(15,15))
fig.delaxes(ax[3,1])

for i,cols in enumerate(['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction']):
    sns.distplot(data[data.Outcome == 0][cols], color='turquoise', kde=False, label='No Diabetes', ax=ax[i//2,i%2])
    sns.distplot(data[data.Outcome == 1][cols], color='coral', kde=False, label='Diabetes',ax=ax[i//2,i%2])
plt.legend()
plt.show()

In [None]:
#Dropping columns with extreme outliers
from scipy import stats
data[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
data.shape

In [None]:
data.shape

This is after Z score implementation

In [None]:
fig,ax=plt.subplots(4,2,figsize=(15,15))
fig.delaxes(ax[3,1])

for i,cols in enumerate(['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction']):
    fig = sns.boxplot(x = 'Outcome',y = cols,data = df_1,ax=ax[i//2,i%2],color='turquoise')
    fig = sns.boxplot(x = 'Outcome',y = cols,data = df_0,ax=ax[i//2,i%2],color='coral')
    fig.set(xticklabels=[" "])
plt.tight_layout();

The shape remained the same but the box plot clearly shows us outliers so we use another method 

Checking for outliers using IQR Score. IQR Score is used to filter out the outliers by keeping only valid values. It is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles

In [None]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
data_new = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]
data_new.shape

Clearly the shape has changed telling us that we had outliers which have been removed

In [None]:
df0=data_new[data_new['Outcome']==0]
df1=data_new[data_new['Outcome']==1]

Now to show the change we use both violin plot and box plot to show removal of outliers

In [None]:
fig,ax=plt.subplots(4,2,figsize=(15,15))
fig.delaxes(ax[3,1])

for i,cols in enumerate(['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction']):
    fig = sns.violinplot(x = 'Outcome',y = cols,data = df1,ax=ax[i//2,i%2],color='turquoise',violinmode='overlay')
    fig = sns.violinplot(x = 'Outcome',y = cols,data = df0,ax=ax[i//2,i%2],color='coral',violinmode='overlay')
    fig.set(xticklabels=[" "])
plt.tight_layout();

In [None]:
fig,ax=plt.subplots(4,2,figsize=(15,15))
fig.delaxes(ax[3,1])

for i,cols in enumerate(['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction']):
    fig = sns.boxplot(x = 'Outcome',y = cols,data = df1,ax=ax[i//2,i%2],color='turquoise')
    fig = sns.boxplot(x = 'Outcome',y = cols,data = df0,ax=ax[i//2,i%2],color='coral')
    fig.set(xticklabels=[" "])
plt.tight_layout();

In [None]:
x = data_new.iloc[:,:-1].values
y = data_new.iloc[:,-1].values

This is for you to read bro, i assigned random state 101 since i learnt some really good values for it recently like, 1,0,42,40 and 101. 

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,random_state=101)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(x_train, y_train)

In [None]:
preds = rf.predict(x_test)

In [None]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,preds))

In [None]:
print(metrics.accuracy_score(y_test,preds))

Please do check if i am making a mistake in gridsearchcv

In [None]:
param_grid = { 
    'n_estimators': [50,100,200,500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [2,3,4,5,6,7,8],
}

In [None]:
grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv= 5)
grid.fit(x_train, y_train)

In [None]:
grid.best_params_

In [None]:
grid.score(x_test,y_test)