# Overview 
[source](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

## Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<hr />

<h2>Table of Content:</h2>

### 1. Import Libraries and Data 
    1.1 Top 5 rows of data
    1.2 Last 5 rows of data
    1.3 Some Random Values from our data 
    1.4 Feature overview
    
### 2. Imputations <br />

    2.1 Check For Null/Missing values<br>
    2.2 Check For Outliers<br>
    
### 3. Exploratory data analysis <br />
    
### 4. Modeling<br />

<hr />    
<hr />

# Let's Start !
# 1. Import Libraries and Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from matplotlib import pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
data = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')

### 1.1 Top 5 rows of data 

In [None]:
data.head()

### 1.2 Last 5 rows of data

In [None]:
data.tail()

### 1.3 Some Random Values from our data

In [None]:
data.sample(5)

### 1.4 Feature overview

In [None]:
data.shape,data.size

    So our data have 768 rows and 9 columns, with a total size of 6912 cells in it.

In [None]:
data.describe()

In [None]:
data.info()

    From above we can see that none of our column in object or text. (only int and float)
    
    We can compute and visualize this data easily.

# 2. Imputations

### 2.1 Check For Null/Missing values

In [None]:
data.isnull().sum()

    So we do not have any null/missing value. Wow, thats great, now lets check for some outliers.

### 2.2 Searching for Outliers.

In [None]:
fig,ax=plt.subplots(1,1,figsize=(16,7))
ax.boxplot(data)
plt.show()

     So we have some outliers in column 5 i.e BMI column.
     Come let's fix this.

In [None]:
data['BMI'].max(),data['BMI'].min(),data['BMI'].mean(),data['BMI'].mode()

     I will be using Interquartile range method for pointing out the outliers, 
     as in this we choose all the data between 15% and 85% rest data we can drop.

In [None]:
# First quartile (Q1) 
Q1 = np.percentile(data['BMI'], 15, interpolation = 'midpoint') 
  
# Third quartile (Q3) 
Q3 = np.percentile(data['BMI'], 85, interpolation = 'midpoint') 
  
# Interquaritle range (IQR) 
IQR = Q3 - Q1 
print("Q1 = ",Q1)
print("Q3 = ",Q3)
print(IQR) 

In [None]:
(data['BMI']>Q3).sum(),(data['BMI']<Q1).sum()

    Here we can see that we have alot data left from these quaritile values.(204 and we can not ignore such large data)
    
    So I manually checked for the outliers and got that we have most outliers after 99% of data,
    means only 1% data contains outliers so lets drop them.  

In [None]:
np.percentile(data['BMI'], 98.5, interpolation = 'midpoint') 

In [None]:
val = data['BMI'].sort_values(ascending=False)

     These are the top values which I am counting as outliers. 

In [None]:
print(val[:8])

In [None]:
data = data[(data['BMI']>50)==False]

In [None]:
data.shape

    So after removing outliers we are left with 760 rows. 
    
    Lets check removal of Outliers by plotting a boxplot.

In [None]:
plt.boxplot(data['BMI'])
plt.show()

    So we have our outliars removed and also we are clear from missing data. 
    
    It's time for some visualizations. 
<hr />

# 3. Exploratory data analysis

#### Firstly lets check that if our data is balanced or not.
**Balanced data** : If there are two classes, then balanced data would mean 50% points for each of the class.)
#### Then we will plot a Distribution plot and other visualizations.

In [None]:
xs = data['Outcome'].value_counts().index
ys = data['Outcome'].value_counts().values

ax = sns.barplot(xs, ys)
ax.set_xlabel("Outcome")
plt.show()


Our Data is clearly not balanced. We will balance this data using **SMOTE technique** after some visualizations.

**SMOTE** is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the **overfitting problem posed by random oversampling**. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.
<hr />

In [None]:
data.plot(kind= 'kde' , subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,10))
plt.show()

We have our maximim columns normally distributed

In [None]:
# Age vs BloodPressure with hue = Outcome
plt.figure(figsize=(12,8))
ax = sns.scatterplot(x="BloodPressure", y="Age", alpha=0.4, data=data[data['Outcome'] == 0])
sns.scatterplot(x="BloodPressure", y="Age", alpha=1, data=data[data['Outcome'] == 1], ax=ax)
plt.show()

In [None]:
# BloodPressure vs BMI with hue = Outcome
plt.figure(figsize=(12,8))
ax = sns.scatterplot(y="BMI", x="BloodPressure", alpha=0.4, data=data[data['Outcome'] == 0])
sns.scatterplot(y="BMI", x="BloodPressure", alpha=1, data=data[data['Outcome'] == 1], ax=ax)
plt.show()

In [None]:
# BloodPressure vs BMI with hue = Outcome
plt.figure(figsize=(12,8))
ax = sns.scatterplot(y="Glucose", x="BloodPressure", alpha=0.4, color="blue", label="0", data=data[data['Outcome'] == 0])
sns.scatterplot(x="BloodPressure", y="Glucose", alpha=1, color="red", label="1", data=data[data['Outcome'] == 1], ax=ax)
plt.show()

## Insights
<hr>
<h4> - A younger person with high blood pressure level have more chances of getting diabetic positive than a elder person with high blood pressure
<br><br>
- An average person with BMI more than 35 have more chances of getting diabetic positive inspite of having a normal blood pressure also.
<br><br>
- A person with high glucose level and high blood pressure have more chances of getting diabetic positive.
<br><br>
- A person with high glucose level and high BMI can also come diabetic positive.
</h4>

<hr />
Now let's apply SMOTE for balancing our data.

Balancing data is important for better modeling results.
<hr />

In [None]:
# Splitting into features and value to be predicted
X = data.drop(columns=['Outcome'])
y = data['Outcome']
fig, ax = plt.subplots(1,2 ,figsize = (10,5))

sns.barplot(x=['0', '1'], y =[sum(y == 0), sum(y == 1)], ax = ax[0])
ax[0].set_title("Before Oversampling")
ax[0].set_xlabel('Outcome')

#Using SMOTE to balance the Data
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 2) 
X, y = sm.fit_resample(X, y) 

sns.barplot(x=['0', '1'], y =[sum(y == 0), sum(y == 1)], ax = ax[1])
ax[1].set_title("After Oversampling")
ax[1].set_xlabel('Outcome')

plt.tight_layout()
plt.show()

# 4. Modeling 

Let's first split the data into X,y using sklearn functions

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)


X_train.shape, X_test.shape, y_train.shape, y_test.shape

We will use the cross validation technique for testing different models on our data.


<h4>What is Cross-validation?</h4>
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. 
The procedure has k number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.(k is the only parameter given)

In [None]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('RF', RandomForestClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10)
    pipeline = make_pipeline(StandardScaler(), RandomForestClassifier())
    pipeline.fit(X_train, y_train)
    cv_results = model_selection.cross_val_score(pipeline, X, y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Here I am using RandomForest Classifier as base and for described modeling.

I am using the **pipeline** feature to make a pipeline for standardising and then only applying proper algorithm. 
<hr />


In [None]:
from sklearn.metrics import accuracy_score, classification_report, plot_confusion_matrix

pipeline = make_pipeline(StandardScaler(), RandomForestClassifier())
pipeline.fit(X_train, y_train)
prediction = pipeline.predict(X_test)

print(f"Accuracy Score : {round(accuracy_score(y_test, prediction) * 100, 2)}%")

Wow, we got accuracy score above 80%. Thats great.

Now let's check for the model report.

In [None]:
print(classification_report(y_test, prediction))

# Thanks for your time :)

## If you like this kernel an Upvote would be appreciated.