# Diabetes diagnosing 

## Context

***This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.***

## Content

***The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.***

*We can learn from the data found on UCI Machine Learning Repository which contains data on female patients at least 21 years old of Pima Indian heritage.*

*We have 768 instances and the following 8 attributes:*

- Number of times pregnant (preg)
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test (plas)
- Diastolic blood pressure in mm Hg (pres)
- Triceps skin fold thickness in mm (skin)
- 2-Hour serum insulin in mu U/ml (insu)
- Body mass index measured as weight in kg/(height in m)^2 (mass)
- Diabetes pedigree function (pedi)
- Age in years (age)

*A particularly interesting attribute used in the study was the Diabetes Pedigree Function, pedi. It provided some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gave us an idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on observations in the proceeding section, it is unclear how well this function predicts the onset of diabetes.*

## Importing libraries and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from IPython.display import Image,display_svg,SVG
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import recall_score, accuracy_score
from sklearn import set_config
set_config(display='diagram')
import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv') 
data.sample(10)

In [None]:
data.shape

In [None]:
data.info()

## Distribution of data

In [None]:
for col in data.columns:
    plt.hist(data[col],edgecolor='black')

    plt.title(col+ " Distribution")
    plt.xlabel(col)
    plt.ylabel('frequency')

    plt.show()

In [None]:
data.describe()

***As we can see there are many readings where glucose, skin thickness, insulin and BMI are 0. That is not possible. It means there are outliers present in the data. The decision tree algorithm is susceptible to outliers***

***Pregnancy column also contains outliers as we can see maximum value is 17. Number of pregnancies beyond 5-7 is inhuman.***

***Almost all the columns are skewed. We need to try to bring the data into normal form.***

In [None]:
# Finding correlation between the features
data.corr()

***The primary factors which play a major role in Diabetes diagnosis are number of pregnancies, blood glucose level,insulin, BMI, DiabetesPedigreeFunction and age of the patient***

***Blood pressure and skin thickness don't help much in prediction of diabetes. Hence, I would remove those features***

In [None]:
data.drop(['BloodPressure','SkinThickness'],axis=1,inplace=True)
data.sample(5)

In [None]:
# Outliers in Pregnancies column

sns.boxplot(data['Outcome'],y=data['Pregnancies']);

In [None]:
# Outliers in glucose levels
sns.boxplot(data['Outcome'],y=data['Glucose']);

In [None]:
# Outliers in Insulin levels

sns.boxplot(data['Outcome'],y=data['Insulin']);

In [None]:
# Outliers in BMI

sns.boxplot(data['Outcome'],y=data['BMI']);

In [None]:
# Outliers in Age

sns.boxplot(data['Outcome'],y=data['DiabetesPedigreeFunction']);

In [None]:
# Outliers in Age

sns.boxplot(data['Outcome'],y=data['Age']);

***We need to deal with outliers in columns like Pregnancies, Glucose, Insulin, BMI, DiabetesPedigreeFunction and Age***

## Outlier Detection 

### Dealing with outliers in Pregnancies column

In [None]:
p25_preg = data['Pregnancies'].quantile(0.25)
p75_preg = data['Pregnancies'].quantile(0.75)
iqr_preg = p75_preg - p25_preg
# We dont want calculated upper range. We know that pregnancies more than 7 or 8 is inhuman.
upper_preg = p75_preg + 0.5 * iqr_preg
lower_preg = p25_preg - 1.5 * iqr_preg

In [None]:
data[data['Pregnancies']>upper_preg]

In [None]:
data[data['Pregnancies']<lower_preg]

***Since data in other columns of these outliers seem right, we won't trimm the data. Instead, we will cap them***

#### Capping the data in Pregnancies column

In [None]:
data['Pregnancies'] = np.where(data['Pregnancies']>upper_preg, upper_preg, data['Pregnancies'])

# Check if there are any outliers
data[data['Pregnancies']>upper_preg]

### Dealing with outliers in Glucose column

In [None]:
p25_glu = data['Glucose'].quantile(0.25)
p75_glu = data['Glucose'].quantile(0.75)
iqr_glu = p75_glu - p25_glu
upper_glu = p75_glu + 1.5 * iqr_glu
lower_glu = p25_glu - 1.5 * iqr_glu

In [None]:
data[data['Glucose']>upper_glu]

In [None]:
data[data['Glucose']<lower_glu]

***This is a noisy/obsolete data. We should trim it***

In [None]:
data = data[data['Glucose']>lower_glu]
data.shape

### Dealing with outliers in Insulin column

In [None]:
p25_in = data['Insulin'].quantile(0.25)
p75_in = data['Insulin'].quantile(0.75)
iqr_in = p75_in - p25_in
upper_in = p75_in + 1.5 * iqr_in
lower_in = p25_in - 1.5 * iqr_in

In [None]:
print(data[data['Insulin']>upper_in].shape)
data[data['Insulin']>upper_in]

In [None]:
print(data[data['Insulin']<lower_in].shape)
data[data['Insulin']<lower_in]

***Since there are 33 outliers. It's obvious that we will cap them***

In [None]:
data['Insulin'] = np.where(data['Insulin']>upper_in, upper_in, data['Insulin'])
print(data.shape)
# Check if there are any outliers
data[data['Insulin']>upper_in]

## Dealing with outliers in BMI column

In [None]:
upper_bmi = 55
lower_bmi = 15

In [None]:
print(data[data['BMI']>upper_bmi].shape)
data[data['BMI']>upper_bmi]

In [None]:
print(data[data['BMI']<lower_bmi].shape)
data[data['BMI']<lower_bmi]

***BMI has almost 14 outliers and we will remove them***

In [None]:
data = data[data['BMI']>lower_bmi]
data.shape

In [None]:
data = data[data['BMI']<upper_bmi]
data.shape

### Dealing with outliers in DiabetesPedigreeFunction column

In [None]:
p25_dpf = data['DiabetesPedigreeFunction'].quantile(0.25)
p75_dpf = data['DiabetesPedigreeFunction'].quantile(0.75)
iqr_dpf = p75_dpf - p25_dpf
upper_dpf = p75_dpf + 1.5 * iqr_dpf
lower_dpf = p25_dpf - 1.5 * iqr_dpf

In [None]:
print(data[data['DiabetesPedigreeFunction']>upper_dpf].shape)
data[data['DiabetesPedigreeFunction']>upper_dpf]

In [None]:
print(data[data['DiabetesPedigreeFunction']<lower_dpf].shape)
data[data['DiabetesPedigreeFunction']<lower_dpf]

***Since there are 29 outliers. It's obvious that we will cap them***

In [None]:
data['DiabetesPedigreeFunction'] = np.where(data['DiabetesPedigreeFunction']>upper_dpf, upper_dpf, data['DiabetesPedigreeFunction'])
print(data.shape)
# Check if there are any outliers
data[data['DiabetesPedigreeFunction']>upper_dpf]

### Dealing with outliers in Age column

In [None]:
p25_age = data['Age'].quantile(0.25)
p75_age = data['Age'].quantile(0.75)
iqr_age = p75_age - p25_age
upper_age = p75_age + 1.5 * iqr_age
lower_age = p25_age - 1.5 * iqr_age

In [None]:
print(data[data['Age']>upper_age].shape)
data[data['Age']>upper_age]

In [None]:
print(data[data['Age']<lower_age].shape)
data[data['Age']<lower_age]

In [None]:
data['Age'] = np.where(data['Age']>upper_age, upper_age, data['Age'])
print(data.shape)
# Check if there are any outliers
data[data['Age']>upper_age]

## Distribution of data post outlier detection 

In [None]:
for col in data.columns:
    plt.hist(data[col],edgecolor='black')

    plt.title(col+ " Distribution")
    plt.xlabel(col)
    plt.ylabel('frequency')

    plt.show()

***Change in stats of the data:***

In [None]:
data.describe()

## Exploratory Data Analysis

In [None]:
sns.scatterplot(data['Glucose'],data['Insulin'],hue=data['Outcome']);

***Many instances in Insuline data are wrong. There has been mistake in taking survey. Some having insulin as 0 and on the other hand some having insulin 300+***


In [None]:
def murammat(df):
    
    df_noisy1 = df[df['Insulin'] == 0]
    df_noisy1.drop(['Insulin'],axis=1,inplace=True)
    df = df[df['Insulin'] > 0]
    df_noisy2 = df[df['Insulin'] >= 300]
    df_noisy2.drop(['Insulin'],axis=1,inplace=True)
    df = df[df['Insulin'] < 300]
    y = df.Insulin.values
    df.drop(['Insulin'],axis=1,inplace=True)
    X=df.values
    sc_X = StandardScaler()
    X=sc_X.fit_transform(X)
    regressor = LinearRegression()
    regressor.fit(X,y)
    repaired1=regressor.predict(sc_X.transform(df_noisy1.values))
    df_noisy1['Insulin'] = repaired1
    repaired2=regressor.predict(sc_X.transform(df_noisy2.values))
    df_noisy2['Insulin'] = repaired2
    df_repaired = df_noisy1.append(df_noisy2)
    df_repaired =df_repaired[['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction',
       'Age', 'Outcome']]

    return df_repaired


In [None]:
data_repaired=murammat(data)
data = data[data['Insulin'] > 0]
data = data[data['Insulin'] < 300]
data = data.append(data_repaired)
sns.scatterplot(data['Glucose'],data['Insulin'],hue=data['Outcome']);

## Distribution of data post EDA

In [None]:
for col in data.columns:
    plt.hist(data[col],edgecolor='black')

    plt.title(col+ " Distribution")
    plt.xlabel(col)
    plt.ylabel('frequency')

    plt.show()

In [None]:
data.shape

## Choice of algorithm

- Random Forest

***Since Random Forest Classifier algorithm is susceptible to variations in frequency distribution of categories, we need to feed equal number of Diabetic as well as non-diabetic patients data into the algorithm***

In [None]:
data

In [None]:
df=data[data.Outcome == 1]

In [None]:
data[data.Outcome == 0].shape

In [None]:
dataset=data[data.Outcome == 0].sample(261)
dataset.shape

In [None]:
dataset=dataset.append(df)
dataset.shape

In [None]:
dataset=dataset.sample(frac=1).reset_index(drop=True)
dataset

In [None]:
for col in dataset.columns:
    plt.hist(dataset[col],edgecolor='black')

    plt.title(col+ " Distribution")
    plt.xlabel(col)
    plt.ylabel('frequency')

    plt.show()

In [None]:
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

## Building deployment ready, secure pipeline

In [None]:
rf1 = RandomForestClassifier(n_estimators=50, criterion='gini',max_depth=4)
diabetes_diagnosis = Pipeline([
    ('rf1',rf1)
])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
diabetes_diagnosis.fit(X_train,y_train)

In [None]:
y_pred=diabetes_diagnosis.predict(X_test)

In [None]:
y_pred

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
recall_score(y_test,y_pred)

We have chosen **max_depth = 4** and **n_estimators= 50** because we want pretty high accuracy, **BUT** since we are dealing with **Diabetes Prediction**, we want even higher Recall. 
*We have to face the curse of precision-recall tradeoff*. We can't afford wrong prediction (False Negative) of a **Diabetic** patient