# Diabetes Prediction
# Pre Exploration Questions?
This notebook is inspired from [Serhat Yazıcıoğlu Notebook](https://www.kaggle.com/serhatyzc/diabetes-prediction-with-cart).
- **What is Diabetes?**
    
    Diabetes is a metabolic disease that causes high blood sugar. The hormone insulin moves sugar from the blood into your cells to be stored or used for energy. With diabetes, your body either doesn't make enough insulin or can't effectively use the insulin it does make.
    
## Where is this data from?
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to dianostically predict whether or not a patient has diabetes, based on certain diagnostic measaurements included in the dataset. Several contraints were places on the selection of these instance from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## What are the fea****tures/columns in the datset and what do they mean?
|Column | Description| Categorised |
| --- | --- | --- |
|**Pregnancies**|Number of times a female candidate become pregnant|$\begin{aligned}
\text{normal: 0-4}\\
\text{overpregnancies: >4}
\end{aligned}$|
|**Glucose**|Plasma Glucose concentration a 2 hrs in an oral glucose tolerance test.|$\begin{aligned}
\text{low: <70}\\
\text{normal: 70-99}\\
\text{high: 99-126}\\
\text{very_high: >126}\\
\end{aligned}$|
|**Blood Pressure**|Diastolic Blood Pressure level (mm/Hg)|$\begin{aligned}
\text{normal: <80}\\
\text{risky: >80}
\end{aligned}$|
|**SkinThickness**|Triceps skin fold thickness(mm)|$\begin{aligned}
\text{normal: <30}\\
\text{highfat: >= 70}\\
\end{aligned}$|
|**Insulin**|2-hours  serum insulin (muU/ml)| - | 
|**BMI**|Body mass index (weight in Kg/$\text{(height in m)}^2$)|$\begin{aligned}
\text{underweight: <18.5}\\
\text{normal: 18.5-25}\\
\text{overweight: 25-30}\\
\text{obese: >30}
\end{aligned}$|
|**Diabetes Pedigree Function**|a function which scores likelihood of diabetes based on family history||
|**Age**| Age in years | |
|**Outcome**| Class variable (0 or 1) 268 of 768 are 1, the others are 0||
    

# What needs to be done?
Build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
import seaborn as sbs

sbs.set_theme()

# %matplotlib qt
%matplotlib inline

In [None]:
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head()

In [None]:
df.info()

In [None]:
df.describe()

We can see that some values are outright outliers; like the Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI are just 0. Which doesn't seems a possible value.

In [None]:
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)

In [None]:
_ = df.hist(bins=50, figsize=(20, 15))

Let's see how many diabetic and healthy persons we do have.

In [None]:
Healthy =  df[ df['Outcome'] == 0 ]
Diabetic = df[ df['Outcome'] == 1 ]

h_diab = pd.Series({'healthy':Healthy.shape[0],
            'Diabetic':Diabetic.shape[0]})
h_diab.plot.bar(alpha=0.7)

In [None]:
h_diab.plot.pie(startangle=90, 
                explode=[0, 0.1],
                autopct='%1.1f%%',
                colors=['C3', 'C4'])
plt.title('Relative % of females diabetic ')
plt.ylabel('')
_ = plt.axis('equal')

### Taking a look at null values

In [None]:
df.isnull().sum()

In [None]:
df.pivot_table(index=['Outcome'] )

We can replace this null values with median values with the appropriate values for the given outcome.

<mark>If the distribution is not symmetrical, it makes sense to use a median instead of the mean. B/c it represents the series better as it will be less affected by median outliers.</mark>

In [None]:
def replace_null_values(df):
    for col in df.columns:
        df.loc[(df['Outcome']==0) & (df[col].isnull()), col] = df[df['Outcome'] == 0][col].median()
        df.loc[(df['Outcome']==1) & (df[col].isnull()), col] = df[ df['Outcome'] == 1][col].median()
    print(df.isnull().sum())
    
replace_null_values(df)

### Feature Engineering

In [None]:
def create_new_bmi(df):
    new_cat = "NEW_BMI_CAT"
    df.loc[(df['BMI'] < 18.5), new_cat] = "underweight"
    df.loc[(df['BMI'] > 18.5) & (df['BMI'] < 25), new_cat] = "normal"
    df.loc[(df['BMI'] > 24) & (df['BMI'] < 30), new_cat] =  "overweight"
    df.loc[(df['BMI'] > 30) & (df['BMI']< 40), new_cat] = "obese"
    df.drop('BMI', axis=1, inplace=True)
    df[new_cat] = df[new_cat].astype('category')

def create_new_glucose(df): 
    new_cat = "NEW_GLUCOSE_CAT"
    df.loc[(df['Glucose'] < 70), new_cat] = "low"
    df.loc[(df['Glucose'] > 70) & (df['Glucose'] < 99), new_cat] = "normal"
    df.loc[(df['Glucose'] > 99) & (df['Glucose'] < 126), new_cat] = "high"
    df.loc[(df['Glucose'] > 126) & (df['Glucose'] < 200), new_cat] = "very_high"
    df[new_cat] = df[new_cat].astype('category')


def create_new_skinthickness(df):
    new_cat = "NEW_SKIN_THICKNESS"
    df.loc[df['SkinThickness'] < 30, new_cat] = "normal"
    df.loc[df['SkinThickness'] >= 70, new_cat] = "highfat"
    df[new_cat] = df[new_cat].astype('category')

def create_new_pregnancies(df):
    new_cat = "NEW_PREGNANCIES"
    df.loc[df['Pregnancies'] == 0, new_cat] = "no_pregnancies"
    df.loc[(df['Pregnancies'] > 0) & df['Pregnancies'] <= 4, new_cat] = "std_pregnancies"
    df.loc[(df['Pregnancies'] > 4), new_cat] = "over_pregnancies"
    df[new_cat] = df[new_cat].astype('category')

def create_circulation_level(df): 
    new_cat = "NEW_CIRCULATION_LEVEL"
    df.loc[(df['SkinThickness'] < 30) & (df['BloodPressure'] < 80), new_cat] = "normal"
    df.loc[(df['SkinThickness'] > 30) & (df['BloodPressure']>= 80), new_cat] = "high_risk"
    df.loc[((df['SkinThickness']< 30) & (df['BloodPressure'] >=80)) | ((df['SkinThickness']> 30) & (df['BloodPressure'] <80)), new_cat] = "medium_risk"
    df[new_cat] = df[new_cat].astype('category')
    df.drop('SkinThickness', axis=1, inplace=True)
    
def create_other_features(df):
    df['PRE_AGE_CAT'] = df['Age'] * df['Pregnancies']
    df['INSULIN_GLUCOSE_CAT'] = df['Insulin'] * df['Glucose']
    df.drop('Pregnancies', axis=1, inplace=True)
    df.drop('Glucose', axis=1, inplace=True)
    
create_new_bmi(df)
create_new_glucose(df)
create_new_pregnancies(df)
create_new_skinthickness(df)
create_circulation_level(df)
create_other_features(df)
df

### Label Encoding
Doing Label encoding only for `Outcome` column.

In [None]:
df['NEW_CIRCULATION_LEVEL'].dtype.name == 'category'

In [None]:
len(df['NEW_CIRCULATION_LEVEL'].unique())

In [None]:
label_encoder = preprocessing.LabelEncoder()
df['Outcome'] = label_encoder.fit_transform(df['Outcome'])

### One-Hot Encoding
We'll do One-Hot encoding for categorical cols.

In [None]:
categ_cols = [col for col in df.columns if df[col].dtype.name == 'category']
print(categ_cols)

In [None]:
def one_hot_encoder(df, columns):
    df_dummy = df.copy()
    df_dummy = pd.get_dummies(df, columns=columns, drop_first=True)
    return df_dummy

result = one_hot_encoder(df, categ_cols)
result

# Model Training
First try with Logistic Regression.

In [None]:
from sklearn.model_selection import train_test_split

X = result.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

rmse  = lambda labels, predictions: np.sqrt(mean_squared_error(labels, predictions))

lg_model = LogisticRegression(max_iter=1000,C=0.01).fit(X_train, y_train)
lg_predictions = lg_model.predict(X_test)
rmse(y_test, lg_predictions)

Since the consequences of detection False Negatives are high; the task should be **Recall centric**.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import recall_score

print(accuracy_score(y_test, lg_predictions))
print(roc_auc_score(y_test, lg_predictions))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, lg_predictions))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [400, 500, 600], 
              'max_leaf_nodes': [14, 15, 16]}
random_forest = RandomForestClassifier(n_jobs=-1)

grid_search_rf = GridSearchCV(random_forest, 
                              param_grid=param_grid, 
                              cv=3,
                              scoring='recall',
                              return_train_score=True)
grid_search_rf.fit(X_train, y_train)
grid_search_rf.best_params_

In [None]:
print(classification_report(y_test, grid_search_rf.predict(X_test)))
print(recall_score(y_test, grid_search_rf.predict(X_test)))
print(roc_auc_score(y_test, grid_search_rf.predict_proba(X_test)[:, 1]))

This does seems to be reasonably acceptable Model. With roc_auc of 92% and recall of 81%. 