# Diabetics prediction using logistic regression

#### Kaggle Problem : https://www.kaggle.com/kandij/diabetes-dataset

###  Overview : 
The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

### Task :
Our Task is to predict if the patient has diabetes or not. 

The aim is to make sure we dont miss out having a wrong prediction to a diabetic patient as non-diabetic although we are fine if we wrongly classify a non-diabetic patient as diabetic. 

Hence we should give more importance to Recall score and try finding the algorithm which gives a better Recall score along with good accuracy.

We will be trying various Classification Algorithms give a better prediction.

### Importing modules and dataset

In [None]:
# Importing necessary modules

import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import f_classif
from sklearn.metrics import classification_report

from sklearn.linear_model import LogisticRegressionCV
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

In [None]:
def diabetes_data_import():
    """
    Function useful for importing a file and converting it to a dataframe
    """
    fileDir = os.path.dirname(os.path.realpath('__file__'))
    print(fileDir)
    relativeDir = '/kaggle/input/diabetes-dataset/diabetes2.csv'
    filename = os.path.join(fileDir,relativeDir)
    datafile = pd.read_csv(relativeDir)
    return datafile

In [None]:
# Importing the dataset file
diab_df = diabetes_data_import()
diab_df

### Understanding the Data 

In [None]:
diab_df.describe()

In [None]:
diab_df.isna().any()

In [None]:
# Defining a function for Horizonal bar plot 

def plot_counts_bar(data,column,fig_size=(9,4),col='blue',col_annot='grey',water_m=False,water_text='KedNat'):
    """
    Function plot_counts_bar plots a horizontal bar graph for Value counts for a given Dataframe Attribute.
    This is much useful in analysis phase in Datascience Projects where data counts for a particular attributes needs to be visualized.
    Mandatory inputs to this function. 
        1. 'data' where dataframe is given as input 
        2. 'column' where column name is given as input for which we need the value counts.
    Optional inputs to this function:
        1. 'fig_size' which represent the figure size for this plot. Default input is (16,9)
        2. 'col' which represents the color of the bar plot. Default input is 'blue'
        3. 'col_annot' which represents the color of annotations. Default input is 'grey'
        4. 'water_m' which represents if we need a watermark text. Default input is boolean as False
        5. 'water_text' which inputs a string variable used for watermark. Default is KedNat
    """
    
    # Figure Size 
    fig, ax = plt.subplots(figsize =fig_size) 

    # Defining the dataframe for value counts
    df = data[column].value_counts().to_frame()
    df.reset_index(inplace=True)
    df.set_axis([column ,'Counts'], axis=1, inplace=True)
    X_data = df[column]
    y_data = df['Counts']

    # Horizontal Bar Plot 
    ax.barh(X_data, y_data , color=col) 

    # Remove axes splines 
    for s in ['top', 'bottom', 'left', 'right']: 
        ax.spines[s].set_visible(False)

    # Remove x, y Ticks 
    ax.xaxis.set_ticks_position('none') 
    ax.yaxis.set_ticks_position('none') 

    # Add padding between axes and labels 
    ax.xaxis.set_tick_params(pad = 5) 
    ax.yaxis.set_tick_params(pad = 10) 

    # Show top values 
    ax.invert_yaxis()
    
    # Add annotation to bars 
    for i in ax.patches: 
        plt.text(i.get_width()+0.2, i.get_y()+0.5,str(round((i.get_width()), 2)),fontsize = 10, fontweight ='bold',color =col_annot) 

    # Add Plot Title 
    title = 'Counts of each '+column
    ax.set_title(title, loc ='left', fontweight="bold" , fontsize=16) 
    
    # Add Text watermark 
    if water_m == True:
        fig.text(0.9, 0.15, water_text, fontsize = 12, color ='grey', ha ='right', va ='bottom', alpha = 0.7) 

    ax.get_xaxis().set_visible(False)

    # Show Plot 
    plt.show() 


In [None]:
# Plotting the labels to check the distribution
plot_counts_bar(diab_df,'Outcome',(8,4),col='green',col_annot='blue')

### Splitting the Data into Train and Test datasets

As we seen above the ratio of splits with 1 and 0 Outcome is 1:2.

If we go by normal split we may have a chance to end up having most of the 0 Outcome in Test set. 

Hence we will be going with StratifiedShuffleSplit approach on Column Outcome

In [None]:
# Defining a function for Stratified split on a given column
def strat_shuffle_split(data,column,testsize=0.2):
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    for train_index, test_index in split.split(data, data[column]):    
        strat_train_set = data.loc[train_index]    
        strat_test_set = data.loc[test_index]
        return(strat_train_set,strat_test_set)

In [None]:
# Splitting into train and test dataset on basis of Stratified split for label
train_set,test_set = strat_shuffle_split(diab_df,'Outcome')

In [None]:
# A check on Outcome in Test Dataset after split
plot_counts_bar(test_set,'Outcome',(8,4),col='Purple',col_annot='Blue')

### Data Wrangling

In [None]:
# Setting up train set 
diab_df = train_set.copy()

In [None]:
diab_num = diab_df[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
y = diab_df['Outcome']

In [None]:
# Understanding the data distribution for each independent feature w.r.t label Outcome
sns.pairplot(diab_df,hue='Outcome')

In [None]:
# Defining a function for Heatmap on a given data
def heat_map(data,fig_size=(8,8)):

    fig, ax = plt.subplots(figsize=fig_size)
    heatmap = sns.heatmap(data,
                          square = True,
                          linewidths = .2,
                          cmap = 'YlGnBu',
                          cbar_kws = {'shrink': 0.8,'ticks' : [-1, -.5, 0, 0.5, 1]},
                          vmin = -1,
                          vmax = 1,
                          annot = True,
                          annot_kws = {'size': 12})

    #add the column names as labels
    ax.set_yticklabels(data.columns, rotation = 0)
    ax.set_xticklabels(data.columns)

    sns.set_style({'xtick.bottom': True}, {'ytick.left': True})

In [None]:
# Correlation Matrix Heatmap for Featutes
heat_map(diab_num.corr())

<b>Observations made from Pairplot and Correlation Heatmap:</b>

<b>Which Model can be better ?</b>

For all plots there is no proper way to differentiate Outcome 1 and 0 . i.e overlapping is clear.

Data for Outcome cannot be classified using a single straight line.

Hence Logistic regression cannot be used here for classification.

KNN or tree based classifiers (Random Forest , XGBoost etc) can fit the data better.


<b>Data Insights if any ?</b>

We could see  lots of values as 0. Eg : For BloodPressure , SkinThickness , Insulin values at 0 form a straight line.

There are datafixes needed before starting with model selection.

Heatmap shows there is no good correlation within features available. Hence there is no Multicollinerity.

In [None]:
# Defining a Function that gives the stats for no of zeros and nulls in a given dataset

def get_stats(data,columns,check_zero = True):
    '''
    Function get_stats gives the insights of bad data like Nulls of zeros in a given dataframe
    Mandatory Inputs to this function:
    data    : Dataframe name
    columns : Columns in dataframe that needs to be checked 
    Optional inputs to this function:
    check_zero : True if no of zeros needs to be checked
    '''
    print('Count of records in dataframe '+str(data.shape[0])+'\n')
    for i in columns:
        is_na_c = 0
        zero_c = 0
        is_na = data[i].isna().any()
        if is_na == True:
            is_na_c = data[i].isna().count()
        if check_zero == True:
            zero_c = data[i][data[i]<=0].count()
        print('Column :'+str(i))
        print('   No of Nulls :'+str(is_na_c)+'   No of Zeros or less :'+str(zero_c))

In [None]:
# Getting stats on Test dataset
get_stats(diab_num,list(diab_num.columns))

<b>Insulin</b> has 290 nulls out of 614. Almost 50% of data is Null. This Attribute can be removed.

<b>Pregnancies</b> with 0 is a valid condition. Hence Pregnancies with 0 doesnt need a fix.

In [None]:
# Defining an Imputer to fix zero values. Same will be used for Train and Test dataset. 
# This will exclude Pregnancies and Insulin

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=0, strategy='median')
imp.fit(diab_num[['Glucose','BloodPressure','SkinThickness','BMI','DiabetesPedigreeFunction','Age']])
imp.statistics_

In [None]:
# Function impute_transform can be used to fit a imputer on the given dataset

def impute_transform(data,imp):
    '''
    impute_transform used to fix the dataframe 'data' with given Imputer instance 'imp'.
    It returns a transformed data in form of dataframe 
    '''
    print('\nStats before Imputing :\n')
    get_stats(data,list(data.columns))
    imp_df = imp.transform(data)
    imp_df = pd.DataFrame(imp_df,columns=list(data.columns))
    print('\nStats after Imputing :\n')
    get_stats(imp_df,list(imp_df.columns))
    return imp_df

In [None]:
# Function data_transform is used to Merge Imputed dataframe with Pregnancies and return the cleaned dataframe

def data_transform(data):
    '''
    data_transform merges the transformed dataframe using impute_transform along with Feature 'Pregnancies'
    '''
    imputed = impute_transform(data[['Glucose','BloodPressure','SkinThickness','BMI','DiabetesPedigreeFunction','Age']],imp)
    df = pd.merge(data[['Pregnancies']],imputed,on=data.index)
    df.drop(columns=['key_0'],inplace=True)
    return df

In [None]:
# Creating Independent feature list in X where data is tranformed using imputer

X = data_transform(diab_num)

<b>Feature Selection</b>

From earlier analysis for Pair plot we had guessed that Logistic regression will not be a better model.

But lets try using Logistic regression so as to measure its performance w.r.t other models.

In [None]:
# Annova test results for features in X

anova_num = f_classif(X, y)
x=0
for i in X:
    print('F value for '+i+' is '+str(anova_num[0][x])+' and p-value is '+str(anova_num[1][x]))
    x+=1

In [None]:
# Selecting Best 3 features based on Annova test 

def k_best_select(X,y,classifier,k):
    '''
    'X' features for predict 'y' using classifier as 'classifier' with no of features to be selected as k
    This function returns the dataframe
    '''
    selector = SelectKBest(classifier, k = 3)
    selector.fit_transform(X, y)
    cols = selector.get_support(indices=True)
    X_logreg = X.iloc[:,cols]
    return X_logreg

In [None]:
X_logreg = k_best_select(X,y,f_classif,3)

In [None]:
# Defining a function to process Logistic regression Algorithm with splits and Standardization

def logistic_reg(X,y,cv=5,standardize=True):
    '''
    Features representing 'X' for labels 'y' for a Cross validation splits as 'cv'
    standardize = True uses StandardScaler before fitting the data
    '''
    if standardize== True:
        X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
    clf = LogisticRegressionCV(cv=cv, random_state=0).fit(X, y)
    yhat = clf.predict(X)
    print('Accuracy score using Logistic regression :'+str(clf.score(X, y)))
    print ('\nClassification Report given below :\n'+str(classification_report(y, yhat)))


In [None]:
logistic_reg(X_logreg,y,5,True)

<b>Above results we can see that Logistic Regression gives a recall score of 0.56 which is less </b>

Lets try other models to check their performances.

### Model Selection

We will be using below Algorithms to test which one suits best on diabetes dataset.

1. KNN Neighbors
2. Random Forest Classifier
3. XG Boost Classifier

Since for all the algorithms we need to find the best hyperparameter that fits the data we will used RandomizedSearchCV.

Below will be the process of finding an optimal model:
1. Give a wide set of parameters and find the best fit model for given parameters using RandomizedSearchCV.
2. Using Best fit model get the model's Recall and Accuracy on entire Training set.
3. Try the best fit model on Test Set directly (since we have not classified into a Validation set due to less data)
4. Get the Recall and Accuracy on Test set.
5. Compare the Train and Test dataset Recall and Accuracy scores.
6. Point 5 will give us idea if the Train model was overfit or underfit.
7. Modify the parameter set again (either reducing or expanding).
8. Repeat steps 1 to 6 until we find a model that has low variance on test set and less bias on train set.

In [None]:
# Defining a function which helps in finding the best fit parameters for a given model using RandomizedSearchCV

def best_fit_search(X,y,estimator,param,n_iter,cv=5,scoring='accuracy',return_model = False):
    '''
   This function uses RandomizedSearchCV to search the optimal fit for given set of parameters 'param'.
   'X' and 'Y' are Features and Labels for a given algorithm 'estimator' for a RandomizedSearchCV that runs for iterations 'n_iter'.
    No of splits is defined using 'cv' and 'scoring' defines scoring pattern.
    'return_model' if True then the functions returns the bestfit model.
    '''
    search = RandomizedSearchCV(estimator=estimator, param_distributions=param, n_iter=n_iter, n_jobs=-1, cv=cv, random_state=42,scoring=scoring)
    result = search.fit(X,y)
    #print('Best parameters for fit : '+str(result.best_params_)+'\n')

    if return_model == False:
        print('Best parameters for fit : '+str(result.best_params_)+'\n')
        print('Best score for fit :'+str(result.best_score_)+'\n')
        print('Best Estimator :'+str(result.best_estimator_)+'\n')
    model = result.best_estimator_
    model.fit(X,y)
    yhat = model.predict(X)
    print('Classification Report \n'+str(classification_report(y, yhat)))
    if return_model == True:
        return model

### K Nearest Neighbors :

In [None]:
# Finding Best fit for K-Nearest Neighbors
params = {'n_neighbors' : list(range(2,20))}
best_fit_search(X,y,KNeighborsClassifier(),params,18,cv=5,scoring='recall')

<b>KNN of 3 is the best fit as per cross validations on training set.</b>

Since we have very less data rows creating an extra validation set to test in validation set is not possible. 

Hence we will test the same in test set directly to get Test set recall and accuracy.

Although in this notebook it comes as a code later I had tested it with different parameter sets on test set so that we can find an optimal parameter fit.

Below are the results

In [None]:
data=[[3,0.71 , 0.84,0.63 , 0.73]]
pd.DataFrame(data,columns=['K-Values','Train_Recall','Train_Accuracy','Test_Recall','Test_Accuracy'])

<b>KNN model also performs well on Test set with recall of 0.63 and accuracy of 0.73</b>

### Random Forest Classifier :


In [None]:
params = {'max_features' : [2,3,4,6] , 'max_depth' : [2,3,4,5,6,7,8] ,'n_estimators': [100]}
best_fit_search(X,y,RandomForestClassifier(),params,28,cv=5,scoring='recall')

Random Forest model with given set of parameters on training data tends to give a recall of 0.92 and accuracy of 0.96.

But looking at the parameters of Best fit it seems to be overfit given that it uses depth of 8.

To test if it really overfits we need to test it on Validation dataset. 

But since we have very less data rows creating an extra validation set is not possible. 

Hence we will test the same in test set directly to get Test set recall and accuracy.

Although in this notebook it comes as a code later I had tested it with different parameter sets on test set so that we can find an optimal parameter fit.

Below are the results

In [None]:
data = [[100,3,3,0.53 , 0.79,0.48 , 0.73],
        [100,3,8,0.92 , 0.96,0.59 , 0.75],
        [100,4,6,0.78 , 0.89,0.56 , 0.74],
        [100,4,5,0.74 , 0.86,0.57 , 0.74],
        [100,3,4,0.64 , 0.83,0.50 , 0.73],
        [100,4,4,0.64 , 0.82,0.56 , 0.75]]
pd.DataFrame(data,columns=['n_estimators','min_child_weight','max_depth','Train_Recall','Train_Accuracy','Test_Recall','Test_Accuracy']).sort_values(by =['Test_Recall','Test_Accuracy'],ascending=False)

As seen above highest Recall is 0.59 on test set which is less than KNN model 0.63.

Random Forest seems to overfit the training set a lot when min_child_weight and max_depth depth increases.

<b>Best fit model here is on Index 5 where min_child_weight and max_depth is 4 (which is less than index no 1 ,3) and also gives a test recall of 0.56 and accuracy of 0.75

But Its seen that KNN performs better than best fit Random Forest model
</b>

### XGBoost Classifier :

In [None]:
params = {
'max_depth' : [2,3,4],
'min_child_weight' : [1,2,3,4,5],
'n_estimators' : [100,200,300]
}
best_fit_search(X,y,XGBClassifier(),params,40,cv=5,scoring='recall')

Given set of parameters looks an Overfit with 99% Recall.

Hence after Testing the model in Test dataset and with different parameter combincations below is the result

In [None]:
data = [[100,3,3,0.83 , 0.96,0.63 , 0.75],
        [100,3,2,0.81 , 0.90,0.65 , 0.77],
        [100,1,2,0.85 , 0.91,0.63 , 0.77],
        [200,1,3,1,1,0.61 , 0.75],
        [200,5,3,0.95 , 0.98 ,0.65 , 0.75],
        [100,5,3,0.90 , 0.94,0.65,0.77],
        [200,5,4,0.99,1,0.63 ,0.72],
        [300,5,3,0.99 , 1.00,0.67,0.76]]
pd.DataFrame(data,columns=['n_estimators','min_child_weight','max_depth','Train_Recall','Train_Accuracy','Test_Recall','Test_Accuracy']).sort_values(by =['Test_Recall','Test_Accuracy'],ascending=False)

<b>Best Recall on Test set for XGBoost is 0.67 with accuracy of 0.76 which has max depth of 3 and min_child_weight of 5

But Index 1 is also good fit as the min_child_weight reduces to 3 and max_depth reduces to 2 with only 100 estimators keeping the the model less complex but giving good Test recall of 0.65 and accuracy of 0.77 on Test Data
    
Results from XGBoost are slightly better than KNN and this model can be used over Random Forest or KNN. </b> 

In [None]:
# Lets build the model using XGBoost best test Recall 

params = {'max_depth' : [2],'min_child_weight' : [3],'n_estimators' : [100]}
model = best_fit_search(X,y,XGBClassifier(),params,1,cv=5,scoring='recall',return_model=True)

### Testing the data on Test dataset

In [None]:
# Creating a copy of Test dataset
diab_df = test_set.copy()

In [None]:
diab_num = diab_df[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
y_test = diab_df['Outcome']

In [None]:
# Tranforming the Testdatset  features
X_test = data_transform(diab_num)

In [None]:
# Predicting the Outcomes
yhat_test = model.predict(X_test)

### Classification report for XGBoost on Test dataset

In [None]:
print('\n'+str(classification_report(y_test, yhat_test)))

### Conclusion:

#### Both KNN and XGBoostClassifier can be used for predicting if the patient is diabetic of not. Difference is minimal.

#### Logistic Regression is out of scope for this dataset.

#### Random Forest also doesnt provide good recall score and takes longer time for training the model.

#### In this we successfully implemented XGBoost as final model due to its edge over KNN on recall and accuracy.

#### It gives a Recall score of 0.65 and Accuracy of 0.77.