# Pima Indians Diabetes dataset from Kaggle


### Data & Objective

In this Kernel I present a short analysis of the "Pima Indians Diabetes Database" provided by UCI Machine Learning in Kaggle, which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. You can download it from [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database/data). All patients from this dataset are females at least 21 years old of Pima Indian heritage.

The objective is to predict the onset of diabetes based on diagnostic measures present in the dataset provided.

This Kernel is devided into 3 parts:

### 1. First Insights
  Get to know the data.
  - How many samples and features do we have?
  - What type of features do we have? 
  - How are they distributed?
  - Do we have null values?

### 2. Model creation & Validation
   - Before model creation first we need to separate into features and labels. Then check if the features need some pre-processing. For example for this dataset some Scaling could be useful.
   - Divide dataset into Train/Test. Test set will not be used for model creation keep it aside!! Train set will be used for training and validation using the k-fold cross validation technique.
   - I will use two classifiers: 
     - **Random Forest**
     - **Logistic Regression**  

### 3. Test final model with unseen data
 Once you have chosen the final model in step **2.** we will use that same model to predict the labels of the Test set we kept aside. This will allow to evaluate if the model we created is not overfit and can get also good results for unseen data.





In [62]:
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import cross_validation
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report, confusion_matrix

%matplotlib inline

### 1. First Insights

In [63]:
df = pd.read_csv("../input/diabetes.csv")
df.head()

In [64]:
df.shape

In [65]:
df.describe()

In [66]:
df.info()
df.columns

In [67]:
Counter(df.Outcome)

In [68]:
sns.countplot(x='Outcome',data=df)

We have a quite **IMBALANCED DATASET**! Thus, it is really important to check not only **accuracy** but also **sensitivity** (how well it classifies the positive class) and **specificity** (how well it classifies the negative class). In this case, as the negative class is almost double the positive class, we are prone to get high specificity but low sensitivity. So remember this concern when evaluating the model.      

In [69]:
df_1 = df[df.Outcome == 1]
df_0 = df[df.Outcome == 0]
columns = df.columns[:-1]

plt.subplots(figsize=(16,10))
number_features = len(columns)
for i,j,  in zip(columns, range(number_features) ):
    plt.subplot(3,3,j+1)
    plt.subplots_adjust(wspace=0.5,hspace=0.5)
    df_0[i].hist(bins=20, color='b', edgecolor='black')
    df_1[i].hist(bins=20, color='r', edgecolor='black')
    plt.title(i)

### 2. Model creation

In [70]:
# get features and labels
X = df.iloc[:,:-1]
labels= df.iloc[:,-1]

In [71]:
# Standarize features
X = StandardScaler().fit_transform(X)

In [72]:
# Divide Data into train and test set  (test set will only be used in section 3.)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=0, stratify=labels)

In [73]:
# reset y_train index
y_train= y_train.reset_index(drop=True)

To help our models to handle the imbalanced problem we need to set the input parameter *class_weight="balanced"* . The **“balanced”** mode will make the classes automatically weighted inversely proportional to how frequently they appear in the data. 

In [74]:
# Random Forest

RF_model = RandomForestClassifier(n_estimators=600, random_state=123456, class_weight="balanced")

acc=[]
sen=[]
spe=[]
kf = KFold(n_splits=5, random_state= 123)
kf.get_n_splits(X_train)

for train_index, test_index in kf.split(X_train):
    Features_train, Features_test = X_train[train_index], X_train[test_index]
    Labels_train, Labels_test = y_train[train_index], y_train[test_index]

    RF_model.fit(Features_train, Labels_train)
    cm = confusion_matrix(Labels_test, RF_model.predict(Features_test))
    tn, fp, fn, tp = confusion_matrix(Labels_test, RF_model.predict(Features_test)).ravel()
    sensitivity = tp/(tp+fn)
    specificity  = tn/(tn+fp)
    accuracy = (tp+tn)/(tp+fp+tn+fn)
    acc.append(accuracy)
    sen.append(sensitivity)
    spe.append(specificity)
    print(accuracy, sensitivity, specificity)

global_acc = np.mean(acc)
acc_std = np.std(acc)
global_sen = np.mean(sen)
sen_std = np.std(sen)
global_spe = np.mean(spe)
spe_std = np.std(spe)

print("_________________________________")
print('Accuracy:', global_acc, "+/-", acc_std)
print('Sensitivity:', global_sen, "+/-", sen_std)
print('Specificity:', global_spe, "+/-", spe_std)

Randon Forest still can't get a very good Sentitivity... Lets try using Logistic Regression now.

In [75]:
# Logistic Regression

C_param_range = [0.001,0.01,0.1,1,10,100]

for i in C_param_range:
    LR_model = LogisticRegression(random_state=0, C=i, class_weight='balanced')
    print("\n C= ", i)
    
    acc=[]
    sen=[]
    spe=[]
    kf = KFold(n_splits=5, random_state= 123)
    kf.get_n_splits(X_train)

    for train_index, test_index in kf.split(X_train):
        Features_train, Features_test = X_train[train_index], X_train[test_index]
        Labels_train, Labels_test = y_train[train_index], y_train[test_index]

        LR_model.fit(Features_train, Labels_train)
        cm = confusion_matrix(Labels_test, LR_model.predict(Features_test))
        tn, fp, fn, tp = confusion_matrix(Labels_test, LR_model.predict(Features_test)).ravel()
        sensitivity = tp/(tp+fn)
        specificity  = tn/(tn+fp)
        accuracy = (tp+tn)/(tp+fp+tn+fn)
        acc.append(accuracy)
        sen.append(sensitivity)
        spe.append(specificity)
        
        print(accuracy, sensitivity, specificity)
      

    global_acc = np.mean(acc)
    acc_std = np.std(acc)
    global_sen = np.mean(sen)
    sen_std = np.std(sen)
    global_spe = np.mean(spe)
    spe_std = np.std(spe)

    print("_________________________________")
    print('Accuracy:', global_acc, "+/-", acc_std)
    print('Sensitivity:', global_sen, "+/-", sen_std)
    print('Specificity:', global_spe, "+/-", spe_std, "\n")

Much better results for sentitivity this time! Lets use Logistic regression as the final model and see how it performs with the unseen data.

### 2. Test final model with unseen data

In [79]:
# check how the RF model would perform on the test set
print(classification_report(y_test, RF_model.predict(X_test)))

In [82]:
# Check results of chosen model (Logistic Regression) for unseen data, i.e. data that was not used for model creation
cm = confusion_matrix(y_test, LR_model.predict(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, LR_model.predict(X_test)).ravel()
sensitivity = tp/(tp+fn)
specificity  = tn/(tn+fp)
accuracy = (tp+tn)/(tp+fp+tn+fn)

print(" Sensitivity:", sensitivity, "\n Specificity", specificity, "\n Accuracy:", accuracy)

In [78]:
print(classification_report(y_test, LR_model.predict(X_test)))

We can see that Logistic Regression outperformed Random Forest, specially because it was able to handle very well the imbalaced problem. 