**Objective:**

*Predict which people are likely to develop diabetes.*

**About the dataset:**
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

**Data Dictionary:**
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)

# Exploratory Data Analysis (EDA)

In [None]:
#Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import norm

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix, classification_report, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
file_path = "/kaggle/input/pima-indians-diabetes-database/diabetes.csv"
diabetes = pd.read_csv(file_path)
diabetes.shape

In [None]:
diabetes.head()

In [None]:
diabetes.info()

We can see that there are 7 Predictors and 1 Target column (Outcome). All the columns are quantitative in nature.
Though there are no missing values, we need to check for inconsistent/zeroes values in the features. 

Let us plot the frequencies of Number of positive vs Negative cases. 

In [None]:
print("No. of people without diabetes: ",diabetes["Outcome"].value_counts()[0])
print("No. of people with diabetes : ",diabetes["Outcome"].value_counts()[1])
print("Percent of people with diabetes : ",round(diabetes["Outcome"].value_counts()[1]/len(diabetes.index)*100,2), "%")

In [None]:
sns.countplot("Outcome", data=diabetes);

We can see that the ratio of Outcomes is 0:1 :: 65% : 35% (which is not bad and acceptable). Though we can use Stratified sampling during our train-test-split exercise to handle the class imbalance.

In [None]:
#Descriptive Statistics - Five Point Summary
diabetes.describe().T

Let us plot the Histograms for all the numeric features to understand their distribution.

In [None]:
#Distribution of each feature

sns.set_style("darkgrid")

fig, ax2 = plt.subplots(4, 2, figsize=(16, 16))

sns.distplot(diabetes['Pregnancies'],ax=ax2[0][0], fit=norm)
sns.distplot(diabetes['Glucose'],ax=ax2[0][1], fit=norm)
sns.distplot(diabetes['BloodPressure'],ax=ax2[1][0], fit=norm)
sns.distplot(diabetes['SkinThickness'],ax=ax2[1][1], fit=norm)
sns.distplot(diabetes['Insulin'],ax=ax2[2][0], fit=norm)
sns.distplot(diabetes['BMI'],ax=ax2[2][1], fit=norm)
sns.distplot(diabetes['DiabetesPedigreeFunction'],ax=ax2[3][0], fit=norm)
sns.distplot(diabetes['Age'],ax=ax2[3][1], fit=norm)

Except for "BMI" every other feature is having strong positive skewness and also displaying kurtosis. Distributions differ heavily from the Normal Distribution (Bell Curve).

Let us observe the Outliers and Quantile distribution of data using BoxPlots. 

In [None]:
#Outliers analysis of each feature

sns.set_style("darkgrid")

fig, ax2 = plt.subplots(4, 2, figsize=(16, 16))

sns.boxplot(diabetes['Pregnancies'],ax=ax2[0][0])
sns.boxplot(diabetes['Glucose'],ax=ax2[0][1])
sns.boxplot(diabetes['BloodPressure'],ax=ax2[1][0])
sns.boxplot(diabetes['SkinThickness'],ax=ax2[1][1])
sns.boxplot(diabetes['Insulin'],ax=ax2[2][0])
sns.boxplot(diabetes['BMI'],ax=ax2[2][1])
sns.boxplot(diabetes['DiabetesPedigreeFunction'],ax=ax2[3][0])
sns.boxplot(diabetes['Age'],ax=ax2[3][1])

In [None]:
sns.countplot("Pregnancies", data=diabetes);

In [None]:
pd.crosstab(diabetes["Pregnancies"], diabetes["Outcome"]).plot()

Let us check the bivariate scatterplots across all the feature combinations

In [None]:
plt.figure(figsize=(10,10))
sns.pairplot(diabetes, diag_kind='kde', hue="Outcome")
plt.show()

Let us check the bivariate regression plots for all the feature combinations

In [None]:
plt.figure(figsize=(10,10))
sns.pairplot(diabetes, kind='reg', hue='Outcome')
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(diabetes.corr(), cmap='magma', vmin = -1, vmax = 1, annot=True, fmt="0.2f", square=True, linewidths=0.2)
plt.show()

In [None]:
sns.lmplot("Age", "Glucose", data=diabetes, hue='Outcome');

In [None]:
sns.lmplot("BloodPressure", "Glucose", data=diabetes, hue='Outcome');

In [None]:
sns.countplot("Pregnancies", hue="Outcome", data=diabetes)

In [None]:
sns.lmplot(y="Insulin",x="Glucose", hue="Outcome", data=diabetes);

# Data Preprocessing

1. Split the data into train and test
2. Impute for Missing Values
3. Data Transformation (Scaling)

In [None]:
diabetes.columns

Let us check'0s' in some of the features and impute them. We will include "Pregnancies" column from this imputation since '0' Pregnancies is a valid information.

In [None]:
#All features except Pregnancies - replacing 0s with NaNs
diabetes[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] = diabetes[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].replace(to_replace=0, value=np.nan)

In [None]:
diabetes.head()

Let us visualise the NaNs now in the dataset

In [None]:
diabetes.info()

In [None]:
print("Number of missing values in dataframe : \n", diabetes.isnull().sum())
print("-------------------------")
print("Percentage of columnwise missing values in dataframe : \n", round(diabetes.isnull().mean() * 100, 2))

Imputing the NaNs with Median values

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

diabetes_cols = diabetes.columns

diabetes = imputer.fit_transform(diabetes)

diabetes = pd.DataFrame(diabetes, columns = diabetes_cols)

diabetes.head()

Let us check the shapes of predictors and target

In [None]:
X = diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = diabetes['Outcome']
print("Shape of X :", X.shape)
print("Shape of y :",y.shape)

Splitting the data into train and test sets. We will use Stratified sampling to handle class imbalance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=diabetes["Outcome"], random_state=24)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

In [None]:
X_train.sample(3)

In [None]:
X_test.sample(3)

Before we fit the model, let us do scaling of features to handle outliers and different feature scales

In [None]:
from sklearn.preprocessing import MinMaxScaler
mmScaler = MinMaxScaler()
X_train_scaled = mmScaler.fit_transform(X_train.values)
X_test_scaled = mmScaler.fit_transform(X_test.values)

X_train = pd.DataFrame(X_train_scaled, index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(X_test_scaled, index=X_test.index, columns=X_test.columns)

In [None]:
X_train.sample(3)

In [None]:
X_test.sample(3)

> # ML Model Building

In [None]:
#Logistic Regression with liblinear solver
log_reg_ml = LogisticRegression(solver = "liblinear")
log_reg_ml.fit(X_train, y_train)

# Model Evaluation & Performance Metrics

In [None]:
#Accuracy score for train set
log_reg_ml.score(X_train, y_train)

In [None]:
#Accuracy score for test set
log_reg_ml.score(X_test, y_test)

In [None]:
#Let us predict the Test set
y_test_pred = log_reg_ml.predict(X_test)

In [None]:
#Let us predict the corresponding Probabilities for Test set
y_test_pred_prob = log_reg_ml.predict_proba(X_test)

In [None]:
#Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred, labels=[1, 0])
plt.figure(figsize = (7,5))
sns.heatmap(cm, annot=True, square=True, fmt = ".2f")
plt.show()

In [None]:
TP = cm[1,1] # True Positive
FN = cm[0,0] # False Negative
FP = cm[0,1] # False Positive
TN = cm[1,0] # True Negative

In [None]:
precision = TP/(TP+FP)
precision

In [None]:
recall = TP/(TP+FN)
recall

In [None]:
sensitivity = TP/(TP+FN)
sensitivity

In [None]:
specificity = TN / (TN + FP)
specificity

In [None]:
#Receiver Operator Characteristic Curve 
roc_auc_score(y_test, y_test_pred)

In [None]:
# Defining the function to plot the ROC curve
def draw_roc(y_test, y_test_pred ):
    fpr, tpr, thresholds = roc_curve( y_test, y_test_pred,
                                              drop_intermediate = False )
    auc_score = roc_auc_score(y_test, y_test_pred)
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic curve')
    plt.legend(loc="lower right")
    plt.show()

    return None

# Calling the function
draw_roc(y_test, y_test_pred)

In [None]:
#Precision-Recall curve
p, r, thresholds = precision_recall_curve(y_test, y_test_pred)
plt.plot(thresholds, p[:-1], "b-")
plt.plot(thresholds, r[:-1], "g-")
plt.show()

In [None]:
print(classification_report(y_test, y_test_pred, labels=[1, 0]))