# Pima Indians Diabetes Database
Predict the onset of diabetes based on diagnostic measures

**Problem Definition**


```
Predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
```
**Features**
* Pregnancies: Number of times pregnant.

* Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test.

* BloodPressure: Diastolic blood pressure (mm Hg).

* SkinThickness: Triceps skin fold thickness (mm).

* Insulin: 2-Hour serum insulin (mu U/ml).

* BMI: Body mass index (weight in kg/(height in m)2).

* DiabetesPedigreeFunction: iabetes pedigree function (a function which scores likelihood of diabetes based on family history).

* Age: No explanation needed.



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Ignoring the warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# Loading the Dataset
data = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
data.head()  #display first 5 rows of dataset

In [None]:
data.tail() #display last 5 rows of dataset

In [None]:
# Column names and its datatypes
data.info(),
data.shape

There are 9 columns and 768 rows in dataset

In [None]:
# Discription of Dataset
data.describe()

In [None]:
data.value_counts()

In [None]:
# Checking Missing Values
data.isna().sum()

There is no missing values

In [None]:
# Check the columns that has '0' values in them

cols = (data.columns == 0).sum()
print(cols)

In [None]:
# See the distribution of the outcome variable
data.groupby('Outcome').size()

There are 500 non diabetics and 268 diabetics pregnant women. There is imbalance in the data.

In [None]:
# Check Percentage of Healthy and Diabetic women
import matplotlib.style as style

style.use('seaborn-pastel')
labels = ["Healthy", "Diabetic"]
data['Outcome'].value_counts().plot(kind='pie',labels=labels, subplots=True,autopct='%1.0f%%', figsize=(5,5));

In [None]:
## See the variables with respect to outcome variable

data.groupby('Outcome').hist(figsize=(20,5),layout=(2,8),histtype='barstacked')
plt.show()

# Exploratory Data Analysis

In [None]:
# Plotting Correlation Matrix using Heatmap

plt.figure(figsize = (10,8))
sns.heatmap(data.corr(), annot =True);

In [None]:
# Checkong Correlation between Outcome and Age

sns.distplot(data.loc[data['Outcome']==0, 'Age'],label='Healthy')
sns.distplot(data.loc[data['Outcome']==1, 'Age'], hist_kws=dict(alpha=0.4), label='Diabetic')
plt.legend(prop={'size': 12})
plt.title('Correlation between Outcome and Age')
plt.xlabel('Age')
plt.ylabel('Count')  
sns.set(rc={'figure.figsize':(8,4)})

In [None]:
data.hist(figsize=(15,10))
plt.show()

In [None]:
sns.distplot(data['Pregnancies'],bins=10);

In [None]:
sns.distplot(data['Age'],bins=10)

In [None]:
sns.distplot(data['DiabetesPedigreeFunction'],bins=10)

**The distribution of data shows us that the data is mostly right skewed.**

In [None]:
plt.figure(figsize=(14,10))
sns.set_style(style='darkgrid')
plt.subplot(2,3,1)
sns.boxplot(data['Glucose'])
plt.subplot(2,3,2)
sns.boxplot(data['BloodPressure'])
plt.subplot(2,3,3)
sns.boxplot(data['Insulin'])
plt.subplot(2,3,4)
sns.boxplot(data['BMI'])
plt.subplot(2,3,5)
sns.boxplot(data['Age'])
plt.subplot(2,3,6)
sns.boxplot(data['SkinThickness'])

In [None]:
# Find the Glucose level in group of pregnant women who had diabetes.
sns.boxplot(x=data['Pregnancies'],y=data['Glucose'],hue=data['Outcome'])

In [None]:
# How many pregnant women had BP?
data.groupby(['Outcome','BloodPressure']).Pregnancies.count().hist()
plt.show()

In [None]:
sns.pairplot(data,hue='Outcome',palette="husl")

# Training and Testing Data

In [None]:
# Split the dataset

x = data.drop("Outcome", axis=1)
y = data.Outcome

In [None]:
# Split the data into training and testing data

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=1)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

**Now we've got our data split into training and test sets, it's time to build a machine learning model.**

# Models:
We're going to try 7 different machine learning models:
```
1. Logistic Regression
2. RandomForestClassifier
3. DECISION TREE CLASSIFIER
4. KNeighborsClassifier
5. Support Vector Classification (SVC)
6. Naive Bayes (GaussianNB)
7. AdaBoostClassifier
```







In [None]:
# Import all 7 the models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier

In [None]:
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(solver='liblinear'),
          "Random Forest": RandomForestClassifier(),
          "Decision Tree": DecisionTreeClassifier(max_depth=6, random_state=123,criterion='entropy'),
          "KNN": KNeighborsClassifier(n_neighbors=7),
          "SVC": SVC(),
          "GaussianNB" : GaussianNB(),
          "AdaBoost" : AdaBoostClassifier(base_estimator = None)}

In [None]:
# Create a function to fit and score models
def fit_and_score(models, x_train, x_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    x_train : training data (no labels)
    x_test : testing data (no labels)
    y_train : training labels
    y_test : test labels
    """
    # Set random seed
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(x_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(x_test, y_test)
    return model_scores

In [None]:
# Call the function
model_scores = fit_and_score(models=models,
                             x_train=x_train,
                             x_test=x_test,
                             y_train=y_train,
                             y_test=y_test)

model_scores

#### Lets print Metrics

In [None]:
# First import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import r2_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import mean_squared_error


In [None]:
for name, model in models.items():
    # Fit the model to the data
    model.fit(x_train, y_train)
    y_preds = model.predict(x_test)
    print(f"Classification Report of {name} is:\n {classification_report(y_test,y_preds)}")

    print(f"Confusion Matrix of {name} is:\n {confusion_matrix(y_test,y_preds)}\n")

    print(f"Mean Squared Error of {name} is: {mean_squared_error(y_test,y_preds)}\n")

    print(f"R2 score is of {name} is: {r2_score(y_test,y_preds)}\n")

    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_preds)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print(f"ROC_AUC Curve for {name} is:\n {roc_auc}")
    print("==================================================================================================\n\n")

### Model Comparison

In [None]:
model_scores

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();