# Indians Diabetes Database

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.


The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.


We build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not
.

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0

**
General Information on Variables
**

**1.Glucose Tolerance Test****

It is a blood test that involves taking multiple blood samples over time, usually 2 hours.It used to diagnose diabetes. The results can be classified as normal, impaired, or abnormal.

Normal Results for Diabetes -> Two-hour glucose level less than 140 mg/dL

Impaired Results for Diabetes -> Two-hour glucose level 140 to 200 mg/dL

Abnormal (Diagnostic) Results for Diabetes -> Two-hour glucose level greater than 200 mg/dL

**2.BloodPressure**

The diastolic reading, or the bottom number, is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen. A normal diastolic blood pressure is lower than 80. A reading of 90 or higher means you have high blood pressure.


Normal: Systolic below 120 and diastolic below 80
Elevated: Systolic 120–129 and diastolic under 80
Hypertension stage 1: Systolic 130–139 and diastolic 80–89
Hypertension stage 2: Systolic 140-plus and diastolic 90 or more
Hypertensive crisis: Systolic higher than 180 and diastolic above 120.

**
3.**BMI****

The standard weight status categories associated with BMI ranges for adults are shown in the following table.

Below 18.5 -> Underweight

18.5 – 24.9 -> Normal or Healthy Weight

25.0 – 29.9 -> Overweight

30.0 and Above -> Obese

Load basic required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

In [None]:
db=pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

In [None]:
db.head()

In [None]:
db.tail()

In [None]:
db.isnull().sum()

NA Values not present on the dataset.

In [None]:
db.info()

In [None]:
db.describe()

Can minimum value of below listed columns be zero (0)?

On these columns, a value of zero does not make sense and thus indicates missing value.

Following columns or variables have an invalid zero value:

1-Pregnancies

2-Glucose

3-BloodPressure

4-SkinThickness

5-Insulin

6-BMI

It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values.

In [None]:
diabetes_data_copy = db.copy(deep = True)
diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans
print(diabetes_data_copy.isnull().sum())

To fill these Nan values the data distribution needs to be understood

In [None]:
p = diabetes_data_copy.hist(figsize = (20,20))

Aiming to impute nan values for the columns in accordance with their distribution

In [None]:
diabetes_data_copy['Glucose'].fillna(diabetes_data_copy['Glucose'].mean(), inplace = True)
diabetes_data_copy['BloodPressure'].fillna(diabetes_data_copy['BloodPressure'].mean(), inplace = True)
diabetes_data_copy['SkinThickness'].fillna(diabetes_data_copy['SkinThickness'].median(), inplace = True)
diabetes_data_copy['Insulin'].fillna(diabetes_data_copy['Insulin'].median(), inplace = True)
diabetes_data_copy['BMI'].fillna(diabetes_data_copy['BMI'].median(), inplace = True)

After replacing  Nan Values distribution

In [None]:
p = diabetes_data_copy.hist(figsize = (20,20))

In [None]:
print(diabetes_data_copy.isnull().sum())

We can see here null values replace completely.

In [None]:
diabetes_data_copy['Outcome'].value_counts()

In [None]:
diabetes_data_copy.shape

In [None]:
diabetes_data_copy.Outcome.unique()

In [None]:
plt.figure(figsize=(7,5)) 
sns.countplot(x="Outcome", data=diabetes_data_copy, palette=('Orange','DarkBlue'))
plt.xlabel("Diabetes Disease (0 = No, 1= Yes)")
plt.show()

The above graph shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost twice the number of diabetic patients

In [None]:
plt.figure(figsize=(10,9)) 
sns.countplot(x="Age", hue="Outcome",data=diabetes_data_copy, palette=('Orange','DarkBlue'))
plt.xlabel("Age (0 = No, 1= Yes)")
plt.show()

We can see here from 25 -60 age groups peoples have diabetes

In [None]:
plt.figure(figsize=(10,9)) 
sns.countplot(x="Glucose", data=diabetes_data_copy, hue="Outcome", palette=('Orange','DarkBlue'))
plt.xlabel('The Slope of The Peak Glucose level')
plt.ylabel('Frequency of Diabetes Disease or Not')
plt.show()

In [None]:
plt.figure(figsize=(10,9)) 
plt.subplot()
plt.title('subplot: 231')
sns.countplot(data = diabetes_data_copy, x='BloodPressure',hue='Outcome')
plt.show()

In [None]:
column_name= ['Pregnancies','SkinThickness','DiabetesPedigreeFunction','BMI','Age','BloodPressure']
diabetes_data_copy[column_name]= diabetes_data_copy[column_name].clip(lower= diabetes_data_copy[column_name].quantile(0.15), upper= diabetes_data_copy[column_name].quantile(0.85), axis=1)

In [None]:
diabetes_data_copy.plot(kind='box', figsize=(10,8))

We found their are outliers present on the 'Insulin','Pregnancies','SkinThickness','DiabetesPedigreeFunction','BMI','Age' columns. We want to remove that outliers form the dataset. Their are too much outliers peresnt on the Insulin. need to drop insulin column from data set.

In [None]:
diabetes_data_copy.drop(columns=['Insulin'], axis=1, inplace=True)

In [None]:
diabetes_data_copy.plot(kind='box', figsize=(10,8))

In [None]:
plt.figure(figsize=(10,9)) 
sns.countplot(x="Pregnancies", hue="Outcome",data=diabetes_data_copy, palette=('Orange','DarkBlue'))
plt.show()

In [None]:
p=sns.pairplot(diabetes_data_copy, hue = 'Outcome',corner=True)

In [None]:
plt.figure(figsize=(12,10))
p=sns.heatmap(diabetes_data_copy.corr(), annot=True,cmap ='RdYlGn') 

Now, let's start building the model

In [None]:
db1=db

In [None]:
x = db1.drop(['Outcome'], axis = 1)
y = db1['Outcome']

In [None]:
from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
x=sc.fit_transform(x)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state = 0)

In [None]:
x_train.shape, x_test.shape

In [None]:
y_train.shape, y_test.shape

Build Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
confmat = confusion_matrix(y_pred, y_test)
confmat

In [None]:
accuracy_score(y_pred, y_test)


Build Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(criterion = "gini", 
                                       min_samples_leaf = 1, 
                                       min_samples_split = 10,   
                                       n_estimators=100, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1, 
                                       n_jobs=-1)

random_forest.fit(x_train, y_train)
y_pred = random_forest.predict(x_test)
random_forest.score(x_train, y_train)
print("Score: ", round(random_forest.oob_score_, 4)*100, "%")