**If You find this notebook useful, *PLEASE UPVOTE ***

# **PIMA Indians Diabetes**
Background
Diabetes, is a group of metabolic disorders in which there are high blood sugar levels over a prolonged period. Symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger. If left untreated, diabetes can cause many complications. Acute complications can include diabetic ketoacidosis, hyperosmolar hyperglycemic state, or death. Serious long-term complications include cardiovascular disease, stroke, chronic kidney disease, foot ulcers, and damage to the eyes.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

# Objective
We will try to build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

# Data
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

* Pregnancies: Number of times pregnant

* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
 
* BloodPressure: Diastolic blood pressure (mm Hg)

* SkinThickness: Triceps skin fold thickness (mm)
 
* Insulin: 2-Hour serum insulin (mu U/ml)
 
* BMI: Body mass index (weight in kg/(height in m)^2)
 
* DiabetesPedigreeFunction: Diabetes pedigree function
 
* Age: Age (years)

* Outcome: Class variable (0 or 1)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df=pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

# To find null values in the data set

**There are no null values in this DataSet**

In [None]:
df.isnull().sum()

**columns and number of Unique values in the columns**

In [None]:
dict={}
for i in df.columns:
    dict[i]=df[i].value_counts().shape[0]
pd.DataFrame(dict,index=['unique']).transpose()

> **checking the datatypes of all the columns in the dataset**
**all the columns have numeric values so no need to encode the columns**

In [None]:
df.dtypes

> Pregnancies maximum range is varing from 0.0-5.0
> 
> Glucose maximum range is ranging from 75-140
> 
> BloodPressure range is ranging from 60-75
> 
> Skin Thickness is widely distributed from 0-40
> 
> Insulin maximum amount of people having in range 0-200
> 
> BMI is 20-60
> 
> DiabetesPedigreeFunction is maximum between 0.0-0.6
> 
> Age of maximum people is between 20-40 
> 
> Outcome is eithe 0 or 1


In [None]:
df.hist(figsize=(20,10))
plt.show

> to find the outliers of all the columns 

In [None]:
fig=plt.figure(figsize=(20,12))
gs=fig.add_gridspec(2,4)
ax0=fig.add_subplot(gs[0,0])
ax1=fig.add_subplot(gs[0,1])
ax2=fig.add_subplot(gs[0,2])
ax3=fig.add_subplot(gs[0,3])
ax4=fig.add_subplot(gs[1,0])
ax5=fig.add_subplot(gs[1,1])
ax6=fig.add_subplot(gs[1,2])
ax7=fig.add_subplot(gs[1,3])

sns.boxplot(df['Pregnancies'],data=df,ax=ax0)
sns.boxplot(df['Glucose'],data=df,ax=ax1)
sns.boxplot(df['BloodPressure'],data=df,ax=ax2)
sns.boxplot(df['SkinThickness'],data=df,ax=ax3)
sns.boxplot(df['Insulin'],data=df,ax=ax4)
sns.boxplot(df['BMI'],data=df,ax=ax5)
sns.boxplot(df['DiabetesPedigreeFunction'],data=df,ax=ax6)
sns.boxplot(df['Age'],data=df,ax=ax7)

In [None]:
!pip install autoviz

In [None]:
!pip install xlrd

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class
av=AutoViz_Class()

In [None]:
dft = av.AutoViz('/kaggle/input/pima-indians-diabetes-database/diabetes.csv', 
                 dfte=df,
                 header=0, 
                 verbose=2, 
                 lowess=False,
                 chart_format="svg", 
                 max_rows_analyzed=1000, 
                 max_cols_analyzed=10)

> **Preprocessing**

In [None]:
X=df.drop('Outcome',axis='columns')
y=df['Outcome']

In [None]:
print(X.shape)
print(y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

> **Scaling down the columns**

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

# Models

> **LogisticRegression & Hypertuning**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,mean_absolute_error,mean_squared_error,r2_score
model=LogisticRegression()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Testing Score:\n",model.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model.get_params())

> **KNN & Hypertuning**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier()
model.fit(X_train,y_train)
from sklearn.metrics import mean_absolute_error,mean_squared_error,confusion_matrix,r2_score,accuracy_score
y_pred=model.predict(X_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Testing Score:\n",model.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model.get_params())
print(accuracy_score(y_test,y_pred)*100)

In [None]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
params={
    'n_neighbors':[1,3,5,7,9],
    'weights':['uniform', 'distance'],
    'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']
}
model2=GridSearchCV(model,param_grid=params,cv=10,n_jobs=None,verbose=True)
model2.fit(X_train,y_train)
y_pred=model2.predict(X_test)
print('best estimator: ',model2.best_estimator_)
print('best params: ',model2.best_params_)
print('classification report: ',classification_report(y_test,y_pred))
print('mae: ',mean_absolute_error(y_test,y_pred))
print('mse: ',mean_squared_error(y_test,y_pred))
print('r2 score: ',r2_score(y_test,y_pred))
print('model2-score: ',model2.score(X_test,y_test)*100)
print('accuracy score: ',accuracy_score(y_test,y_pred)*100)

> **Decision Tree & Hypertuning**

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=6, random_state=123,criterion='entropy')

dtree.fit(X_train,y_train)
y_pred=dtree.predict(X_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Testing Score:\n",dtree.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))

> **Random Forest Regressor**

In [None]:
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=0)
model.fit(X_train,y_train)
y_pred=dtree.predict(X_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Testing Score:\n",model.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model.get_params())
print('accuracy score',accuracy_score(y_test,y_pred)*100)

> **Gradient Boosting Algorithm**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model=GradientBoostingClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Testing Score:\n",model.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model.get_params())
print('accuracy score',accuracy_score(y_test,y_pred)*100)

> **XGBoost Classifier**

In [None]:
from xgboost import XGBClassifier
model1=XGBClassifier(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)
model1.fit(X_train,y_train)
y_pred=model1.predict(X_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Testing Score:\n",model1.score(X_test,y_test)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
print(model1.get_params())
print('accuracy score',accuracy_score(y_test,y_pred)*100)

# Logistic Regression is performing best for the Diabetes dataset