# <center> <div class="alert alert-block alert-info">  <span style="color:crimson;"> Diabetes Prediction  </center>

![diabetes](https://blogs.biomedcentral.com/on-medicine/wp-content/uploads/sites/6/2019/05/AdobeStock_168931649.t5cb8050a.m800.xfO8gsyEaog6WSrZjeMVZW26gzxV9HbkmhltC8DUSpRI-620x342.jpeg)

*Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy*

# <center> <div class="alert alert-block alert-info">  <span style="color:crimson;"> Data Description </center>
    
* **Pregnancies**: Number of times pregnant
* **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure**: Diastolic blood pressure (mm Hg)
* **SkinThickness**: Triceps skin fold thickness (mm)
* **Insulin**: 2-Hour serum insulin (mu U/ml)
* **BMI**: Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction**: Diabetes pedigree functionr
* **Age**: Age (years)
* **Cabin** : Cabin Number
* **Outcome**: Class variable (0 or 1)

In [None]:
#importing Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import cufflinks as cf
from scipy import stats
import plotly.express as px
import matplotlib.pyplot as plt

from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
cf.go_offline()

# <center> <div class="alert alert-block alert-info">  <span style="color:crimson;"> Exploratory Data Analysis </center>

In [None]:
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head(n=5)

In [None]:
df.shape

**Statistical Insights**

In [None]:
df.describe()

1. For these below mentioned features minimum values are zero.so,consider these rows with zero values as null values
* **BloodPressure**
* **Glucose**
* **SkinThickness**
* **insulin**
* **BMI**
2. Maximum insulin here 846 this may be a outlier


In [None]:
df.info()

* BMI and Diabatespedigreefunction features values are float type and remaining are int type 

# Data cleaning

**Checking and Removing null values**

In [None]:
df.isnull().sum()

In [None]:
col = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']

for i in col:
    df[i].replace(0,df[i].mean(),inplace=True)

**Checking Duplicates**

In [None]:
duplicate = df[df.duplicated()]
duplicate

* No duplicate values

# Data Visualization

In [None]:
px.histogram(df,x='Outcome',color='Outcome',barmode='group')

* 500 patients with good condition
* 268 patients with bad condition

In [None]:
px.histogram(df,x='Pregnancies',color='Outcome',barmode='group')

In [None]:
px.histogram(df,x='Glucose',color='Outcome')

* A blood sugar level less than 140 mg/dL (7.8 mmol/L) is normal. A reading of more than 200 mg/dL (11.1 mmol/L) after two   hours indicates diabetes. A reading between 140 and 199 mg/dL (7.8 mmol/L and 11.0 mmol/L) indicates prediabetes.

In [None]:
px.histogram(df,x='BloodPressure',color='Outcome')

* If you have diabetes, you are twice as likely to have high blood pressure. Untreated, high blood pressure can raise your   risk for heart disease.

In [None]:
px.histogram(df,x='SkinThickness',color='Outcome')

In [None]:
px.histogram(df,x='Insulin',color='Outcome')

* A normal measurement of free insulin is less than 17 mcU/mL

In [None]:
px.histogram(df,x='BMI',color='Outcome')

In [None]:
px.histogram(df,x='DiabetesPedigreeFunction',color='Outcome')

In [None]:
px.histogram(df,x='Age',color='Outcome')

* Mostly grater than 35 age patients have high ratio

In [None]:
px.scatter(df,x='Age',y='BMI',color='Outcome',size='Pregnancies',hover_data=['BloodPressure','Insulin'])

* After 35 have more yellow bubbles

In [None]:
px.box(df,points='all')

* Insulin have more outliers

In [None]:
px.violin(df,x='Insulin',box=True,points='all',color='Outcome')

**Outliers removal**

In [None]:
zscore = np.abs(stats.zscore(df))
print(zscore)

In [None]:
threshold = 4
print(np.where(zscore > 4))

In [None]:
df1=df
df1 = df1[(zscore<4).all(axis=1)]

In [None]:
df.shape,df1.shape

In [None]:
px.imshow(df1.corr())

# <center> <div class="alert alert-block alert-info">  <span style="color:crimson;"> Data Preprocessing </center>

In [None]:
data = df1.drop(['Outcome'],axis=True)
data_target = df1['Outcome']

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data,data_target,test_size=0.3,random_state=42)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)

# <center> <div class="alert alert-block alert-info">  <span style="color:crimson;"> Models </center>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,confusion_matrix

**LogisticRegression**

In [None]:
log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)

log_acc=accuracy_score(y_test,log_reg.predict(x_test))

print("Train Set Accuracy:"+str(accuracy_score(y_train,log_reg.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,log_reg.predict(x_test))*100))

**DecisionTree**

In [None]:
d_tree = DecisionTreeClassifier()
d_tree.fit(x_train,y_train)

d_acc=accuracy_score(y_test,d_tree.predict(x_test))

print("Train Set Accuracy:"+str(accuracy_score(y_train,d_tree.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,d_tree.predict(x_test))*100))

**RandomForest**

In [None]:
r_for = RandomForestClassifier()
r_for.fit(x_train,y_train)

r_acc=accuracy_score(y_test,r_for.predict(x_test))

print("Train Set Accuracy:"+str(accuracy_score(y_train,r_for.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,r_for.predict(x_test))*100))

**Knearstneigbors**

In [None]:
k_nei = KNeighborsClassifier()
k_nei.fit(x_train,y_train)

k_acc = accuracy_score(y_test,k_nei.predict(x_test))

print("Train set Accuracy:"+str(accuracy_score(y_train,k_nei.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,k_nei.predict(x_test))*100))

**support vector**

In [None]:
s_vec = SVC()
s_vec.fit(x_train,y_train)

s_acc = accuracy_score(y_test,s_vec.predict(x_test))

print("Train set Accuracy:"+str(accuracy_score(y_train,s_vec.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,s_vec.predict(x_test))*100))

**Gaussiannb**

In [None]:
g_clf = GaussianNB()
g_clf.fit(x_train,y_train)

g_acc = accuracy_score(y_test,g_clf.predict(x_test))

print("Train set Accuracy:"+str(accuracy_score(y_train,g_clf.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,g_clf.predict(x_test))*100))

**VotingClassifier**

In [None]:
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = RandomForestClassifier(random_state=1)
model3 = SVC(random_state=1)
model4 = KNeighborsClassifier()
model = VotingClassifier(estimators=[('lr', model1), ('rf', model2),('sc',model3)], voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)

In [None]:
score = model.score(x_test,y_test)
train_scored = model.score(x_train,y_train)
print("DecisionTreeClassifier Test Score:",train_scored*100)
print("DecisionTreeClassifier Test Score:",score*100)
m_acc = accuracy_score(y_test,model.predict(x_test))

In [None]:
B_clf = BaggingClassifier(base_estimator = SVC(),n_estimators=10, random_state=6)
B_clf.fit(x_train,y_train)

B_acc = accuracy_score(y_test,B_clf.predict(x_test))

print("Train set Accuracy:"+str(accuracy_score(y_train,B_clf.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,B_clf.predict(x_test))*100))

In [None]:
GB_clf = GradientBoostingClassifier(n_estimators=100,random_state=6)
GB_clf.fit(x_train,y_train)

GB_acc = accuracy_score(y_test,GB_clf.predict(x_test))

print("Train set Accuracy:"+str(accuracy_score(y_train,GB_clf.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,GB_clf.predict(x_test))*100))

In [None]:
x_clf = XGBClassifier()
x_clf.fit(x_train,y_train)

x_acc = accuracy_score(y_test,x_clf.predict(x_test))

print("Train set Accuracy:"+str(accuracy_score(y_train,x_clf.predict(x_train))*100))
print("Test Set Accuracy:"+str(accuracy_score(y_test,x_clf.predict(x_test))*100))

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic','KNN', 'SVC',  'Decision Tree Classifier',
             'Random Forest Classifier',  'Gaussian','Voting Classifier','Bagging','Boosting','xgboost'],
    'Score': [ log_acc,k_acc, s_acc, d_acc, r_acc, g_acc,m_acc,B_acc,GB_acc,x_acc]
})

models.sort_values(by = 'Score', ascending = False)

In [None]:
px.bar(models,x='Model',y='Score',color='Model')

In [None]:
y_predict=s_vec.predict(x_test)
conf_mat = confusion_matrix(y_predict,y_test)

In [None]:
from mlxtend.plotting import plot_confusion_matrix
 
fig, ax = plot_confusion_matrix(conf_mat=conf_mat, figsize=(6, 6), cmap=plt.cm.Greens)

**<center> Any good suggestions are accepted </center>**

# <center> <div class="alert alert-block alert-info">  <span style="color:crimson;"> Done </center>
