# About Dataset
# Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

# Content
Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

1) Pregnancies: Number of times pregnant
2) Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3) BloodPressure: Diastolic blood pressure (mm Hg)
4) SkinThickness: Triceps skin fold thickness (mm)
5) Insulin: 2-Hour serum insulin (mu U/ml)
6) BMI: Body mass index (weight in kg/(height in m)^2)
7) DiabetesPedigreeFunction: Diabetes pedigree function
8) Age: Age (years)
9) Outcome: Class variable (0 or 1)

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

In [2]:
df=pd.read_csv(r"..\Dataset\diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.shape

(768, 9)

In [4]:
X_train,X_test,Y_Train,Y_test=train_test_split(df.drop("Outcome",axis=1),df["Outcome"],test_size=0.3,random_state=42)

In [5]:
X_train.shape,Y_Train.shape,X_test.shape,Y_test.shape

((537, 8), (537,), (231, 8), (231,))

In [6]:
classifier_1=LogisticRegression()
classifier_2=DecisionTreeClassifier()
classifier_3=SVC(kernel='Linear')
classifier_4=GaussianNB()

In [7]:
grid_param_1={
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 300]
}
grid_param_2={
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_param_3={
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}
scoring = {'accuracy': 'accuracy', 'recall': 'recall'}
gridCV_1=GridSearchCV(classifier_1,param_grid=grid_param_1,scoring=scoring, refit='recall', cv=5)
gridCV_2=GridSearchCV(classifier_2,param_grid=grid_param_2,scoring=scoring, refit='recall', cv=5)
gridCV_3=GridSearchCV(classifier_3,param_grid=grid_param_3,scoring=scoring, refit='recall', cv=5)
gridCV_1.fit(X_train,Y_Train)
gridCV_2.fit(X_train,Y_Train)
gridCV_3.fit(X_train,Y_Train)
classifier_4.fit(X_train,Y_Train)




In [13]:
Y_predict_1=gridCV_1.predict(X_test)
Y_predict_2=gridCV_2.predict(X_test)
Y_predict_3=gridCV_3.predict(X_test)
Y_predict_4=classifier_4.predict(X_test)

In [14]:
print("For logistic regression :")
print("accuracy :",accuracy_score(Y_test,Y_predict_1))
print("precision :",precision_score(Y_test, Y_predict_1))
print("recall :",recall_score(Y_test,Y_predict_1))
print("F1_score :",f1_score(Y_test,Y_predict_1))

For logistic regression :
accuracy : 0.7359307359307359
precision : 0.6172839506172839
recall : 0.625
F1_score : 0.6211180124223602


In [15]:
print("For Decision tree :")
print("accuracy :",accuracy_score(Y_test,Y_predict_2))
print("precision :",precision_score(Y_test, Y_predict_2))
print("recall :",recall_score(Y_test,Y_predict_2))
print("F1_score :",f1_score(Y_test,Y_predict_2))

For Decision tree :
accuracy : 0.70995670995671
precision : 0.5730337078651685
recall : 0.6375
F1_score : 0.603550295857988


In [16]:
print("For SVC :")
print("accuracy :",accuracy_score(Y_test,Y_predict_3))
print("precision :",precision_score(Y_test, Y_predict_3))
print("recall :",recall_score(Y_test,Y_predict_3))
print("F1_score :",f1_score(Y_test,Y_predict_3))

For SVC :
accuracy : 0.7445887445887446
precision : 0.6329113924050633
recall : 0.625
F1_score : 0.6289308176100629


In [17]:
print("For naive bias :")
print("accuracy :",accuracy_score(Y_test,Y_predict_4))
print("precision :",precision_score(Y_test, Y_predict_4))
print("recall :",recall_score(Y_test,Y_predict_4))
print("F1_score :",f1_score(Y_test,Y_predict_4))

For naive bias :
accuracy : 0.7445887445887446
precision : 0.6235294117647059
recall : 0.6625
F1_score : 0.6424242424242423


In [19]:
from sklearn.ensemble import BaggingClassifier
bag=BaggingClassifier(estimator=GaussianNB(),n_estimators=500,random_state=42)
bag.fit(X_test,Y_test)
Y_predict_5=bag.predict(X_test)
print("For bagging :")
print("accuracy :",accuracy_score(Y_test,Y_predict_5))
print("precision :",precision_score(Y_test, Y_predict_5))
print("recall :",recall_score(Y_test,Y_predict_5))
print("F1_score :",f1_score(Y_test,Y_predict_5))


For bagging :
accuracy : 0.7922077922077922
precision : 0.7051282051282052
recall : 0.6875
F1_score : 0.6962025316455697


In [20]:
import pickle
path='..\Model\model.pkl'
with open(path,'wb') as f:
    pickle.dump(bag,f)