<a href="https://colab.research.google.com/github/ritzx21/IJARIIE-Research-Paper-Decision-Support-System-for-Predicting-Disease/blob/main/Diabetes_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import Dependencies

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

Analysis of data
### PIMA Diabetes Dataset

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/ritzx21/Diabetes-Prediction/main/diabetes.csv")

In [None]:
df = shuffle(df , random_state = 43)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
557,8,110,76,0,0,27.8,0.237,58,0
383,1,90,62,18,59,25.1,1.268,25,0
680,2,56,56,28,45,24.2,0.332,22,0
205,5,111,72,28,0,23.9,0.407,27,0
188,8,109,76,39,114,27.9,0.64,31,1


In [None]:
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
16,0,118,84,47,230,45.8,0.551,31,1
58,0,146,82,0,0,40.5,1.781,44,0
277,0,104,64,23,116,27.8,0.454,23,0
255,1,113,64,35,0,33.6,0.543,21,1
320,4,129,60,12,231,27.5,0.527,31,0


In [None]:
df.shape  #Hence 768 rows and 9 columns

(768, 9)

In [None]:
 df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 557 to 320
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 60.0 KB


In [None]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [None]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

0 --> Non-diabetic -- 500 patients


1 --> Diabetic -- 268 patients


In [None]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


Separating the data and labels

In [None]:
X = df.drop('Outcome',axis = 1)  #Axis = 1 for dropping a column and axis=0 for dropping a row
Y = df['Outcome']

In [None]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
557,8,110,76,0,0,27.8,0.237,58
383,1,90,62,18,59,25.1,1.268,25
680,2,56,56,28,45,24.2,0.332,22
205,5,111,72,28,0,23.9,0.407,27
188,8,109,76,39,114,27.9,0.64,31


In [None]:
Y.head()

557    0
383    0
680    0
205    0
188    1
Name: Outcome, dtype: int64

Data Standardization

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

Instead of the above 2 steps where we first fit the data then transform it , we can do it directly in one step using `scaler.fit_transform(x)`. Both give the same output.

In [None]:
print(standardized_data)

[[ 1.23388019 -0.34096773  0.35643175 ... -0.53211885 -0.70935431
   2.10669743]
 [-0.84488505 -0.96691063 -0.36733675 ... -0.87480081  2.40438805
  -0.70119842]
 [-0.54791859 -2.03101358 -0.67752325 ... -0.98902814 -0.42244303
  -0.95646168]
 ...
 [-1.14185152 -0.5287506  -0.26394125 ... -0.53211885 -0.05398855
  -0.87137393]
 [-0.84488505 -0.24707629 -0.26394125 ...  0.20401277  0.21480201
  -1.04154944]
 [ 0.04601433  0.25367803 -0.47073225 ... -0.57019463  0.16648011
  -0.19067191]]


In [None]:
X = standardized_data
Y = df['Outcome']

In [None]:
X_train , X_test  , Y_train ,Y_test = train_test_split(X,Y , test_size = 0.2 , random_state = 42)

In [None]:
X_train.shape

(614, 8)

In [None]:
Y_train.shape

(614,)

SVM (Support Vector Machine) Model

In [None]:
model = svm.SVC(kernel = 'linear')

In [None]:
model.fit(X_train , Y_train)

In [None]:
Y_predict = model.predict(X_test)

In [None]:
accuracy = accuracy_score(Y_test ,Y_predict)
print("Accuracy using SVM model : ",accuracy*100)

Accuracy using SVM model :  77.92207792207793


Random Forest


In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
RFCmodel = RandomForestClassifier(max_depth = 13)

In [None]:
RFCmodel.fit(X_train, Y_train)

In [None]:
Y_predictRFC = RFCmodel.predict(X_test)

In [None]:
accuracy_RFC = accuracy_score(Y_test , Y_predictRFC)
print("Accuracy using Random Forest Classifier: ",accuracy_RFC*100)

Accuracy using Random Forest Classifier:  75.97402597402598


Multinomial Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
LRmodel = LogisticRegression(multi_class = 'multinomial')

In [None]:
LRmodel.fit(X_train , Y_train)

In [None]:
Y_predictLR = LRmodel.predict(X_test)

In [None]:
accuracy_LR = accuracy_score(Y_test , Y_predictLR)
print('Accuracy using Logistic Regression is ',accuracy_LR*100)

Accuracy using Logistic Regression is  78.57142857142857


Predictive System

In [None]:
import numpy as np

In [None]:
input_data = (5,166,72,19,175,25.8,0.587,51)

input_data_as_numpy_array = np.asarray(input_data)

input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = LRmodel.predict(std_data)
print(prediction)

if (prediction[0] == 0):
  print("The person is not diabetic")
else:
  print("The person is diabetic")

[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[1]
The person is diabetic




Save the model

In [None]:
import pickle

In [None]:
filename = 'diabetes_model.sav'
pickle.dump(LRmodel, open(filename, 'wb'))

In [None]:
# loading the saved model
loaded_model = pickle.load(open('diabetes_model.sav', 'rb'))