## **Data Preparation & Model**

### About Dataset

Link to dataset: [Dataset](http://kaggle.com/datasets/uciml/pima-indians-diabetes-database)

### `Context`
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

### `Content`
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

### `Acknowledgements`
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# set seed for reproductibility
SEED = 20
np.random.seed(SEED)

In [3]:
# load the dataset
df = pd.read_csv("../data/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Replacing all 0 values with Null values
def replace_zero(df):
    df_nan = df.copy(deep=True)
    cols = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
    df_nan[cols] = df_nan[cols].replace(0, np.nan)
    return df_nan

df_nan = replace_zero(df)

In [5]:
df_nan.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [6]:
def find_median(frame,var):
    temp = frame[frame[var].notnull()]
    temp = frame[[var,'Outcome']].groupby('Outcome')[[var]].median().reset_index()
    return temp

In [7]:
def replace_null(frame,var):
    median_df=find_median(frame,var)
    var_0=median_df[var].iloc[0]
    var_1=median_df[var].iloc[1]
    frame.loc[(frame['Outcome'] == 0) & (frame[var].isnull()), var] = var_0
    frame.loc[(frame['Outcome'] == 1) & (frame[var].isnull()), var] = var_1
    return frame[var].isnull().sum()

In [8]:
df_nan.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

- Handling Null Values

In [9]:
print(str(replace_null(df_nan,'Glucose'))+ ' Nulls for Glucose')
print(str(replace_null(df_nan,'SkinThickness'))+ ' Nulls for SkinThickness')
print(str(replace_null(df_nan,'Insulin'))+ ' Nulls for Insulin')
print(str(replace_null(df_nan,'BMI'))+ ' Nulls for BMI')
print(str(replace_null(df_nan,'BloodPressure'))+ ' Nulls for BloodPressure')

0 Nulls for Glucose
0 Nulls for SkinThickness
0 Nulls for Insulin
0 Nulls for BMI
0 Nulls for BloodPressure


In [10]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [11]:
# We need to scale our data for uniformity
from sklearn.preprocessing import StandardScaler

def std_scaler(df):
    scaler = StandardScaler()
    x = pd.DataFrame(scaler.fit_transform(df.drop('Outcome', axis=1),), 
                     columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
                              'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])
    y = df['Outcome']
    return x, y

In [12]:
X, y = std_scaler(df_nan)

In [13]:
# describe X
X.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,-6.476301e-17,1.480297e-16,-3.978299e-16,8.095376e-18,-3.469447e-18,1.31839e-16,2.451743e-16,1.931325e-16
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-2.551447,-3.999727,-2.486187,-1.434747,-2.070186,-1.189553,-1.041549
25%,-0.8448851,-0.7202356,-0.6934382,-0.4603073,-0.440843,-0.717659,-0.6889685,-0.7862862
50%,-0.2509521,-0.1536274,-0.03218035,-0.1226607,-0.440843,-0.0559387,-0.3001282,-0.3608474
75%,0.6399473,0.6100618,0.6290775,0.3275348,0.3116039,0.6057816,0.4662269,0.6602056
max,3.906578,2.539814,4.100681,7.868309,7.909072,5.041489,5.883565,4.063716


In [14]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

In [15]:
# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED, stratify=y)

- Implementing KNN

In [16]:
# baseline model: KNN
from sklearn.neighbors import KNeighborsClassifier

test_scores = []
train_scores = []

for i in range(5, 15):
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    train_scores.append(neigh.score(X_train, y_train))
    test_scores.append(neigh.score(X_test, y_test))

In [17]:
print('Max train_score is ' + str(max(train_scores)*100) + ' at K =  ' + str(train_scores.index(max(train_scores))+5))

Max train_score is 85.66775244299674 at K =  5


In [19]:
print('Max test_score is ' + str(max(test_scores)*100) + ' at K =  ' + str(test_scores.index(max(test_scores))+5))

Max test_score is 87.01298701298701 at K =  13


- Logistic Regression

In [22]:
# let's try logistic regression
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(random_state=SEED, penalty='l2').fit(X_train, y_train)
log_model_score = log_model.score(X_test, y_test)
log_model_score

0.8311688311688312

- SVM

In [23]:
# Support Vector Machines
from sklearn.svm import SVC
svm_model = SVC().fit(X_train, y_train)
svm_predict = svm_model.predict(X_test)
svm_model.score(X_test, y_test)

0.8896103896103896

In [24]:
# Function to evaluate model performance
def model_perf(pred,Y_test):
    cmp_list=[]
    for i,j in zip(pred,Y_test):
        if i==j:
            cmp_list.append(1)
        else:
            cmp_list.append(0)
    return cmp_list

In [25]:
cmp_list = model_perf(svm_predict, y_test)

In [27]:
print('Model Accuracy Confirmation :'+ str(cmp_list.count(1)/len(y_test)))

Model Accuracy Confirmation :0.8896103896103896


- Random Forest

In [28]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=SEED, max_depth=2).fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_model.score(X_test, y_test)

0.8571428571428571

- Applying Neural Network

In [30]:
import tensorflow as tf

def buil_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(8, activation='relu', input_shape=[len(X_train.keys())]),
        tf.keras.layers.Dense(4, activation='relu'),
        tf.keras.layers.Dense(2, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07)

    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

neural_model = buil_model()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [31]:
neural_model.summary()

In [32]:
# keeping the epochs high as dataset is small
EPOCHS = 1000
neural_pred = neural_model.fit(X_train, y_train, epochs=EPOCHS, validation_split=0.1, verbose=2)

Epoch 1/1000
18/18 - 2s - 105ms/step - accuracy: 0.5833 - loss: 0.6873 - val_accuracy: 0.7419 - val_loss: 0.6656
Epoch 2/1000
18/18 - 0s - 8ms/step - accuracy: 0.6449 - loss: 0.6693 - val_accuracy: 0.7581 - val_loss: 0.6297
Epoch 3/1000
18/18 - 0s - 9ms/step - accuracy: 0.7029 - loss: 0.6372 - val_accuracy: 0.8226 - val_loss: 0.5737
Epoch 4/1000
18/18 - 0s - 8ms/step - accuracy: 0.7554 - loss: 0.5963 - val_accuracy: 0.8387 - val_loss: 0.5216
Epoch 5/1000
18/18 - 0s - 11ms/step - accuracy: 0.7826 - loss: 0.5611 - val_accuracy: 0.8387 - val_loss: 0.4864
Epoch 6/1000
18/18 - 0s - 8ms/step - accuracy: 0.7826 - loss: 0.5310 - val_accuracy: 0.8387 - val_loss: 0.4598
Epoch 7/1000
18/18 - 0s - 8ms/step - accuracy: 0.8043 - loss: 0.5023 - val_accuracy: 0.8548 - val_loss: 0.4334
Epoch 8/1000
18/18 - 0s - 7ms/step - accuracy: 0.8080 - loss: 0.4788 - val_accuracy: 0.8548 - val_loss: 0.4012
Epoch 9/1000
18/18 - 0s - 8ms/step - accuracy: 0.8333 - loss: 0.4531 - val_accuracy: 0.8871 - val_loss: 0.387

In [33]:
# lets measure final performance
hist = pd.DataFrame(neural_pred.history)
hist['epoch'] = neural_pred.epoch
hist.tail()

Unnamed: 0,accuracy,loss,val_accuracy,val_loss,epoch
995,0.945652,0.174988,0.935484,0.434741,995
996,0.945652,0.175265,0.935484,0.43166,996
997,0.945652,0.175294,0.935484,0.427576,997
998,0.945652,0.17515,0.935484,0.457927,998
999,0.945652,0.17542,0.935484,0.427878,999


In [34]:
neural_test=neural_model.predict(X_test)

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step


In [35]:
neural_test_converted=[]
for i in neural_test:
    if i>0.5:
        neural_test_converted.append(1)
    else:
        neural_test_converted.append(0)

In [36]:
cmp_list = model_perf(neural_test_converted, y_test)

In [37]:
print('Test Accuracy :' + str(cmp_list.count(1)/len(y_test)*100)+' %')

Test Accuracy :82.46753246753246 %


In [None]:
import pickle
# Lets dump our SVM model
pickle.dump(svm_model, open('svm_model.pkl','wb'))

: 