<h1> BONUS EXERCISE 2 - Classification task</h1>


_Machine Learning (2021), Vahid Piroozbakht_


<h3> Dataset Name : “Pima Indians Diabetes Database”.</h3>

Source : Kaggle (https://www.kaggle.com/uciml/pima-indians-diabetes-database)

<h2> 1. Introduction:</h2>

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. <a href="https://www.kaggle.com/uciml/pima-indians-diabetes-database">[1]</a>

In [1]:
# import libraries
import numpy as np
import matplotlib
import pandas as pd
import sklearn
import matplotlib.pyplot as plt 
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

<h2>2. Problem Formulation: </h2>
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

In [2]:
# Load Dataset
df=pd.read_csv('Diabetess.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35,,33.6,0.627,50.0,1
1,1,85.0,66,29,,26.6,0.351,31.0,0
2,8,183.0,64,0,,23.3,0.672,32.0,1
3,1,89.0,66,23,94.0,28.1,0.167,21.0,0
4,0,137.0,40,35,168.0,43.1,2.288,33.0,1


<h3>2.1. Features:</h3>

In [4]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

The following features are present in the dataset:
* **Pregnancies:**  Number of times pregnant
* **Glucose:** Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* **BloodPressure:** Diastolic blood pressure (mm Hg)
* **SkinThickness:** Triceps skin fold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml)
* **BMI:** Body mass index (weight in kg/(height in m)2)
* **DiabetesPedigreeFunction:** Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
* **Age:** Age (years)
* **Outcome:** Class variable (0 if non-diabetic, 1 if diabetic)


<h3>2.2. Pre processing features:</h3>
<h4>2.2.1 NaN value:</h4>
In below first we find the columns contain NaN value,

In [5]:
# Find colums contain NaN value
nanCols=[]
for col in df.columns:
    if df[df[col].isna()].shape[0]>0 :
        nanCols.append(col)
print("Colums contain NaN value: ",nanCols)

Colums contain NaN value:  ['Glucose', 'Insulin', 'BMI']


In [6]:
# set Nan values to zero
for nCol in nanCols:
    df[nCol]=df[nCol].fillna(0)
    print("Column '",nCol,"' NaN values are set to zero.")

Column ' Glucose ' NaN values are set to zero.
Column ' Insulin ' NaN values are set to zero.
Column ' BMI ' NaN values are set to zero.


<h3>2.3. Set the Training and validation set:</h3>
In this level after preparing data for processing, we need to divide our data set to two differnt sub sets. First is the Training set that will be used in mode training, seconnd validation set that is for testing and validating the model.

In [7]:
# set X and y
y=df.Outcome
tmpDS=df.drop(columns=['Outcome'])
X=tmpDS
print('Shape of X is', X.shape)
print('Shape of y is', y.shape)

Shape of X is (768, 8)
Shape of y is (768,)


<h1> <center> - - - (A) - - - </center></h1>

In [8]:
# Split-out validation dataset
X_train,X_test,y_train,y_test=model_selection.train_test_split(
    X,y,test_size=0.30,random_state=7)

<h2>3. Method:</h2>

<h3>3.1. LogisticRegression:</h3>

In [9]:
log = LogisticRegression(max_iter=1000)
log.fit(X_train, y_train)
pred= log.predict(X_train)
tr_log_error = mean_squared_error(y_train, pred)    # compute training error 
print("Training error: ",tr_log_error)

pred = log.predict(X_test)    # compute predictions of validation set
val_log_error = mean_squared_error(y_test, pred)    # compute validation error 
print("Validation error: ",val_log_error)
print("Score: ",log.score(X_test,y_test))

Training error:  0.21787709497206703
Validation error:  0.22077922077922077
Score:  0.7792207792207793


<h3>3.2. Artificial neural network:</h3>

In [10]:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000)
clf.fit(X_train, y_train)
pred= clf.predict(X_train)
tr_clf_error = mean_squared_error(y_train, pred)    # compute training error 
print("Training error: ",tr_clf_error)

pred = clf.predict(X_test)    # compute predictions of validation set
val_clf_error = mean_squared_error(y_test, pred)    # compute validation error 
print("Validation error: ",val_clf_error)
print("Score: ",clf.score(X_test,y_test))

Training error:  0.34450651769087526
Validation error:  0.3722943722943723
Score:  0.6277056277056277


<h2>4. Conclusion:</h2>
According to the data processes in part 3.1 and 3.2 which were on the Diabetess data set, and by comparing the results of our ML algorithms which were <b>LogisticRegression</b> and <b>Artificial neural network</b>, Logistic regression with 77% accuracy was much more performed than ANN with 62%.

<h1> <center> - - - (B) - - - </center></h1>
Note: in the task page, the test rate was written as <i>0.001</i>, which effects on accuracy etc, I guess it should be a typo. I set test size 0.10

In [11]:
# Set Glucose, BMI, and Age, as the input variables and Outcome as the target variable
XB=df[["Glucose","BMI","Age"]]
yB=df.Outcome

In [12]:
# Split-out validation dataset
X_train,X_test,y_train,y_test=model_selection.train_test_split(
    XB,yB,test_size=0.10,random_state=7)

In [13]:
# LogisticRegression
logB = LogisticRegression(max_iter=1000)
logB.fit(X_train, y_train)
pred= logB.predict(X_train)
tr_logB_error = mean_squared_error(y_train, pred)    # compute training error 
print("Training error: ",tr_logB_error)

pred = logB.predict(X_test)    # compute predictions of validation set
val_logB_error = mean_squared_error(y_test, pred)    # compute validation error 
print("Validation error: ",val_logB_error)
print("Score: ",logB.score(X_test,y_test))

Training error:  0.23154848046309695
Validation error:  0.19480519480519481
Score:  0.8051948051948052


In [14]:
# Artificial neural network
clfB = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1,max_iter=1000)
clfB.fit(X_train, y_train)
pred= clfB.predict(X_train)
tr_clfB_error = mean_squared_error(y_train, pred)    # compute training error 
print("Training error: ",tr_clfB_error)

pred = clfB.predict(X_test)    # compute predictions of validation set
val_clfB_error = mean_squared_error(y_test, pred)    # compute validation error 
print("Validation error: ",val_clfB_error)
print("Score: ",clfB.score(X_test,y_test))

Training error:  0.3429811866859624
Validation error:  0.4025974025974026
Score:  0.5974025974025974


<h1> <center> - - - (C) - - - </center></h1>

In [15]:
# The patients' age is 35 years with a glucose level of 110 and BMI of 35.
# ["Glucose","BMI","Age"]
patientInfo=[[110,35,35]]
predict=logB.predict(patientInfo)
print("The predict value is",predict," => (0 if non-diabetic, 1 if diabetic)")

The predict value is [0]  => (0 if non-diabetic, 1 if diabetic)
