# Tutorial: Machine_Learning - Classification Example

*Note:* This session is graded (binary grading "Complete/Incomplete"). Complete all the exercises and submit the ipynb to canvas under assignment **Tutorial: Machine Learning for Numerical Classification ** by end of today (03/08).


## 0. The Diabetes Dataset
Take a look at the dataset here: https://www.kaggle.com/datasets/mathchi/diabetes-data-set

Description from the website:

"This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes."

Read the description ("Content" section) from the website. As per the website, the columns of the datasets are:

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)

### Today's Task:
**To build classification models than can predict whether the person will have diabetes or not (i.e., Outcome) based on the inputs.**

I have split the data into two:

1. **train.csv** This is to train our machine learning classifiers.
2. **test.csv** This will be used to evaluate our classifiers' performance.

Download these two files from **Canvas->Files->Week8->train.csv** and **Canvas->Files->Week8->test.csv**

**IMP: PUT THESE TWO FILES IN THE SAME DIRECTORY AS THIS NOTEBOOK TO AVOID FILENOTFOUND ERROR**

## 1. Load and Preprocess Data
Let's load the data into a Pandas DataFrame. Check and remove duplicates. Check and clean rows with NULL entries.

In [5]:
import pandas as pd

train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,466,0,74,52,10,36,27.8,0.269,22,0
1,719,5,97,76,27,0,35.6,0.378,52,1
2,319,6,194,78,0,0,23.5,0.129,59,1
3,402,5,136,84,41,88,35.0,0.286,35,1
4,752,3,108,62,24,0,26.0,0.223,25,0


### 1.1. Feature and Label
We will select coumns that will act as input ($x_1,x_1,..,x_N$) and output ($y_{actual}$).

In [6]:
train_features = train_df[["Pregnancies","Glucose","BloodPressure",
                          "SkinThickness","Insulin","BMI",
                           "DiabetesPedigreeFunction","Age"]]

train_labels = train_df["Outcome"]

train_features.head()
train_labels.head()

0    0
1    1
2    1
3    1
4    0
Name: Outcome, dtype: int64

## 2. Defining and training Classifiers

Let's define Logistic regression and Neural Network with default configurations.

A neural network is also called as a MultiLayered Perceptron (MLP)

In [None]:
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier


lr_classifier = LogisticRegression(solver='lbfgs',max_iter=10000)
mlp_classifier = MLPClassifier(solver='lbfgs', alpha=1e-5,
                               hidden_layer_sizes=(8, 2), random_state=11,max_iter=10000)


# train our models
lr_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())
mlp_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())

### 3. Evaluating our models' performance
We define accuracy metric as follows:

$$acc = \frac{Number~of~times~predicted~class~=~actual~class}{Total~number~of~examples}$$

In [None]:
from sklearn.metrics import accuracy_score

#load test data
test_df = pd.read_csv("test.csv")

# Extract the input features
test_inputs = test_df[["Pregnancies","Glucose","BloodPressure",
                          "SkinThickness","Insulin","BMI",
                           "DiabetesPedigreeFunction","Age"]]

y_actual = test_df["Outcome"]

# predict using logistic regression model
y_predicted_lr = lr_classifier.predict(test_inputs.to_numpy())
lr_accuracy_score = accuracy_score(y_predicted_lr,y_actual)

# predict using logistic regression model
y_predicted_mlp = mlp_classifier.predict(test_inputs.to_numpy())
mlp_accuracy_score = accuracy_score(y_predicted_mlp,y_actual)

print (f"Accuracy of the Logistic Classifier = {lr_accuracy_score}")
print (f"Accuracy of the MLP Classifier = {mlp_accuracy_score}")

Accuracy of the Logistic Classifier = 0.7467532467532467
Accuracy of the MLP Classifier = 0.6233766233766234


### 3.1. Insights:
For our dataset and configurations, we see that Logistic Regression models is 74.6% accurate on our test data. It is a better model than the Neural Network model

## 4. Saving our best model for future use

We can store our best model on our hard drive and load it as and when we need it.

In [None]:
# Storing
import pickle

file_to_write = open("diabetes_best_model.saved","wb")
pickle.dump(lr_classifier,file_to_write)
file_to_write.close()

## 5. Loading our best model and testing it

In [None]:
import pickle
import numpy

model_file = open("diabetes_best_model.saved","rb")
model = pickle.load(model_file)
model_file.close()

# Let's prepare a sample input
pregnancies = 1
glucose = 200
bp = 66
skin_thickness = 20
insulin = 95
bmi = 32.9
diabetes_pedigree = 0.6
age = 28

input_data =numpy.array([[pregnancies,glucose,bp,
                        skin_thickness,insulin,bmi,
                        diabetes_pedigree,age]])

y_predicted_lr = lr_classifier.predict(input_data)

if y_predicted_lr[0]==1:
    print ("The person is likely to have diabetes in the near future")
if y_predicted_lr[0]==0:
    print ("The person will not have diabetes")

The person is likely to have diabetes in the near future


## E1. Exercise: Let's work on a different set of features "Age" and repeat

**Create a new Jupyter Notbook and do the following**

Following the above hands-on, initialize and train Logistic Regression and MLP classifiers for predicting diabetes using the following columns:

**Feature Columns (Input):**
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- Age: Age (years)

**Label (Output):**
- Outcome: Class variable (0 or 1)

What accuracy figures are you getting for the two classifiers? Are they very different from the accuracy figures we for in Section 3 above? Write down in a markdown block.

In [14]:
import pandas as pd

train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,466,0,74,52,10,36,27.8,0.269,22,0
1,719,5,97,76,27,0,35.6,0.378,52,1
2,319,6,194,78,0,0,23.5,0.129,59,1
3,402,5,136,84,41,88,35.0,0.286,35,1
4,752,3,108,62,24,0,26.0,0.223,25,0


In [15]:
train_features = train_df[["Glucose","BloodPressure","Insulin","BMI","Age"]]

train_labels = train_df["Outcome"]

train_features.head()
train_labels.head()

0    0
1    1
2    1
3    1
4    0
Name: Outcome, dtype: int64

In [16]:
# Now let's define our models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier


lr_classifier = LogisticRegression(solver='lbfgs',max_iter=10000)
mlp_classifier = MLPClassifier(solver='lbfgs', alpha=1e-5,
                               hidden_layer_sizes=(8, 2), random_state=11,max_iter=10000)


# train our models
lr_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())
mlp_classifier.fit(train_features.to_numpy(),train_labels.to_numpy())

In [18]:
from sklearn.metrics import accuracy_score

#load test data
test_df = pd.read_csv("test.csv")

# Extract the input features
test_inputs = test_df[["Glucose","BloodPressure","Insulin","BMI","Age"]]

y_actual = test_df["Outcome"]

# predict using logistic regression model
y_predicted_lr = lr_classifier.predict(test_inputs.to_numpy())
lr_accuracy_score = accuracy_score(y_predicted_lr,y_actual)

# predict using logistic regression model
y_predicted_mlp = mlp_classifier.predict(test_inputs.to_numpy())
mlp_accuracy_score = accuracy_score(y_predicted_mlp,y_actual)

print (f"Accuracy of the Logistic Classifier = {lr_accuracy_score}")
print (f"Accuracy of the MLP Classifier = {mlp_accuracy_score}")

Accuracy of the Logistic Classifier = 0.7402597402597403
Accuracy of the MLP Classifier = 0.6298701298701299


In [19]:
# Storing
import pickle

file_to_write = open("diabetes_best_model.saved","wb")
pickle.dump(lr_classifier,file_to_write)
file_to_write.close()

In [22]:
import pickle
import numpy

model_file = open("diabetes_best_model.saved","rb")
model = pickle.load(model_file)
model_file.close()

# Let's prepare a sample input

glucose = 200
bp = 66

insulin = 95
bmi = 32.9

age = 28

input_data =numpy.array([[glucose,bp,insulin,bmi,age]])

y_predicted_lr = lr_classifier.predict(input_data)

if y_predicted_lr[0]==1:
    print ("The person is likely to have diabetes in the near future")
if y_predicted_lr[0]==0:
    print ("The person will not have diabetes")

The person is likely to have diabetes in the near future


The accuracy I'm getting for the two figures are as follows:
- Accuracy of the Logistic Classifier = 0.7402597402597403
- Accuracy of the MLP Classifier = 0.6298701298701299

This is very slightly different from the accuracy classifiers for the two figures when we have 8 features instead:
- Accuracy of the Logistic Classifier = 0.7467532467532467
- Accuracy of the MLP Classifier = 0.6233766233766234