# MSBD5001-Spring 2022

Predicting pneumonia in kidney transplant recipients

https://www.kaggle.com/c/msbd5001-spring-2022/overview

---


## Description

Kidney transplantation is the optimal treatment to cure patients with end-stage renal disease (ESRD). However, infectious complication, especially pneumonia, is the main cause of mortality in the early stage. In this in-class competition, we aimed to study the association between collected patient immune status features during immune monitoring and pneumonia in kidney transplant patients through machine learning models.

The immune status features consist of the percentages and absolute cell counts of CD3+CD4+ T cells, CD3+CD8+ T cells, CD19+ B cells and natural killer (NK) cells, and median fluorescence intensity (MFI) of human leukocyte antigen (HLA)-DR on monocytes and CD64 on neutrophils. Also, basic information including age and sex is provided. The task is to predict whether the patient will get pneumonia after the kidney transplantation.

---
## Dataset information

xxxxxx

---
## ML problem definition

Multi-class classification

---

## Evaluation Metric

The evaluation metric used is prediction accuracy.

# Classifiers: (1) Decision Tree, (2) KNN and (3) Random Forest


# Download data

In [1]:
# # find the share link of the file/folder on Google Drive
# file_share_links = [
#                     "https://drive.google.com/file/d/1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw/view?usp=sharing",   #sample_submission.csv
#                     "https://drive.google.com/file/d/1UEZzWcKk_QaHWzDMl6XdWGJLgkhiguj-/view?usp=sharing",   #test.csv
#                     "https://drive.google.com/file/d/1L7AYTx15-AQ2nXAuXm7hLC9HDvpKJCGw/view?usp=sharing",   #train.csv
# ]

# for file_share_link in file_share_links:
#     # extract the ID of the file
#     file_id = file_share_link[file_share_link.find("d/") + 2: file_share_link.find('/view')]
#     print(file_id)

#     # append the id to this REST command
#     file_download_link = "https://docs.google.com/uc?export=download&id=" + file_id
#     print(file_download_link)


In [2]:
# !wget -O sample_submission.csv "https://docs.google.com/uc?export=download&id=1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw"
# !wget -O test.csv "https://docs.google.com/uc?export=download&id=1UEZzWcKk_QaHWzDMl6XdWGJLgkhiguj-"
# !wget -O train.csv "https://docs.google.com/uc?export=download&id=1L7AYTx15-AQ2nXAuXm7hLC9HDvpKJCGw"

# Setup

In [3]:
import os
import time
from typing import Iterable

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

# save scikit-learn model
from joblib import dump, load
# import pickle


from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

#Split data
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler
#Grid Search with Cross Validation
from sklearn.model_selection import GridSearchCV

# decision tree
from sklearn.tree import DecisionTreeClassifier

# knn
from sklearn.neighbors import KNeighborsClassifier

# random forest
from sklearn.ensemble import RandomForestClassifier

# Data preparation

In [4]:
%%sh
ls

head -n 5 train.csv
head -n 5 test.csv
head -n 5 sample_submission.csv


1.1(done)-msbd5001-kaggle-decisionTree-knn-RF.ipynb
1.2(done)-msbd5001-kaggle-pycaret.ipynb
2(done)-msbd5001-kaggle-test-models.ipynb
README.md
models.zip
sample_submission.csv
stacker_auc1.pkl
test.csv
train.csv
id,MO HLADR+ MFI (cells/ul),Neu CD64+MFI (cells/ul),CD3+T (cells/ul),CD8+T (cells/ul),CD4+T (cells/ul),NK (cells/ul),CD19+ (cells/ul),CD45+ (cells/ul),Age,Sex 0M1F,Mono CD64+MFI (cells/ul),label
0,3556.0,2489.0,265.19,77.53,176.55,0.0,4.2,307.91,52,0,7515.0,1
1,1906.0,134.0,1442.61,551.9,876.07,112.1,168.15,1735.48,20,1,1756.0,0
2,1586.0,71.0,1332.74,684.2,655.26,244.95,216.52,1820.04,28,1,1311.0,0
3,683.0,94.0,419.23,255.8,162.17,72.05,44.68,538.22,55,1,1443.0,0
id,MO HLADR+ MFI (cells/ul),Neu CD64+MFI (cells/ul),CD3+T (cells/ul),CD8+T (cells/ul),CD4+T (cells/ul),NK (cells/ul),CD19+ (cells/ul),CD45+ (cells/ul),Age,Sex 0M1F,Mono CD64+MFI (cells/ul)
0,2843.0,156.0,1358.52,730.78,637.85,127.06,94.82,1588.62,45,1,3256.0
1,437.0,137.0,509.43,268.05,243.07,390.86,98.24,1002.76,51,1

In [5]:
folder_models = 'models'
folder_data = ''
path_train = folder_data + 'train.csv'
path_test = folder_data + 'test.csv'


#read train data
train = np.genfromtxt(path_train, delimiter=',', names=True, dtype=float)

row_index_name = 'id'
label_name = 'label'
feature_names = [x for x in train.dtype.names if (x != label_name) & (x != row_index_name)]

# train_x = train[feature_names].tolist()
# train_y = train[label_name].tolist()


#read test data
test = np.genfromtxt(path_test, delimiter=',', names=True, dtype=float)

# test_x = test[feature_names].tolist()
# test_y = test[label_name].tolist()


#class labels
labels = list(set(train[label_name]))
print(f'Classes/Labels of dataset (column: {label_name}):', labels)


# View
print(f'row_index_name: {row_index_name}')
print(f'label_name: {label_name}')
print(f'feature columns: {feature_names}')

# print(test_x[:5])
# print('labels of test data:', test_y[:5])

print('train.shape =', train.shape)
print('test.shape =', test.shape)


Classes/Labels of dataset (column: label): [0.0, 1.0]
row_index_name: id
label_name: label
feature columns: ['MO_HLADR_MFI_cellsul', 'Neu_CD64MFI_cellsul', 'CD3T_cellsul', 'CD8T_cellsul', 'CD4T_cellsul', 'NK_cellsul', 'CD19_cellsul', 'CD45_cellsul', 'Age', 'Sex_0M1F', 'Mono_CD64MFI_cellsul']
train.shape = (87,)
test.shape = (59,)


In [6]:
# Convert np.array to df
df_train = pd.DataFrame(train)
df_test = pd.DataFrame(test)

if row_index_name:
    df_train.set_index(row_index_name, inplace=True)
    df_test.set_index(row_index_name, inplace=True)

display(df_train.describe())
display(df_train.info())

display(df_test.describe())
display(df_test.info())


Unnamed: 0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
count,86.0,86.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,86.0,87.0
mean,1264.244186,290.383721,982.570115,479.34092,494.904023,212.732874,118.78092,1325.096437,40.218391,0.482759,2066.534884,0.333333
std,765.452376,490.283499,617.332545,344.326452,311.836604,173.553264,96.218344,791.602538,10.461919,0.502599,1198.401364,0.474137
min,112.0,30.0,74.4,36.61,39.59,0.0,4.2,209.25,19.0,0.0,72.0,0.0
25%,685.5,77.5,549.39,237.92,272.745,78.815,52.425,780.615,33.0,0.0,1461.25,0.0
50%,1108.5,124.5,871.71,423.27,459.72,188.78,89.79,1179.27,41.0,0.0,1757.5,0.0
75%,1602.25,244.5,1268.085,624.45,624.36,262.845,155.45,1617.725,49.5,1.0,2238.25,1.0
max,4145.0,3124.0,3791.23,2548.1,1517.81,878.04,485.86,4757.28,60.0,1.0,7515.0,1.0


<class 'pandas.core.frame.DataFrame'>
Float64Index: 87 entries, 0.0 to 86.0
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MO_HLADR_MFI_cellsul  86 non-null     float64
 1   Neu_CD64MFI_cellsul   86 non-null     float64
 2   CD3T_cellsul          87 non-null     float64
 3   CD8T_cellsul          87 non-null     float64
 4   CD4T_cellsul          87 non-null     float64
 5   NK_cellsul            87 non-null     float64
 6   CD19_cellsul          87 non-null     float64
 7   CD45_cellsul          87 non-null     float64
 8   Age                   87 non-null     float64
 9   Sex_0M1F              87 non-null     float64
 10  Mono_CD64MFI_cellsul  86 non-null     float64
 11  label                 87 non-null     float64
dtypes: float64(12)
memory usage: 8.8 KB


None

Unnamed: 0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
count,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0
mean,1212.423729,206.491525,1085.340508,546.220339,523.237966,226.820339,115.048983,1439.65339,41.186441,0.355932,1971.220339
std,772.139285,248.195027,564.337155,342.37002,271.730902,189.056327,87.200827,689.02181,9.438503,0.482905,1137.384129
min,82.0,24.0,258.01,114.98,80.39,17.72,2.96,314.25,15.0,0.0,371.0
25%,696.5,65.0,629.89,268.3,336.955,88.33,59.5,914.84,34.5,0.0,1283.5
50%,1010.0,114.0,1025.32,433.61,511.0,174.86,98.24,1378.32,42.0,0.0,1701.0
75%,1623.0,232.0,1495.395,751.38,676.53,318.14,143.56,1855.05,49.0,1.0,2375.0
max,4195.0,1141.0,2771.2,1738.55,1225.68,956.78,501.91,3355.86,62.0,1.0,6788.0


<class 'pandas.core.frame.DataFrame'>
Float64Index: 59 entries, 0.0 to 58.0
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MO_HLADR_MFI_cellsul  59 non-null     float64
 1   Neu_CD64MFI_cellsul   59 non-null     float64
 2   CD3T_cellsul          59 non-null     float64
 3   CD8T_cellsul          59 non-null     float64
 4   CD4T_cellsul          59 non-null     float64
 5   NK_cellsul            59 non-null     float64
 6   CD19_cellsul          59 non-null     float64
 7   CD45_cellsul          59 non-null     float64
 8   Age                   59 non-null     float64
 9   Sex_0M1F              59 non-null     float64
 10  Mono_CD64MFI_cellsul  59 non-null     float64
dtypes: float64(11)
memory usage: 5.5 KB


None

## Clean data - Remove `null` rows

In [7]:
# Cheak `null` rows
display(df_train[df_train.isna().any(axis=1)])
display(df_train[df_train.isnull().any(axis=1)])

display(df_test[df_test.isna().any(axis=1)])
display(df_test[df_test.isnull().any(axis=1)])


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
39.0,,,1336.54,739.71,550.3,68.46,192.07,1615.68,21.0,0.0,,0.0


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
39.0,,,1336.54,739.71,550.3,68.46,192.07,1615.68,21.0,0.0,,0.0


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [8]:
# Remove `null` rows

## Train data
df_train = df_train[~df_train.isna().any(axis=1)]
display(df_train[df_train.isna().any(axis=1)])



## Test data
df_test = df_test[~df_test.isna().any(axis=1)]
display(df_test[df_test.isna().any(axis=1)])



Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


## Split `validation` sets from `train` sets

randomly select rows from Pandas DataFrame,
* https://www.geeksforgeeks.org/how-to-randomly-select-rows-from-pandas-dataframe/

In [9]:
# To get 3 random rows
# each time it gives 3 different rows
# df_validation = df_train.sample(n = 3)

df_train, df_validation = train_test_split(df_train, test_size=0.08, random_state=200)

display(df_train.index)
display(df_validation.index)

Float64Index([ 9.0, 84.0, 19.0, 38.0,  2.0, 30.0, 10.0, 42.0, 61.0, 17.0, 51.0,
              44.0,  5.0, 50.0, 29.0, 65.0, 64.0, 66.0, 68.0, 67.0, 34.0, 47.0,
              25.0, 59.0, 36.0, 70.0, 54.0, 18.0, 41.0,  4.0, 12.0, 74.0, 79.0,
              78.0, 86.0, 49.0,  8.0, 33.0, 48.0, 63.0,  3.0,  0.0, 21.0, 20.0,
              82.0, 46.0, 32.0, 31.0, 53.0, 35.0, 76.0, 75.0, 71.0, 85.0, 13.0,
              55.0, 22.0, 72.0, 24.0, 23.0, 73.0, 15.0, 27.0, 52.0,  7.0,  1.0,
              57.0, 83.0,  6.0, 11.0, 58.0, 14.0, 80.0, 77.0, 56.0, 43.0, 69.0,
              16.0, 26.0],
             dtype='float64', name='id')

Float64Index([81.0, 28.0, 60.0, 37.0, 62.0, 40.0, 45.0], dtype='float64', name='id')

### Make np.array (1) `train_x` `train_y` (2) `validation_x` `validation_y`

In [10]:
train_x = df_train[feature_names]
train_y = df_train[label_name]

test_x = validation_x = df_validation[feature_names]
test_y = validation_y = df_validation[label_name]

train_x

Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9.0,1412.0,243.0,1177.19,684.42,490.50,185.30,67.22,1441.06,36.0,1.0,1213.0
84.0,634.0,1002.0,1300.00,558.00,724.00,67.00,105.00,1484.26,34.0,0.0,2926.0
19.0,403.0,555.0,313.48,131.53,182.69,46.68,7.90,370.30,40.0,0.0,2209.0
38.0,1010.0,1384.0,570.13,312.90,233.84,80.17,31.18,702.08,56.0,1.0,5501.0
2.0,1586.0,71.0,1332.74,684.20,655.26,244.95,216.52,1820.04,28.0,1.0,1311.0
...,...,...,...,...,...,...,...,...,...,...,...
56.0,1055.0,87.0,913.42,410.52,507.04,77.46,43.08,1040.18,48.0,0.0,1728.0
43.0,1679.0,79.0,483.21,162.00,309.00,227.05,101.09,817.24,39.0,0.0,4480.0
69.0,1495.0,125.0,2910.03,1431.78,1517.81,446.94,401.45,3817.75,20.0,1.0,1793.0
16.0,1045.0,124.0,1179.07,522.49,661.78,210.96,93.63,1498.11,49.0,1.0,2075.0


# Comparison of Classifiers

You are required to implement the following classifiers and compare the performance achieved by different classifiers.

* Decision Tree

    You should build decision trees on the dataset in terms of entropy and gini criterions. For each criterion, you should set the depth as {5,10,15,20} respectively. You need to compare the performance *(accuracy, precision, recall, f1 score and training time)* and give a brief discussion.


* KNN, Random Forest

    Apply three different classifiers KNN and Random Forest on the dataset. For each classifier, evaluate the performance *(accuracy, precision, recall, f1 score and training time)*. You are required to compare the performance of different classifiers and give a brief discussion.

## Decision Tree

### Training 


In [11]:
def trainModels(models, x_train: np.array, y_train: np.array) -> Iterable:
    '''Train the models, return trained models.
    '''
    trained = []
    training_times = []

#     trained = [model.fit(x_train, y_train) for model in models]
    for model in models:
        start_time = time.time()
        model = model.fit(x_train, y_train)
        elapsed_time = time.time() - start_time
        training_times.append(elapsed_time)
        trained.append(model)
        print(f'{elapsed_time:.4f}s elapsed during training')
    
    return trained, training_times
    

#### (1) criterion = 'entropy'

In [12]:

criterion = 'entropy'
max_depths = [5,10,15,20]

# Create a Decision Tree Classifier objects
dectrees_entropy = [DecisionTreeClassifier(criterion=criterion,max_depth=depth) for depth in max_depths]
display(dectrees_entropy)

# Train the Decision Tree Classifier model
trained_dectrees_entropy, dectree_entropy_training_times = trainModels(dectrees_entropy, train_x, train_y)

# trained_dectrees_entropy


[DecisionTreeClassifier(criterion='entropy', max_depth=5),
 DecisionTreeClassifier(criterion='entropy', max_depth=10),
 DecisionTreeClassifier(criterion='entropy', max_depth=15),
 DecisionTreeClassifier(criterion='entropy', max_depth=20)]

0.0049s elapsed during training
0.0035s elapsed during training
0.0037s elapsed during training
0.0027s elapsed during training


In [13]:
m = trained_dectrees_entropy[0]
m.__class__.__name__ , m.criterion, m.max_depth

('DecisionTreeClassifier', 'entropy', 5)

#### (2) criterion = 'gini'

In [14]:
criterion = 'gini'
max_depths = [5,10,15,20]

# Create a Decision Tree Classifier objects
dectrees_gini = [DecisionTreeClassifier(criterion=criterion,max_depth=depth) for depth in max_depths]
display(dectrees_gini)

# Train the Decision Tree Classifier model
trained_dectrees_gini, dectree_gini_training_times = trainModels(dectrees_gini, train_x, train_y)

# trained_dectrees_gini


[DecisionTreeClassifier(max_depth=5),
 DecisionTreeClassifier(max_depth=10),
 DecisionTreeClassifier(max_depth=15),
 DecisionTreeClassifier(max_depth=20)]

0.0045s elapsed during training
0.0038s elapsed during training
0.0029s elapsed during training
0.0024s elapsed during training


### Testing 

Compare the metrics (accuracy, precision, recall, f1 score and training time)

#### save trained model function
https://scikit-learn.org/stable/modules/model_persistence.html

In [15]:
def saveModel(model,
              arruracy,
              precision,
              recall,
              f1,
              folder = 'models'):
    
    if not os.path.exists(folder):
        os.makedirs(folder)

    arruracy = f'{arruracy:.3f}'
    model_filename_map = {
        'DecisionTreeClassifier' : lambda : f'dectree_{model.criterion}_{model.max_depth}depth_{arruracy}acc.joblib',
        'KNeighborsClassifier'   : lambda : f'knn_{model.n_neighbors}n_{arruracy}acc.joblib',
        'RandomForestClassifier' : lambda : f'rforest_{model.n_estimators}estimators_{arruracy}acc.joblib',
        }

    dump(model, f'{folder}/{model_filename_map[model.__class__.__name__]()}')

#### testing functions

In [16]:
def evaluate(y_true: np.array, y_pred: np.array) -> Iterable[float]:
    import warnings
    warnings.filterwarnings('ignore')
    
    print('Label(s) that never appear in prediction: ', set(y_true) - set(y_pred)) 
    
    # Print the Classification report
    arruracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, labels=labels, average='macro')
    recall = recall_score(y_true, y_pred, labels=labels, average='macro')
    f1 = f1_score(y_true, y_pred, labels=labels, average='macro')

#     print('classification_report() =\n', classification_report(y_true, list(y_pred), digits=5))    
#     plotConfusionMatrix(y_true, y_pred)   

    return arruracy, precision, recall, f1


def plotConfusionMatrix(y_true: np.array, y_pred: np.array):
    # Plot the two-way Confusion Matrix
    fig, ax = plt.subplots(ncols=1)
    sb.heatmap(confusion_matrix(y_true, y_pred),
               annot = True, 
               fmt=".0f", 
#                annot_kws={"size": 18}
              )
    ax.set_xlabel('Predictions')
    ax.set_ylabel('Actuals')

    plt.show()


def average(ls: list) -> int:
    ls = list(ls)
    return sum(ls)/len(ls)
    

In [17]:
def testModels(models, 
               x_train: np.array, y_train: np.array, 
               x_test: np.array, y_test: np.array, 
               time_training_ls: list
              ) -> Iterable:
    
    pred_y_list = []
    arruracy_ls = []
    precision_ls = []
    recall_ls = []
    f1_ls = []

    for model, time_training in zip(models, time_training_ls):
        print(model)
        print(model.__class__.__name__)

        # For training data
        pred_y_train = model.predict(x_train)
        print('>>> For training data')
        print('arruracy, precision, recall, f1=', evaluate(y_train, pred_y_train))
        print()
        
        # For testing data
        print('>>> For testing data')
        pred_y = model.predict(x_test)
        pred_y_list.append(pred_y)
        print('pred_y[:5] =', pred_y[:5])
        
        # Evalution metrics
        arruracy, precision, recall, f1 = evaluate(y_test, pred_y)
        arruracy_ls.append(arruracy)
        precision_ls.append(precision)
        recall_ls.append(recall)
        f1_ls.append(f1)
        print(f'arruracy, precision, recall, f1 = {arruracy:.4f}, {precision:.4f}, {recall:.4f}, {f1:.4f}')
        print()

#         print('classification_report() =\n', classification_report(y_true, list(y_pred), digits=5))    
#         plotConfusionMatrix(y_true, y_pred)  
        print(f'Training time:\t{time_training:.4f}s')

        # Save model to storage
        saveModel(model, arruracy, precision, recall, f1)
        print('>>> Model saved')
        print('=========================\n')

    print('average training time (s) =', average(time_training_ls))
    print('average arruracy \t=', average(arruracy_ls))
    print('average precision \t=', average(precision_ls))
    print('average recall \t\t=', average(recall_ls))
    print('average f1 score \t=', average(f1_ls))
    
    # return pred_y_list, arruracy_ls, precision_ls, recall_ls, f1_ls


#### entropy

In [18]:
testModels(trained_dectrees_entropy, train_x, train_y, test_x, test_y, dectree_entropy_training_times)



DecisionTreeClassifier(criterion='entropy', max_depth=5)
DecisionTreeClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (0.9746835443037974, 0.9811320754716981, 0.9642857142857143, 0.9718660968660968)

>>> For testing data
pred_y[:5] = [0. 0. 0. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 1.0000, 1.0000, 1.0000, 1.0000

Training time:	0.0049s
>>> Model saved

DecisionTreeClassifier(criterion='entropy', max_depth=10)
DecisionTreeClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5] = [1. 0. 0. 1. 1.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 0.7143, 0.6667, 0.8333, 0.6500

Training time:	0.0035s
>>> Model saved

DecisionTreeClassifier(criterion='entropy', max_depth=15)
DecisionTreeClassifier
>>> For training data
Label(s) tha

#### gini

In [19]:
testModels(trained_dectrees_gini, train_x, train_y, test_x, test_y, dectree_gini_training_times)


DecisionTreeClassifier(max_depth=5)
DecisionTreeClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5] = [0. 0. 0. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 1.0000, 1.0000, 1.0000, 1.0000

Training time:	0.0045s
>>> Model saved

DecisionTreeClassifier(max_depth=10)
DecisionTreeClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5] = [0. 1. 0. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 0.8571, 0.7500, 0.9167, 0.7879

Training time:	0.0038s
>>> Model saved

DecisionTreeClassifier(max_depth=15)
DecisionTreeClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5]

### brief discussion - Decision Tree with entropy vs gini

Their training times are short. 

But their accuracy, precision, recall and f1 score are similar and relatively low (~= 50% or even < 50%, worse than random guess).

<!-- 
|                   | entropy| gini   |
|-------------------|--------|--------|
| **training time** | slower | faster |
| **arruracy**      | similar| similar|
| **precision**     | higher | lower  |
| **recall**        | higher | lower  |
| **f1 score**      | higher | lower  | -->



## KNN

### Normalizing the data

KNN is a distance-based algorithm, it is using distances between data points to determine their similarity. It is sensitive to distances between features.

The difference of value in features in dataset are large, so we need to normalize them.

https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [20]:
# fit scaler on training data
# scaler = MinMaxScaler().fit(train_x)
scaler = MinMaxScaler().fit(df_train[feature_names])

# transform train data
# train_norm_x = scaler.transform(train_x)
train_norm_x = scaler.transform(df_train[feature_names])
# transform test data
# test_norm_x = scaler.transform(test_x)
test_norm_x = scaler.transform(df_validation[feature_names])

train_norm_x[:2]


array([[0.32234069, 0.06884292, 0.29670176, 0.25793851, 0.30503579,
        0.21103822, 0.13083918, 0.27084474, 0.41463415, 1.        ,
        0.1532984 ],
       [0.12943218, 0.31415643, 0.32974336, 0.20760186, 0.46299604,
        0.07630632, 0.20927625, 0.28034336, 0.36585366, 0.        ,
        0.38344753]])

### Train KNN
Documentation of K-Neighbors Classifier (KNN): 

1. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
2. https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/

In [21]:
n_neighbors = [1,3,9,11]
knn_models = [KNeighborsClassifier(n_neighbors=k) for k in n_neighbors]

# fit the model with the training data
knn_models, knn_training_times = trainModels(knn_models, train_norm_x, train_y)

0.0008s elapsed during training
0.0014s elapsed during training
0.0015s elapsed during training
0.0010s elapsed during training


In [22]:
m = knn_models[0]
m.__class__.__name__ , m.n_neighbors

('KNeighborsClassifier', 1)

### Test KNN

In [23]:
# Test the model with the training & testing data

knn_models_test_info = testModels(knn_models, train_norm_x, train_y, test_norm_x, test_y, knn_training_times)


KNeighborsClassifier(n_neighbors=1)
KNeighborsClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5] = [0. 0. 1. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 0.8571, 0.7500, 0.9167, 0.7879

Training time:	0.0008s
>>> Model saved

KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (0.8481012658227848, 0.843939393939394, 0.8179271708683473, 0.8280116110304789)

>>> For testing data
pred_y[:5] = [0. 0. 1. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 0.8571, 0.7500, 0.9167, 0.7879

Training time:	0.0014s
>>> Model saved

KNeighborsClassifier(n_neighbors=9)
KNeighborsClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (0

**Hidden code to compare performance of knn in raw and normalized data**

<!-- 
```trainX_ls = [train_x, train_norm_x]
testX_ls = [test_x, test_norm_x]
data_types = ['raw', 'normalized']

for trainX, testX, data_type in zip(trainX_ls, testX_ls, data_types):
    print('\n==================================')
    print(data_type)
    
    n_neighbors = [1,3,5,7,11,21,31,41]
    knn_models = [KNeighborsClassifier(n_neighbors=k) for k in n_neighbors]

    # Train the model with the training data
    knn_models, training_times = trainModels(knn_models, trainX, train_y)

    # Test the model with the training & testing data
    testKnnModels(knn_models, trainX, train_y, testX, test_y)
```
 -->

## Random Forest

In [24]:
# # Grid Search with Cross Validation
# # Create the parameter grid based on the results of random search 
# param_grid = {
# #     'bootstrap': [True],
#     'max_depth': [None, 8, 10],
# #     'max_features': [2, 3],
# #     'min_samples_leaf': [1, 3, 4, 5],
# #     'min_samples_split': [2, 8, 10, 12],
#     'n_estimators': [50, 100, 150, 200, 250, 300, 1000]
# }

In [25]:
# # Instantiate the grid search model
# grid_search = GridSearchCV(estimator = RandomForestClassifier(), param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)

# # Fit the grid search to the data
# grid_search.fit(train_x, train_y)
# print('grid_search.best_params_ =', grid_search.best_params_)


In [26]:
# rf_best_grid = grid_search.best_estimator_
# rf_best_grid.fit(train_x, train_y)

# # Predict Response corresponding to Predictors
# # y_train_pred = rf_best_grid.predict(X_train)
# # y_test_pred = rf_best_grid.predict(X_test)

# testModels([rf_best_grid], train_x, train_y, test_x, test_y, [0])

In [27]:
n_estimators = [150, 200, 250, 300]

# Random Forest using Train Data
rforests = [RandomForestClassifier(n_estimators=n) 
            for n in n_estimators]

# Train
rforests, rforests_training_times = trainModels(rforests, train_x, train_y)
print()

# Test
testModels(rforests, train_x, train_y, test_x, test_y, rforests_training_times)


0.1984s elapsed during training
0.6079s elapsed during training
1.0374s elapsed during training
1.4815s elapsed during training

RandomForestClassifier(n_estimators=150)
RandomForestClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5] = [0. 0. 0. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 1.0000, 1.0000, 1.0000, 1.0000

Training time:	0.1984s
>>> Model saved

RandomForestClassifier(n_estimators=200)
RandomForestClassifier
>>> For training data
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1= (1.0, 1.0, 1.0, 1.0)

>>> For testing data
pred_y[:5] = [0. 0. 0. 1. 0.]
Label(s) that never appear in prediction:  set()
arruracy, precision, recall, f1 = 1.0000, 1.0000, 1.0000, 1.0000

Training time:	0.6079s
>>> Model saved

RandomForestClassifier(n_estimators=250)
RandomForestClassifier
>>> For trainin

In [28]:
m = rforests[0]
m.__class__.__name__ , m.n_estimators

('RandomForestClassifier', 150)

# Compress `models/` folder


In [29]:
!zip -r {folder_models}.zip {folder_models}

updating: models/ (stored 0%)
updating: models/dectree_entropy_5depth_1.000acc.joblib (deflated 56%)
updating: models/rforest_250estimators_1.000acc.joblib (deflated 87%)
updating: models/knn_11n_1.000acc.joblib (deflated 51%)
updating: models/knn_1n_0.857acc.joblib (deflated 51%)
updating: models/dectree_entropy_20depth_0.714acc.joblib (deflated 61%)
updating: models/rforest_300estimators_1.000acc.joblib (deflated 87%)
updating: models/knn_3n_0.857acc.joblib (deflated 51%)
updating: models/dectree_entropy_15depth_0.714acc.joblib (deflated 60%)
updating: models/dectree_entropy_10depth_0.714acc.joblib (deflated 60%)
updating: models/rforest_200estimators_1.000acc.joblib (deflated 87%)
updating: models/rforest_150estimators_1.000acc.joblib (deflated 87%)
updating: models/knn_9n_1.000acc.joblib (deflated 51%)
  adding: models/dectree_gini_15depth_0.857acc.joblib (deflated 60%)
  adding: models/dectree_gini_10depth_0.857acc.joblib (deflated 60%)
  adding: models/dectree_gini_5depth_1.000ac

In [30]:
!echo "$(TZ=':Asia/Hong_Kong' date +"%Y%m%d.%Hh%Mm")"

20220405.19h47m
