# MSBD5001-Spring 2022

Predicting pneumonia in kidney transplant recipients

https://www.kaggle.com/c/msbd5001-spring-2022/overview

---


## Description

Kidney transplantation is the optimal treatment to cure patients with end-stage renal disease (ESRD). However, infectious complication, especially pneumonia, is the main cause of mortality in the early stage. In this in-class competition, we aimed to study the association between collected patient immune status features during immune monitoring and pneumonia in kidney transplant patients through machine learning models.

The immune status features consist of the percentages and absolute cell counts of CD3+CD4+ T cells, CD3+CD8+ T cells, CD19+ B cells and natural killer (NK) cells, and median fluorescence intensity (MFI) of human leukocyte antigen (HLA)-DR on monocytes and CD64 on neutrophils. Also, basic information including age and sex is provided. The task is to predict whether the patient will get pneumonia after the kidney transplantation.

---
## Dataset information

xxxxxx

---
## ML problem definition

Multi-class classification

---

## Evaluation Metric

The evaluation metric used is prediction accuracy.

# Classifiers: (1) Decision Tree, (2) KNN and (3) Random Forest


# Download data

In [None]:
# # find the share link of the file/folder on Google Drive
# file_share_links = [
#                     "https://drive.google.com/file/d/1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw/view?usp=sharing",   #sample_submission.csv
#                     "https://drive.google.com/file/d/1UEZzWcKk_QaHWzDMl6XdWGJLgkhiguj-/view?usp=sharing",   #test.csv
#                     "https://drive.google.com/file/d/1L7AYTx15-AQ2nXAuXm7hLC9HDvpKJCGw/view?usp=sharing",   #train.csv
# ]

# for file_share_link in file_share_links:
#     # extract the ID of the file
#     file_id = file_share_link[file_share_link.find("d/") + 2: file_share_link.find('/view')]
#     print(file_id)

#     # append the id to this REST command
#     file_download_link = "https://docs.google.com/uc?export=download&id=" + file_id
#     print(file_download_link)


1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw
https://docs.google.com/uc?export=download&id=1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw
1UEZzWcKk_QaHWzDMl6XdWGJLgkhiguj-
https://docs.google.com/uc?export=download&id=1UEZzWcKk_QaHWzDMl6XdWGJLgkhiguj-
1L7AYTx15-AQ2nXAuXm7hLC9HDvpKJCGw
https://docs.google.com/uc?export=download&id=1L7AYTx15-AQ2nXAuXm7hLC9HDvpKJCGw


In [None]:
# !wget -O sample_submission.csv "https://docs.google.com/uc?export=download&id=1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw"
# !wget -O test.csv "https://docs.google.com/uc?export=download&id=1UEZzWcKk_QaHWzDMl6XdWGJLgkhiguj-"
# !wget -O train.csv "https://docs.google.com/uc?export=download&id=1L7AYTx15-AQ2nXAuXm7hLC9HDvpKJCGw"

--2022-04-05 10:25:59--  https://docs.google.com/uc?export=download&id=1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw
Resolving docs.google.com (docs.google.com)... 108.177.13.100, 108.177.13.113, 108.177.13.102, ...
Connecting to docs.google.com (docs.google.com)|108.177.13.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-74-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/ns1s1s4kmid7aa2765bap8ph857gvbmc/1649154300000/10004626043936594729/*/1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw?e=download [following]
--2022-04-05 10:26:00--  https://doc-0c-74-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/ns1s1s4kmid7aa2765bap8ph857gvbmc/1649154300000/10004626043936594729/*/1pP81aU-10NWzVNbNnQUDvCeSrBLOkcWw?e=download
Resolving doc-0c-74-docs.googleusercontent.com (doc-0c-74-docs.googleusercontent.com)... 173.194.215.132, 2607:f8b0:400c:c0c::84
Connecting to doc-0c-74-docs.googleusercontent.com (doc-0c-74-docs.goo

# Setup

## Install auto-ml `pycaret` lib

```shell
!pip install pycaret[full]
```

In [None]:
# !pip install pycaret

In [None]:
import os
import time
from typing import Iterable

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

# save scikit-learn model
from joblib import dump, load
# import pickle


from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

#Split data
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler
#Grid Search with Cross Validation
from sklearn.model_selection import GridSearchCV

# decision tree
from sklearn.tree import DecisionTreeClassifier

# knn
from sklearn.neighbors import KNeighborsClassifier

# random forest
from sklearn.ensemble import RandomForestClassifier

# Data preparation

In [None]:
%%sh
ls

head -n 5 train.csv
head -n 5 test.csv
head -n 5 sample_submission.csv


logs.log
sample_data
sample_submission.csv
test.csv
train.csv
id,MO HLADR+ MFI (cells/ul),Neu CD64+MFI (cells/ul),CD3+T (cells/ul),CD8+T (cells/ul),CD4+T (cells/ul),NK (cells/ul),CD19+ (cells/ul),CD45+ (cells/ul),Age,Sex 0M1F,Mono CD64+MFI (cells/ul),label
0,3556.0,2489.0,265.19,77.53,176.55,0.0,4.2,307.91,52,0,7515.0,1
1,1906.0,134.0,1442.61,551.9,876.07,112.1,168.15,1735.48,20,1,1756.0,0
2,1586.0,71.0,1332.74,684.2,655.26,244.95,216.52,1820.04,28,1,1311.0,0
3,683.0,94.0,419.23,255.8,162.17,72.05,44.68,538.22,55,1,1443.0,0
id,MO HLADR+ MFI (cells/ul),Neu CD64+MFI (cells/ul),CD3+T (cells/ul),CD8+T (cells/ul),CD4+T (cells/ul),NK (cells/ul),CD19+ (cells/ul),CD45+ (cells/ul),Age,Sex 0M1F,Mono CD64+MFI (cells/ul)
0,2843.0,156.0,1358.52,730.78,637.85,127.06,94.82,1588.62,45,1,3256.0
1,437.0,137.0,509.43,268.05,243.07,390.86,98.24,1002.76,51,1,491.0
2,826.0,82.0,1232.22,493.42,744.08,516.28,320.15,2200.58,32,0,1381.0
3,861.0,50.0,1512.86,925.51,590.07,380.25,25.8,1929.1,50,0,1377.0
id,label


In [None]:
folder_models = 'models'
folder_data = ''
path_train = folder_data + 'train.csv'
path_test = folder_data + 'test.csv'


#read train data
train = np.genfromtxt(path_train, delimiter=',', names=True, dtype=float)

row_index_name = 'id'
label_name = 'label'
feature_names = [x for x in train.dtype.names if (x != label_name) & (x != row_index_name)]

# train_x = train[feature_names].tolist()
# train_y = train[label_name].tolist()


#read test data
test = np.genfromtxt(path_test, delimiter=',', names=True, dtype=float)

# test_x = test[feature_names].tolist()
# test_y = test[label_name].tolist()


#class labels
labels = list(set(train[label_name]))
print(f'Classes/Labels of dataset (column: {label_name}):', labels)


# View
print(f'row_index_name: {row_index_name}')
print(f'label_name: {label_name}')
print(f'feature columns: {feature_names}')

# print(test_x[:5])
# print('labels of test data:', test_y[:5])

print('train.shape =', train.shape)
print('test.shape =', test.shape)


Classes/Labels of dataset (column: label): [0.0, 1.0]
row_index_name: id
label_name: label
feature columns: ['MO_HLADR_MFI_cellsul', 'Neu_CD64MFI_cellsul', 'CD3T_cellsul', 'CD8T_cellsul', 'CD4T_cellsul', 'NK_cellsul', 'CD19_cellsul', 'CD45_cellsul', 'Age', 'Sex_0M1F', 'Mono_CD64MFI_cellsul']
train.shape = (87,)
test.shape = (59,)


In [None]:
# Convert np.array to df
df_train = pd.DataFrame(train)
df_test = pd.DataFrame(test)

if row_index_name:
    df_train.set_index(row_index_name, inplace=True)
    
    df_test.set_index(row_index_name, inplace=True)
    df_test.index = df_test.index.astype(int)


display(df_train.describe())
display(df_train.info())

display(df_test.describe())
display(df_test.info())


Unnamed: 0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
count,86.0,86.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,86.0,87.0
mean,1264.244186,290.383721,982.570115,479.34092,494.904023,212.732874,118.78092,1325.096437,40.218391,0.482759,2066.534884,0.333333
std,765.452376,490.283499,617.332545,344.326452,311.836604,173.553264,96.218344,791.602538,10.461919,0.502599,1198.401364,0.474137
min,112.0,30.0,74.4,36.61,39.59,0.0,4.2,209.25,19.0,0.0,72.0,0.0
25%,685.5,77.5,549.39,237.92,272.745,78.815,52.425,780.615,33.0,0.0,1461.25,0.0
50%,1108.5,124.5,871.71,423.27,459.72,188.78,89.79,1179.27,41.0,0.0,1757.5,0.0
75%,1602.25,244.5,1268.085,624.45,624.36,262.845,155.45,1617.725,49.5,1.0,2238.25,1.0
max,4145.0,3124.0,3791.23,2548.1,1517.81,878.04,485.86,4757.28,60.0,1.0,7515.0,1.0


<class 'pandas.core.frame.DataFrame'>
Float64Index: 87 entries, 0.0 to 86.0
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MO_HLADR_MFI_cellsul  86 non-null     float64
 1   Neu_CD64MFI_cellsul   86 non-null     float64
 2   CD3T_cellsul          87 non-null     float64
 3   CD8T_cellsul          87 non-null     float64
 4   CD4T_cellsul          87 non-null     float64
 5   NK_cellsul            87 non-null     float64
 6   CD19_cellsul          87 non-null     float64
 7   CD45_cellsul          87 non-null     float64
 8   Age                   87 non-null     float64
 9   Sex_0M1F              87 non-null     float64
 10  Mono_CD64MFI_cellsul  86 non-null     float64
 11  label                 87 non-null     float64
dtypes: float64(12)
memory usage: 8.8 KB


None

Unnamed: 0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
count,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0
mean,1212.423729,206.491525,1085.340508,546.220339,523.237966,226.820339,115.048983,1439.65339,41.186441,0.355932,1971.220339
std,772.139285,248.195027,564.337155,342.37002,271.730902,189.056327,87.200827,689.02181,9.438503,0.482905,1137.384129
min,82.0,24.0,258.01,114.98,80.39,17.72,2.96,314.25,15.0,0.0,371.0
25%,696.5,65.0,629.89,268.3,336.955,88.33,59.5,914.84,34.5,0.0,1283.5
50%,1010.0,114.0,1025.32,433.61,511.0,174.86,98.24,1378.32,42.0,0.0,1701.0
75%,1623.0,232.0,1495.395,751.38,676.53,318.14,143.56,1855.05,49.0,1.0,2375.0
max,4195.0,1141.0,2771.2,1738.55,1225.68,956.78,501.91,3355.86,62.0,1.0,6788.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 0 to 58
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MO_HLADR_MFI_cellsul  59 non-null     float64
 1   Neu_CD64MFI_cellsul   59 non-null     float64
 2   CD3T_cellsul          59 non-null     float64
 3   CD8T_cellsul          59 non-null     float64
 4   CD4T_cellsul          59 non-null     float64
 5   NK_cellsul            59 non-null     float64
 6   CD19_cellsul          59 non-null     float64
 7   CD45_cellsul          59 non-null     float64
 8   Age                   59 non-null     float64
 9   Sex_0M1F              59 non-null     float64
 10  Mono_CD64MFI_cellsul  59 non-null     float64
dtypes: float64(11)
memory usage: 5.5 KB


None

## Clean data - Remove `null` rows

In [None]:
# Cheak `null` rows
display(df_train[df_train.isna().any(axis=1)])
display(df_train[df_train.isnull().any(axis=1)])

display(df_test[df_test.isna().any(axis=1)])
display(df_test[df_test.isnull().any(axis=1)])


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
39.0,,,1336.54,739.71,550.3,68.46,192.07,1615.68,21.0,0.0,,0.0


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
39.0,,,1336.54,739.71,550.3,68.46,192.07,1615.68,21.0,0.0,,0.0


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [None]:
# Remove `null` rows

## Train data
df_train = df_train[~df_train.isna().any(axis=1)]
display(df_train[df_train.isna().any(axis=1)])



## Test data
df_test = df_test[~df_test.isna().any(axis=1)]
display(df_test[df_test.isna().any(axis=1)])



Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1


Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


## Split `validation` sets from `train` sets

randomly select rows from Pandas DataFrame,
* https://www.geeksforgeeks.org/how-to-randomly-select-rows-from-pandas-dataframe/

In [None]:
# To get 3 random rows
# each time it gives 3 different rows
# df_validation = df_train.sample(n = 3)

df_train, df_validation = train_test_split(df_train, test_size=0.08, random_state=200)

display(df_train.index)
display(df_validation.index)

Float64Index([ 9.0, 84.0, 19.0, 38.0,  2.0, 30.0, 10.0, 42.0, 61.0, 17.0, 51.0,
              44.0,  5.0, 50.0, 29.0, 65.0, 64.0, 66.0, 68.0, 67.0, 34.0, 47.0,
              25.0, 59.0, 36.0, 70.0, 54.0, 18.0, 41.0,  4.0, 12.0, 74.0, 79.0,
              78.0, 86.0, 49.0,  8.0, 33.0, 48.0, 63.0,  3.0,  0.0, 21.0, 20.0,
              82.0, 46.0, 32.0, 31.0, 53.0, 35.0, 76.0, 75.0, 71.0, 85.0, 13.0,
              55.0, 22.0, 72.0, 24.0, 23.0, 73.0, 15.0, 27.0, 52.0,  7.0,  1.0,
              57.0, 83.0,  6.0, 11.0, 58.0, 14.0, 80.0, 77.0, 56.0, 43.0, 69.0,
              16.0, 26.0],
             dtype='float64', name='id')

Float64Index([81.0, 28.0, 60.0, 37.0, 62.0, 40.0, 45.0], dtype='float64', name='id')

### Make np.array (1) `train_x` `train_y` (2) `validation_x` `validation_y`

In [None]:
train_x = df_train[feature_names]
train_y = df_train[label_name]

test_x = validation_x = df_validation[feature_names]
test_y = validation_y = df_validation[label_name]

train_x

Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9.0,1412.0,243.0,1177.19,684.42,490.50,185.30,67.22,1441.06,36.0,1.0,1213.0
84.0,634.0,1002.0,1300.00,558.00,724.00,67.00,105.00,1484.26,34.0,0.0,2926.0
19.0,403.0,555.0,313.48,131.53,182.69,46.68,7.90,370.30,40.0,0.0,2209.0
38.0,1010.0,1384.0,570.13,312.90,233.84,80.17,31.18,702.08,56.0,1.0,5501.0
2.0,1586.0,71.0,1332.74,684.20,655.26,244.95,216.52,1820.04,28.0,1.0,1311.0
...,...,...,...,...,...,...,...,...,...,...,...
56.0,1055.0,87.0,913.42,410.52,507.04,77.46,43.08,1040.18,48.0,0.0,1728.0
43.0,1679.0,79.0,483.21,162.00,309.00,227.05,101.09,817.24,39.0,0.0,4480.0
69.0,1495.0,125.0,2910.03,1431.78,1517.81,446.94,401.45,3817.75,20.0,1.0,1793.0
16.0,1045.0,124.0,1179.07,522.49,661.78,210.96,93.63,1498.11,49.0,1.0,2075.0


# Comparison of Classifiers


In [None]:
# from pycaret.classification import *

# clf1 = setup(data = df_train, 
#              target = label_name, 
#             #  categorical_features = feature_names,
#              silent = True)


In [None]:
# compare_models(fold = 25) # 3 rows of training data are selected as validation set per each folding

In [None]:
# create_model('rf')

## Re-install auto-ml `pycaret` lib

In [None]:
# !pip install numpy
# !pip install tf-estimator-nightly==2.8.0.dev2021122109
# !pip install pycaret[full]

!pip install --use-deprecated=legacy-resolver pycaret[full]



In [None]:
from pycaret.classification import *

clf1 = setup(data = df_train, 
             target = label_name, 
            #  categorical_features = feature_names,
             silent = True)


Unnamed: 0,Description,Value
0,session_id,1303
1,Target,label
2,Target Type,Binary
3,Label Encoded,"0.0: 0, 1.0: 1"
4,Original Data,"(79, 12)"
5,Missing Values,False
6,Numeric Features,10
7,Categorical Features,1
8,Ordinal Features,False
9,High Cardinality Features,False


In [None]:
# 省時間全部fold切5分
lr = create_model('lr', fold = 25, max_iter=500)
# knn = create_model('knn', fold = 25)
# nb = create_model('nb', fold = 25)
# dt = create_model('dt', fold = 25)
# svm = create_model('svm', fold = 25)
# rbfsvm = create_model('rbfsvm', fold = 25)
# gpc = create_model('gpc', fold = 25)
# mlp = create_model('mlp', fold = 25)
# ridge = create_model('ridge', fold = 25)
rf = create_model('rf', fold = 25)
# qda = create_model('qda', fold = 25)
ada = create_model('ada', fold = 25)
# lda = create_model('lda', fold = 25)
# gbc = create_model('gbc', fold = 25)
et = create_model('et', fold = 25)
xgboost = create_model('xgboost', fold = 25)
lightgbm = create_model('lightgbm', fold = 25)
# catboost = create_model('catboost', fold = 25)

estimator_list = [lr,
                #   knn,
                #   nb,
                #   dt,
                #   svm,
                #   rbfsvm,
                #   gpc,
                #   mlp,
                #   ridge,
                  rf,
                #   qda,
                  ada,
                #   lda,
                #   gbc,
                  et,
                  xgboost,
                  lightgbm,
                #   catboost
                  ]

# 第二層用xgboost
stacker_all = stack_models(estimator_list = estimator_list, meta_model = xgboost)

pred = predict_model(stacker_all,data = df_validation)

accuracy_score(pred[label_name], pred[['Label']].astype(float))


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8333,0.625,0.5,1.0,0.6667,0.5714,0.6325
1,0.8333,0.875,0.5,1.0,0.6667,0.5714,0.6325
2,0.8333,1.0,0.5,1.0,0.6667,0.5714,0.6325
3,0.8333,1.0,1.0,0.6667,0.8,0.6667,0.7071
4,0.8333,0.6875,0.5,1.0,0.6667,0.5714,0.6325
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,0.6,0.6667,0.5,0.5,0.5,0.1667,0.1667
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Stacking Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0


1.0

In [None]:
y_true = pred['label']
y_pred = pred['Label'].astype(float)

print('classification_report() =\n', classification_report(y_true, list(y_pred), digits=5))    


classification_report() =
               precision    recall  f1-score   support

         0.0    1.00000   1.00000   1.00000         6
         1.0    1.00000   1.00000   1.00000         1

    accuracy                        1.00000         7
   macro avg    1.00000   1.00000   1.00000         7
weighted avg    1.00000   1.00000   1.00000         7



## Save `stacker` model

In [None]:
save_model(stacker_all, 'stacker_auc1')
load_model('stacker_auc1')


Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Loaded


Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=False, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[], target='label',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numeric_strat...
                                                                  missing=nan,
                                                                  monotone_constraints='()',
                                                                  n_estimators=100,
                               

In [None]:

dump(stacker_all, f'stacker_auc1.joblib')
load(f'stacker_auc1.joblib')

# Compress `models/` folder


In [None]:
# %%sh
# zip -r {folder_models}.zip {folder_models}

In [None]:
!echo "$(TZ=':Asia/Hong_Kong' date +"%Y%m%d.%Hh%Mm")"

# Test

In [None]:
def saveResult(df_result: np.array,
               csv_name: str,
               label_name: str = 'label',
               folder = 'results'
               ):
    """result.csv format
    | id | label |
    |----|-------|
    | xx | 0 or 1|
    """

    if not os.path.exists(folder):
        os.makedirs(folder)
    
    # np.savetxt(csv_name, result[['id', 'label']], delimiter=",")
    df_result[[label_name]].to_csv(f'{folder}/{csv_name}.csv', index=True)


In [None]:


# # load models
path = 'stacker_auc1'
# model = load_model('stacker_auc1')

# pred_y = testModel(model, df_test[feature_names])
df_result = predict_model(stacker_all, data = df_test)
df_result = df_result.rename(columns={"Label": label_name})
df_result[label_name] = df_result[label_name].astype(float).astype(int)

display(df_result)

csv_name = path[: path.find('.joblib')].replace(f'{folder_models}/', '')
saveResult(df_result, csv_name, label_name, 'results')
# break

Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul,label,Score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,2843.0,156.0,1358.52,730.78,637.85,127.06,94.82,1588.62,45.0,1.0,3256.0,0,0.9675
1,437.0,137.0,509.43,268.05,243.07,390.86,98.24,1002.76,51.0,1.0,491.0,0,0.9731
2,826.0,82.0,1232.22,493.42,744.08,516.28,320.15,2200.58,32.0,0.0,1381.0,0,0.9741
3,861.0,50.0,1512.86,925.51,590.07,380.25,25.8,1929.1,50.0,0.0,1377.0,0,0.9811
4,1160.0,157.0,890.42,403.91,489.53,266.92,87.63,1251.52,43.0,0.0,1844.0,0,0.9853
5,867.0,85.0,1662.11,865.5,804.14,220.68,92.58,2063.11,44.0,1.0,986.0,0,0.9855
6,1330.0,114.0,1307.95,710.86,607.96,271.01,214.49,1855.05,31.0,0.0,2077.0,0,0.9488
7,494.0,48.0,1522.39,618.19,911.49,338.85,104.45,2013.05,36.0,1.0,1409.0,0,0.9911
8,2119.0,73.0,1219.66,732.14,468.48,71.54,83.08,1378.32,45.0,1.0,2403.0,0,0.8442
9,2052.0,39.0,1223.65,642.55,565.39,323.67,153.64,1711.88,36.0,0.0,1701.0,0,0.9879


In [None]:
df_test

Unnamed: 0_level_0,MO_HLADR_MFI_cellsul,Neu_CD64MFI_cellsul,CD3T_cellsul,CD8T_cellsul,CD4T_cellsul,NK_cellsul,CD19_cellsul,CD45_cellsul,Age,Sex_0M1F,Mono_CD64MFI_cellsul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.0,2843.0,156.0,1358.52,730.78,637.85,127.06,94.82,1588.62,45.0,1.0,3256.0
1.0,437.0,137.0,509.43,268.05,243.07,390.86,98.24,1002.76,51.0,1.0,491.0
2.0,826.0,82.0,1232.22,493.42,744.08,516.28,320.15,2200.58,32.0,0.0,1381.0
3.0,861.0,50.0,1512.86,925.51,590.07,380.25,25.8,1929.1,50.0,0.0,1377.0
4.0,1160.0,157.0,890.42,403.91,489.53,266.92,87.63,1251.52,43.0,0.0,1844.0
5.0,867.0,85.0,1662.11,865.5,804.14,220.68,92.58,2063.11,44.0,1.0,986.0
6.0,1330.0,114.0,1307.95,710.86,607.96,271.01,214.49,1855.05,31.0,0.0,2077.0
7.0,494.0,48.0,1522.39,618.19,911.49,338.85,104.45,2013.05,36.0,1.0,1409.0
8.0,2119.0,73.0,1219.66,732.14,468.48,71.54,83.08,1378.32,45.0,1.0,2403.0
9.0,2052.0,39.0,1223.65,642.55,565.39,323.67,153.64,1711.88,36.0,0.0,1701.0
