## Introduction

After we're done with the [Pre-processing and Training Data Development](https://github.com/tvo10/DSCT/blob/main/First%20Capstone/afib_detection_feature_engineering.ipynb). We have 3 files:
1. 11 features and 1 label.
2. 13 features and 1 label.
3. 25 features and 1 label.

We do not know which dataset will help us in yielding the high accuracy score yet. As a result, in this notebook, we will read in each file and apply different algorithms to compare the accuracy score. Besides the accuracy metric, we also focus on the recall metric since we want to detect as many Atrial Fibrillation cases as possible.

In [1]:
# import essential libraries
import pandas as pd
import numpy as np
import pickle
import scipy
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, LogisticRegressionCV, SGDClassifier, RidgeClassifier
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error, mean_absolute_error, f1_score
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn import svm, linear_model
from sklearn import tree, metrics
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
import lightgbm
from bayes_opt import BayesianOptimization
from catboost import CatBoostClassifier, cv, Pool
import gzip

## (1) 11 Features and 1 Label

**10 Features:**
* age
* sex
* height
* weight
* heart_axis
* validated_by
* second_opinion
* validated_by_human
* pacemaker
* strat_fold

**1 Label:**
* ritmi

This csv file consists of 1803 observations and 11 variables. In brief, we already dropped the missing values for the height and weight columns.

In [2]:
# read in csv 
df = pd.read_csv('training_11_features.csv')
df = df.dropna()
# df = df[df['ritmi'] != 0]
df = df.reset_index(drop=True)
print(df.shape)
df.head()

(1803, 11)


Unnamed: 0,ritmi,age,sex,height,weight,heart_axis,validated_by,second_opinion,validated_by_human,pacemaker,strat_fold
0,2,29.0,1,164.0,56.0,0,0.0,0,1,0,1
1,0,59.0,0,156.0,75.0,0,0.0,0,1,0,9
2,2,84.0,1,152.0,51.0,0,0.0,0,1,0,7
3,0,79.0,0,172.0,66.0,0,0.0,0,1,0,5
4,1,67.0,0,178.0,73.0,4,0.0,0,1,0,5


In [3]:
# convert all columns' types to int64
df['age'] = df['age'].astype('int64')
df['height'] = df['height'].astype('int64')
df['weight'] = df['weight'].astype('int64')
df['validated_by'] = df['validated_by'].astype('int64')

# get info for columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1803 entries, 0 to 1802
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   ritmi               1803 non-null   int64
 1   age                 1803 non-null   int64
 2   sex                 1803 non-null   int64
 3   height              1803 non-null   int64
 4   weight              1803 non-null   int64
 5   heart_axis          1803 non-null   int64
 6   validated_by        1803 non-null   int64
 7   second_opinion      1803 non-null   int64
 8   validated_by_human  1803 non-null   int64
 9   pacemaker           1803 non-null   int64
 10  strat_fold          1803 non-null   int64
dtypes: int64(11)
memory usage: 155.1 KB


In [4]:
# train-test split
X = df.drop(columns='ritmi')
y = df['ritmi']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size = 0.25, random_state = 246)

### Random Forest

We used the RandomForest algorithm and tuned the model with GridSearchCV, we got 0.45 for the highest performance score (accuracy metric). Afterwards, we used the model to predict X_test. Based on the recall metric, we can conclude that the model has 47% of accurately detecting normal cases, 41% of accurately detecting Atrial Fibrillation, and 49% of accurately detecting other arrhythmia cases.

In [5]:
# Plug in appropriate max_depth and random_state parameters
rf = RandomForestClassifier()
rf_param_grid = {'n_estimators': [600], 'criterion': ['entropy'], 'max_depth': [60]} #0.4615443314230772
rf_cv= GridSearchCV(rf,rf_param_grid,cv=7,n_jobs=-1)
rf_cv.fit(X_train,y_train)

print("Best Score:" + str(rf_cv.best_score_))
print("Best Parameters: " + str(rf_cv.best_params_))

Best Score:0.46004486939800227
Best Parameters: {'criterion': 'entropy', 'max_depth': 60, 'n_estimators': 600}


In [6]:
y_pred = rf_cv.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.45      0.44      0.45       162
           1       0.45      0.43      0.44       117
           2       0.44      0.47      0.45       172

    accuracy                           0.45       451
   macro avg       0.45      0.45      0.45       451
weighted avg       0.45      0.45      0.45       451



### LightGBM

We also used the LightGBM algorithm, and tuned the model with BayesianOptimization, and we got 0.64 for the highest performance score using the auc metric. We can say that the score has been improved a lot, but it is a different metric. Since the metric is not the same, we'll apply other algorithms later to see if there is another algorithm that helps improve our accuracy score.

In [7]:
def lgb_eval(num_leaves,max_depth,lambda_l2,lambda_l1,min_child_samples, min_data_in_leaf):
    params = {
        "objective" : "binary",
        "metric" : "auc", 
        'is_unbalance': True,
        "num_leaves" : int(num_leaves),
        "max_depth" : int(max_depth),
        "lambda_l2" : lambda_l2,
        "lambda_l1" : lambda_l1,
        "num_threads" : 20,
        "min_child_samples" : int(min_child_samples),
        'min_data_in_leaf': int(min_data_in_leaf),
        "learning_rate" : 0.03,
        "subsample_freq" : 5,
        "bagging_seed" : 42,
        "verbosity" : -1
    }
    lgtrain = lightgbm.Dataset(X_train, y_train)
    cv_result = lightgbm.cv(params,
                       lgtrain,
                       1000,
                       early_stopping_rounds=100,
                       stratified=True,
                       nfold=3)
    return cv_result['auc-mean'][-1]

In [8]:
lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (25, 4000),
                                                'max_depth': (5, 63),
                                                'lambda_l2': (0.0, 0.05),
                                                'lambda_l1': (0.0, 0.05),
                                                'min_child_samples': (50, 10000),
                                                'min_data_in_leaf': (100, 2000)
                                                })

lgbBO.maximize(n_iter=10, init_points=2)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.5     [0m | [0m 0.001754[0m | [0m 0.01623 [0m | [0m 13.16   [0m | [0m 5.433e+0[0m | [0m 1.488e+0[0m | [0m 1.711e+0[0m |
| [0m 2       [0m | [0m 0.5     [0m | [0m 0.042   [0m | [0m 0.02319 [0m | [0m 49.17   [0m | [0m 1.11e+03[0m | [0m 1.822e+0[0m | [0m 1.603e+0[0m |
| [95m 3       [0m | [95m 0.6337  [0m | [95m 0.04281 [0m | [95m 0.03493 [0m | [95m 39.03   [0m | [95m 226.1   [0m | [95m 218.8   [0m | [95m 3.964e+0[0m |
| [0m 4       [0m | [0m 0.5     [0m | [0m 0.03825 [0m | [0m 0.01928 [0m | [0m 12.82   [0m | [0m 8.253e+0[0m | [0m 1.422e+0[0m | [0m 1.689e+0[0m |
| [0m 5       [0m | [0m 0.5     [0m | [0m 0.002021[0m | [0m 0.04044 [0m | [0m 57.3    [0m | [0m 7.747e+0[0m | [0m 890.2   [0m | [0m 3

### Other Algorithms

After trying different algorithms, we can see that KNeighborsClassifier returned 0.45 for the highest score using the same accuracy metric. Based on the recall metric, the model has 41% of accurately detecting normal cases, 47% of accurately detecting Atrial Fibrillation, and 49% of accurately detecting other arrhythmia cases. Even though it has more percentage of detecting Atrial Fibrillation cases, the accuracy score is lower than using the Random Forest algorithm. However, we can assume that KNeighborsClassifier is the most suitable algorithm in this dataset.

In [9]:
# clfl2=LogisticRegression(max_iter=1000000)
# parameters = {'C': [10000], 'solver': ['saga'],  'multi_class': ['auto']} # 0.4681891485581523

# clfl2 = svm.SVC()
# parameters = {'kernel':['linear'], 'C':[8]} #0.45558562252289186

# clfl2 = LogisticRegressionCV(max_iter=100000)
# parameters = {"Cs": [10], 'solver': ['saga'], 'fit_intercept':[True], 'penalty': ['l1']} # 0.4711466447997813

# clfl2 = RidgeClassifier(max_iter=1000)
# parameters = {'alpha': [0.9], 'solver': ['auto']} #0.4592865928659286

clfl2 = KNeighborsClassifier()
parameters = {'n_neighbors': [150], 'weights': ['distance'], 'metric': ['euclidean']} #0.4889326226595599

fitmodel = GridSearchCV(clfl2, param_grid=parameters, cv=5, refit=True, scoring="accuracy", n_jobs=-1, verbose=3)
fitmodel.fit(X_train, y_train)
print(fitmodel.best_estimator_, fitmodel.best_params_, fitmodel.best_score_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.0s remaining:    0.0s


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=150, p=2,
                     weights='distance') {'metric': 'euclidean', 'n_neighbors': 150, 'weights': 'distance'} 0.4889326226595599


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.7s finished


In [10]:
y_pred = fitmodel.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.47      0.41      0.44       162
           1       0.49      0.47      0.48       117
           2       0.42      0.49      0.45       172

    accuracy                           0.45       451
   macro avg       0.46      0.46      0.46       451
weighted avg       0.46      0.45      0.45       451



## (2) 13 Features and 1 Label

**13 Features:**
* age
* sex
* height
* weight
* nurse
* site
* device
* heart_axis
* validated_by
* second_opinion
* validated_by_human
* pacemaker
* strat_fold

**1 Label:** 
* ritmi

This csv file consists of 6366 observations and 14 variables. Insread of dropping missing values for the columns, we filled the missing values with the mean values for the age, height, and weight columns. In addition, we also filled the missing values with 0 for the nurse, site, validated_by, heart_axis, and pacemaker columns.

In [11]:
df = pd.read_csv('training_13_features.csv')
new_df = df.dropna()
# new_df = new_df[new_df['ritmi'] != 2]
new_df = new_df.reset_index(drop=True)
df.head()

Unnamed: 0,ritmi,age,sex,height,weight,nurse,site,device,heart_axis,validated_by,second_opinion,validated_by_human,pacemaker,strat_fold
0,2,54.0,0,166.796356,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
1,1,54.0,0,166.796356,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
2,0,55.0,0,166.796356,69.841845,1.0,2.0,1,1.0,1.0,0,1,0.0,10
3,2,29.0,1,164.0,56.0,7.0,1.0,10,0.0,0.0,0,1,0.0,1
4,2,57.0,0,166.796356,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,1


In [12]:
# convert all columns' types to float64
for i in range(14):
    new_df[new_df.columns[i]] = new_df[new_df.columns[i]].astype('float64')
    
# get info for columns
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6366 entries, 0 to 6365
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ritmi               6366 non-null   float64
 1   age                 6366 non-null   float64
 2   sex                 6366 non-null   float64
 3   height              6366 non-null   float64
 4   weight              6366 non-null   float64
 5   nurse               6366 non-null   float64
 6   site                6366 non-null   float64
 7   device              6366 non-null   float64
 8   heart_axis          6366 non-null   float64
 9   validated_by        6366 non-null   float64
 10  second_opinion      6366 non-null   float64
 11  validated_by_human  6366 non-null   float64
 12  pacemaker           6366 non-null   float64
 13  strat_fold          6366 non-null   float64
dtypes: float64(14)
memory usage: 696.4 KB


In [13]:
# train-test split
X = new_df.drop(columns='ritmi')
y = new_df['ritmi']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size = 0.2, random_state = 246)

### Random Forest

We used the RandomForest algorithm and tuned the model with GridSearchCV, we got 0.50 for the accuracy score. Afterwards, we used the model to predict X_test. Based on the recall metric, we can conclude that the model has 38% of accurately detecting normal cases, 49% of accurately detecting Atrial Fibrillation, and 60% of accurately detecting other arrhythmia cases.

In [14]:
# Plug in appropriate max_depth and random_state parameters
rf = RandomForestClassifier()
rf_param_grid = {'n_estimators': [600], 'criterion': ['entropy'], 'max_depth': [60]} #0.502161524857536
rf_cv= GridSearchCV(rf,rf_param_grid,cv=7,n_jobs=-1)
rf_cv.fit(X_train,y_train)

print("Best Score:" + str(rf_cv.best_score_))
print("Best Parameters: " + str(rf_cv.best_params_))

Best Score:0.5037340827771167
Best Parameters: {'criterion': 'entropy', 'max_depth': 60, 'n_estimators': 600}


In [15]:
y_pred = rf_cv.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.43      0.38      0.40       407
         1.0       0.51      0.48      0.49       318
         2.0       0.54      0.60      0.57       549

    accuracy                           0.50      1274
   macro avg       0.49      0.49      0.49      1274
weighted avg       0.50      0.50      0.50      1274



### LightGBM

We also used the LightGBM algorithm, and tuned the model with BayesianOptimization, and we got 0.64 for the highest performance score using the auc metric.

In [16]:
lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (25, 4000),
                                                'max_depth': (5, 63),
                                                'lambda_l2': (0.0, 0.05),
                                                'lambda_l1': (0.0, 0.05),
                                                'min_child_samples': (50, 10000),
                                                'min_data_in_leaf': (100, 2000)
                                                })

lgbBO.maximize(n_iter=10, init_points=2)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.6391  [0m | [0m 0.04647 [0m | [0m 0.02277 [0m | [0m 18.49   [0m | [0m 8.727e+0[0m | [0m 134.5   [0m | [0m 1.371e+0[0m |
| [0m 2       [0m | [0m 0.6044  [0m | [0m 0.02907 [0m | [0m 0.006896[0m | [0m 52.64   [0m | [0m 8.533e+0[0m | [0m 1.595e+0[0m | [0m 3.268e+0[0m |
| [0m 3       [0m | [0m 0.6383  [0m | [0m 0.003875[0m | [0m 0.03582 [0m | [0m 21.89   [0m | [0m 70.68   [0m | [0m 153.7   [0m | [0m 117.0   [0m |
| [0m 4       [0m | [0m 0.6081  [0m | [0m 0.01025 [0m | [0m 0.01776 [0m | [0m 45.82   [0m | [0m 4.809e+0[0m | [0m 1.457e+0[0m | [0m 1.265e+0[0m |
| [95m 5       [0m | [95m 0.6408  [0m | [95m 0.04423 [0m | [95m 0.02645 [0m | [95m 41.34   [0m | [95m 1.185e+0[0m | [95m 153.9   [0m | [95m 3

### K-Neighbors

Finally, we applied the KNeighborsClassifier algorithm. Based on the recall metric, the model has 23% of accurately detecting normal cases, 40% of accurately detecting Atrial Fibrillation, and 70% of accurately detecting other arrhythmia cases. Thus, we can conclude that RandomForest works best for this dataset as it returns 0.5 as the accuracy score and it has 49% of accurately detecting Atrial Fibrillation cases.

In [17]:
clfl2 = KNeighborsClassifier()
parameters = {'n_neighbors': [140], 'weights': ['distance'], 'metric': ['euclidean']} #0.4830258302583026

fitmodel = GridSearchCV(clfl2, param_grid=parameters, cv=5, refit=True, scoring="accuracy", n_jobs=-1, verbose=2)
fitmodel.fit(X_train, y_train)
print(fitmodel.best_estimator_, fitmodel.best_params_, fitmodel.best_score_)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s finished


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=140, p=2,
                     weights='distance') {'metric': 'euclidean', 'n_neighbors': 140, 'weights': 'distance'} 0.48291479569900764


In [18]:
y_pred = fitmodel.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.38      0.23      0.29       407
         1.0       0.54      0.40      0.46       318
         2.0       0.49      0.70      0.58       549

    accuracy                           0.48      1274
   macro avg       0.47      0.44      0.44      1274
weighted avg       0.47      0.48      0.45      1274



## (3) 25 Features and 1 Label

**25 Features:**
* I
* II
* III
* aVF
* aVR
* aVL
* V1
* V2
* V3
* V4
* V5
* V6
* age
* sex
* height
* weight
* nurse
* site
* device
* heart_axis
* validated_by
* second_opinion
* validated_by_human
* pacemaker
* strat_fold

**1 Label:**
* ritmi

This dataset consists of 4319176 observations and 26 variables. This dataset is different than the other two datasets since we combined 12 variables in the compressed numpy data file and 14 variables in the second dataset. Therefore, we have more information about the 12 ecg leads that may help us in detecting atrial fibrillation cases.

In [19]:
# read in csv
df = pd.read_csv('training_25_features.csv')
df

Unnamed: 0,I,II,III,aVF,aVR,aVL,V1,V2,V3,V4,...,weight,nurse,site,device,heart_axis,validated_by,second_opinion,validated_by_human,pacemaker,strat_fold
0,-0.005,0.135,0.140,-0.065,-0.073,0.137,-0.125,-0.090,-0.110,-0.210,...,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
1,-0.005,0.135,0.140,-0.065,-0.073,0.137,-0.125,-0.090,-0.110,-0.211,...,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
2,-0.005,0.131,0.136,-0.063,-0.070,0.133,-0.125,-0.082,-0.102,-0.190,...,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
3,-0.005,0.130,0.135,-0.063,-0.070,0.132,-0.122,-0.077,-0.094,-0.172,...,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
4,-0.005,0.128,0.133,-0.062,-0.069,0.130,-0.119,-0.071,-0.084,-0.157,...,69.841845,0.0,0.0,0,3.0,0.0,0,0,0.0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4319171,0.010,0.170,0.160,-0.090,-0.075,0.165,0.155,0.365,0.230,0.030,...,69.841845,1.0,2.0,1,3.0,1.0,0,1,0.0,8
4319172,0.014,0.174,0.160,-0.094,-0.073,0.167,0.155,0.368,0.245,0.029,...,69.841845,1.0,2.0,1,3.0,1.0,0,1,0.0,8
4319173,0.016,0.176,0.160,-0.096,-0.073,0.167,0.155,0.383,0.261,0.040,...,69.841845,1.0,2.0,1,3.0,1.0,0,1,0.0,8
4319174,0.014,0.174,0.160,-0.094,-0.073,0.167,0.155,0.406,0.282,0.059,...,69.841845,1.0,2.0,1,3.0,1.0,0,1,0.0,8


In [20]:
# convert all the columns to float64
for i in range(26):
    df[df.columns[i]] = df[df.columns[i]].astype('float64')
    
# get info for columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4319176 entries, 0 to 4319175
Data columns (total 26 columns):
 #   Column              Dtype  
---  ------              -----  
 0   I                   float64
 1   II                  float64
 2   III                 float64
 3   aVF                 float64
 4   aVR                 float64
 5   aVL                 float64
 6   V1                  float64
 7   V2                  float64
 8   V3                  float64
 9   V4                  float64
 10  V5                  float64
 11  V6                  float64
 12  ritmi               float64
 13  age                 float64
 14  sex                 float64
 15  height              float64
 16  weight              float64
 17  nurse               float64
 18  site                float64
 19  device              float64
 20  heart_axis          float64
 21  validated_by        float64
 22  second_opinion      float64
 23  validated_by_human  float64
 24  pacemaker           floa

In [21]:
# train-test split
X = df.drop(columns='ritmi')
y = df['ritmi']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size = 0.25, random_state = 1234)

### Random Forest

After training and tuning the model for the second dataset, we learned that the Random Forest algorithm is the best for detecting Atrial Fibrillation cases. Since this dataset is based off the second dataset (please see [here](https://github.com/tvo10/DSCT/blob/main/First%20Capstone/afib_detection_feature_engineering.ipynb) for further details), we only applied the Random Forest algorithm to get the accuracy score. Surprisingly, we got **0.99** for the accuracy score. We can conclude that the third dataset is the best dataset among the three datasets. Based on the recall metric, the model has 99% of accurately detecting normal cases, 98% of accurately detecting Atrial Fibrillation cases, and 99% of accurately detecting other arrhythmia cases.

In [22]:
# Plug in appropriate max_depth and random_state parameters
rf = RandomForestClassifier()
# rf_param_grid = {'n_estimators': [20], 'criterion': ['entropy'], 'max_depth': [20]} #0.9267653536506913
rf_param_grid = {'n_estimators': [45], 'criterion': ['entropy'], 'max_depth': [45]} #0.9868391563552115
rf_cv= GridSearchCV(rf,rf_param_grid,cv=7,n_jobs=-1)
rf_cv.fit(X_train,y_train)

print("Best Score:" + str(rf_cv.best_score_))
print("Best Parameters: " + str(rf_cv.best_params_))

Best Score:0.986763833361317
Best Parameters: {'criterion': 'entropy', 'max_depth': 45, 'n_estimators': 45}


In [23]:
y_pred = rf_cv.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99    335792
         1.0       0.98      0.98      0.98    267846
         2.0       0.99      0.99      0.99    476156

    accuracy                           0.99   1079794
   macro avg       0.99      0.99      0.99   1079794
weighted avg       0.99      0.99      0.99   1079794



## Conclusion

The third dataset, which consists of 25 features and 1 label, is the most suitable dataset to be used in training the model. Please see the below table for a summary of the prediction score among three datasets with different algorithms.

<table border="1">
<colgroup>
<col width="15%" />
<col width="16%" />
<col width="20%" />
<col width="27%" />
</colgroup>
<thead valign="bottom">
<tr><th>Datasets</th>
<th>(1) 11 features and 1 label</th>
<th>(2) 13 features and 1 label</th>
<th>(3) 25 features and 1 label</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>Random Forest</span></a></td>
<td>0.45 (Accuracy)</td>
<td>0.48 (Accuracy)</td>
<td>0.99 (Accuracy)</td>
</tr>
<tr><td>K-Neighbors</td>
<td>0.45 (Accuracy)</td>
<td>0.50 (Accuracy)</td>
<td>N/A</td>
</tr>
<tr><td>LightGBM</td>
<td>0.64 (AUC)</td>
<td>0.64 (AUC)</td>
<td>N/A</td>
</tr>
</tbody>
</table>