# Notebook description

This notebook has been created for the Covid-19 prediction challenge.

The goal of this notebook is to present a methodology for using exams and patient data to create a predictive model for evaluating whether the patient has COVID-19 and if and what type of unit of treatment they will require.

The model approach was:
1. Missing values: We first removed data-points that did not have any information other than the age quantile. We do that because if we try to include those data point's we will most likely not give a significant weight to the test results, and therefore hurt their predictive power. We could try to restrict our model to very few columns that have a higher % filled and then run our predictions, this would most likely generate a model that has more generalization power, but less accuracy when dealing with patients that have richier information.
2. Correlated Features: We dropped features with more than 90% correlation (kept just one of the features)
3. Feature selection: We did not apply any feature selection methodology, as we are assuming that all medical information could be relevant, but looking forward we should use the permutation importances results to decide which columns to keep
4. Model architecture: For task1 we used ensemble methods with a randomized grid search, using a 10-fold stratified cross validation method. This was done to select the best params, but the real model performance is evaluated over an iteration of 500 random train test splits. For the task2 we used a similar method, but treated the problem as a multiclass classification one, where we used both the direct model and a One Vs All method
5. Model deployment: To avoid overfit we suggested using Bagging classifier with 400 estimators with max samples of 0.9 (this way, the model performance should be close to the reported one)

It's important to note that the whole notebook has a number of parameters that should be altered according to the data input.

---

# Imports

In [None]:
# Data engineering libraries
import numpy as np
import pandas as pd

# Python default libraries
import os
import warnings
import itertools
import joblib

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Statistical functions
from scipy import stats

# imbalanced learn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

# Model selection
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# Machine Learning libraries
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.multiclass import OneVsRestClassifier

# Metrics libraries
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score

# Interpretability libraries
from sklearn.inspection import permutation_importance

# Notebook settings
%config Completer.use_jedi = False
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

<hr>

# Variable definitions

In [None]:
# Set random state to allow reproducible results
rs = 42

<hr>

# Auxiliary functions

In [None]:
# Create function to divide the dataset into different train-test splits and assess the average performance of the model
def mean_performance(features_dataset, target_dataset, number_of_splits, model, percentage_test=0.2, labels=None):
    # Define empty lists to serve as output
    labels = [0, 1] if labels is None else sorted(labels)
    metrics = ['precision', 'recall', 'f1', 'auc']
    output_dict = dict()
    for label in labels:
        for m in metrics:
            output_dict[m + '_class_' + str(label)] = list()
        
    # initialize model
    model = model.estimator.set_params(**model.best_params_)
    
    for n in range(1, number_of_splits + 1):
        # Random train test split
        X_train, X_test, y_train, y_test = train_test_split(features_dataset, target_dataset, test_size=percentage_test, stratify=target_dataset)
        
        # Take the best model and fit it to the training set
        model.fit(X_train, y_train)

        # Make predictions
        predictions = model.predict(X_test)
        probas = model.predict_proba(X_test)

        # Calculate performance metrics
        precision, recall, f1, support = precision_recall_fscore_support(y_test, predictions, labels=labels)
        
        # Break the metrics by class and append results
        dummies = pd.get_dummies(y_test)
        for label in labels:
            if label not in dummies.columns:
                dummies[label] = 0
            auc = roc_auc_score(dummies[label], probas[:, label])
            
            output_dict['precision_class_' + str(label)].append(precision[label])
            output_dict['recall_class_' + str(label)].append(recall[label])
            output_dict['f1_class_' + str(label)].append(f1[label])
            output_dict['auc_class_' + str(label)].append(auc)
       
    # calculate the average results
    avg = dict()
    for label in labels:
        for m in metrics:
            avg['avg_' + m + '_class_' + str(label)] = np.mean(output_dict[m + '_class_' + str(label)])
    
    return avg

In [None]:
# Create function to calculate the permutated feature importance across different train-test splits
def permutated_feature_importance(features_dataset, target_dataset, input_columns, number_of_splits, 
                                  model, score, percentage_test=0.2):
    # Define output dataframe
    output_feature = pd.DataFrame()
    std_feature = pd.DataFrame()
    
    # Create column with the name of the feature
    output_feature['FEATURE'] = input_columns
    std_feature['FEATURE'] = input_columns
    
    # initialize model
    model = model.estimator.set_params(**model.best_params_)
    
    for n in range(1, number_of_splits + 1):
        # Stratified train test split
        X_train, X_test, y_train, y_test = train_test_split(features_dataset, target_dataset, test_size=percentage_test,
                                                    stratify=target_dataset)
        # Take the best model and fit it to the training set
        model.fit(X_train, y_train)
        
        # Dictionary of permutation
        permutated_dict = permutation_importance(model, features_dataset, target_dataset, scoring=score, random_state=rs)
        output_feature[n] = permutated_dict['importances_mean']
        std_feature[n] = permutated_dict['importances_std']
    
    # Calculate average importance and its std across all splits
    output_feature['AVG_IMPORTANCE_' + str(number_of_splits)] = output_feature.mean(axis=1)
    std_feature['AVG_STD_' + str(number_of_splits)] = std_feature.mean(axis=1)
    
    # Merge std avg with final output frame
    output_feature = output_feature.merge(std_feature[['FEATURE', 'AVG_STD_' + str(number_of_splits)]], on=['FEATURE'])
    
    # Sort values by average importance
    output_feature = output_feature.sort_values(by=['AVG_IMPORTANCE_' + str(number_of_splits)], ascending=False)
        
    return output_feature

In [None]:
def get_est_grid(grid, m):
    """
    Get the equivalent grid dictionary for a One vs All estimator
    :param grid: (dict) original grid dictioanry
    :param m   : (string) model name
    return: (dict) grid for One vs All
    """
    grid2 = dict()
    for k in grid:
        if m in k:
            grid2[k.replace(m + '__', m + '__estimator__')] = grid[k]
        else:
            grid2[k] = grid[k]
    return grid2

In [None]:
# function to calculate correlation between categorical variables
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
    rcorr = r - ((r - 1) ** 2)/(n - 1)
    kcorr = k - ((k - 1) ** 2)/(n - 1)
    return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

---

# Loading data

In [None]:
raw_covid = pd.read_excel('/kaggle/input/covid19/dataset.xlsx')

---

# EDA

## Dataset basic information

We begin by checking basic data information, such as header and the shape of data, just to get an initial expectation of data challanges we may be facing

In [None]:
raw_covid.sample(10)

*Note that we have a couple of columns related to Coronavirus test (e.g. Coronavirus HKU1, Coronavirus 229E, ...). However, even if a patient test positive in one of these, it does not necessarily mean that they have SARS-Cov-2. The opposite also happens (patients can test negative in Coronavirus tests but test positive for SARS-Cov-2)*

In [None]:
raw_covid.shape

## Nulls evaluation

We check the top columns in terms of null values

In [None]:
missing = (raw_covid.isnull().sum()/raw_covid.shape[0]*100).sort_values(ascending=False).reset_index()
missing.columns = ['VARIABLE', 'PERC_MISSING']
missing.head(10)

In [None]:
missing['QUANTILE_MISSING'] = pd.cut(missing['PERC_MISSING'], 10)

In [None]:
sns.countplot(y='QUANTILE_MISSING', data=missing)

In [None]:
msno.matrix(raw_covid, sparkline=False)
plt.xticks(rotation='vertical')

**Conclusions:**

There are a lot of null entries in most of the columns. From the looks of it, it seems that most of the null entries are because the patient did not do that specific test (e.g. calculate platelets, red blood cells, etc.). 

Therefore, we will not use impute the nulls with mean, median or mode to not produce bias and false data. Instead we will:
- Drop columns that have 100% null entries (Mycoplasma pneumonia, Urine - Sugar, Partial thromboplastin time (PTT), Prothrombin time (PT), Activitiy and D-Dimer columns. Since they have 100% null entries, they have no prediction power
- For the remaining columns, we will impute values to indicate that they were not tested, guaranteeing that they will not mix with the ones that have real data. This does not necessarily mean that we will use these columns in our final model, which is a job that has to be done during model building / feature selection phase

## Data types

In [None]:
raw_covid.dtypes

**Conclusions:**

Checking each column dtype with the snippet of the data, we found that a couple of columns have different dtype than they should have:
- Urine - pH: due to some entries states as "Não Realizado", this column is considered as an object type when it should be float
- Urine - Leukocytes: there are 9 entries that have a value of "<1000" in this column. Due to this fact, this column is considered as an object type when it should be float

 We will correct these columns dtypes and their entries later on this notebook

##  Data Statistics

### Numerical columns 

In [None]:
numerical_columns = raw_covid.select_dtypes(exclude='object').columns
print('There are', len(numerical_columns), 'numerical columns in the dataset')

raw_covid[numerical_columns].describe()

**Conclusion**
    
Since the features are all normalized it's hard to tell whether it exists or not outliers in this table. In theory, we should flag an outlier data point's that have a test result outside a feasable range (for instance, have an hematocrit above 100%), but since we don't have the original values we will assume that this data treatment has been performed previously

### Categorical Data 

In [None]:
categorical_columns = raw_covid.select_dtypes(include='object').columns
print('There are', len(categorical_columns), 'categorical columns in the dataset')

raw_covid[categorical_columns].describe()

In [None]:
plt.figure(figsize = (12, 5))
sns.countplot(x = 'SARS-Cov-2 exam result', data = raw_covid)

In [None]:
raw_covid['SARS-Cov-2 exam result'].value_counts(normalize = True)*100

**Conclusion:**
    
We have a really an imbalanced dataset: 90.1% of the cases have negative results while only 9.9% are positive.
This means that we will probably have to use balancing methods to obtain better model performances

### Number of unique entries for each column 

In [None]:
for col in [c for c in raw_covid.columns if 'Patient ID' not in c]:
    print(col, ':', raw_covid[col].nunique())

## Correlation analysis 

In [None]:
corr = raw_covid.corr()
plt.figure(figsize=(20, 8))

# Heatmap of correlations
sns.heatmap(corr, cmap=plt.cm.RdYlBu_r, vmin=-0.25, annot=False, vmax=0.8)

In [None]:
# melt the matrix columns to get pair-wise correlation table
corr = pd.melt(corr.reset_index(), id_vars='index', value_name='corr')
    
# remove rows with the same variable
corr = corr[corr['index'] != corr['variable']]
    
# drop duplicates on pairwise columns
corr = corr.loc[
    pd.DataFrame(
        np.sort(corr[['index', 'variable']], 1), index=corr.index
    ).drop_duplicates(keep='first').index
]

# sort values by absolute correlation values
corr['corr'] = corr['corr'].abs()
corr.sort_values(by='corr', ascending=False, inplace=True)

# show top n rows
corr.head(60)

In [None]:
# list of columns to exclude from the correlation analysis
exclude_columns = [
    'Patient ID', 
    'SARS-Cov-2 exam result',
    'Patient addmited to regular ward (1=yes, 0=no)',
    'Patient addmited to semi-intensive unit (1=yes, 0=no)',
    'Patient addmited to intensive care unit (1=yes, 0=no)',
    
    # raises an error because of the number of null values
    'Urine - Esterase', 'Urine - Aspect', 'Urine - pH','Urine - Hemoglobin',
    'Urine - Bile pigments', 'Urine - Ketone Bodies', 'Urine - Nitrite', 
    'Urine - Density', 'Urine - Urobilinogen', 'Urine - Protein', 'Urine - Sugar',
    'Urine - Leukocytes', 'Urine - Crystals', 'Urine - Red blood cells', 
    'Urine - Hyaline cylinders', 'Urine - Granular cylinders', 'Urine - Yeasts',
    'Urine - Color'
]

cols = [c for c in categorical_columns if c in raw_covid.columns and c not in exclude_columns]

# calculate the correlation for each pair of variables
corr = list()
for c1, c2 in list(itertools.combinations(cols, 2)):
    corr.append([c1, c2, cramers_v(raw_covid[c1].values, raw_covid[c2].values)])

# show the table of correlations    
corr = pd.DataFrame(data=corr, columns=['col1', 'col2', 'corr'])
corr.sort_values(by='corr', ascending=False)

**Conclusion:**
    
We can see that there is of features that are highly correlated with each other (perfect positive or negative correlations), meaning that we will probably just keep one version of each variable. We will not apply the same proccess for the categorial features, given that they have a lower correlation score. Further in this notebook we will add criterias to do so

## Outliers 

We initially analyzed outliers trough univariate analysis (boxplots and barplots), but we could not really tell if the values indicated as outliers could be considered so. And this is because of 2 facts:

- A lot of features have null entries. Therefore, a small amount of people did the tests and we can not really tell what would happen if more people tested (i.e. would the outliers remain outliers?)
- We thought about searching for standard values for each one of the tests in the dataset to see if the entries could be considered outliers. However, the data was standardized to have a mean of zero and a unit standard deviation, making it impossible to evaluate the original values

Due to this fact, we will leave the dataset as is and not treat potential outliers

---

# Data Preprocessing

In [None]:
# columns to exclude from the process
exclude_columns = [
    'Patient ID',
    'SARS-Cov-2 exam result',
    'Patient addmited to regular ward (1=yes, 0=no)',
    'Patient addmited to semi-intensive unit (1=yes, 0=no)',
    'Patient addmited to intensive care unit (1=yes, 0=no)'
]

## Create copy of the data

This step is done to maintain the raw data intouchable. If we need to recurr to the original data, we do not need to import it again

In [None]:
cleaned_covid = raw_covid.copy()

## Treat dtypes

In this section we will treat the inputted errors discovered during the EAD process as well as the dtypes of the 2 columns with errors:
- Urine - pH
- Urine - Leukocytes

In [None]:
# Urine - pH
cleaned_covid.loc[cleaned_covid['Urine - pH'] == 'Não Realizado', 'Urine - pH'] = '-9'
cleaned_covid.loc[cleaned_covid['Urine - pH'] == 'not_done', 'Urine - pH'] = '-9'
cleaned_covid['Urine - pH'] = cleaned_covid['Urine - pH'].astype(np.float)

# Urine - Leukocytes
# We will input the <1000 entries with the midpoint 500. Note that there is no entry with this specific value
cleaned_covid.loc[cleaned_covid['Urine - Leukocytes'] == '<1000', 'Urine - Leukocytes'] = '500'
cleaned_covid.loc[cleaned_covid['Urine - Leukocytes']=='not_done', 'Urine - Leukocytes'] = '-9'
cleaned_covid['Urine - Leukocytes'] = cleaned_covid['Urine - Leukocytes'].astype(float)

# Urine - Crystals
rep = {'á': 'a', '-': 'Minus', '+': 'Plus'}
for key, value in rep.items():
    cleaned_covid['Urine - Crystals'] = cleaned_covid['Urine - Crystals'].str.replace(key, value)

### Selection of Patients with/without tests

In this section we will break the dataframe into 2 groups:

- Dataframe containing patients who did not do all the tests
- Dataframe containing patients who had to go trough all the tests

In [None]:
# List of columns related to tests
tests_columns = [c for c in cleaned_covid.columns if c not in exclude_columns and c not in ['Patient age quantile']]

# Dataframe of patients without any test result
cleaned_covid_no_test = cleaned_covid[cleaned_covid[tests_columns].isnull().all(axis=1)].copy()
cleaned_covid_no_test.dropna(how='all', axis='columns', inplace=True)

# Dataframe of patients with test results
cleaned_covid_test = cleaned_covid[~(cleaned_covid['Patient ID'].isin(cleaned_covid_no_test['Patient ID']))].copy()

In [None]:
cleaned_covid_test['SARS-Cov-2 exam result'].value_counts()

## Drop all null columns 

In [None]:
cleaned_covid.dropna(how='all', axis=1, inplace=True)

### Drop columns with lot's of nulls 

Selecting what is the "allowed percentage of nulls" is not an easy task, and there is no magic number to select it. Normally we would drop columns with 25%-30% of missing values, but that depends on how much predictive power the variable has. Depending on that, even columns with 50%-70% of nulls could be kept. Above those values the standard practice would be to request that the business area would collect the information. In our case it's not possible, and it's hard to tell whether each variable has or not great predictive power because that could be dependent on a combination of factors and tests. Because of that we can either procceed to perform a rigorous series of analysis to understand how a set of test combinations could influence a test result and verify data-driven insights with a specilist. For simplicity, in our case, we will drop variables that could result in overfit.

In [None]:
# plot the amount of columns dropped by the fill percentage required
task1_ratio = cleaned_covid_test.count() / cleaned_covid_test.shape[0]
task2_ratio = cleaned_covid_test[cleaned_covid_test['SARS-Cov-2 exam result'] == 'positive']
task2_ratio = task2_ratio.count() / task2_ratio.shape[0]
ratio = pd.concat([task1_ratio, task2_ratio], axis='columns')
res = list()
for thrs in np.linspace(0.01, 0.3, 50):
    res.append([thrs, ratio[(ratio[0] < thrs) & (ratio[1] < thrs)].shape[0]])
fig = plt.figure(figsize=(16,4))
ax = fig.gca()
ax.set_xticks(np.linspace(0.01, 0.3, 20))
plt.scatter(np.array(res)[:, 0], np.array(res)[:, 1])
plt.grid()

In [None]:
# apply column removal
thrs = 0.05
task1_ratio = cleaned_covid_test.count() / cleaned_covid_test.shape[0]
task2_ratio = cleaned_covid_test[cleaned_covid_test['SARS-Cov-2 exam result'] == 'positive']
task2_ratio = task2_ratio.count() / task2_ratio.shape[0]
ratio = pd.concat([task1_ratio, task2_ratio], axis='columns')
cleaned_covid_test.drop(columns=ratio[(ratio[0] < thrs) & (ratio[1] < thrs)].index, inplace=True)

## Correlation analysis

In [None]:
corr = cleaned_covid.drop(columns=exclude_columns).corr()
corr = pd.melt(corr.reset_index(), id_vars='index', value_name='corr')
corr = corr[corr['index'] != corr['variable']]
corr = corr.loc[pd.DataFrame(np.sort(corr[['index', 'variable']], 1), index=corr.index).drop_duplicates(keep='first').index]
corr['corr'] = corr['corr'].abs()
res = list()
for thrs in np.linspace(0.7, 1, 32):
    x = corr[corr['corr'] > thrs]
    drop_corr = list()
    while x.shape[0] > 0:
        for c in x['index'].append(x['variable']).unique():
            v1 = x[(x['variable'] == c) | (x['index'] == c)]
            
            if v1.shape[0] > 0:
                c2 = v1['index'].values[0] if v1['index'].values[0] != c else v1['variable'].values[0]
                v2 = x[(x['variable'] == c2) | (x['index'] == c2)]
                
                drop = c if (v1.shape[0] > v2.shape[0]) and (cleaned_covid[c].count() < cleaned_covid[c2].count()) else c2
                drop_corr.append(drop)
                x = x[(x['index'] != drop)& (x['variable'] != drop)]
    res.append((thrs, cleaned_covid.shape[1] - len(drop_corr)))
plt.plot(np.array(res)[:, 0], np.array(res)[:, 1])

In [None]:
# stablish correlation threshold to drop variables
thrs = 0.9

# calculate correlation matrix
corr = cleaned_covid_test.drop(columns=exclude_columns).corr()

# transform columns into rows
corr = pd.melt(corr.reset_index(), id_vars='index', value_name='corr')

# remove correlation of same variable
corr = corr[corr['index'] != corr['variable']]
corr['corr'] = corr['corr'].abs()

# drop pair-wise duplicates
corr = corr.loc[
    pd.DataFrame(
        np.sort(corr[['index', 'variable']], 1), index=corr.index
    ).drop_duplicates(keep='first').index
]

# select pairs above threshold
x = corr[corr['corr'] > thrs].sort_values('corr', ascending=False)
drop_corr = list()

# for each pair
while x.shape[0] > 0:
    for c in x['index'].append(x['variable']).unique():
        # verify number of ocurrencces of a feature in the table
        v1 = x[(x['variable'] == c) | (x['index'] == c)]
        
        if v1.shape[0] > 0:
            c2 = v1['index'].values[0] if v1['index'].values[0] != c else v1['variable'].values[0]
            v2 = x[(x['variable'] == c2) | (x['index'] == c2)]

            # if the first features have more occurences than the second, that means that the first variable
            # can be explained by multiple variables, therefore we will prefer the second which adds more
            # variance to the dataset
            drop = c if (v1.shape[0] > v2.shape[0]) and (cleaned_covid[c].count() < cleaned_covid[c2].count()) else c2
            drop_corr.append(drop)
            x = x[(x['index'] != drop)& (x['variable'] != drop)]

# drop the selected columns
cleaned_covid_test.drop(drop_corr, axis=1, inplace=True)

In [None]:
drop_corr

## Impute missing values

In [None]:
# List of columns with null entries
null_columns = [c for c in cleaned_covid_test.columns if cleaned_covid_test[c].isnull().sum() != 0]

# List of columns to be filled with -9
columns_num_fill = [c for c in null_columns if c in cleaned_covid_test.select_dtypes(exclude='object').columns]

# List of columns to be filled with "not_done"
columns_cat_fill = [c for c in null_columns if c in cleaned_covid_test.select_dtypes(include='object').columns]

# Impute missing values
for col in columns_num_fill:
    cleaned_covid_test[col].fillna(-9, inplace=True)
    
for col in columns_cat_fill:
    cleaned_covid_test[col].fillna('not_done', inplace=True)

## Treat categorical columns

In [None]:
# SARS-Cov-2 exam result
cleaned_covid_no_test.loc[cleaned_covid_no_test['SARS-Cov-2 exam result']=='negative', 'SARS-Cov-2 exam result'] = 0
cleaned_covid_no_test.loc[cleaned_covid_no_test['SARS-Cov-2 exam result']=='positive', 'SARS-Cov-2 exam result'] = 1
cleaned_covid_no_test['SARS-Cov-2 exam result'] = cleaned_covid_no_test['SARS-Cov-2 exam result'].astype(np.int)

cleaned_covid_test.loc[cleaned_covid_test['SARS-Cov-2 exam result']=='negative', 'SARS-Cov-2 exam result'] = 0
cleaned_covid_test.loc[cleaned_covid_test['SARS-Cov-2 exam result']=='positive', 'SARS-Cov-2 exam result'] = 1
cleaned_covid_test['SARS-Cov-2 exam result'] = cleaned_covid_test['SARS-Cov-2 exam result'].astype(np.int)

# One hot encode categorical columns
categorical = [c for c in cleaned_covid_test.select_dtypes(include='object').columns if 'Patient ID' not in c]
encoded_covid_test = pd.get_dummies(cleaned_covid_test[categorical], drop_first=True)
cleaned_covid_test.drop(categorical, axis=1, inplace=True)
cleaned_covid_test = pd.concat([cleaned_covid_test, encoded_covid_test], axis=1)

## Feature and target split

### Task 1: Predict confirmed COVID-19 cases among suspected cases

In [None]:
# Create features and target for dataframe with no tests
features_no_test_task1 = cleaned_covid_no_test.drop(exclude_columns, axis = 1)
target_no_test_task1 = cleaned_covid_no_test['SARS-Cov-2 exam result']

# Create features and target for dataframe with tests
features_covid_task1 = cleaned_covid_test.drop(exclude_columns, axis = 1)
target_covid_task1 = cleaned_covid_test['SARS-Cov-2 exam result']

### Task 2: Predict admission to general ward, semi-intensive unit or intensive care unit

In [None]:
units = cleaned_covid_test[cleaned_covid_test['SARS-Cov-2 exam result'] == 1]
features_covid_task2 = units.drop(exclude_columns, axis = 1)

target_cols = [
    'Patient addmited to regular ward (1=yes, 0=no)', 'Patient addmited to semi-intensive unit (1=yes, 0=no)', 
    'Patient addmited to intensive care unit (1=yes, 0=no)'
]
target_covid_task2 = units[target_cols[0]] + units[target_cols[1]] * 2 + units[target_cols[2]] * 3

## Feature Importance 

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source
from IPython.display import SVG
dt = Pipeline([('smt', SMOTE(0.5, random_state=rs)), ('dt', DecisionTreeClassifier(random_state=rs, max_depth=3))])
dt.fit(features_covid_task1, target_covid_task1)

In [None]:
dt_feat = pd.DataFrame(dt.named_steps['dt'].feature_importances_, index=features_covid_task1.columns, columns=['feat_importance'])
dt_feat.sort_values('feat_importance').tail(8).plot.barh()
plt.show()

In [None]:
graph = Source(export_graphviz(dt.named_steps['dt'], out_file=None, feature_names=features_covid_task1.columns, filled=True))
SVG(graph.pipe(format='svg'))

In [None]:
graph.render()

- A lot of features do not add any predictive power, indicating that tree algorithms will be highly dependent on very few features
- We did not apply any feature selection methodology, as we are assuming that all medical information could be relevant, but looking forward we should use the permutation importance's results to decide which columns to keep


<hr>

# Task 1: Predict confirmed COVID-19 cases among suspected cases

## Model Building 

When we were dealing with this problem, one of the questions that we end up having was how we should perform our train-test split in a proper manner (i.e. have a test set that reflects people who will end up at the hospital in the future). 

Since we have no way to know the answer for this question (i.e. to know if people who will go to the hospital reflects the overall population or if people who goes are biased ones - e.g. more worried about their health than usual people), we end up thinking about several approaches to really determine the performance of our final model:

1) Run randomized train-test split and create a model and validate its performance on the single test set: as mentioned earlier, this specific approach could result in a test set that does not reflect the future patients of the hospital and our performance would be biased

2) Run several train-test splits and create different models for each split: this approach would end up being too much time consuming since the number of models would be equal to the number of splits we end up choosing (e.g. 50 train-test splits means 50 model building pipeline)

3) Run a stratified k-fold cross validation with the entire dataset and choose the model (i.e. algorithm and its parameters) with the best average performance. Create several train-test splits and assess the average performance of the model across the different splits

We decided to go forward we our approach number 3 since it is fast and realiable at the same time

### Model instantiation

In [None]:
# Define param grid
smt_grid = {
    'smt__sampling_strategy': [0.2, 0.3, 0.4, 0.5],
    'smt__k_neighbors' : [2, 3, 5],
    'smt__random_state': [rs]
}

rf_grid = {
    'rf__n_estimators' : [int(x) for x in np.linspace(100, 2000, 20)],
    'rf__max_features' : ['auto', 'sqrt', 'log2'],
    'rf__min_samples_leaf': [2, 5, 10], # we set the minimum of samples leaf to minimize overfit 
    'rf__min_samples_split': [5, 10, 15], # we set the minimum of samples split to minimize overfit
    'rf__max_depth': [5, 8, 15], # we limited the max depth to 15 given the risk of overfit
    'rf__random_state' : [rs]
}
rf_grid.update(smt_grid)

xgb_grid = {
    'xgb__loss' : ['ls', 'lad', 'huber', 'quantile'],
    'xgb__n_estimators': [int(x) for x in np.linspace(100, 2000, 20)],
    'xgb__learning_rate': np.linspace(0.01, 0.1, 10),
    'xgb__max_depth': range(3, 10),
    'xgb__subsample': [0.8, 0.85, 0.9, 0.95, 1],
    'xgb__max_features': ['auto', 'sqrt'],
    'xgb__min_samples_leaf': [2, 5, 10], # we set the minimum of samples split to minimize overfit
    'xgb__min_samples_split': [5, 10, 15, 30], # we limited the max depth to 15 given the risk of overfit
    'xgb__random_state': [rs]
}
xgb_grid.update(smt_grid)

lgbm_grid ={
    'lgbm__num_leaves': stats.randint(6, 50), 
    'lgbm__min_child_samples': stats.randint(100, 500), 
    'lgbm__min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
    'lgbm__subsample': stats.uniform(loc=0.2, scale=0.8), 
    'lgbm__colsample_bytree': stats.uniform(loc=0.4, scale=0.6),
    'lgbm__reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
    'lgbm__reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100],
    'lgbm__random_state': [rs]
}
lgbm_grid.update(smt_grid)

# list models to run
# NOTE: Since the lgbm model is much faster to run, we are going to use it for now, but those lines should be uncommented
# when thinking of a model deployment
models_task1 = [
    ('RF', 'rf', RandomForestClassifier, rf_grid),
    ('XGB', 'xgb', XGBClassifier, xgb_grid),
    ('LGBM', 'lgbm', LGBMClassifier, lgbm_grid),
]

# dictionary of results
results_task1 = dict()

# set the metric to select the best model
MODEL_SELECTION_METRIC_TASK1 = 'recall'

### Model Tunning

Note: when the estimator inside RandomizedSearchCV is classifier, the default cross validation method is a stratified k-fold cross validation. See the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform) for more detail

In [None]:
# go through the models
for m, n, c, g in models_task1:
    print(m)

    # instantiate pipeline
    p = Pipeline([('smt', SMOTE(random_state=rs)), (n, c())])

    # run a grid search
    results_task1[m] = RandomizedSearchCV(
        p, g, cv=10, scoring=MODEL_SELECTION_METRIC_TASK1, verbose=1, n_jobs=-1, n_iter=300, random_state=rs, refit=False
    )
    results_task1[m].fit(features_covid_task1.values, target_covid_task1.values)

    # print out the model performance
    print('Best %s %s score:' % (m, MODEL_SELECTION_METRIC_TASK1), results_task1[m].best_score_)
    print('\n')

### Model real performance valuation

In [None]:
print('-' * 100)

# select the model with the highest score
models = [results_task1[m] for m in results_task1]
scores = [results_task1[m].best_score_ for m in results_task1]
i = np.argmax(scores)
best_model_task1 = models[i]

avg = mean_performance(
    features_covid_task1.values, target_covid_task1.values, 500, 
    model=best_model_task1, percentage_test=0.1,
)
for k in avg:
    print(k + ':', '%.2f' % avg[k])
print('-' * 100)
print('\n')

### Model interpretability

To interpret the model, we will recurr only to global interpretability (for local interpretability, one can recurr to several libraries such as LIME, more specifically, we will recurr to permutated feature importance. The basic idea of this technique can be found here

In [None]:
interp_task1 = permutated_feature_importance(
    features_covid_task1.values, target_covid_task1, features_covid_task1.columns, 50, best_model_task1, score = MODEL_SELECTION_METRIC_TASK1
) 

In [None]:
sns.barplot(x='AVG_IMPORTANCE_50', y='FEATURE', data=interp_task1.head(10))

**Modeling conclusions:**

- We initially created one single model for the entire dataset (for both patients with and without test results information). However, that specific model gave poor performance across different algorithms, feature selection and class balancing techniques. We did not showed the results for the global model in this final version of the notebook
- When breaking the dataset between patients with vs without test information, our model for the dataframe with test information performed better (confirming our initial hypothesis). However, the patients without information have only one single column (Patients age quantile) and, therefore, also resulted in poor performance models (not showed in this notebook)

**Feature importance conclusions:**

- Looking at the feature importance table generated above, we can conclude that, in this dataset, the most important features for our global model were: (1) Patient age quantile, (2) Leukocytes and (3) Platelets
- We are not familiar with the health industry and its terms / meanings so we cannot conclude if the features that give tests information make sense or not in helping to predict SARS-Covid-2. However, having the patient age quantile as the most important feature makes sense **when thinking only in modeling terms**, since it is the column that has no null entries and had no imputed values.
- It is also possible to encounter features with negative feature importance. In these cases, the predictions with the shuffled data happened to be more accurate than the real data. This happens when the feature did not matter or due to random chance that caused the predictions on shuffled data to be more accurate. Again, **we cannot be sure if this makes sense due to our lack of knowledge in the health field**

**General conclusions:**

- Dealing with 2 models (1 for a dataframe with test information and 1 for no test information) leads to better results, but it also means that the hospital needs to test patients before having any prediction output which may not be ideal since it means more costs, more demand for overall tests
- Applying one single model for the entire dataset means that we do not necessarily need to run any priori tests in patients. However, the model performance is worse: we have lower recall (i.e. higher false negatives - telling people they do not have covid when in fact they have) and lower f1 score (harmonic mean of the precision and recall)

---

# Task 2: Predict admission to general ward, semi-intensive unit or intensive care unit

## Model Building 

For the task 2 we will follow a similar methodology to task 1, but we are going change the problem to a multiclass classification and also use both an OneVsRest and the model itself

### Model instantiation 

In [None]:
# list models to run
# NOTE: Since the lgbm model is much faster to run, we are going to use it for now, but those lines should be uncommented
# when thinking of a model deployment
models_task2 = [
    ('OVA_RF', 'rf', lambda: OneVsRestClassifier(RandomForestClassifier()), get_est_grid(rf_grid, 'rf')),
    ('RF', 'rf', RandomForestClassifier, rf_grid),
    ('OVA_XGB', 'xgb', lambda: OneVsRestClassifier(XGBClassifier()), get_est_grid(xgb_grid, 'xgb')),
    ('XGB', 'xgb', XGBClassifier, xgb_grid),
    ('OVA_LGBM', 'lgbm', lambda: OneVsRestClassifier(LGBMClassifier()), get_est_grid(lgbm_grid, 'lgbm')),
    ('LGBM', 'lgbm', LGBMClassifier, lgbm_grid)
]

# dictionary of results
results_task2 = dict()

# set the metric to select the best model
MODEL_SELECTION_METRIC_TASK2 = 'recall_weighted'

### Model Tunning

Note: when the estimator inside RandomizedSearchCV is classifier, the default cross validation method is a stratified k-fold cross validation. See the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform) for more detail

In [None]:
# go through the models
for m, n, c, g in models_task2:
    print(m)

    # instantiate pipeline
    p = Pipeline([('smt', SMOTE(random_state=rs)), (n, c())])

    # we change the sampling strategy based on the fact that we have a multi-class classification problem
    g['smt__sampling_strategy'] = ['not majority']
    
    # we also limit the number of knn given that we may not have enough sample depending on the random split
    g['smt__k_neighbors'] = [2]
    
    # run a grid search
    # NOTE: we reduced the number of iterations just to deploy the notebook faster. But looking forward this number should be 200~300
    results_task2[m] = RandomizedSearchCV(
        p, g, cv=10, scoring=MODEL_SELECTION_METRIC_TASK2, verbose=1, n_jobs=-1, n_iter=300, random_state=rs, refit=False
    )
    results_task2[m].fit(features_covid_task2.values, target_covid_task2.values)

    # print out the model performance
    print('Best %s %s score:' % (m, MODEL_SELECTION_METRIC_TASK2), results_task2[m].best_score_)
    print('\n')

### Model real performance valuation

In [None]:
print('-' * 100)

# select the model with the highest score
models = [results_task2[m] for m in results_task2]
scores = [results_task2[m].best_score_ for m in results_task2]
i = np.argmax(scores)
best_model_task2 = models[i]

avg = mean_performance(
    features_covid_task2.values, target_covid_task2.values, 300, 
    model=best_model_task2, percentage_test=0.1, labels=list(target_covid_task2.unique())
)
for k in avg:
    print(k + ':', '%.2f' % avg[k])
print('-' * 100)
print('\n')

### Model interpretabiliy 

In [None]:
interp_task2 = permutated_feature_importance(
    features_covid_task2.values, target_covid_task2, features_covid_task2.columns, 50, best_model_task2, score = MODEL_SELECTION_METRIC_TASK2
) 

In [None]:
sns.barplot(x='AVG_IMPORTANCE_50', y='FEATURE', data=interp_task2.head(10))

**Modeling conclusions:**

- We decided to approach this problem using 2 methods: one-vs-all and multiclass modeling.
- The one-vs-all model seems to have a better performance than the standard multiclass

**Feature importance conclusions:**

- Looking at the permutated feature importance table generated above, we can notice that the top 3 features were: (1) Proteina C reativa mg/dL, (2) Neutrophils and (3) Mean corpuscular hemoglobin concentration (MCHC). However, **we are not familiar with the health industry and its terms / meanings so we cannot conclude if the features that give tests information make sense or not in helping to predict admissions to general ward, semi intensive and intensive care unit**
- We can also noticed that the columns do not have a high permutated feature importance on average, meaning that there is no column that outlies as the "most important" one. Basically, due to the small gap in average feature importance across several features, each feature have a little contribution to the final prediction
- It is also possible to encounter features with negative feat


---

# Model Deployment 

For the model deployment we will select the best model for each task and run their training with the whole dataset. To avoid overfit, we will put the best within a bagging regressor selecting a given % of samples from the dataset

In [None]:
n_estimators = 400
max_samples = 0.9

# instantiate the bagging classifier
model1 = best_model_task1.estimator.set_params(**best_model_task1.best_params_)
bag1 = BaggingClassifier(model1, n_estimators=n_estimators, max_samples=max_samples)

model2 = best_model_task2.estimator.set_params(**best_model_task2.best_params_)
bag2 = BaggingClassifier(model2, n_estimators=n_estimators, max_samples=max_samples)

# train and export model
bag1.fit(features_covid_task1.values, target_covid_task1.values)
joblib.dump(bag1, 'model_task1.pkl', compress=9)

# train and export model
# NOTE: We had to comment the second model because there are very few positive examples depending on the class
# because of that we cannot run the bag with SMOTE. We either have to change our sampler or increase the number of data points
# bag2.fit(features_covid_task2.values, target_covid_task2.values)
# joblib.dump(bag2, 'model_task2.pkl', compress=9)

---