# Week 6  - An introduction to machine learning (Part II) - Exercise and Solution

We'll apply some of the material from the previous lectures to recreating the analysis from a [nature machine intelligence](https://www.nature.com/natmachintell/) paper, ["An interpretable mortality prediction model for COVID-19 patients"](https://www.nature.com/articles/s42256-020-0180-7).

## 0. Setup

You will need to install the [xlrd] (https://xlrd.readthedocs.io/en/latest/) package to complete the Exercise.

 To install this packages, launch the "Anaconda Prompt (Anaconda3)" program and run:

`conda install -c anaconda xlrd `

<img src="../img/az_conda_prompt.png">


### Training data

The original training datasets for the paper are linked as [Supplementary data](https://static-content.springer.com/esm/art%3A10.1038%2Fs42256-020-0180-7/MediaObjects/42256_2020_180_MOESM3_ESM.zip). You don't have to download this since we have included the single Excel file we need for this example as `data/time_series_375_preprocess_en.xlsx` in this project. Below we provide code to read the Excel data into a Pandas DataFrame.

In [5]:
import datetime
import pandas as pd

In [6]:
TRAIN_PATH = '../data/time_series_375_preprocess_en.xlsx'
RANDOM_SEED=42

In [7]:
def load_training_data(path):
    """ Load Excel sheet of measurements from patients (timepandas.DataFrame with MultiIndex ['PATIENT_ID', 'RE_DATE'] (the unique patient identifier and patient sample date, corresponding to columns [0,1] respectively of the loaded worksheet), then retain the last set of measurements made per patient, drop 'Admission time', 'Discharge time', 'gender' and 'age' features, and replace NaNs with -1. 
    """

    # Specify explicitly what columns we want to load and what their data types are expected to be.
    DTYPES = {
        'PATIENT_ID': int,
        'RE_DATE': str,
        'age': int,
        'gender': int,
        'Admission time': str,
        'Discharge time': str,
        'outcome': float,
        'Hypersensitive cardiac troponinI': float,
        'hemoglobin': float,
        'Serum chloride': float,
        'Prothrombin time': float,
        'procalcitonin': float,
        'eosinophils(%)': float,
        'Interleukin 2 receptor': float,
        'Alkaline phosphatase': float,
        'albumin': float,
        'basophil(%)': float,
        'Interleukin 10': float,
        'Total bilirubin': float,
        'Platelet count': float,
        'monocytes(%)': float,
        'antithrombin': float,
        'Interleukin 8': float,
        'indirect bilirubin': float,
        'Red blood cell distribution width': float,
        'neutrophils(%)': float,
        'total protein': float,
        'Quantification of Treponema pallidum antibodies': float,
        'Prothrombin activity': float,
        'HBsAg': float,
        'mean corpuscular volume': float,
        'hematocrit': float,
        'White blood cell count': float,
        'Tumor necrosis factorα': float,
        'mean corpuscular hemoglobin concentration': float,
        'fibrinogen': float,
        'Interleukin 1β': float,
        'Urea': float,
        'lymphocyte count': float,
        'PH value': float,
        'Red blood cell count': float,
        'Eosinophil count': float,
        'Corrected calcium': float,
        'Serum potassium': float,
        'glucose': float,
        'neutrophils count': float,
        'Direct bilirubin': float,
        'Mean platelet volume': float,
        'ferritin': float,
        'RBC distribution width SD': float,
        'Thrombin time': float,
        '(%)lymphocyte': float,
        'HCV antibody quantification': float,
        'D-D dimer': float,
        'Total cholesterol': float,
        'aspartate aminotransferase': float,
        'Uric acid': float,
        'HCO3-': float,
        'calcium': float,
        'Amino-terminal brain natriuretic peptide precursor(NT-proBNP)': float,
        'Lactate dehydrogenase': float,
        'platelet large cell ratio ': float,
        'Interleukin 6': float,
        'Fibrin degradation products': float,
        'monocytes count': float,
        'PLT distribution width': float,
        'globulin': float,
        'γ-glutamyl transpeptidase': float,
        'International standard ratio': float,
        'basophil count(#)': float,
        '2019-nCoV nucleic acid detection': float,
        'mean corpuscular hemoglobin': float,
        'Activation of partial thromboplastin time': float,
        'High sensitivity C-reactive protein': float,
        'HIV antibody quantification': float,
        'serum sodium': float,
        'thrombocytocrit': float,
        'ESR': float,
        'glutamic-pyruvic transaminase': float,
        'eGFR': float,
        'creatinine': float
    }

    # Specify which string columns should be interpreted as datetimes.
    DATETIME_COLUMNS = ['RE_DATE', 'Admission time', 'Discharge time']
    
    return (
        pd.read_excel(path, index_col=[0,1], dtype=DTYPES, parse_dates=DATETIME_COLUMNS)
            .sort_index()
            .groupby('PATIENT_ID').last()
            .drop(['Admission time', 'Discharge time'], axis=1)
            .drop(['age', 'gender'], axis=1) # removed in later preprocessing step in original paper       
    )

In [8]:
def remove_columns_with_missing_data(df, threshold=0.2):
    """ Remove all columns from DataFrame df where the proportion of missing records is greater than threshold.
    """
    return df.dropna(axis=1, thresh=(1.0-threshold)*len(df))

In [9]:
data = load_training_data(path=TRAIN_PATH)
print(data.shape)
data.head()

(375, 75)


Unnamed: 0_level_0,outcome,Hypersensitive cardiac troponinI,hemoglobin,Serum chloride,Prothrombin time,procalcitonin,eosinophils(%),Interleukin 2 receptor,Alkaline phosphatase,albumin,...,mean corpuscular hemoglobin,Activation of partial thromboplastin time,High sensitivity C-reactive protein,HIV antibody quantification,serum sodium,thrombocytocrit,ESR,glutamic-pyruvic transaminase,eGFR,creatinine
PATIENT_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,19.9,131.0,100.0,12.4,0.09,1.7,,71.0,37.6,...,32.3,38.9,2.6,0.09,142.7,0.16,41.0,30.0,74.7,88.0
2,0.0,1.9,149.0,98.1,12.3,0.09,0.1,441.0,45.0,37.2,...,32.2,36.0,27.4,,137.4,0.27,40.0,22.0,94.6,74.0
3,0.0,,126.0,102.2,13.6,0.06,0.1,591.0,69.0,38.4,...,33.3,34.8,3.6,0.1,143.2,0.23,29.0,67.0,84.6,64.0
4,0.0,4.8,103.0,103.1,16.3,0.38,2.5,,79.0,34.1,...,39.2,,14.5,0.11,144.2,0.27,72.0,26.0,74.2,88.0
5,0.0,5.6,130.0,102.2,14.6,0.02,3.0,258.0,84.0,39.5,...,30.0,,0.8,0.08,143.6,0.36,11.0,18.0,122.8,54.0


To set things up, as done in the paper, we'll remove all the columns with more than 20% missing data, and separate out our predictors ('X') and response ('y') variables.

In [10]:
data = remove_columns_with_missing_data(data).fillna(-1)
X = data.drop('outcome', axis=1)
y = data.outcome.astype(int)

## Exercises

### 1. Split data into training and test sets.

### 2. Fit a RandomForestClassifier on the training set.

### 3. Evaluate the classifier performance by calculating the confusion matrix and the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) on the test set.

### 4. Plot the feature importances of the fitted classifier (this is basically the main finding of the Nature paper).

### 5. Try running a different type of classifier and/or see how well you can do on the test set by tuning hyperparameters using cross-validation, grid search or otherwise.