# Goal
The recent worldwide spread of the COVID-19 virus has placed an incredible strain on the worldwide healthcare system. Intelligent allocation of limited medical resources during a crisis is critical for preventing a collapse of the healthcare system and treating the greatest number of people possible. This project aims to assit hospitals allocate limited beds, personal protective equipment, and staff during the early stages of a developing outbreak.   

Given a dataset of COVID-19 patients that includes demographic data, medical history and a timeseries of vitals/blood test results, we build several predictors that predict whether or not that patient will need ICU care in the near future. The results of these predictors can be used to triage individual COVID cases, even before severe symptoms may become apparent. Resources can be focused to those that need it the most, while not being wasted on those who won't need it.

# Overview of Provided Dataset
* The dataset is an analaysis of 384 patients admitted to the hospital after a positive COVID-19 test. 
* For each patient, basic demographic information and medical history was recorded (i.e. Age, gender, and 'Disease Grouping').
* Additionally, a blood test and vital signs were recorded upon admission to the hospital and every two hours afterwards.  
<ul>
* These measurements are captured in the WINDOW column. Window 0-2 implies that the measurements in that row were taken between 0 and 2 hours after entry to the hospital, 2-4 means that that the measurments in that row were taken 2 to 4 hours after entry, etc.
* Every patient had a 0-2 measurement, but not all patients had later measurements taken
</ul>
* We aim to predict the 'ICU' Variable. If this value is 0, the patient is not in ICU care. If this value is 1, the patient is in ICU care. 

# Imports
Standard imports for data processing, data visualization, machine learning.




In [None]:

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
dataPath = "/content/drive/MyDrive/ML/KaggleRawDataEdited.csv"

data = pd.read_csv(dataPath)



In [None]:
#The columns provided in the raw dataset. 
data.columns
for each in data.columns:
  print("'"+str(each)+"'")

In [None]:
icu_patient = data[['PATIENT_VISIT_IDENTIFIER','ICU']].groupby(['PATIENT_VISIT_IDENTIFIER']).sum()

#These patients never needed to go to the ICU
no_icu = icu_patient[icu_patient['ICU'] == 0]

#These patients needed to go to the ICU at some point in time. 
icu = icu_patient[icu_patient['ICU'] > 0]

#We see that in the provided dataset, 190 patients did not need ICU care, while
#195 did. The dataset is rougly balanced between the two classes. 
print(len(no_icu))
print(len(icu))

190
195


# Cleaning up the data

In [None]:
#This function will create a column called 'ever_icu'. This column will have a
#1 if the corresponding patient went to the ICU at any point in time. It will
#have a zero otherwise. This is the target variable we are predicting.

def if_ever_icu(row):
  patientEntry = row['PATIENT_VISIT_IDENTIFIER']
  allPatientEntries = data[data['PATIENT_VISIT_IDENTIFIER'] == patientEntry]
  sumIcu = allPatientEntries[['PATIENT_VISIT_IDENTIFIER', 'ICU']].sum()
  if int (sumIcu['ICU']) > 0:
    return 1
  return 0

data['ever_icu'] = data.apply(if_ever_icu, axis=1)
data = data.drop('ICU', axis=1)
print(data)


      PATIENT_VISIT_IDENTIFIER  AGE_ABOVE65  ... WINDOW  ever_icu
0                            0            1  ...    0-2         1
1                            0            1  ...    2-4         1
2                            0            1  ...    4-6         1
3                            0            1  ...   6-12         1
4                            0            1  ...    12+         1
...                        ...          ...  ...    ...       ...
1920                       384            0  ...    0-2         0
1921                       384            0  ...    2-4         0
1922                       384            0  ...    4-6         0
1923                       384            0  ...   6-12         0
1924                       384            0  ...    12+         0

[1925 rows x 86 columns]


Not all patients had measurements taken of each feature during each if the 5 timeseries measurements. Below, we assume that if a value is missing, we can fill it in with the average of the other measurements that the patient had taken.

In [None]:
columns = 'ALBUMIN_MEAN','BE_ARTERIAL_MEAN','BE_VENOUS_MEAN','BIC_ARTERIAL_MEAN',\
'BIC_VENOUS_MEAN','BILLIRUBIN_MEAN','CALCIUM_MEAN','CREATININ_MEAN','FFA_MEAN',\
'GGT_MEAN','GLUCOSE_MEAN','HEMATOCRITE_MEAN','HEMOGLOBIN_MEAN','INR_MEAN','LACTATE_MEAN','LEUKOCYTES_MEAN','LINFOCITOS_MEAN',\
'NEUTROPHILES_MEAN','P02_ARTERIAL_MEAN','P02_VENOUS_MEAN','PC02_ARTERIAL_MEAN',\
'PC02_VENOUS_MEAN','PCR_MEAN','PH_ARTERIAL_MEAN','PH_VENOUS_MEAN','PLATELETS_MEAN',\
'POTASSIUM_MEAN','SAT02_ARTERIAL_MEAN','SAT02_VENOUS_MEAN','SODIUM_MEAN','TGO_MEAN',\
'TGP_MEAN','TTPA_MEAN','UREA_MEAN','DIMER_MEAN','BLOODPRESSURE_DIASTOLIC_MEAN',\
'BLOODPRESSURE_SISTOLIC_MEAN','HEART_RATE_MEAN','RESPIRATORY_RATE_MEAN','TEMPERATURE_MEAN',\
'OXYGEN_SATURATION_MEAN','BLOODPRESSURE_DIASTOLIC_MEDIAN','BLOODPRESSURE_SISTOLIC_MEDIAN',\
'HEART_RATE_MEDIAN','RESPIRATORY_RATE_MEDIAN','TEMPERATURE_MEDIAN','OXYGEN_SATURATION_MEDIAN',\
'BLOODPRESSURE_DIASTOLIC_MIN','BLOODPRESSURE_SISTOLIC_MIN','HEART_RATE_MIN','RESPIRATORY_RATE_MIN',\
'TEMPERATURE_MIN','OXYGEN_SATURATION_MIN','BLOODPRESSURE_DIASTOLIC_MAX','BLOODPRESSURE_SISTOLIC_MAX',\
'HEART_RATE_MAX','RESPIRATORY_RATE_MAX','TEMPERATURE_MAX','OXYGEN_SATURATION_MAX','BLOODPRESSURE_DIASTOLIC_DIFF',\
'BLOODPRESSURE_SISTOLIC_DIFF','HEART_RATE_DIFF','RESPIRATORY_RATE_DIFF','TEMPERATURE_DIFF',\
'OXYGEN_SATURATION_DIFF','BLOODPRESSURE_DIASTOLIC_DIFF_REL','BLOODPRESSURE_SISTOLIC_DIFF_REL',\
'HEART_RATE_DIFF_REL','RESPIRATORY_RATE_DIFF_REL','TEMPERATURE_DIFF_REL','OXYGEN_SATURATION_DIFF_REL'

#This function will generate a temporary column that stores the average of each
#feature, by patient. For instance, if a patients five measurements of ALBUMIN_MEAN are [10,n/a,n/a,30,20],
#a new column named 'group_ALBUMIN_MEAN' will store the values 20,20,20,20,20 in the corresponding rows
#for that patient. We will use this later to fill in missing values. 
def getAverageOfFeatureInGroup(feature):
  patients = data.groupby('PATIENT_VISIT_IDENTIFIER')
  temparray = np.zeros(1925)
  for i in range(0,384):
    group = patients.get_group(i)
    mean = group[feature].mean()
    for j in range(i*5, i*5+5):
      temparray[j]=mean
  data['group_'+feature] = temparray

#make temporary columns of each feature in the above list.
for each in columns:
  getAverageOfFeatureInGroup(each)


#if a value is missing in a given row, this funciton will fill in that missing
#value with the average of all other values of the same patient. For instance,
#if a patients five measurements of ALBUMIN_MEAN are [10,n/a,n/a,30,20], the two
#n/a entries will be filled in with 20.
for each in columns:
  data[each].fillna(data['group_'+each],inplace=True)

#drop the temporary columns that we used
for each in columns:
  data = data.drop('group_'+each, axis=1)



Below we turn age percentiles into integer values so that they can be used as linear features during the classification step. 

In [None]:
## Age percentiles

ages = {'60th':6,
        '90th': 9,
        '10th': 1,
        '40th': 4,
        '70th': 7,
        '20th': 2,
        '50th': 5,
        '80th': 8,
        '30th': 3,
 'Above 90th' : 10}

def map_age(row):
    per = row['AGE_PERCENTIL']
    mapped = ages[per]
    return mapped

data['age_mapped'] = data.apply(map_age,axis=1)
data = data.drop('AGE_PERCENTIL', axis=1)


Below we one-hot-encode gender so it can be used as a feature in the classification step.

In [None]:
y = pd.get_dummies(data.GENDER, prefix='GENDER')
print(y.head())

data['GENDER_0'] = y['GENDER_0']
data['GENDER_1'] = y['GENDER_1']

data = data.drop('GENDER', axis=1)

   GENDER_0  GENDER_1
0         1         0
1         1         0
2         1         0
3         1         0
4         1         0


In [None]:
data.to_csv('./tom_data.csv')