# **COVID-19 Prediction of Death**
**Context**

A health crisis of massive proportion such as the current COVID-9 pandemic provides us with an opportunity to ponder and reflect over what we can better in the way we deal with healthcare to make us humans be more prepared and enabled to combat such an event in the future.
During the entire course of the pandemic, one of the main problems that healthcare providers have faced is the shortage of medical resources and a proper plan to efficiently distribute them.
They have been in the dark failing to understand how much resource they could even in the very next week as the COVID-19 curve has swayed very unpredictably. In these tough times, being able to predict what kind of resource an individual might require at the time of being tested positive or even before that will be of great help to the authorities as they would be able to procure and arrange for the resources necessary to save the life of that patient.

**Content**

While the above are lofty thoughts, procuring patient data of COVID-19 patients containing patient-specific information regarding patient history and habits is a different ball game altogether. This is mainly due to the regulatory security laws such as HIPAA and GDPR which makes it almost impossible for anyone to get hands-on PHI data. I spend literally days and nights searching for a suitable data-set, called up people I knew for any directions towards a data-set which might be of use to me. Finally, I found this data-set https://www.gob.mx/salud/documentos/datos-abiertos-152127 which was released by the Mexican government. This data-set contains a huge number of anonymised patient-related information.

**Columns**
* id: ID of patient
* sex: Female - 1, Male - 2
* patient_type: Outpatient - 1, Inpatient - 2
* entry_date: Date of Entry to hospital
* date_symptoms: Date of first symptom
* date_died: Date of death
* intubed: Yes - 1, No - 2, Data missing or NA - 97,98,99
* pneumonia: Yes - 1, No - 2, Data missing or NA - 97,98,99
* age: Age
* pregnancy: Yes - 1, No - 2, Data missing or NA - 97,98,99
* diabetes: Yes - 1, No - 2, Data missing or NA - 97,98,99
* copd: Yes - 1, No - 2, Data missing or NA - 97,98,99
* asthma: Yes - 1, No - 2, Data missing or NA - 97,98,99
* inmsupr: Yes - 1, No - 2, Data missing or NA - 97,98,99
* hypertension: Yes - 1, No - 2, Data missing or NA - 97,98,99
* other_disease: Yes - 1, No - 2, Data missing or NA - 97,98,99
* cardiovascular: Yes - 1, No - 2, Data missing or NA - 97,98,99
* obesity: Yes - 1, No - 2, Data missing or NA - 97,98,99
* renal_chronic: Yes - 1, No - 2, Data missing or NA - 97,98,99
* tobacco: Yes - 1, No - 2, Data missing or NA - 97,98,99
* contact_other_covid: Yes - 1, No - 2, Data missing or NA - 97,98,99
* covid_res: Positive - 1, Negative - 2, Awaiting Results - 3
* icu: Yes - 1, No - 2, Data missing or NA - 97,98,99

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, plot_confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils import class_weight
from keras import models
from keras import layers
from keras import regularizers
from keras import optimizers
import tensorflow as tf
import random as rn

os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(1)
rn.seed(2)
tf.random.set_seed(3)

<h2>Data preparation</h2> 

In [None]:
df = pd.read_csv('../input/covid19-patient-precondition-dataset/covid.csv')

# display the first 5 rows
df.head()

In [None]:
# display columns info
df.info()

In [None]:
# the target column will be the death of the patient
# patient survived if date_died == '9999-99-99'
df['death'] = df['date_died'].apply(lambda x: 0 if x == '9999-99-99' else 1)

In [None]:
# drop some unnecessary columns
df.drop(columns={"id","patient_type",
                 "entry_date","date_symptoms",
                 "date_died","pregnancy"}, axis=1, inplace=True)

In [None]:
# replace all missing values (97,98 and 99) with nan
temp = df['age'] # save age - you do not want to drop old people!
df = df.replace([97,98,99], [np.nan for i in range(3)])
df['age'] = temp

# drop all nan rows
df = df.dropna()

In [None]:
# drop rows with covid_res == Awaiting Results 
df=df[df['covid_res'] != 3]

In [None]:
# replace all 1,2 values with 0,1
temp = df[['age','death']] # save age and death
df = df.replace([1,2], [0,1])
df[['age','death']] = temp

In [None]:
# update index
df = df.reset_index(drop=True)

In [None]:
# display the first 5 rows of prepared data
df.head()

In [None]:
# display columns info of prepared data
df.info()

In [None]:
# there seems to be an imbalance of target classes
df['death'].value_counts().to_frame()

In [None]:
# split features and target
X = df.loc[:, df.columns != 'death'].values
y = np.array(df['death'])

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=1)

print("Train X: ", X_train.shape)
print("Train y: ", y_train.shape)
print("Test X: ", X_test.shape)
print("Test y: ", y_test.shape)

In [None]:
# normalize age
ageColumnIndex = df.columns.get_loc('age')

mean = X_train[:,ageColumnIndex].mean(axis=0)
X_train[:,ageColumnIndex] -= mean
std = X_train[:,ageColumnIndex].std(axis=0)
X_train[:,ageColumnIndex] /= std

X_test[:,ageColumnIndex] -= mean
X_test[:,ageColumnIndex] /= std

In [None]:
# use SMOTE oversampling method to solve the imbalance problem
X_train_oversampled, y_train_oversampled = SMOTE().fit_resample(X_train, y_train)

print("Oversampled train X: ", X_train_oversampled.shape)
print("Oversampled train y: ", y_train_oversampled.shape)

pd.DataFrame(y_train_oversampled).value_counts()

<h2>Classification with Decision Tree</h2> 

In [None]:
decTreClassifier = DecisionTreeClassifier()
decTreClassifier.fit(X_train_oversampled, y_train_oversampled)

y_pred = decTreClassifier.predict(X_test)

plot_confusion_matrix(decTreClassifier, X_test, y_test)
print(classification_report(y_test, y_pred))

<h2>Classification with Random Forest</h2>

In [None]:
randForestClassifier = RandomForestClassifier(n_estimators=100)
randForestClassifier.fit(X_train_oversampled, y_train_oversampled)

y_pred = randForestClassifier.predict(X_test)

plot_confusion_matrix(randForestClassifier, X_test, y_test)
print(classification_report(y_test, y_pred))

In [None]:
# get features importance
rf_feature_importance = pd.Series(randForestClassifier.feature_importances_,index=df.loc[:, df.columns != 'death'].columns).sort_values(ascending=False)

sns.barplot(x = rf_feature_importance, y = rf_feature_importance.index)

plt.title("RF feature importance")
plt.legend()
plt.show()

<h2>Classification with Gradient Boosting</h2>

In [None]:
gradientBoostingClassifier = GradientBoostingClassifier(learning_rate=1)
gradientBoostingClassifier.fit(X_train_oversampled, y_train_oversampled)

y_pred = gradientBoostingClassifier.predict(X_test)

plot_confusion_matrix(gradientBoostingClassifier, X_test, y_test)
print(classification_report(y_test, y_pred))

<h2>Classification with simple ANN</h2>

In [None]:
# prepare validation data from train data
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,
                                                          y_train,
                                                          test_size=0.15,
                                                          random_state=1)

print("Train X: ", X_train_new.shape)
print("Train y: ", y_train_new.shape)
print("Val X: ", X_val.shape)
print("Val y: ", y_val.shape)
print("Test X: ", X_test.shape)
print("Test y: ", y_test.shape)

In [None]:
model = models.Sequential()

model.add(layers.Dense(256, activation = "relu",
                       kernel_regularizer=regularizers.l2(0.001),
                       input_shape = (X_train_new.shape[1],)))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(128, activation = "relu",
                       kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(64, activation = "relu",
                       kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

In [None]:
model.compile(optimizer=optimizers.Adam(lr=0.0001), 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

# use weights to solve the imbalance problem
computed_weights = class_weight.compute_class_weight('balanced',
                                                     np.unique(y_train_new),
                                                     y_train_new)
class_weights = {0: computed_weights[0], 1: computed_weights[1]}

history = model.fit(
    X_train_new,y_train_new,
    epochs=100,
    batch_size=512,
    validation_data = (X_val, y_val),
    verbose=0,
    class_weight=class_weights
)

In [None]:
history_dict = history.history
epochs = range(1, len(history_dict['loss']) + 1)

plt.plot(epochs, history_dict['loss'], 'bo', label='loss_values')
plt.plot(epochs, history_dict['val_loss'], 'b', label='val_loss_values')

plt.xlabel('epochs')
plt.ylabel('loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
plt.clf()

plt.plot(epochs, history_dict['accuracy'], 'ro', label='acc_values')
plt.plot(epochs, history_dict['val_accuracy'], 'r', label='val_acc_values')

plt.xlabel('epochs')
plt.ylabel('loss')
plt.title('Training and validation accuracy')
plt.legend()

plt.show()

In [None]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test accuracy:", test_acc)
print("Test loss:", test_loss)