<div style="text-align:center">
    <img src="DOH.jpg" alt="sample image for the notebook" width="600">
</div>

<center><h1>Predicting Filipino COVID-19 Survival</h1></center>
<center><em>A Deep Learning Model with an FNN Architecture for Predicting Survival</em></center>
<hr>


This project tackles the challenge of predicting Filipino patients' COVID-19 survival time and recovery status. We leverage a Feedforward Neural Network (FNN) built with TensorFlow to analyze patient data and estimate:

**Survival Time:** Days from symptom onset to recovery/death (removal date).

**Recovery Status:** Recovered or Dead, based on removal date.

The model takes "Sex" and "Age" as input features. We plan to experiment with different hyperparameters to optimize model performance. This project provides a framework for predicting COVID-19 outcomes using readily available data points. The model can be further developed to consider additional factors and refined for improved accuracy.


***

<a name="top"></a>
#### Table of Contents:

[ref0]: #dat_prep
- [Data Preperation][ref0]

[ref1]: #data_preprocessing
- [Data Preprocessing][ref1]

[ref3]: #feature_extraction
- [Feature Extraction][ref3]


***

In [1]:
# Importing of Modules
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from lifelines.statistics import logrank_test
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from keras.models import Sequential

<a name="dat_prep"></a>
# Data Preperation
***
Before training, we meticulously prepare the data. We'll clean it for inconsistencies, missing values, and outliers. We'll then select the most relevant features like pre-existing conditions (beyond sex and age) and potentially create new informative ones. Finally, we'll address missing data using imputation techniques. This ensures the model learns from high-quality information for accurate predictions.

In [2]:
batch_data_0 = pd.read_csv('/Users/reginaldgonzales/Desktop/GITHUB/DOH_Prediction/dataset/DOH COVID Data Drop_ 20240103 (2020-2023) - 04 Case Information_batch_0.csv', low_memory=False)
batch_data_1 = pd.read_csv('/Users/reginaldgonzales/Desktop/GITHUB/DOH_Prediction/dataset/DOH COVID Data Drop_ 20240103 (2020-2023) - 04 Case Information_batch_1.csv', low_memory=False)
batch_data_2 = pd.read_csv('/Users/reginaldgonzales/Desktop/GITHUB/DOH_Prediction/dataset/DOH COVID Data Drop_ 20240103 (2020-2023) - 04 Case Information_batch_2.csv', low_memory=False)
batch_data_3 = pd.read_csv('/Users/reginaldgonzales/Desktop/GITHUB/DOH_Prediction/dataset/DOH COVID Data Drop_ 20240103 (2020-2023) - 04 Case Information_batch_3.csv', low_memory=False)
batch_data_4 = pd.read_csv('/Users/reginaldgonzales/Desktop/GITHUB/DOH_Prediction/dataset/DOH COVID Data Drop_ 20240103 (2020-2023) - 04 Case Information_batch_4.csv', low_memory=False)

In [3]:
df = pd.concat([batch_data_0, batch_data_1, batch_data_2, batch_data_3, batch_data_4], ignore_index=True)

In [4]:
df.head()

Unnamed: 0,CaseCode,Age,AgeGroup,Sex,DateSpecimen,DateResultRelease,DateRepConf,DateDied,DateRecover,RemovalType,...,ProvRes,CityMunRes,CityMuniPSGC,BarangayRes,BarangayPSGC,HealthStatus,Quarantined,DateOnset,Pregnanttab,ValidationStatus
0,C404174,38.0,35 to 39,FEMALE,,2020-01-30,2020-01-30,,,RECOVERED,...,NEGROS ORIENTAL,DUMAGUETE CITY (CAPITAL),PH074610000,,,RECOVERED,NO,2020-01-21,NO,"Removal Type is ""Recovered"", but no Recovered ..."
1,C462688,44.0,40 to 44,MALE,,2020-01-30,2020-02-03,2020-02-01,,DIED,...,NEGROS ORIENTAL,DUMAGUETE CITY (CAPITAL),PH074610000,,,DIED,NO,2020-01-18,,
2,C387710,60.0,60 to 64,FEMALE,2020-01-23,2020-01-30,2020-02-05,,2020-01-31,RECOVERED,...,BOHOL,PANGLAO,PH071233000,,,RECOVERED,NO,2020-01-21,NO,Case has Admitting Facility but is not Admitte...
3,C377460,49.0,45 to 49,MALE,,,2020-03-06,,,RECOVERED,...,BATANGAS,SANTO TOMAS,PH041028000,,,RECOVERED,NO,,,Case has Admitting Facility but is not Admitte...
4,C498051,63.0,60 to 64,MALE,2020-03-05,,2020-03-06,2020-03-11,,DIED,...,RIZAL,CAINTA,PH045805000,,,DIED,NO,,,Age or Birthdate is Invalid\nCase has Lab Resu...


In [5]:
df.shape

(4136488, 23)

In [6]:
df.columns.tolist()

['CaseCode',
 'Age',
 'AgeGroup',
 'Sex',
 'DateSpecimen',
 'DateResultRelease',
 'DateRepConf',
 'DateDied',
 'DateRecover',
 'RemovalType',
 'DateRepRem',
 'Admitted',
 'RegionRes',
 'ProvRes',
 'CityMunRes',
 'CityMuniPSGC',
 'BarangayRes',
 'BarangayPSGC',
 'HealthStatus',
 'Quarantined',
 'DateOnset',
 'Pregnanttab',
 'ValidationStatus']

In [7]:
df.isnull().sum()

CaseCode                   0
Age                    11898
AgeGroup               11898
Sex                        1
DateSpecimen          975556
DateResultRelease     976231
DateRepConf                0
DateDied             4070302
DateRecover          3404454
RemovalType             1773
DateRepRem              1773
Admitted              173623
RegionRes               3882
ProvRes                57369
CityMunRes            105525
CityMuniPSGC          111496
BarangayRes           369066
BarangayPSGC          374907
HealthStatus               0
Quarantined           115223
DateOnset            2644554
Pregnanttab          2068927
ValidationStatus       46925
dtype: int64

## Feature Selection

In [8]:
doh_df = df[['Sex', 'Age', 'DateOnset', 'DateRecover', 'DateDied', 'RemovalType', 'DateRepRem','DateResultRelease', 'DateRepConf']]

In [9]:
doh_df.head()

Unnamed: 0,Sex,Age,DateOnset,DateRecover,DateDied,RemovalType,DateRepRem,DateResultRelease,DateRepConf
0,FEMALE,38.0,2020-01-21,,,RECOVERED,2020-02-07,2020-01-30,2020-01-30
1,MALE,44.0,2020-01-18,,2020-02-01,DIED,2020-02-02,2020-01-30,2020-02-03
2,FEMALE,60.0,2020-01-21,2020-01-31,,RECOVERED,2020-02-05,2020-01-30,2020-02-05
3,MALE,49.0,,,,RECOVERED,2020-03-27,,2020-03-06
4,MALE,63.0,,,2020-03-11,DIED,2020-03-12,,2020-03-06


In [10]:
# Drop rows where 'Age' or 'Sex' is NaN
doh_df = doh_df.dropna(subset=['Age', 'Sex', 'RemovalType'])
doh_df.isnull().sum()
doh_df = doh_df[doh_df['Age'] >= 0]

In [11]:
doh_df.isnull().sum()

Sex                        0
Age                        0
DateOnset            2637370
DateRecover          3392573
DateDied             4056744
RemovalType                0
DateRepRem                 0
DateResultRelease     975334
DateRepConf                0
dtype: int64

In [12]:
doh_df.shape

(4122840, 9)

In [13]:
doh_df.head()

Unnamed: 0,Sex,Age,DateOnset,DateRecover,DateDied,RemovalType,DateRepRem,DateResultRelease,DateRepConf
0,FEMALE,38.0,2020-01-21,,,RECOVERED,2020-02-07,2020-01-30,2020-01-30
1,MALE,44.0,2020-01-18,,2020-02-01,DIED,2020-02-02,2020-01-30,2020-02-03
2,FEMALE,60.0,2020-01-21,2020-01-31,,RECOVERED,2020-02-05,2020-01-30,2020-02-05
3,MALE,49.0,,,,RECOVERED,2020-03-27,,2020-03-06
4,MALE,63.0,,,2020-03-11,DIED,2020-03-12,,2020-03-06


## Feature Engineering

In [14]:
# Add a new column named "COVID Positive Dates" to record the dates someone had COVID.
doh_df['DateOnset'] = pd.to_datetime(doh_df['DateOnset'], errors='coerce', format='%Y-%m-%d')
doh_df['DateRecover'] = pd.to_datetime(doh_df['DateRecover'], errors='coerce', format='%Y-%m-%d')
doh_df['DateDied'] = pd.to_datetime(doh_df['DateDied'], errors='coerce', format='%Y-%m-%d')
doh_df['DateRepRem'] = pd.to_datetime(doh_df['DateRepRem'], errors='coerce', format='%Y-%m-%d')
doh_df['DateResultRelease'] = pd.to_datetime(doh_df['DateResultRelease'], errors='coerce', format='%Y-%m-%d')
doh_df['DateRepConf'] = pd.to_datetime(doh_df['DateRepConf'], errors='coerce', format='%Y-%m-%d')

In [15]:
doh_df.head()

Unnamed: 0,Sex,Age,DateOnset,DateRecover,DateDied,RemovalType,DateRepRem,DateResultRelease,DateRepConf
0,FEMALE,38.0,2020-01-21,NaT,NaT,RECOVERED,2020-02-07,2020-01-30,2020-01-30
1,MALE,44.0,2020-01-18,NaT,2020-02-01,DIED,2020-02-02,2020-01-30,2020-02-03
2,FEMALE,60.0,2020-01-21,2020-01-31,NaT,RECOVERED,2020-02-05,2020-01-30,2020-02-05
3,MALE,49.0,NaT,NaT,NaT,RECOVERED,2020-03-27,NaT,2020-03-06
4,MALE,63.0,NaT,NaT,2020-03-11,DIED,2020-03-12,NaT,2020-03-06


In [16]:
def calculate_duration(row):
    start_date = row['DateOnset']
    if pd.isna(start_date):
        start_date = row['DateResultRelease']
    if pd.isna(start_date):
        start_date = row['DateRepConf']
    
    end_date = row['DateRecover']
    if pd.isna(end_date):
        end_date = row['DateRepRem']

    if pd.notna(start_date) and pd.notna(end_date):
        duration = (end_date - start_date).days
        if duration < 0:  # Add a check to handle negative durations
            return None  
        return duration
    else:
        return None

doh_df['CovidDuration'] = doh_df.apply(calculate_duration, axis=1)

In [17]:
# Creating a Age Group
bins = [0, 18, 35, 60, 118]
labels = ['0-18', '19-35', '36-60', '61+']
doh_df['Age'] = pd.to_numeric(doh_df['Age'], errors='coerce')
doh_df['AgeGroup'] = pd.cut(doh_df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)

In [18]:
# Factor in the Sex
average_durations_by_sex_and_age = doh_df.groupby(['Sex', 'AgeGroup'], observed=True)['CovidDuration'].mean().to_dict()
average_durations_by_age = doh_df.groupby('AgeGroup', observed=True)['CovidDuration'].mean().to_dict()

In [19]:
doh_df.head()

Unnamed: 0,Sex,Age,DateOnset,DateRecover,DateDied,RemovalType,DateRepRem,DateResultRelease,DateRepConf,CovidDuration,AgeGroup
0,FEMALE,38.0,2020-01-21,NaT,NaT,RECOVERED,2020-02-07,2020-01-30,2020-01-30,17.0,36-60
1,MALE,44.0,2020-01-18,NaT,2020-02-01,DIED,2020-02-02,2020-01-30,2020-02-03,15.0,36-60
2,FEMALE,60.0,2020-01-21,2020-01-31,NaT,RECOVERED,2020-02-05,2020-01-30,2020-02-05,10.0,36-60
3,MALE,49.0,NaT,NaT,NaT,RECOVERED,2020-03-27,NaT,2020-03-06,21.0,36-60
4,MALE,63.0,NaT,NaT,2020-03-11,DIED,2020-03-12,NaT,2020-03-06,6.0,61+


## Imputing

In [20]:
# Reducing the Values Exceeding 30 and to fill in Missing Values in CovidDuration
def impute_duration(row):
    if pd.isna(row['CovidDuration']):
        
        return average_durations_by_sex_and_age.get((row['Sex'], row['AgeGroup']), np.nan)  # Use NaN as fallback
    elif row['CovidDuration'] > 30:

        return average_durations_by_age.get(row['AgeGroup'], row['CovidDuration'])  # Fallback to original if no average
    else:
        return row['CovidDuration']


doh_df['CovidDuration'] = doh_df.apply(impute_duration, axis=1)

In [21]:
print(f"{doh_df[['Sex', 'Age', 'AgeGroup', 'CovidDuration']].head()}")

      Sex   Age AgeGroup  CovidDuration
0  FEMALE  38.0    36-60           17.0
1    MALE  44.0    36-60           15.0
2  FEMALE  60.0    36-60           10.0
3    MALE  49.0    36-60           21.0
4    MALE  63.0      61+            6.0


In [22]:
count_over_30 = (doh_df['CovidDuration'] > 30).sum()
print(f'Number of entries with CovidDuration greater than 30 days: {count_over_30}')

Number of entries with CovidDuration greater than 30 days: 0


In [23]:
doh_df.isnull().sum()

Sex                        0
Age                        0
DateOnset            2690211
DateRecover          3393511
DateDied             4057210
RemovalType                0
DateRepRem            134046
DateResultRelease    1107829
DateRepConf           134046
CovidDuration              0
AgeGroup                   0
dtype: int64

In [24]:
# Remove the Decimals
doh_df['CovidDuration'] = doh_df['CovidDuration'].round().astype(int)

In [25]:
doh_df.head()

Unnamed: 0,Sex,Age,DateOnset,DateRecover,DateDied,RemovalType,DateRepRem,DateResultRelease,DateRepConf,CovidDuration,AgeGroup
0,FEMALE,38.0,2020-01-21,NaT,NaT,RECOVERED,2020-02-07,2020-01-30,2020-01-30,17,36-60
1,MALE,44.0,2020-01-18,NaT,2020-02-01,DIED,2020-02-02,2020-01-30,2020-02-03,15,36-60
2,FEMALE,60.0,2020-01-21,2020-01-31,NaT,RECOVERED,2020-02-05,2020-01-30,2020-02-05,10,36-60
3,MALE,49.0,NaT,NaT,NaT,RECOVERED,2020-03-27,NaT,2020-03-06,21,36-60
4,MALE,63.0,NaT,NaT,2020-03-11,DIED,2020-03-12,NaT,2020-03-06,6,61+


[ref0]: #top
[Back to Table of Contents][ref0]

<a name="data_preprocessing"></a>
# Data Preprocessing
***
The data gets prepped for training! Categorical features like "Sex" are converted to numbers. We then pick these features and the target survival time/status (converted to 1/0 for dead/recovered). The data is split into training and testing sets. Finally, numerical features are scaled for better training, and recovery status is converted for multi-class classification. 

In [26]:
le_sex = LabelEncoder()
doh_df['Sex'] = le_sex.fit_transform(doh_df['Sex'])

le_age_group = LabelEncoder()
doh_df['AgeGroup'] = le_age_group.fit_transform(doh_df['AgeGroup'])

In [27]:
X = doh_df[['Sex', 'AgeGroup']]
y_duration = doh_df['CovidDuration'].values
y_status = doh_df['RemovalType'].apply(lambda x: 1 if x == 'RECOVERED' else 0).values

In [28]:
X_train, X_test, y_train_duration, y_test_duration, y_train_status, y_test_status = train_test_split(
    X, y_duration, y_status, test_size=0.2, random_state=42)

In [29]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

y_train_status_categorical = to_categorical(y_train_status)
y_test_status_categorical = to_categorical(y_test_status)

In [30]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

y_train_status_categorical = to_categorical(y_train_status)
y_test_status_categorical = to_categorical(y_test_status)

[ref1]: #top
[Back to Table of Contents][ref1]

<a name="feature_extraction"></a>
# Feature Extraction
***
A Feedforward Neural Network (FNN) crunches the data! It takes features like "Sex" and "AgeGroup" after preprocessing. Hidden layers with ReLU activation and Dropout learn complex patterns, with L2 regularization for stability. Separate outputs predict survival time (days) and recovery status (Recovered/Dead). We train the model with an Adam optimizer and specific loss functions for each output. Finally, the model predicts survival time and recovery probabilities for new data, assigning labels based on the highest probability. We then evaluate its performance.

In [31]:
input_layer = Input(shape=(X_train_scaled.shape[1],))

x = Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001))(input_layer)  
x = Dropout(0.5)(x)
x = Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.001))(x)  
duration_output = Dense(1, name='duration')(x)

status_output = Dense(2, activation='softmax', kernel_regularizer=regularizers.l2(0.001), name='status')(x)

# Construct the model
model = Model(inputs=input_layer, outputs=[duration_output, status_output])

In [32]:
# Compile the model
model.compile(optimizer='adam',
              loss={'duration': 'mse', 'status': 'categorical_crossentropy'},
              loss_weights={'duration': 1.0, 'status': 2.0},  # Adjust weights as needed
              metrics={'duration': ['mse'], 'status': ['accuracy']})

model.summary()

In [33]:
# Early stopping with patience 5
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

In [34]:
# Training the model
history = model.fit(
    x=X_train_scaled,
    y={'duration': y_train_duration, 'status': y_train_status_categorical},
    validation_split=0.2,
    epochs=100,  
    batch_size=64,
    callbacks=[early_stopping]
)

Epoch 1/100
[1m41229/41229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 763us/step - duration_mse: 30.2538 - loss: 30.4510 - status_accuracy: 0.9836 - val_duration_mse: 26.5427 - val_loss: 26.6964 - val_status_accuracy: 0.9840
Epoch 2/100
[1m41229/41229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 719us/step - duration_mse: 26.4826 - loss: 26.6377 - status_accuracy: 0.9837 - val_duration_mse: 26.4887 - val_loss: 26.6409 - val_status_accuracy: 0.9840
Epoch 3/100
[1m41229/41229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 706us/step - duration_mse: 26.5276 - loss: 26.6822 - status_accuracy: 0.9837 - val_duration_mse: 26.4858 - val_loss: 26.6371 - val_status_accuracy: 0.9840
Epoch 4/100
[1m41229/41229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 680us/step - duration_mse: 26.5331 - loss: 26.6868 - status_accuracy: 0.9838 - val_duration_mse: 26.4941 - val_loss: 26.6458 - val_status_accuracy: 0.9840
Epoch 5/100
[1m41229/41229[0m [32m━━━━━━━━━━━

In [35]:
test_results = model.evaluate(
    X_test_scaled,
    {'duration': y_test_duration, 'status': y_test_status_categorical}
)
print(f'Test Loss: {test_results[0]}, Duration MSE: {test_results[1]}, Status Accuracy: {test_results[2]}')

[1m25768/25768[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 372us/step - duration_mse: 26.4285 - loss: 26.5820 - status_accuracy: 0.9836
Test Loss: 26.59313201904297, Duration MSE: 26.440404891967773, Status Accuracy: 0.9836375117301941


In [36]:
predictions = model.predict(X_test_scaled)
predicted_duration, predicted_status = predictions
predicted_status_labels = np.argmax(predicted_status, axis=1)

[1m25768/25768[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 382us/step


In [37]:
predicted_labels = np.argmax(predicted_status, axis=1)
true_labels = np.argmax(y_test_status_categorical, axis=1)
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.9836374926021868
Precision: 0.9836374926021868
Recall: 1.0
F1 Score: 0.9917512612769038


In [38]:
model.save("/Users/reginaldgonzales/Desktop/GITHUB/DOH_Prediction/model.h5")



[ref2]: #top
[Back to Table of Contents][ref2]

# Conclusion
***
The Feedforward Neural Network (FNN) model achieved promising results in predicting COVID-19 survival time and recovery status for Filipino patients using "Sex" and "AgeGroup" as features. Here's a breakdown of the performance metrics:

**Accuracy: 98.4%** - The model correctly classified nearly all recovery statuses (Recovered vs. Dead) in the unseen test data.

**Precision: 98.4%** - When the model predicted someone as Recovered, they were genuinely Recovered with a high probability.

**Recall: 100%** - The model identified all actual Recovered cases from the test data.

**F1-Score: 99.2** - This balanced metric combines precision and recall, indicating the model's effectiveness in predicting Recovered cases and avoiding false positives.

These results suggest that the FNN model is a valuable tool for healthcare professionals to gain insights into a patient's COVID-19 prognosis based on readily available demographic information. However, it's important to acknowledge the limitations:

- The model is trained on a specific dataset and might generalize differently to other populations.
- The model only considers "Sex" and "AgeGroup," and incorporating additional factors like pre-existing conditions could improve performance.

