# Task 1 - Identify if a pacient is SARS-Cov-2 positive

Albert Einstein Hospital in São Paulo, Brazil has provided a dataset of its pacients with a number of exams and tests and the SARS-Cov-2 exam result.
The idea of this notebook is to provide a model that identify if a pacient is SARS-Cov-2 positive based on the other exams results.
The greatest challenges here are:

1) To handle with missing values, since the set of exams held for each pacient is different, and a lot of missing values are present;

2) To present good results with unbalanced classes (there are much more negative results)

In [1]:
import pandas as pd
import numpy as np

from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix 
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.neural_network import MLPClassifier

import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

In [1]:
df = pd.read_excel('/kaggle/input/covid19/dataset.xlsx')

## 1. Feature Selection

### 1.1 Selection of variables with at least 10% non NaN values

In [1]:
df_freq = df.count()
variables = df_freq.loc[df_freq>0.1*df.shape[0]].index
df = df[variables]

### 1.2. Selection of continuous variables
Now it is checked the correlation of the continuous variables compared to our target (Covid exam result).
Here, it is selected only the variables with |c|>=0.1, which is not a high correlation, but it has a minimum significance.

In [1]:
df2 = df.copy()
df2.drop(['Patient addmited to regular ward (1=yes, 0=no)','Patient addmited to semi-intensive unit (1=yes, 0=no)','Patient addmited to intensive care unit (1=yes, 0=no)'],axis=1,inplace=True)
df2['SARS-Cov-2 exam result'] = df2['SARS-Cov-2 exam result'].map(lambda x: 1 if x == 'positive' else 0)
corrs = df2.corr()['SARS-Cov-2 exam result']
cont_out = list(corrs.loc[(corrs<0.1)|(corrs>-0.1)].index)
cont_vars = list(corrs.loc[(corrs>=0.1)|(corrs<=-0.1)].index)

Correlations (after feature selection):

In [1]:
corrs[cont_vars]

In [1]:
cont_vars = list(set(cont_vars)-set(['SARS-Cov-2 exam result','Patient addmited to regular ward (1=yes, 0=no)']))
variables = [var for var in variables if var not in cont_out]
cat_vars = list(set(variables) - set(cont_vars)-set(['Patient ID','Patient addmited to regular ward (1=yes, 0=no)','Patient addmited to semi-intensive unit (1=yes, 0=no)','Patient addmited to intensive care unit (1=yes, 0=no)']))
variables = cont_vars+cat_vars
yCol = ['SARS-Cov-2 exam result']
df = df[variables+yCol]

Set of continuous variables:

In [1]:
cont_vars

Categorical variables (no feature selection is held on them):

In [1]:
cat_vars

## 2. Data preparation (handling missing values)

I do not infer anything about missing data, so data preparation is made according the following steps:

1) For the continuous variables, any row with missing data is deleted

2) For the categorical variables, missing data is filled with "not_tested"

3) One hot encoding of categorical data

4) not_tested dummy columns are deleted

So, missing data in categorical variables are handled as a sequence of zeros in one hot encoding.
Example:

| Influenza_A_detected | Influenza_A_not_detected |
| --- | --- |
| 0 | 0 |

In [1]:
df = df.loc[~df[cont_vars].isna().all(axis=1)]
df[cat_vars] = df[cat_vars].replace(np.nan,'not_tested')
df_dummies = pd.get_dummies(df[cat_vars])

dumm_col = list(df_dummies.columns)
cols = []
for col in dumm_col:
    if "not_tested" not in col:
        cols.append(col)
df_dummies = df_dummies[cols]

df = pd.concat([df,df_dummies],axis=1)
df.drop(columns=cat_vars,axis=1,inplace=True)
variables = list(set(df.columns)-set(yCol))

df.dropna(inplace=True)
df.shape

The dataframe has 598 entries, which is a good number to handle with!

## 3. Train and test

First, data normalization is held, followed by a random under sampling (I do not want to over sample and evaluate the model over inferred data), in order to balance classes. After that, train-test split (20% testing) is performed.

It is used an ensemble of 5 MLP classifiers, with 2 hidden layers each and different number of neurons:
- MLP1: 2 hidden layers with 28 neurons each
- MLP2: 2 hidden layers with 32 neurons each
- MLP3: 2 hidden layers with 36 neurons each
- MLP4: 2 hidden layers with 40 neurons each
- MLP5: 2 hidden layers with 44 neurons each

The predicted class is a majority voting of each network predicted label.

In [1]:
scaler = MinMaxScaler(copy=False, feature_range=(0, 1))
df[variables] = scaler.fit_transform(df[variables])

df[yCol] = df[yCol].replace('negative',0)
df[yCol] = df[yCol].replace('positive',1)

x = df[variables]
y = df[yCol]
rus = RandomUnderSampler(random_state=42)
x_rus, y_rus = rus.fit_resample(x, y)

x_train, x_test, y_train, y_test = train_test_split(x_rus, y_rus, test_size=0.20, random_state=1)

In [1]:
clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=1e-3,
                    hidden_layer_sizes=(28, 2), random_state=1)

clf.fit(x_train, y_train)
y_pred_nn1 = clf.predict(x_test)

clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=1e-3,
                    hidden_layer_sizes=(32, 2), random_state=1)

clf.fit(x_train, y_train)
y_pred_nn2 = clf.predict(x_test)

clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=1e-3,
                    hidden_layer_sizes=(36, 2), random_state=1)

clf.fit(x_train, y_train)
y_pred_nn3 = clf.predict(x_test)

clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=1e-3,
                    hidden_layer_sizes=(40, 2), random_state=1)

clf.fit(x_train, y_train)
y_pred_nn4 = clf.predict(x_test)

clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=1e-3,
                    hidden_layer_sizes=(44, 2), random_state=1)

clf.fit(x_train, y_train)
y_pred_nn5 = clf.predict(x_test)

In [1]:
df_ensemble = pd.DataFrame(columns=['class_nn1','class_nn2','class_nn3','class_nn4','class_nn5','y_true'])
df_ensemble['class_nn1'] = y_pred_nn1
df_ensemble['class_nn2'] = y_pred_nn2
df_ensemble['class_nn3'] = y_pred_nn3
df_ensemble['class_nn4'] = y_pred_nn4
df_ensemble['class_nn5'] = y_pred_nn5
y_test.reset_index(inplace=True,drop=True)
df_ensemble['y_true'] = y_test
df_ensemble['y_pred'] = df_ensemble[['class_nn1','class_nn2','class_nn3','class_nn4','class_nn5']].mode(axis=1)

### Confusion Matrix

In [1]:
cm = confusion_matrix(df_ensemble['y_true'],df_ensemble['y_pred'])

df_cm = pd.DataFrame(cm, index = ['negative','positive'],
                  columns = ['negative','positive'])
sn.set(font_scale=1.4) # for label size
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size

### Accuracy

In [1]:
accuracy_score(df_ensemble['y_true'],df_ensemble['y_pred'])

### Precision

In [1]:
precision_score(df_ensemble['y_true'],df_ensemble['y_pred'])

### Recall

In [1]:
recall_score(df_ensemble['y_true'],df_ensemble['y_pred'])

### F1 Score

In [1]:
f1_score(df_ensemble['y_true'],df_ensemble['y_pred'], average='macro')

## 4. Conclusions

The model presents very interesting results regarding accuracy, precision, recal and F1, which means that the model, along with the undersample, handled well with the class unbalance.

I hope this could be helpful somehow!