# Identification of relevant markers for predicting the patient trajectories

This Notebook investigates the differences in blood markers between patients which were hospitalized, sent to the ICU or could go home after testing Covid-19-positive. To do this, we stratify the patients into groups and succesfully check for statistically significant differences between the groups in various markers via ANOVA.

# Data:

The Notebook uses the Einstein Dataset from the UNCOVER Covid-19 Challenge.

# Results: 

Six markers are identified which vary significantly between the groups and therefore are plausible markers for patient trajectories.

In [None]:
import pandas as pd
import numpy as np
import scipy as sp

data = pd.read_csv("../input/uncover/UNCOVER/einstein/diagnosis-of-covid-19-and-its-clinical-spectrum.csv",
                  header = 0)



Since we are interested in patient trajectories, we first subset the dataset for the actual patients, meaning that they must have tested positive for covid:

In [None]:
positives = data[data.sars_cov_2_exam_result.eq("positive")]
positives.shape

Great, we still have 558 patients left, enough to do analysis on.
Now the dataset has to be cleaned. Since we expect a lot of missing values, let's look at the percentage of missing values in the dataset:

In [None]:
missing_pos = pd.DataFrame(positives.isnull().mean() * 100)
pd.set_option('display.max_rows',len(missing_pos))
missing_pos

As we can see, most of the columns contain mostly missing values. Some, such as "d_dimer" and "albumin", contain not one actual value. Let's discard all colums that have mostly (> 95%) missing values.

In [None]:
for rowname, missingness in zip(missing_pos.index, missing_pos.values):
    if  missingness[0] > 95:
        positives.drop(str(rowname), inplace=True, axis = 1)
        print("Dropped " + rowname)

As we can see, we just lost a lot of columns. But hey, they were (almost) entirely empty, anyway! Let's see how much is left:

In [None]:
positives.shape

558 patients with 53 columns each. This does not mean that every patient has a value in each of the columns, but we can look at this subset as the actual heart of the dataset.

Now, there are three columns that indicate if a patient got admitted

1. to the regular floor,
2. to a semi-intensive care or
3. to the intensive care unit. 

We can safely assume that the patients that do not match either of these criteria have been

4. sent home. 

So, let's split up our dataset into these 4 groups:

In [None]:
regular = positives[positives.patient_addmited_to_regular_ward_1_yes_0_no.eq("t")]
semi = positives[positives.patient_addmited_to_semi_intensive_unit_1_yes_0_no.eq("t")]
intensive = positives[positives.patient_addmited_to_intensive_care_unit_1_yes_0_no.eq("t")]

home = positives.drop(list(regular.index) + list(semi.index) + list(intensive.index), axis=0, inplace=False)

print(len(regular), len(semi), len(intensive), len(home))

So, from out Covid-19 positive patients, 36 got sent to the regular floor, 8 got sent into semi intensive care, another 8 got sent into intensive care and 506 were sent home.
This is definitely unbalanced, but it is the best we can do. Additionally, this doesnt mean that the homegoers have far more values, since this group seems to have much sparser information:

In [None]:
pd.concat([
           pd.DataFrame(home.isnull().mean() * 100).rename(columns={0:"Home"}),
           pd.DataFrame(regular.isnull().mean() * 100).rename(columns={0:"Regular"}),
           pd.DataFrame(semi.isnull().mean() * 100).rename(columns={0:"Semi"}),
           pd.DataFrame(intensive.isnull().mean() * 100).rename(columns={0:"Intensive"}),
          ],
           axis=1)

As you can see, the missingness is by far the highest among the homegoers. That is not a problem, since this group makes up for it in numbers.

Now, lets see if we can find differences in the 4 different groups, or "strata", in some of these parameters. First, lets get some helpers on the way:

In [None]:
def select_columns(column):
    # returns a list of lists for the specified column for each stratum
    return [np.array(stratum[column].dropna()) for stratum in [home,regular,semi,intensive]]

from scipy.stats import f_oneway

def analyze(df,blacklist):
    # run a oneway-anova between the for groups for each column and return the results
    res = {}
    for column in df.columns:
        if column not in blacklist:
            print(column + ":")
            try:
                f, p = f_oneway(*select_columns(column))
                print("p-Value: " + str(p))
                res.update({column : (f,p)})
            except ValueError as e:
                print(e)
            
    return res

# we are not interested in the following columns:

blacklist= ['patient_id', 'patient_age_quantile', 'sars_cov_2_exam_result',
       'patient_addmited_to_regular_ward_1_yes_0_no',
       'patient_addmited_to_semi_intensive_unit_1_yes_0_no',
       'patient_addmited_to_intensive_care_unit_1_yes_0_no']

In [None]:
analysis = analyze(positives,blacklist)

As we can see, there are many low p-values. There also are some errors concerning the categorical columns, but we will handle them later on. Let's continue with our continous variables and conduct a p-value adjustment with False Discovery Rate. Since we have run many tests, some of them might be positive just by chance. To avoid that, we need a p-value-adjustment for multiple testing:

In [None]:
def fdr(p_vals):

    from scipy.stats import rankdata
    ranked_p_values = rankdata(p_vals)
    fdr = p_vals * len(p_vals) / ranked_p_values
    fdr[fdr > 1] = 1

    return fdr

In [None]:
p_values = [x[1] for x in list(analysis.values())]
p_values= np.array(p_values)

fdr(p_values)

Even after correction, there are still some significant p-values! Lets see how many variables differ significantly between at least two of our groups:

In [None]:
adj_p_values = fdr(p_values)

print(np.sum(adj_p_values < 0.05))

significant_columns = list(map(list(analysis.keys()).__getitem__,list(np.where(adj_p_values < 0.05)[0])))

print(significant_columns)

The difference of six columns is statistically significant between the groups. It doesnt surprise us that most of these columns have something to do with the immune system! This indicates that we are on the right track.

Next, lets take deeper look into each of these six columns and how exactly they differ between the groups:


In [None]:
def column_generator(columns_of_interest):
    
    for column in columns_of_interest:
        
        values = np.array([stratum[column].dropna() for stratum in [home,regular,semi,intensive]])
        treatments = np.repeat(["Home","Regular","Semi","Intensive"], repeats= [len(x) for x in values])
         
        values = np.hstack(values)
        # Stack the data (and rename columns):

        value_df = pd.DataFrame(values.T,columns=["Values"])
        treatments_df = pd.DataFrame(treatments.T,columns=["Treatments"])

        stacked_data = pd.concat([treatments_df,value_df],axis=1)
        stacked_data.name = column
        
        yield stacked_data


In [None]:
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
                                         MultiComparison)

In [None]:
# Set up the data for comparison (creates a specialised object)
for stacked_data in column_generator(significant_columns):
    MultiComp = MultiComparison(stacked_data['Values'],
                                stacked_data['Treatments'])

    # Show all pair-wise comparisons:
    
    # Print the comparisons
    print("Variable: " + stacked_data.name)
    print(MultiComp.tukeyhsd(alpha=0.05/len(significant_columns)).summary())

In [None]:
# interim result
pd.DataFrame(list(map(list(analysis.keys()).__getitem__,list(np.where(adj_p_values < 0.05)[0])))).to_csv("submission.csv")