# Motivation: 

With this kernel I intend to use the GBT (gradient boosted classifier) algorithm from scikit learn ONLY to show the ordered importance of feature/symptoms/comorbidities provided in the data against criticality of the patient. Results from this kind of analysis may highlight what comorbidities lead to critical and severe outcomes from Covid19 virus. <br> 
This could help in triage of patients, optimal distribution of vaccination in resource constrained countries (as and when they come), or prevention for those more susceptible to the virus. 

# Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier

# Data Cleaning 

In [None]:
data_source_path = "../input/covid19-patient-precondition-dataset/covid.csv"
covid = pd.read_csv( data_source_path, 
                    encoding = "ISO-8859-1", 
                    low_memory = False)

covid.info()

# Cleaning the data to keep only the rows containing 1, 2 values as 97 and 99 are essentialling missing data
covid = covid.loc[(covid.intubed == 1) | (covid.intubed == 2)]
covid = covid.loc[(covid.pneumonia == 1) | (covid.pneumonia == 2)]
covid = covid.loc[(covid.diabetes == 1) | (covid.diabetes == 2)]
covid = covid.loc[(covid.copd == 1) | (covid.copd == 2)]
covid = covid.loc[(covid.asthma == 1) | (covid.asthma == 2)]
covid = covid.loc[(covid.inmsupr == 1) | (covid.inmsupr == 2)]
covid = covid.loc[(covid.hypertension == 1) | (covid.hypertension == 2)]
covid = covid.loc[(covid.other_disease == 1) | (covid.other_disease == 2)]
covid = covid.loc[(covid.cardiovascular == 1) | (covid.cardiovascular == 2)]
covid = covid.loc[(covid.obesity == 1) | (covid.obesity == 2)]
covid = covid.loc[(covid.renal_chronic == 1) | (covid.renal_chronic == 2)]
covid = covid.loc[(covid.tobacco == 1) | (covid.tobacco == 2)]
covid = covid.loc[(covid.covid_res == 1) | (covid.covid_res == 2)]
covid = covid.loc[(covid.icu == 1) | (covid.icu == 2)]

# Modifying data to encode it into True/False instead of 1/0 
covid.sex = covid.sex.apply(lambda x: True if x == 1 else False)
covid.intubed = covid.intubed.apply(lambda x: True if x == 1 else False)
covid.pneumonia = covid.pneumonia.apply(lambda x: True if x == 1 else False)
covid.diabetes = covid.diabetes.apply(lambda x: True if x == 1 else False)
covid.copd = covid.copd.apply(lambda x: True if x == 1 else False)
covid.asthma = covid.asthma.apply(lambda x: True if x == 1 else False)
covid.inmsupr = covid.inmsupr.apply(lambda x: True if x == 1 else False)
covid.hypertension = covid.hypertension.apply(lambda x: True if x == 1 else False)
covid.other_disease = covid.other_disease.apply(lambda x: True if x == 1 else False)
covid.cardiovascular = covid.cardiovascular.apply(lambda x: True if x == 1 else False)
covid.obesity = covid.obesity.apply(lambda x: True if x == 1 else False)
covid.renal_chronic = covid.renal_chronic.apply(lambda x: True if x == 1 else False)
covid.tobacco = covid.tobacco.apply(lambda x: True if x == 1 else False)
covid.covid_res = covid.covid_res.apply(lambda x: True if x == 1 else False)
covid.icu = covid.icu.apply(lambda x: True if x == 1 else False)

#including pregnancy == True, False otherwise ( EVEN FOR UNKNOWN DATA )
covid.pregnancy = covid.pregnancy.apply(lambda x: True if x == 1 else False)

In [None]:
# filtering out records where covid results returned negative
covid = covid.loc[covid.covid_res==True]

# I create a new field called 'critical1', 
# which would be True for both where deaths have occurred OR where
# patients needed ICU. 
covid['critical1'] = (covid.date_died != '9999-99-99') | (covid.icu == True)

# I create another field called 'critical2', 
# which would be True for both where 'critical1' from above is True
# OR where patient required intubation identified by intubed is also True
covid['critical2'] = (covid.critical1 == True) | (covid.intubed == True) 

# We will fit two different classifiers to check if there is any change in symptom importance
# predicted by either. 

print(covid.shape)

# removing columns not needed  
covid.drop(columns = ['patient_type', 
                      'contact_other_covid', 
                      'covid_res', 
                      'entry_date', 
                      'date_symptoms', 
                      'date_died'], inplace=True)

In [None]:
# selecting data for GBT classifier
symptoms = covid[['sex',
                  'pneumonia',
                  'age',
                  'pregnancy',
                  'diabetes',
                  'copd',
                  'asthma',
                  'inmsupr',
                  'hypertension',
                  'other_disease',
                  'cardiovascular', 
                  'obesity',
                  'renal_chronic',
                  'tobacco']]
label1 = covid['critical1']
label2 = covid['critical2']

# Modeling

In [None]:
gbt1 = GradientBoostingClassifier(random_state=0)
gbt2 = GradientBoostingClassifier(random_state=0)
gbt1.fit(symptoms, label1)
gbt2.fit(symptoms, label2)

In [None]:
# Extracting feature importance, features here being the symptoms
importance1 = gbt1.feature_importances_
importance2 = gbt2.feature_importances_

def get_importance(list_symptoms, importance):
    sym_imp_map = [] 
    for sym, imp in zip(list_symptoms, importance):
        sym_imp_map.append((sym,imp))
    return sym_imp_map 

symptom_imp_map1 = get_importance(symptoms.columns, importance1)
symptom_imp_map2 = get_importance(symptoms.columns, importance2)

print('\nFor death or ICU criticality, we have: \n')
sorted(symptom_imp_map1, reverse=True, key=lambda x: x[1])

In [None]:
print('\n For death, ICU OR intubed patients, we have: \n')
sorted(symptom_imp_map2, reverse=True, key=lambda x: x[1])

<h2> Conclusions </h2> 

For both cases the order in which symptoms highly correlate with criticality is similar (differing only by small magnitudes) 

The order of importance is: 

<ul>
    <li> Age -  Older people are at a higher risk. Please <a href="https://www.kaggle.com/jeffreybraun/identifying-susceptible-pop-of-covid-19-fatality">refer to this </a> for the exact correlation. </li>
    <li> Pneumonia -  People diagnosed with pneumonia come second in criticality. </li>
    <li> RENAL_CHRONIC - identifies if the patient has a diagnosis of chronic kidney failure. </li> 
    <li> Sex or Gender - with Male being more susceptible than female. (refer to the same kernel above) </li> 
    <li> Diabetes -   Identifies if the patient has a diagnosis of diabetes. </li>
    <li> hypertension -  Identifies if the patient has a diagnosis of hypertension.</li> 
    <li> Obesity </li>
    <li> Other Disease </li>
    <li> 'inmsupr' -   Identifies if the patient has immunosuppression. </li>
    <li> Cardiovascular disease </li>
    <li> COPD </li> 
    <li> Asthma - This goes against common intuition; perhaps more data is required for further clarifucation </li>
    <li> Tobacco - This too may seem counterintuitive but there is <a href="https://www.who.int/news-room/commentaries/detail/smoking-and-covid-19"> no established study </a> that clearly identifies the correlation between smoking/tobacco consumption and Covid19 severity yet. </li>
    <li> Pregnancy - I would be skeptical about this result too but <a href="https://www.acog.org/clinical/clinical-guidance/practice-advisory/articles/2020/03/novel-coronavirus-2019"> here's </a> more about pregnancy and Covid19. </li>
   </ul>

<br>

<h2> Important Notes </h2> 
    
These results come from only 67300 records filtered from <a href="https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset"> here </a>, which has been sampled from Mexico. <br> 
Some "symptoms" like Pregnancy and Asthma could be undersampled in the data. Pregnancy also has this added complexity in that it is highly dependent on Gender; it's entropy/gini_index should be calculated conditioned on the Gender==Female group instead of how it has been done here in isolation.  

I am not a doctor and have nothing to do with any Medical Fraternity from any country. As is true with most data science results and models, the larger the number of records the more accurate the predictions get. The data here is low cardinal (True/False) and low dimensional (14 symptoms). As such, the GBT model should be doing what it does best; yet the results should not be taken as concrete and robust, until data from more sources could corraborate the same. <br>
Please feel free to let me know if I am missing anything. <br>

Lastly, big shout out to <a href="https://www.kaggle.com/tanmoyx"> Tanmay Mukherjee </a> for finding the data and posting it to Kaggle. Most of the initial cells with data cleaning are also borrowed from Tanmay's <a href="https://www.kaggle.com/tanmoyx/covid-19-icu-requirement-prediction"> kernel </a>
    