# COVID19 severe (hospitalized) cases in Brazil

# Univariate logistic regression

Publicly available data for hospitalized cases in Brazil is used to perform a retrospective cross-sectional observational study.

It means that we want to infer the paramaters driving the outcome from data gathered from hospitals.
The data has been downloaded, selected and preprocessed before the steps described in this notebook.

The aim is to compare different variants and infer what parameters are driving the severity and the outcome.<br>
I proceed as follow:
<ol>
    <li> Compute some new features from the data (time between some dates, etc) and explore data</li>
    <li> Perform univariate logistic regression</li>
            <ol><li><strong>Perform regression (Presented in this notebook)</strong></li><li>Show results</li></ol>
    <li> Based on the result of the univariate regression, move to a multivariate logistic regression with stepwise approach.</li>
</ol>
    
| Features in the data | Features added | Primary Outcome (Label) | Secondary Outcome |
|-|-|-|-|
| age | age group | death/cured | ICU admission |
| sex | nber of comorbidities | | ventilation (invasive/noninvasive) |
| ethnicity | nber of vaccine doses received | | |
| Federative unit (i.e., state) | delay between last dose of vaccine (if >=2) and 1st symptoms | | |
| symptoms | length of hospital stay | | |
| comorbidities | delay between 1st symptoms and hospitalization | | |
| pregnancy status | | | |

This is applied to four different periods of time when four different variants where dominant (>=80% of samples analyzed were corresponding to the variant of interest, source: GISAID database):
- Delta
- Omicron BA.1
- Omicron BA.2
- Omicron BA.4/BA.5

> **Data source**: all the data has been taken from the Brazilian Ministry of Health https://opendatasus.saude.gov.br/organization/ministerio-da-saude
>
>It requires translation from Portuguese to English

## Python module used in this notebook

In [1]:
import pandas as pd
import scipy.stats as scst
import sys
import numpy as np
import matplotlib.pyplot as plt
import os

## Parameters definition: 

- variant period of time/names
- age class
- comorbidity list in English
- ethnicity list in English

In [2]:
variants_period = [['2021-09-12','2021-12-19'],['2022-01-03','2022-03-20'],['2022-04-11','2022-05-29'],
                   ['2022-07-18','2022-10-02']]
variants_name = ['Delta','BA.1.X','BA.2.X','BA.4/5.X']
file_name = ['Delta','BA1X','BA2X','BA45X']
#age_class = [[0,4],[5,9],[10,14],[15,19],[20,29],[30,39],[40,49],[50,59],[60,64],[65,69],[70,74],[75,79],[80]]
age_range = [[0,4],[5,14],[15,24],[25,44],[45,54],[55,64],[65]]
age_code = [0,1,2,3,4,5,6]
gender = ['Female','Male']
comorb_list = ['cardiovascular_disease','hematologic_disease','down_syndrom','liver_disease','asthma','diabetes',
          'neurological_disease','chronic_lung_disease','weaken_immune_system','renal_disease','obesity','puerperal',
               'other_comorbidities']
race_list = ['Indigenous','Brown','Asian','Black','White','Unknown']

## Univariate logistic regression

- Load data

In [3]:
df_encoded = pd.read_parquet('encoded_severe_cases_data.pq')
df_analysis = pd.read_parquet('analysis_severe_cases_data.pq')
df_analysis.outcome = df_analysis.outcome.replace('cured',0).replace('death',1)

### Univariate logistic regression per variant: primary outcome

- import modules

In [4]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from numpy.linalg import LinAlgError

- Regression results

    - Splitting parameters between parameters with reference level (categorical) and parameters without reference (continuous)

In [5]:
univariate_parameters_withref = ['state','age_group','sex','ethnicity','nb_comorbidities','nb_vaccine_dose','icuadm',
                                 'ventilation','ventilation_outoficu','invasive_ventilation','delay_vaccine']
univariate_parameters = ['length_stay','length_delay']

- Define reference for each categorical parameter

| Features | Reference |
|-|-|
| State | Distrito Federal (DF) |
| Age group | Age group 3 (25-44 y) |
| ethnicity | White |
| Number of comorbidities | 0 |
| Ventilation | No ventilation |
| Ventilation out of ICU | No ventilation |
| Invasive ventilation | In ICU |
| ICU admission | No |
| Delay between last vaccine dose and first symptom | <90 days |
| Comorbidities | No comorbidity |

In [6]:
df_analysis['state_releveled'] = pd.Series(pd.Categorical(df_analysis.state,categories=['DF','SP','CE','RS','PR','SC','MG',
                                                                                        'GO','BA','RJ','MS','PA','PE','PB',
                                                                                        'ES','PI','MT','TO','SE','AC','AL',
                                                                                        'MA','RO','AP','RR','AM','RN']))
df_analysis['age_group_releveled'] = pd.Series(pd.Categorical(df_analysis.age_group,categories=[3,0,1,2,4,5,6]))
df_analysis['sex_releveled'] = pd.Series(pd.Categorical(df_analysis.sex,categories=['Female','Male']))
df_analysis['ethnicity_releveled'] = pd.Series(pd.Categorical(df_analysis.ethnicity,categories=['white','asian','brown',
                                                                                                'black','indigenous']))
df_analysis['nb_comorbidities_releveled'] = pd.Series(pd.Categorical(df_analysis.nb_comorbidities,categories=[0,1,2,3,4]))
df_analysis['nb_vaccine_dose_releveled']  = pd.Series(pd.Categorical(df_analysis.nb_vaccine_dose,categories=[0,1,2,3]))
df_analysis['ventilation'] = np.where((df_analysis.ventilation_invasive==0) & (df_analysis.ventilation_noninvasive==0),'No',None)
df_analysis.ventilation = np.where((df_analysis.ventilation_invasive==1) & (df_analysis.ventilation_noninvasive==0),'Invasive',df_analysis.ventilation)
df_analysis.ventilation = np.where((df_analysis.ventilation_invasive==0) & (df_analysis.ventilation_noninvasive==1),'Noninvasive',df_analysis.ventilation)
df_analysis['ventilation_releveled']  = pd.Series(pd.Categorical(df_analysis.ventilation,categories=['No','Invasive','Noninvasive']))
df_analysis['ventilation_outoficu'] = np.where((df_analysis.ventilation_invasive==0) & (df_analysis.ventilation_noninvasive==0) & (df_analysis.icu_adm==0),'No',None)
df_analysis.ventilation_outoficu = np.where((df_analysis.ventilation_invasive==1) & (df_analysis.ventilation_noninvasive==0) & (df_analysis.icu_adm==0),'Invasive',df_analysis.ventilation_outoficu)
df_analysis.ventilation_outoficu = np.where((df_analysis.ventilation_invasive==0) & (df_analysis.ventilation_noninvasive==1) & (df_analysis.icu_adm==0),'Noninvasive',df_analysis.ventilation_outoficu)
df_analysis['ventilation_outoficu_releveled']  = pd.Series(pd.Categorical(df_analysis.ventilation_outoficu,categories=['No','Invasive','Noninvasive']))
df_analysis['invasive_ventilation'] = np.where((df_analysis.ventilation_invasive==1) & (df_analysis.icu_adm==1),'In ICU',None)
df_analysis.invasive_ventilation = np.where((df_analysis.ventilation_invasive==1) & (df_analysis.icu_adm==0),'Out of ICU',df_analysis.invasive_ventilation)
df_analysis['invasive_ventilation_releveled']  = pd.Series(pd.Categorical(df_analysis.invasive_ventilation,categories=['In ICU','Out of ICU']))
df_analysis['icuadm'] = np.where(df_analysis.icu_adm==1,'Yes',None)
df_analysis.icuadm = np.where(df_analysis.icu_adm==0,'No',df_analysis.icuadm)
df_analysis['icuadm_releveled']  = pd.Series(pd.Categorical(df_analysis.icuadm,categories=['No','Yes']))
df_analysis['delay_vaccine'] = np.where(df_analysis.delay_lastdose_onset<90,'<90',None)
df_analysis.delay_vaccine = np.where(df_analysis.delay_lastdose_onset>=90,'>=90',df_analysis.delay_vaccine)
df_analysis['delay_vaccine_releveled'] = pd.Series(pd.Categorical(df_analysis.delay_vaccine,categories=['<90','>=90']))

- Loop on categorical parameters/variants

In [7]:
df_regression = pd.DataFrame()
for param in univariate_parameters_withref:
    output = pd.DataFrame()
    for variant in df_analysis.variant.unique():
        print(variant,':',param, end="  --  ")
        variant_data = df_analysis[df_analysis.variant==variant].copy()
        variant_data = variant_data.reset_index(drop=True)
        reg = smf.logit('outcome ~ '+param+'_releveled',data=variant_data).fit(disp=0,method='ncg')
        if variant == df_analysis.variant.unique()[0]:
            output['param'] = reg.params.index[1:].str.replace('_releveled\[T.',' ').str.replace('\]','')
        output['odds_ratio_'+variant] = np.exp(reg.params.values)[1:]
        conf = np.exp(reg.conf_int())[1:]
        conf.columns = ['CI_low','CI_high']
        output['CI_low_'+variant] = conf.CI_low.values
        output['CI_high_'+variant] = conf.CI_high.values
        output['p_values_'+variant] = reg.pvalues.values[1:]
    df_regression = df_regression.append(output,ignore_index=True)
df_regression.to_csv('categorical_by_variant_reg.csv',index=False)

Delta : state  --  BA.1.X : state  --  BA.2.X : state  --  BA.4/5.X : state  --  Delta : age_group  --  BA.1.X : age_group  --  BA.2.X : age_group  --  BA.4/5.X : age_group  --  Delta : sex  --  BA.1.X : sex  --  BA.2.X : sex  --  BA.4/5.X : sex  --  Delta : ethnicity  --  BA.1.X : ethnicity  --  BA.2.X : ethnicity  --  BA.4/5.X : ethnicity  --  Delta : nb_comorbidities  --  BA.1.X : nb_comorbidities  --  BA.2.X : nb_comorbidities  --  BA.4/5.X : nb_comorbidities  --  Delta : nb_vaccine_dose  --  BA.1.X : nb_vaccine_dose  --  BA.2.X : nb_vaccine_dose  --  BA.4/5.X : nb_vaccine_dose  --  Delta : icuadm  --  BA.1.X : icuadm  --  BA.2.X : icuadm  --  BA.4/5.X : icuadm  --  Delta : ventilation  --  BA.1.X : ventilation  --  BA.2.X : ventilation  --  BA.4/5.X : ventilation  --  Delta : ventilation_outoficu  --  BA.1.X : ventilation_outoficu  --  BA.2.X : ventilation_outoficu  --  BA.4/5.X : ventilation_outoficu  --  Delta : invasive_ventilation  --  BA.1.X : invasive_ventilation  --  BA.2.X

- Loop on continuous parameters

In [8]:
for param in univariate_parameters:
    df_param_reg = pd.DataFrame()
    for variant in df_encoded.columns[df_encoded.columns.str.contains('variant')]:
        variant_data = df_encoded[df_encoded[variant]==1].copy()
        for column in variant_data.columns[variant_data.columns.str.contains(param)].tolist():
            if column == variant_data.columns[variant_data.columns.str.contains(param)][0]:
                param_data = variant_data[variant_data[column].notnull()].copy()
            else:
                param_data = param_data[param_data[column].notnull()].copy()
        X = param_data[param_data.columns[param_data.columns.str.contains(param)].tolist()]
        Y = param_data[['outcome']]
        fit_method = 'newton'
        reg = sm.Logit(Y.astype(int),X.astype(int)).fit(disp=0,method=fit_method)
        if variant == df_encoded.columns[df_encoded.columns.str.contains('variant')][0]:
            df_param_reg['param_value'] = reg.params.index
        df_param_reg['odds_ratio_'+variant] = np.exp(reg.params.values)
        conf = np.exp(reg.conf_int())
        conf.columns = ['CI_low','CI_high']
        df_param_reg['CI_low_'+variant] = conf.CI_low.values
        df_param_reg['CI_high_'+variant] = conf.CI_high.values
        indx = df_param_reg.columns[df_param_reg.columns.str.contains(variant)].union(['param_value'])
df_param_reg.to_csv('continuous_by_variant_reg.csv',index=False)

#### comorbidities

In [9]:
univariate_comorbidities = ['cardiovascular_disease','hematologic_disease','down_syndrom','liver_disease','asthma',
                            'diabetes','neurological_disease','chronic_lung_disease','weaken_immune_system',
                            'renal_disease','obesity','puerperal','other_comorbidities']

In [10]:
df_regression = pd.DataFrame()

for param in univariate_comorbidities:
    output = pd.DataFrame()
    for variant in df_analysis.variant.unique():        
        print(param+' - '+variant, end="  --  ")
        variant_data = df_analysis[df_analysis.variant==variant].copy()
        variant_data = variant_data.reset_index(drop=True)
        variant_data['loc_comorb'] = np.where(variant_data[param]==1,param,None)
        variant_data.loc_comorb = np.where(variant_data.nb_comorbidities==0,'No comorbidity',variant_data.loc_comorb)
        variant_data['loc_comorb_releveled'] = pd.Series(pd.Categorical(variant_data.loc_comorb,categories=['No comorbidity',param]))
        reg = smf.logit('outcome ~ loc_comorb_releveled',data=variant_data).fit(disp=0,method='ncg')
        conf = np.exp(reg.conf_int().values)[1:]
        if variant == df_analysis.variant.unique()[0]:
            output['param'] = reg.params.index[1:].str.replace('loc_comorb_releveled\[T.',' ').str.replace('\]','')
        output['odds_ratio_'+variant] = np.exp(reg.params.values)[1:]
        conf = np.exp(reg.conf_int())[1:]
        conf.columns = ['CI_low','CI_high']
        output['CI_low_'+variant] = conf.CI_low.values
        output['CI_high_'+variant] = conf.CI_high.values
        output['p_values_'+variant] = reg.pvalues.values[1:]
        del variant_data
    df_regression = df_regression.append(output,ignore_index=True)
df_regression.to_csv('comorbidities_by_variant_reg.csv',index=False)

cardiovascular_disease - Delta  --  cardiovascular_disease - BA.1.X  --  cardiovascular_disease - BA.2.X  --  cardiovascular_disease - BA.4/5.X  --  hematologic_disease - Delta  --  hematologic_disease - BA.1.X  --  hematologic_disease - BA.2.X  --  hematologic_disease - BA.4/5.X  --  down_syndrom - Delta  --  down_syndrom - BA.1.X  --  down_syndrom - BA.2.X  --  down_syndrom - BA.4/5.X  --  liver_disease - Delta  --  liver_disease - BA.1.X  --  liver_disease - BA.2.X  --  liver_disease - BA.4/5.X  --  asthma - Delta  --  asthma - BA.1.X  --  asthma - BA.2.X  --  asthma - BA.4/5.X  --  diabetes - Delta  --  diabetes - BA.1.X  --  diabetes - BA.2.X  --  diabetes - BA.4/5.X  --  neurological_disease - Delta  --  neurological_disease - BA.1.X  --  neurological_disease - BA.2.X  --  neurological_disease - BA.4/5.X  --  chronic_lung_disease - Delta  --  chronic_lung_disease - BA.1.X  --  chronic_lung_disease - BA.2.X  --  chronic_lung_disease - BA.4/5.X  --  weaken_immune_system - Delta  --

In the above cell, comorbidities impact is derived with each comorbity independently from any other comorbitity.

To have a full picture, one categorical variable is created for each comorbidity and perform a new logistic regression adding all comorbidities in the model (basically, a multivariate analysis including all comorbidities).

In [11]:
list_comorb_releveled = []
for comorb in univariate_comorbidities:
    print(comorb)
    df_analysis[comorb+'_prereleveled'] = np.where(df_analysis[comorb]==0,'No',None)
    df_analysis[comorb+'_prereleveled'] = np.where(df_analysis[comorb]==1,'Yes',df_analysis[comorb+'_prereleveled'])
    df_analysis[comorb+'_releveled'] = pd.Series(pd.Categorical(df_analysis[comorb+'_prereleveled'],categories=['No','Yes']))
    list_comorb_releveled.append(comorb+'_releveled')

cardiovascular_disease
hematologic_disease
down_syndrom
liver_disease
asthma
diabetes
neurological_disease
chronic_lung_disease
weaken_immune_system
renal_disease
obesity
puerperal
other_comorbidities


In [12]:
df_regression = pd.DataFrame()
for comorb in list_comorb_releveled:
    if comorb == list_comorb_releveled[0]:
        model = comorb
    else:
        model = model+' + '+comorb
for variant in df_analysis.variant.unique():        
    print(variant, end="  --  ")
    variant_data = df_analysis[df_analysis.variant==variant].copy()
    variant_data = variant_data.reset_index(drop=True)
    reg = smf.logit('outcome ~ '+model,data=variant_data).fit(disp=0,method='ncg')
    conf = np.exp(reg.conf_int().values)[1:]
    if variant == df_analysis.variant.unique()[0]:
        df_regression['param'] = reg.params.index[1:].str.replace('_releveled\[T.Yes',' ').str.replace('\]','')
    df_regression['odds_ratio_'+variant] = np.exp(reg.params.values)[1:]
    conf = np.exp(reg.conf_int())[1:]
    conf.columns = ['CI_low','CI_high']
    df_regression['CI_low_'+variant] = conf.CI_low.values
    df_regression['CI_high_'+variant] = conf.CI_high.values
    df_regression['p_values_'+variant] = reg.pvalues.values[1:]
    del variant_data
df_regression.to_csv('comorbidities_dependent_by_variant_reg.csv',index=False)

Delta  --  BA.1.X  --  BA.2.X  --  BA.4/5.X  --  

### Univariate logistic regression per variant: secondary outcomes

- ICU admission
- Invasive ventilation

In [13]:
outcomes = ['ventilation_invasive','icu_adm']
outcome_delete_from_parameters = [['invasive_ventilation','ventilation','ventilation_outoficu'],['icuadm','ventilation_outoficu',
                                                                                                 'invasive_ventilation']]
univariate_parameters_withref = ['state','age_group','sex','ethnicity','nb_comorbidities','nb_vaccine_dose','icuadm',
                                 'ventilation','ventilation_outoficu','invasive_ventilation','delay_vaccine']
univariate_parameters = ['length_stay','length_delay']

In [14]:
for outcome in outcomes:
    df_regression = pd.DataFrame()
    local_parameters = univariate_parameters_withref.copy()
    if type(outcome_delete_from_parameters[outcomes.index(outcome)]) == list:
        for element in outcome_delete_from_parameters[outcomes.index(outcome)]:
            local_parameters.remove(element)
    else:
        local_parameters.remove(outcome_delete_from_parameters[outcomes.index(outcome)])
    for param in local_parameters:
        output = pd.DataFrame()
        for variant in df_analysis.variant.unique():
            print(variant,'-',outcome,':',param, end='  --  ')
            variant_data = df_analysis[df_analysis.variant==variant].copy()
            reg = smf.logit(outcome+' ~ '+param+'_releveled',data=variant_data).fit(disp=0,method='ncg')
            if variant == df_analysis.variant.unique()[0]:
                output['param'] = reg.params.index[1:].str.replace('_releveled\[T.',' ').str.replace('\]','')
            output['odds_ratio_'+variant] = np.exp(reg.params.values)[1:]
            conf = np.exp(reg.conf_int())[1:]
            conf.columns = ['CI_low','CI_high']
            output['CI_low_'+variant] = conf.CI_low.values
            output['CI_high_'+variant] = conf.CI_high.values
            output['p_values_'+variant] = reg.pvalues.values[1:]
        df_regression = df_regression.append(output,ignore_index=True)
    df_regression.to_csv(outcome+'_categorical_by_variant_reg.csv',index=False)

Delta - ventilation_invasive : state  --  BA.1.X - ventilation_invasive : state  --  BA.2.X - ventilation_invasive : state  --  BA.4/5.X - ventilation_invasive : state  --  Delta - ventilation_invasive : age_group  --  BA.1.X - ventilation_invasive : age_group  --  BA.2.X - ventilation_invasive : age_group  --  BA.4/5.X - ventilation_invasive : age_group  --  Delta - ventilation_invasive : sex  --  BA.1.X - ventilation_invasive : sex  --  BA.2.X - ventilation_invasive : sex  --  BA.4/5.X - ventilation_invasive : sex  --  Delta - ventilation_invasive : ethnicity  --  BA.1.X - ventilation_invasive : ethnicity  --  BA.2.X - ventilation_invasive : ethnicity  --  BA.4/5.X - ventilation_invasive : ethnicity  --  Delta - ventilation_invasive : nb_comorbidities  --  BA.1.X - ventilation_invasive : nb_comorbidities  --  BA.2.X - ventilation_invasive : nb_comorbidities  --  BA.4/5.X - ventilation_invasive : nb_comorbidities  --  Delta - ventilation_invasive : nb_vaccine_dose  --  BA.1.X - ventil

- Loop on continuous parameters

In [15]:
for outcome in outcomes:
    local_parameters = univariate_parameters.copy()
    for param in local_parameters:
        print(outcome+' - '+param,end='  --  ')
        df_param_reg = pd.DataFrame()
        for variant in df_encoded.columns[df_encoded.columns.str.contains('variant')]:
            variant_data = df_encoded[df_encoded[variant]==1].copy()
            for column in variant_data.columns[variant_data.columns.str.contains(param)].tolist():
                if column == variant_data.columns[variant_data.columns.str.contains(param)][0]:
                    param_data = variant_data[variant_data[column].notnull()].copy()
                else:
                    param_data = param_data[param_data[column].notnull()].copy()
            X = param_data[param_data.columns[param_data.columns.str.contains(param)].tolist()]
            Y = param_data[['outcome']]
            fit_method = 'newton'
            reg = sm.Logit(Y.astype(int),X.astype(int)).fit(disp=0,method=fit_method)
            if variant == df_encoded.columns[df_encoded.columns.str.contains('variant')][0]:
                df_param_reg['param_value'] = reg.params.index
            df_param_reg['odds_ratio_'+variant] = np.exp(reg.params.values)
            conf = np.exp(reg.conf_int())
            conf.columns = ['CI_low','CI_high']
            df_param_reg['CI_low_'+variant] = conf.CI_low.values
            df_param_reg['CI_high_'+variant] = conf.CI_high.values
            indx = df_param_reg.columns[df_param_reg.columns.str.contains(variant)].union(['param_value'])
    df_param_reg.to_csv(outcome+'_continuous_by_variant_reg.csv',index=False)

ventilation_invasive - length_stay  --  ventilation_invasive - length_delay  --  icu_adm - length_stay  --  icu_adm - length_delay  --  

#### comorbidities

Same method is applied here as for primary outcome (death): first an univariate analysis and then a multivariate analysis

In [16]:
univariate_comorbidities = ['cardiovascular_disease','hematologic_disease','down_syndrom','liver_disease','asthma',
                            'diabetes','neurological_disease','chronic_lung_disease','weaken_immune_system',
                            'renal_disease','obesity','puerperal','other_comorbidities']

In [17]:
for outcome in outcomes:
    df_regression = pd.DataFrame()
    for param in univariate_comorbidities:
        output = pd.DataFrame()
        for variant in df_analysis.variant.unique():        
            print(param+' - '+variant,end='  --  ')
            variant_data = df_analysis[df_analysis.variant==variant].copy()
            variant_data = variant_data.reset_index(drop=True)
            variant_data['loc_comorb'] = np.where(variant_data[param]==1,param,None)
            variant_data.loc_comorb = np.where(variant_data.nb_comorbidities==0,'No comorbidity',variant_data.loc_comorb)
            variant_data['loc_comorb_releveled'] = pd.Series(pd.Categorical(variant_data.loc_comorb,categories=['No comorbidity',param]))
            reg = smf.logit(outcome+' ~ loc_comorb_releveled',data=variant_data).fit(disp=0,method='ncg')
            conf = np.exp(reg.conf_int().values)[1:]
            if variant == df_analysis.variant.unique()[0]:
                output['param'] = reg.params.index[1:].str.replace('loc_comorb_releveled\[T.',' ').str.replace('\]','')
            output['odds_ratio_'+variant] = np.exp(reg.params.values)[1:]
            conf = np.exp(reg.conf_int())[1:]
            conf.columns = ['CI_low','CI_high']
            output['CI_low_'+variant] = conf.CI_low.values
            output['CI_high_'+variant] = conf.CI_high.values
            output['p_values_'+variant] = reg.pvalues.values[1:]
            del variant_data
        df_regression = df_regression.append(output,ignore_index=True)
    df_regression.to_csv(outcome+'_comorbidities_by_variant_reg.csv',index=False)

cardiovascular_disease - Delta  --  cardiovascular_disease - BA.1.X  --  cardiovascular_disease - BA.2.X  --  cardiovascular_disease - BA.4/5.X  --  hematologic_disease - Delta  --  hematologic_disease - BA.1.X  --  hematologic_disease - BA.2.X  --  hematologic_disease - BA.4/5.X  --  down_syndrom - Delta  --  down_syndrom - BA.1.X  --  down_syndrom - BA.2.X  --  down_syndrom - BA.4/5.X  --  liver_disease - Delta  --  liver_disease - BA.1.X  --  liver_disease - BA.2.X  --  liver_disease - BA.4/5.X  --  asthma - Delta  --  asthma - BA.1.X  --  asthma - BA.2.X  --  asthma - BA.4/5.X  --  diabetes - Delta  --  diabetes - BA.1.X  --  diabetes - BA.2.X  --  diabetes - BA.4/5.X  --  neurological_disease - Delta  --  neurological_disease - BA.1.X  --  neurological_disease - BA.2.X  --  neurological_disease - BA.4/5.X  --  chronic_lung_disease - Delta  --  chronic_lung_disease - BA.1.X  --  chronic_lung_disease - BA.2.X  --  chronic_lung_disease - BA.4/5.X  --  weaken_immune_system - Delta  --

In [18]:
for outcome in outcomes:
    df_regression = pd.DataFrame()
    for variant in df_analysis.variant.unique():        
        print(variant,end='  --  ')
        variant_data = df_analysis[df_analysis.variant==variant].copy()
        variant_data = variant_data.reset_index(drop=True)
        reg = smf.logit(outcome+' ~ '+model,data=variant_data).fit(disp=0,method='ncg')
        conf = np.exp(reg.conf_int().values)[1:]
        if variant == df_analysis.variant.unique()[0]:
            df_regression['param'] = reg.params.index[1:].str.replace('_releveled\[T.',' ').str.replace('\]','')
        df_regression['odds_ratio_'+variant] = np.exp(reg.params.values)[1:]
        conf = np.exp(reg.conf_int())[1:]
        conf.columns = ['CI_low','CI_high']
        df_regression['CI_low_'+variant] = conf.CI_low.values
        df_regression['CI_high_'+variant] = conf.CI_high.values
        df_regression['p_values_'+variant] = reg.pvalues.values[1:]
        del variant_data
    df_regression.to_csv(outcome+'_comorbidities_dependent_by_variant_reg.csv',index=False)

Delta  --  BA.1.X  --  BA.2.X  --  BA.4/5.X  --  Delta  --  BA.1.X  --  BA.2.X  --  BA.4/5.X  --  

### Univariate logistic regression with variant combination: primary outcome

In [19]:
variant_combination = [['Delta','BA.1.X'],['BA.1.X','BA.2.X'],['BA.2.X','BA.4/5.X']]

In [20]:
df_regression = pd.DataFrame()
for two_variants in variant_combination:
    output = pd.DataFrame()
    variant_data = df_analysis[(df_analysis.variant==two_variants[0]) | (df_analysis.variant==two_variants[1])].copy()
    variant_data = variant_data.reset_index(drop=True)
    variant_data['variant_releveled']  = pd.Series(pd.Categorical(variant_data.variant,categories=two_variants))
    reg = smf.logit('outcome ~ variant_releveled',data=variant_data).fit(disp=0,method='ncg')
    output['param'] = [two_variants[1]+' - Ref: '+two_variants[0]]
    output['odds_ratio'] = np.exp(reg.params.values)[1:]
    conf = np.exp(reg.conf_int())[1:]
    conf.columns = ['CI_low','CI_high']
    output['CI_low'] = conf.CI_low.values
    output['CI_high'] = conf.CI_high.values
    output['p_values'] = reg.pvalues.values[1:]
    del variant_data
    df_regression = df_regression.append(output,ignore_index=True)
df_regression.to_csv('variant_combination_reg.csv',index=False)

### Univariate logistic regression with variant combination: secondary outcome

In [21]:
outcomes = ['ventilation_invasive','icu_adm']
variant_combination = [['Delta','BA.1.X'],['BA.1.X','BA.2.X'],['BA.2.X','BA.4/5.X']]

In [22]:
for outcome in outcomes:
    df_regression = pd.DataFrame()
    for two_variants in variant_combination:
        output = pd.DataFrame()
        variant_data = df_analysis[(df_analysis.variant==two_variants[0]) | (df_analysis.variant==two_variants[1])].copy()
        variant_data = variant_data.reset_index(drop=True)
        variant_data['variant_releveled']  = pd.Series(pd.Categorical(variant_data.variant,categories=two_variants))
        reg = smf.logit(outcome+' ~ variant_releveled',data=variant_data).fit(disp=0,method='ncg')
        output['param'] = [two_variants[1]+' - Ref: '+two_variants[0]]
        output['odds_ratio'] = np.exp(reg.params.values)[1:]
        conf = np.exp(reg.conf_int())[1:]
        conf.columns = ['CI_low','CI_high']
        output['CI_low'] = conf.CI_low.values
        output['CI_high'] = conf.CI_high.values
        output['p_values'] = reg.pvalues.values[1:]
        del variant_data
        df_regression = df_regression.append(output,ignore_index=True)
    df_regression.to_csv(outcome+'_variant_combination_reg.csv',index=False)