Aggregates Statistics are COVID-19 cases dominate the news, they dominate the conversation, and they dominate Kaggle. One of the most watch Aggregates Statistics is the number of Confirmed Cases in each country.  The challenge we face when thinking about the number of Confirmed Cases is whether these figures are accurate.  Some countries have taken heroic steps to testing, and others remain slow to roll-out tests.  This begs the question: what can we rely on?  While the number of confirmed may vary greatly between countries based on testing policy, the number of fatalities I expect to be far more faithful.  The problem with comparing these figures is the 1. most people don't die of the disease, and 2. countries can observe fatalities at a lag to their number of Confirmed Cases.  The question then remains: can we use Fatalities to verify the consistency of Confirmed Cases, and if the factors driving them do differ, why?

# Contents
1. [Data](#Data)  
2. [%Δ ConfirmedCases Model](#%Δ-ConfirmedCases-Model)  
3. [%Δ Fatalities Model](#%Δ-Fatalities-Model)  
4. [Model Comparison](#Model-Comparison)  
5. [Conclusion](#Conclusion)

In [None]:
! apt install libgeos-dev
! pip uninstall -y shapely; pip install --no-binary :all: shapely==1.6.4
! pip uninstall -y cartopy; pip install --no-binary :all: cartopy==0.17.0
! pip install geoviews==1.6.6 hvplot==0.5.2 panel==0.8.0 bokeh==1.4.0

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import os
from operator import add, mul
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import hvplot.pandas
import holoviews as hv
import cartopy.crs as ccrs
import geopandas as gpd
from toolz.curried import map, partial, pipe, reduce
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Any results you write to the current directory are saved as output.
hv.extension('bokeh')

# Data

I chose to draw on a number of data sources on not only COVID cases but country indicators on GDP, infant mortality, etc., as well as data on population estimates and land size.  In order to better control for the variance in Fatalities and Confirmed Cases as a result of country sizes, I opted to look at Fatalities and Confirmed Cases per Capita.  As, at this stage in the virus, the pandemic is still dominated by the exponential growth in new cases, I opted to analyze the relationship between the percent change in Fatalities or Confirmed Cases per Capita, against percent changes in our factors.  Two interesting exceptions to this were variables representing the weeks since the first case in the country and the first case death, where I included both the log of the weeks since this event, to represent percent change, and the original value. This is used to model any effects relating to the logarithmic flattening of the curve late in the infection in a given country.  

In [None]:
countries = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')).replace('United States of America', 'US')
covid = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/train.csv', parse_dates=['Date'], index_col='Id')
indicators = pd.read_csv('/kaggle/input/countries-of-the-world/countries of the world.csv', decimal=',').replace('United States', 'US')

country_indicators = (countries.assign(name = lambda df: df.name.astype(str).str.strip())
                     .merge(indicators.assign(Country = lambda df: df.Country.astype(str).str.strip()), 
                            left_on = 'name', right_on='Country', how='inner'))
weeks = (covid
         .assign(dayofweek = lambda df: df.Date.dt.dayofweek)
         .set_index('Date')
         .drop(columns=['Province_State'])
         .groupby(['Country_Region', pd.Grouper(freq='W')]).agg({'ConfirmedCases':'sum', 'Fatalities':'sum', 'dayofweek':'max'})
         .reset_index()
         .where(lambda df: df.ConfirmedCases > 0)
         .dropna(0)
         .groupby('Country_Region')
         .apply(lambda df: (df
                            .sort_values('Date')
                            .assign(week_of_infection = lambda df: pd.np.arange(df.shape[0]))))
         .where(lambda df: df.dayofweek >= 6)
         .drop(columns=['dayofweek'])
         .dropna(0)
         .reset_index(drop=True)
         .merge(country_indicators, left_on='Country_Region', right_on='name', how='inner')
         .pipe(lambda df: gpd.GeoDataFrame(df, geometry='geometry'))
         .assign(ConfirmedCases_per_capita = lambda df: (df.ConfirmedCases / df.pop_est),
                 Fatalities_per_capita= lambda df: (df.Fatalities / df.pop_est),
                 land_area = lambda df: df.area.astype('float'),
                 week_of_infection_exp = lambda df: df.week_of_infection.apply(np.exp))
         .groupby('Country_Region')
         .apply(lambda df: (df
                            .assign(week_since_first_death = lambda x: (x.week_of_infection - x.where(lambda y: y.Fatalities > 0)
                                                                        .week_of_infection.min())
                                                                        .clip(lower=0)
                                                                        .fillna(0))))
         .assign(week_since_first_death_exp = lambda df: df.week_since_first_death.apply(np.exp))
         .drop(columns = 'gdp_md_est'))
weeks

To construct our design matrix for our experiment, we opted to include all our numeric columns and, to account for country-specific effects, we opted to include dummy variables for countries.  This design matrix is reused in both models, to estimate covariates for %Δ Cases/capita and %Δ Fatalities/capita.  

In [None]:
X, y_cases = (weeks
     .select_dtypes(include=['number'])
     .drop(columns=['ConfirmedCases', 'Fatalities', 'ConfirmedCases_per_capita', 'Fatalities_per_capita'])
     .replace(0, 1e-8)# add jitter
     .transform(np.log)
     .pipe(lambda df: df.fillna(df.mean()))
     .rename(columns = lambda name: '%Δ ' + name)
     .rename(columns = {'%Δ week_of_infection_exp': 'week_of_infection'})
     .rename(columns = {'%Δ week_since_first_death_exp': 'week_since_first_death'})
     .pipe(lambda df: pd.concat([df, pd.get_dummies(weeks.name, drop_first=True).rename(columns =lambda s: 'is_'+s)], axis=1))
     .assign(const = 1),
        
    weeks
    .loc[:, ['ConfirmedCases_per_capita']]
    .rename(columns={'ConfirmedCases_per_capita': 'Cases/capita'})
    .replace(0, 1e-8)# add jitter
    .transform(np.log)  
    .rename(columns = lambda name: '%Δ ' + name)
    )

X.head()

# %Δ ConfirmedCases Model

Our final response variable for %Δ Cases/capita looks approximately symmetric, which should make our assumption of conditional normality in our models better motivated. 

In [None]:
y_cases.hvplot.kde(title='Kernel Density Estimation of %Δ Confirmed Cases Response')


To perform feature selection, we opted to use forward-backwards stepwise feature selection, with an input threshold of 0.015 and removal threshold at the 5% level of significance.  In order to ensure this procedure was not bias to the order of the columns, the columns were randomly shuffles, and the selection procedure was rerun multiple times. 

In [None]:
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.015, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

params_cases = stepwise_selection(X.loc[:, np.random.permutation(X.columns)], y_cases, threshold_in=0.015)

model_cases = OLS(y_cases, X.loc[:, params_cases])
results_cases = model_cases.fit()

**Regression Estimates**  
Our final model includes many of our country-specific effects, which may be interesting to analyze. What is fascinating to investigate in our model is the inclusion of significant %Δ Industry and %Δ Agriculture features.  This may suggest that countries with largely service-based economies have lower growth-rates in infection controlling for our other variables.  

In [None]:
results_cases.summary()

**Influence + Leverage against Squared Residuals + QQ-plot**  
Our final model includes many of our country-specific effects, which may be interesting to analyze. What is fascinating
A major concern for our analysis is the clear structure in our leverage and residuals, suggesting there may be an omitted variable not included in our design matrix or by our selection procedure.  Despite this structure, the distribution of our errors appears to strongly follow our assumptions of normality, which is promising for the later tests on our model.  g to investigate in our model is the inclusion of significant %Δ Industry and %Δ Agriculture features.  This may suggest that countries with largely service-based economies have lower growth-rates in infection controlling for our other variables.  

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(17, 5))
fig = sm.graphics.influence_plot(results_cases, ax=axes[0], criterion="cooks")
fig = sm.graphics.plot_leverage_resid2(results_cases, ax=axes[1])
res = results_cases.resid # residuals
fig = sm.qqplot(res, ax=axes[2])
fig.tight_layout()


**Partial Regression Plots**  
What should raise concern is the correlation many variables have with the errors, and the presence of heteroskedasticity in our data, which may be a function of the number of the transformations on our data or omitted variables. 

In [None]:
fig = plt.figure(figsize=(30,60))
sm.graphics.plot_partregress_grid(results_cases, fig=fig)

# %Δ Fatalities Model

For our second model, we will investigate the covariates on %Δ Fatalities Response.  Looking at the Kernel Density Estimation of our response variable, we can see clearly a mixture of two- possibly three- symmetric distributions, which may have interesting covariates in our data. 

In [None]:
y_fatalities = (weeks
    .loc[:, ['Fatalities_per_capita']]
    .rename(columns={'Fatalities_per_capita': 'Fatalities/capita'})
    .replace(0, 1e-8)# add jitter
    .transform(np.log)  
    .rename(columns = lambda name: '%Δ ' + name))

y_fatalities.hvplot.kde(title='Kernel Density Estimation of %Δ Fatalities Response')

Similar to our first model, we chose a forward-backwards stepwise method of feature selection, but with a threshold of 0.025 for variables entering our model. This higher value was chosen after analyzing our models under a number of different hyperparameters and comparing the variables entering the model against our %Δ Cases/capita model. 

In [None]:
params_fatalities = stepwise_selection(X.loc[:, np.random.permutation(X.columns)], y_fatalities,  threshold_in=0.025)

model_fatalities = OLS(y_fatalities, X.loc[:, params_fatalities])
results_fatalities = model_fatalities.fit()

**Regression Estimates**  
Our estimated model appears to have far fewer features included and a far lower Adjusted R-squared. While it may be difficult to explain why our %Δ Fatalities/capita model is explained worse by its covariates than the %Δ Cases/capita model, this may be due to the fact that many countries are too early on in their infection rate to recognise deaths, making estimation more challenging. 

In [None]:
results_fatalities.summary()

**Influence + Leverage against Squared Residuals + QQ-plot**  
This model appears far more dominated by points of high leverage, and our residuals seem to exhibit much fatter tails to the model. 

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(17, 5))
fig = sm.graphics.influence_plot(results_fatalities, ax=axes[0], criterion="cooks")
fig = sm.graphics.plot_leverage_resid2(results_fatalities, ax=axes[1])
res = results_fatalities.resid # residuals
fig = sm.qqplot(res, ax=axes[2])
fig.tight_layout()


**Partial Regression Plots**  
Yet again, our model does suffer from string correlation against our residuals which may be a function of either omitted variables or poor transformations of our feature-space.  

In [None]:
fig = plt.figure(figsize=(30,60))
sm.graphics.plot_partregress_grid(results_fatalities, fig=fig)

# Model Comparison


The main aim of this analysis has been to compare the coefficients across our two models to identify where and why they differ. The conjecture I present in this notebook is that if these coefficients differ this may be an indication that either Fatalities are driven by other factors which do not influence the number of Confirmed Cases, or that the number of Confirmed Cases is a function of factors which lead to better testing and thus higher rates of Confirmed Cases.  What is interesting here is to observe where these may be either omitted variables or change in the sign of a coefficient between the two models. 

In [None]:
(pd.concat([results_cases.params.to_frame(name='Coefficient').assign(Response = '%Δ Cases/capita'),
            results_fatalities.params.to_frame(name='Coefficient').assign(Response = '%Δ Fatalities/capita')], axis=0)
 .drop(index=['const'])
 .reset_index().rename(columns={'index': 'Covariate'})
 .where(lambda s: ~s.Covariate.str.startswith('is_')).dropna().set_index('Covariate')
 .hvplot.bar(title='COVID-19: Coefficients on (%Δ) Covariate against (%Δ) Response', by='Response', rot=90)
 .opts(width=1200, height=400))

In [None]:
coefficients = (results_cases.params.to_frame('Cases')
                .join(results_fatalities.params.to_frame('Fatalities'), how='outer')
                .fillna(0))

In order to extend on our visual analysis of these coefficients, we can test if the coefficients of one of our models are statistically different from the estimated of the model, under the t-distribution. This is different than identifying whether these coefficients are statistically non-zero, as many of these coefficients we are comparing against can take on positive and negative values.  

Firstly we will check if estimates for our %Δ Fatalitlies/capita model the same as those for %Δ ConfirmedCases per capita model.   

In [None]:
formula = (coefficients
 .Cases
 .loc[results_fatalities.params.index]
 .reset_index()
 .rename(columns={'index':'Name'})
 .assign(formula = lambda df: df.Name.astype(str) + ' = ' + df.Cases.astype(str) + ' ,')
 .formula
 .sum())[:-1]

T_test = results_fatalities.t_test(formula)
T_test.summary_frame().assign(names = model_fatalities.exog_names).set_index('names').round(3)

Secondly, we will check if our estimates for our %Δ ConfirmedCases/capita model is the same as those for our %Δ Fatalities/capita model.  

In [None]:
formula = (coefficients
 .Fatalities
 .loc[results_cases.params.index]
 .reset_index()
 .rename(columns={'index':'Name'})
 .assign(formula = lambda df: df.Name.astype(str) + ' = ' + df.Fatalities.astype(str) + ' ,')
 .formula
 .sum())[:-1]

T_test = results_cases.t_test(formula)
T_test.summary_frame().assign(names = model_cases.exog_names).set_index('names').round(3)

# Conclusion
What appears interesting from our analysis, is that apart from some country-specific estimtes. The estimates of our coefficients do seem to differ between our two models. The structure of the economies of countries appears a credible factor to investigate why these estimates vary so much which may have far reaching implications as the virus spreads. 

I would love to hear your feedback on this notebook and any suggestions on how I may improve the analysis in anyway by included new data sources or new methodologies.  Please, if you liked this kernel, please give it a vote and check our some of my other intesting kernels on COVID-19 Survival Analysis.  