Spatial Regression is an approach popular across many disciplines to estimate regression models in the presence of spatial correlation.  For economist investigating the covariates which drive housing prices, controlling for spatial correlation in our models allows us to better control for neighbourhood effects. These neighbourhood effects may be as a result of the willingness of a buyer to pay more for a house near other nice houses or as the results of unobserved variables in the data which capture the distance to schools, shops and restaurants.  
  
In analyzing the spread and impact of the COVID-19 disease, I have been fascinated by whether we observed uncontrolled spatial correlation in infections or fatalities between neighbouring countries.  This may be as the results of trade or migrations between neighbouring countries or as the results of other properties of the weather or geography which are difficult to account for in the model. 

Additionally, as new data has been released to the public on Government interventions, I have been interested about building on some of my previous notebooks to investigate the success of these programmes. 

In this notebook, we are interested in identifying whether controlling for this autocorrelation provides for more stable estimates to our model, as we as whether the physical distances between countries reflect honest notions of similarity between nations which may otherwise be better captured by the total number of flights or migration between countries or trade. 

# Contents
1. [Data](#Data)  
2. [Spatial Regression Model](#Spatial-Regression-Model)  
3. [Conclusion](#Conclusion)

In [None]:
! apt install libgeos-dev
! pip uninstall -y shapely; pip install --no-binary :all: shapely==1.6.4
! pip uninstall -y cartopy; pip install --no-binary :all: cartopy==0.17.0
! conda install -y geoviews=1.6.6 hvplot=0.5.2 bokeh==1.4.0 pysal

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import os
from operator import add, mul
from time import time
from functools import wraps
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import hvplot.pandas
import holoviews as hv
import cartopy.crs as ccrs
import geopandas as gpd
from toolz.curried import map, partial, pipe, reduce
from statsmodels.regression.linear_model import OLS
from statsmodels.tools.tools import add_constant
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pysal as ps
from sklearn.manifold import MDS, smacof
from sklearn.metrics.pairwise import euclidean_distances

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Any results you write to the current directory are saved as output.
hv.extension('bokeh')

# Data

I chose to draw on a number of data sources on not only COVID cases but country indicators on GDP, infant mortality, etc., as well as data on population estimates and land size.  In order to better control for the variance in Fatalities and Confirmed Cases as a result of country sizes, I opted to look at Fatalities and Confirmed Cases per Capita.  As, at this stage in the virus, the pandemic is still dominated by the exponential growth in new cases, I opted to analyze the relationship between the percent change in Fatalities or Confirmed Cases per Capita, against percent changes in our factors.  Two interesting exceptions to this were variables representing the weeks since the first case in the country and the first case death, where I included both the log of the weeks since this event, to represent percent change, and the original value. This is used to model any effects relating to the logarithmic flattening of the curve late in the infection in a given country.  

We have included in this analysis data using [Oxford University's Government Reponse Tracker Data](https://www.bsg.ox.ac.uk/research/research-projects/oxford-covid-19-government-response-tracker), as well as the [WHO 2000 World Health Report](https://www.who.int/whr/2000/en/). The WHO Report data is 20 years old, but does appear a reliable source, when contrasted with other data sources found. 

In [None]:
def timing(f):
    @wraps(f)
    def wrap(*args, **kw):
        ts = time()
        result = f(*args, **kw)
        te = time()
        print('func:%r took: %2.4f sec' % \
          (f.__name__, te-ts))
        return result
    return wrap

def shape(f, outputs=True, inputs=True):
    @wraps(f)
    def wrap(*args, **kw):
        ts = time()
        result = f(*args, **kw)
        te = time()
        
        if inputs:
            print('func:%r input shape:%r' % \
              (f.__name__, args[0].shape))
            
        if outputs:
            print('func:%r output shape:%r' % \
              (f.__name__, result.shape))
        return result
    return wrap

In [None]:
oxford = (pd.read_csv('/kaggle/input/oxford-covid19-government-response-tracker/OxCGRT_Download_120420_170601_Full.csv', parse_dates=['Date'])
         .fillna(0)
         .drop(columns=['CountryCode', 'StringencyIndexForDisplay', 'Unnamed: 39', 'ConfirmedCases', 'ConfirmedDeaths']))
is_gen = ~oxford.columns.str.endswith('_IsGeneral') & ~oxford.columns.str.endswith('_Notes')
oxford = (oxford
          .loc[:, is_gen]
          .set_index(['CountryName','Date'])
          .rename(columns=lambda s: 'Ox'+s)
          .groupby('CountryName')
          .cumsum()
          .reset_index())

In [None]:
oxford

In [None]:
countries = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')).replace('United States of America', 'US')
covid = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/train.csv', parse_dates=['Date'], index_col='Id')
indicators = pd.read_csv('/kaggle/input/countries-of-the-world/countries of the world.csv', decimal=',').replace('United States', 'US')
who_scores = pd.read_csv('/kaggle/input/who-2000-health-index/WHO 2000 Health Index.csv')

country_indicators = (countries.assign(name = lambda df: df.name.astype(str).str.strip())
                     .merge(indicators.assign(Country = lambda df: df.Country.astype(str).str.strip()), 
                            left_on = 'name', right_on='Country', how='inner'))
weeks = (covid
         .assign(dayofweek = lambda df: df.Date.dt.dayofweek)
         .merge(oxford, left_on=['Country_Region', 'Date'], right_on=['CountryName','Date'])
         .set_index('Date')
         .drop(columns=['Province_State'])
         .groupby(['Country_Region', pd.Grouper(freq='W')]).agg({'ConfirmedCases':'sum',
                                                                 'Fatalities':'sum',
                                                                 'dayofweek':'max',
                                                                 'OxS1_School closing':'mean',
                                                                'OxS2_Workplace closing':'mean',
                                                                'OxS3_Cancel public events':'mean',
                                                                'OxS4_Close public transport':'mean',
                                                                'OxS5_Public information campaigns':'mean',
                                                                'OxS6_Restrictions on internal movement':'mean',
                                                                'OxS7_International travel controls':'mean',
                                                                'OxS8_Fiscal measures':'mean',
                                                                'OxS9_Monetary measures':'mean',
                                                                'OxS10_Emergency investment in health care':'mean',
                                                                'OxS11_Investment in Vaccines':'mean',
                                                                'OxS12_Testing framework':'mean',
                                                                'OxS13_Contact tracing':'mean',
                                                                'OxStringencyIndex':'mean'})
         .reset_index()
         .where(lambda df: df.ConfirmedCases > 0)
         .dropna(0)
         .groupby('Country_Region')
         .apply(lambda df: (df
                            .sort_values('Date')
                            .assign(week_of_infection = lambda df: pd.np.arange(df.shape[0]))))
         .where(lambda df: df.dayofweek >= 6)
         .drop(columns=['dayofweek'])
         .dropna(0)
         .reset_index(drop=True)
         .merge(country_indicators, left_on='Country_Region', right_on='name', how='inner')
         .pipe(lambda df: gpd.GeoDataFrame(df, geometry='geometry'))
         .assign(ConfirmedCases_per_capita = lambda df: (df.ConfirmedCases / df.pop_est),
                 Fatalities_per_capita= lambda df: (df.Fatalities / df.pop_est),
                 land_area = lambda df: df.area.astype('float'),
                 week_of_infection_exp = lambda df: df.week_of_infection.apply(np.exp))
         .groupby('Country_Region')
         .apply(lambda df: (df
                            .assign(week_since_first_death = lambda x: (x.week_of_infection - x.where(lambda y: y.Fatalities > 0)
                                                                        .week_of_infection.min())
                                                                        .clip(lower=0)
                                                                        .fillna(0))))
         .assign(week_since_first_death_exp = lambda df: df.week_since_first_death.apply(np.exp))
         .drop(columns = 'gdp_md_est')
         .merge(who_scores.assign(Country = lambda df: df.Country.str.normalize("NFKD").str.strip()), 
                left_on='name', right_on='Country', how='left'))
weeks

In order to model the spatial correlation between countries, we critically need to find an approach to compute the distances between these nations.  For simplicity, we have opted to fall back on geopandas default projection and use the euclidean distances between the midpoints of the countries in order to esimate their distances.  For sparse countries, it may be more faithful to use the shortest distance between borders or between capital cities, but this appeared computationally intensive and may be grounds for future work.  

In [None]:
points = (weeks.geometry
          .representative_point()
          .apply(lambda df: pd.Series([df.x, df.y]))
          .rename(columns={0:'x',1:'y'}))
hv.Labels(points.assign(names = weeks.name).drop_duplicates(), kdims=['x','y'], vdims='names').opts(height=500, width=800, title='Midpoints of Countries')

In [None]:
distances = euclidean_distances(points)
distances = np.sin(np.pi * distances / distances.max().max())
D = pd.DataFrame(distances, index = weeks.name, columns = weeks.name).drop_duplicates().T.drop_duplicates()

Z, score = smacof(D.to_numpy(), n_components=2)

(hv.Labels(pd.DataFrame(Z, columns=['x','y'])
           .assign(names = D.index)
           .drop_duplicates(), kdims=['x','y'], vdims='names').opts(height=500, width=800))

To construct our design matrix for our experiment, we opted to include all our numeric columns and, to account for country-specific effects, we opted to include dummy variables for countries.  This design matrix is reused in both models, to estimate covariates for %Δ Fatalities/capita.  The reason we opted to model %Δ Fatalities/capita and not %Δ Confirmed Cases/capita, is due to problems of testing which appears highly correlated with many indicators.  

In [None]:
X, y_fatalities = (weeks
     .select_dtypes(include=['number'])
     .drop(columns=['ConfirmedCases', 'Fatalities', 'ConfirmedCases_per_capita', 'Fatalities_per_capita'])
     .replace(0, 1e-8)# add jitter
     .transform(np.log)
     .pipe(lambda df: df.fillna(df.mean()))
     .rename(columns = lambda name: '%Δ ' + name)
     .rename(columns = {'%Δ week_of_infection_exp': 'week_of_infection'})
     .rename(columns = {'%Δ week_since_first_death_exp': 'week_since_first_death'})
     .pipe(lambda df: pd.concat([df, pd.get_dummies(weeks.name, drop_first=True).rename(columns =lambda s: 'is_'+s)], axis=1))
     .assign(const = 1),
        
    weeks
    .loc[:, ['Fatalities_per_capita']]
    .rename(columns={'Fatalities_per_capita': 'Fatalities/capita'})
    .replace(0, 1e-8)# add jitter
    .transform(np.log)  
    .rename(columns = lambda name: '%Δ ' + name)
    )

X.head()

A major challenge in working with Government Response data, is that the response of government is endogenously determined by the load of the public healthcare system. This makes that that our estimates will present that increase in government intervention are positively corrrelated with deaths. In order to deal with this phenomenon, I opted to orthogonalize the Government Response data with respect to the number of Confirmed Cases per Capital and %Δ Confirmed Cases per Capital. After performing this operation with still need to deal with the fact that our Government Response data is highly correlated, as governments tend to take actions simultaneously. To manage this challenge, I opted to use Confirmatory Factor Analysis to compute a Government Response Factor Latent Variable for use in my model. 

In [None]:
y_cases =(weeks
    .loc[:, ['ConfirmedCases_per_capita']]
    .rename(columns={'ConfirmedCases_per_capita': 'Cases/capita'})
         )

log_y_cases = (y_cases
    .replace(0, 1e-8)# add jitter
    .transform(np.log)  
    .rename(columns = lambda name: '%Δ ' + name))

v = np.c_[y_cases.to_numpy(), log_y_cases.to_numpy()]

In [None]:
I = X.loc[:, X.columns.str.startswith('%Δ OxS')]
iv_names = [s + ' IV' for s in I.columns]
I = I.to_numpy()
IV = pd.DataFrame(I - v @ (np.linalg.pinv(v.T @ v) @ v.T @ I), columns=iv_names)

In [None]:
from sklearn.decomposition import FactorAnalysis
government_response = pd.DataFrame(FactorAnalysis(1).fit_transform(IV), columns=['Government Response Factor'])

In [None]:
X = pd.concat([X.loc[:, ~X.columns.str.startswith('%Δ OxS')], government_response], axis=1)

In [None]:
X.head()

# Spatial Regression Model

## Feature Selection

In order to perform feature selection, we opted to perform repeated step-wise forward-backward selection on a non-spatal regresison model. This was mainly due to frustration with the pysal API and prior work on modelling COVID-19 fatalities. This always allows us to more directly compare the effects of controlling for spatial correlation in our model between neighbouring countries. 

In [None]:
def stepwise_selection(X, y, 
                       initial_list=[], 
                       n_iter = 1,
                       threshold_in=0.015, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    for _ in range(n_iter):
        while True:
            changed=False
            # forward step
            excluded = list(set(X.columns)-set(included))
            new_pval = pd.Series(index=excluded).sample(frac=1)
            for new_column in excluded:
                model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
                new_pval[new_column] = model.pvalues[new_column]
            best_pval = new_pval.min()
            if best_pval < threshold_in:
                best_feature = new_pval.idxmin()
                included.append(best_feature)
                changed=True
                if verbose:
                    print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

            # backward step
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
            # use all coefs except intercept
            pvalues = model.pvalues.iloc[1:]
            worst_pval = pvalues.max() # null if pvalues is empty
            if worst_pval > threshold_out:
                changed=True
                worst_feature = pvalues.idxmax()
                included.remove(worst_feature)
                if verbose:
                    print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
            if not changed:
                break
    return included

params_cases = stepwise_selection(X.loc[:, np.random.permutation(X.columns)], y_fatalities,
                                  n_iter=3,
                                  threshold_in=0.015, threshold_out=0.025)

model_cases = OLS(y_fatalities, X.loc[:, params_cases])
results_cases = model_cases.fit()

In [None]:
results_cases.summary()

## Model

From our model below, we see interestingly see strong evidence for spatial correlation in the data when looking at our test for using Lagrange Multiplier (error), 
Robust LM (error) and Lagrange Multiplier (SARMA).  If we again look at the parameter estimates for our model, we see variables across the board significant at the 5% level.  What appears fascinating in our estimates are the positive coefficients on %Δ Arable (%), which may provide policy makes insight into how best to support nations worst exposed to the virus. 

In [None]:
X_prime = X.loc[:,params_cases]
X_prime = X_prime.loc[:, X_prime.min(0) != X_prime.max(0)]
w = ps.lib.weights.full2W(distances)

model = ps.model.spreg.OLS(y_fatalities.values, X_prime.values, w=w, spat_diag=True, name_x=X_prime.columns.tolist(), name_y=y_fatalities.columns.tolist()[0])
print(model.summary)

In [None]:
pd.Series(model.betas[1:].flatten(), index=X_prime.columns.tolist(), name='Coefficients: Relationship with %Δ COVID-19 Fatalities').hvplot.bar().opts(xrotation=75, height=600)

## Spatial PCR

When analyzing Variance Inflation Factor, explored in a previous notebook, I became interested in how our parameter estimates might change if we apply a Principle Component Regression Approach.  Using this approach, we chose to apply a power transform to our continuous features before standard scaling and computing our principle components. The number of principle components were chosen based using the ELBO rule, based on the varianced explained.  I opted to reintroduce the country indicator variables into the model after computing the principle components, in order to avoid conflating the distributional assumptions across our different types of data in scaling and to ease interprettation. 

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, PowerTransformer

X_prime_all = pd.concat([X
                         .loc[:, ~X.columns.str.startswith('is_')]
                         .drop(columns=['const'])
                         .drop(columns=['Government Response Factor']),
                        IV], axis=1)
X_prime_all = X_prime_all.loc[:, X_prime_all.min(0) != X_prime_all.max(0)]

pipeline = make_pipeline(PowerTransformer(), StandardScaler(), PCA(7))
Z = pd.concat([pd.DataFrame(pipeline.fit_transform(X_prime_all)), X_prime.loc[:, X_prime.columns.str.startswith('is_')]], axis=1)

(hv.Bars(pipeline.named_steps['pca'].explained_variance_ratio_)
.opts(title='Variance Explained Ratio of Princple Components', ylabel='Variance Explained', xlabel='Component', width=600))

Looking at the outputs of the model, it appears spatital correlation in our errors is more apparent, based on our tests. 

In [None]:
pcr_model = ps.model.spreg.OLS(y_fatalities.values, Z.values, w=w, spat_diag=True, name_x=Z.columns.tolist(), name_y=y_fatalities.columns.tolist()[0])
print(pcr_model.summary)

I opted to reproject the coefficients from our Spatial PCR model into the original space of our data in order to analyze them further. 

In [None]:
betas = np.array(pcr_model.betas[1:-Z.columns.astype(str).str.startswith('is').sum()])
coef_ = pipeline.named_steps['pca'].inverse_transform(betas.reshape(1,-1))
(pd.DataFrame(coef_.flatten(), index = X_prime_all.columns, columns=['Reprojected Coefficients'])
 .hvplot.bar().opts(xrotation=90, height=400))

# Conclusion

I would love to hear your feedback on this notebook and any suggestions on how I may improve the analysis in anyway by included new data sources or new methodologies.  Please, if you liked this kernel, please give it a vote and check our some of my other intesting kernels on COVID-19 Survival Analysis.  