# Typology of National Responses to the Covid-19 Pandemic

## 1. Overview
As the Covid-19 pandemic reaches 188 countries and territories, causing more than 4.89 million cases and more than 322,000 deaths as of May 19, 2020, governments worldwide have responded with a wide range of measures. The measures adopted by each country vary by intensity and timing, and they have contributed to the differing degrees of national effectiveness to contain the virus.

In this submission we evaluate the impact of government strategies on Covid-19 infection by developing a typology of governments' early policy responses and comparing infection rates across the categories. Government responses often involve various combinations of policies, ranging from public health to economics. To compare policies across countries, we adopt the Oxford Covid-19 government response stringency index, a common index that aggregates policy scores and is comparable across countries. Furthermore, we focus on **early government actions**, that is, measures imposed during the span of two weeks before the date of the first confirmed case in a country and one month afterwards.


### Takeaways

We derived five categories of early national responses from cluster analysis (more details in section 3.1):

* Proactive
* Swiftly Responsive
* Responsive
* Moderately Responsive
* Slow Start

We then compared across the country categories the infection curves and the speed at which the infection reached certain thresholds. We controlled for country healthcare and economic differences using match and regression analysis.

We find that countries with Proactive or Responsive Covid-19 response profiles tend to experience slower infection at a smaller scale. The difference between the two groups is indistinguishable. On the other hand, countries in the Slow Start group had the fastest growth of confirmed cases, while the Moderately Responsive group lies in-between. Due to the small sample size (74 countries), only the difference between the Proactive group and the Slow Start group is statistically significant.

Similar policies seem to be less effective on countries deprived of resources. Covid19 may bring countries with pre-existing social and economic vulnerabilities an even greater degree of uncertainty. These countries require focus study with analytical models different from other countries to help better predict the influence of policies.

### Caveats ### 

As expected, more proactive and quicker government responses are beneficial in the fight of the Covid-19 virus. However, our analysis also yielded some counterintuitive findings, such as placing Germany in the Slow Start group along with the United States, the United Kingdom, and Spain. This misalignment suggests that the policy metric we adopted, the Oxford stringency index, may not accurately or holistically capture government responses to the pandemic, and that government policy alone may not determine the outcome of the country in the pandemic. Last but not least, as we evaluate government responses, we should keep in mind the economic and social impacts of these policies, in particular the lockdown measures, in addition to their health impacts.

## 2. Data
One of the datasets we use in this submission comes from the UNCOVER datasets, the UNCOVER John Hopkins CSSE global confirmed cases. We accessed the data on the project's github page on May 9th (https://github.com/CSSEGISandData/COVID-19).

The core of our analysis is built upon the time series of the Oxford Covid-19 government response stringency index. We accessed the data on May 9th at https://covidtracker.bsg.ox.ac.uk/

We also include a number of country-level socioeconomic variables:

* UN population-related data (https://population.un.org/wpp/Download/Standard/CSV/)
* World Bank GDP data (https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?view=map)
* Kaggle country to continent (https://www.kaggle.com/statchaitya/country-to-continent)
* World Bank life expectancy at birth by country (https://data.worldbank.org/indicator/sp.dyn.le00.in)

We have uploaded both raw and processed datasets used in this submission. In the visualization and analysis section below, we directly imported the cleaned data. In the Appendix at the end of this notebook, we demonstrated step-by-step how we cleaned up the raw data.

Our final sample consists of 74 countries. Our sample choice is highly constrained by data availability. We adopt a six-week period to summarize and classify government responses. The six-week period starts from 15 days before the diagnosis of the first case to 30 days after the first confirmed case. Countries with significant missing data during this period were dropped. We also exclude from our sample countries with fewer than 1,000 cumulative confirmed cases.

## 3. Analysis and Findings
### 3.1 Typology of National Responses

In [None]:
#setup
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.patches as mpatches
import warnings
from dateutil import parser
from datetime import datetime, timedelta
from matplotlib import rcParams
from sklearn.cluster import KMeans
from matplotlib.lines import Line2D

warnings.filterwarnings('ignore')

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

In [None]:
# time series of daily cumualtive confirmed cases
case_df = pd.read_csv('/kaggle/input/covid19-data/time_series_covid19_confirmed_global.csv')

# cleaned time series. each column is a country. each row is cum. cases since the day of first confirmed case
case_df_day0 = pd.read_csv('/kaggle/input/covid19-data/cum_cases_from_day0.csv')

# stringency index
policy_df = pd.read_csv('/kaggle/input/covid19-data/OxCGRT_latest.csv')

# cleaned stringency index. each column is a country
policy_df_day0 = pd.read_csv('/kaggle/input/covid19-data/stringency_index_from_day0.csv')
policy_df_day0 = policy_df_day0.set_index('days_since')

# cleaned country-level data: 
# dates reaching certain thresholds of confirmed cases and country-level covariates
country_df = pd.read_csv('/kaggle/input/code-result/country_background_result.csv')

In [None]:
# aggregate stringency index by week, into 10 even intervals
# week 0 is the first week since the date of the first confirmed case
# negative week numbers represent weeks before the date of first case

# convert days to weeks
policy_df_day0['weeks_since'] = policy_df_day0.index.map(lambda n: n // 7)

# calculate weekly means
weekly = policy_df_day0.groupby('weeks_since').mean()

# turn numeric values into 10 intervals
weekly = weekly.apply(lambda col: pd.cut(col, np.arange(0.0, 110.0, 10.0), include_lowest=True), axis=0)

# plot heatmap from week -5 to week 8 (other weeks omitted due to large scale of missing data )
counts = weekly.loc[-5:8].apply(lambda row: row.value_counts(), axis=1).transpose()

# order row names
row_names = counts.index.tolist()
row_names.sort()
row_names.reverse()

counts = counts.reindex(index = row_names)
sns.heatmap(counts, cmap=sns.light_palette("navy"))
plt.xlabel('Week')

# remove column weeks_since
policy_df_day0 = policy_df_day0.drop(columns='weeks_since')

We explored the variations in policy intensity and timing in this heatmap. We first calculated a weekly mean of stringency index in each country and assigned the mean score to one of ten even intervals between 0 and 100, which is the range of the stringency index. Each cell in the heatmap represents the number of countries with mean scores falling into the given interval in the given week. With the week in which the first Covid-19 case was confirmed labelled as week 0, we show in this figure the distribution of government responses from five weeks before week 0 to 8 weeks after week 0.

This heatmap shows that countries are overwhelmingly concentrated in the bottom left corner and the top right corner. The concentration at the bottom left indicates that most countries did not take any action and have a stringency score betwen 0 and 10 in the weeks before week0. The concentration on the top right has lighter colors than the bottom left corner does. This illustrates that many countries have reached the highest level of policy stringency (scores between 90 and 100) during weeks 5-8, but there are also some countries located in the lower intervals. 

Based on this heatmap, we choose the period from two weeks before week 0 to 4 weeks after week 0 as the observation period of early government responses. This observation period captures large variations in governments' early policy stringency and faciliates the classification of countries through cluster analysis. 

In [None]:
country_df.cluster5.value_counts()

In [None]:
# create dataset for plotting below
def create_df_after_N(N):    
    """
    Return a subset of cumulative cases per country, starting from the date of at least N case.
    """
    ls = []

    for i, row in country_df.iterrows():
        if pd.notnull(row['days_to_%i'%N]):
            stop = int(row['days_to_%i'%N]) + 1
            ls.append(case_df_day0[row.country][stop:].reset_index(drop=True))

    return pd.concat(ls, axis=1)

case_df_after_500 = create_df_after_N(500)

In [None]:
order_to_plot = ['Proactive', 'Swiftly Responsive', 'Responsive', 'Moderately Responsive', 'Slow Start']
n_clusters = 5

fig, axes = plt.subplots(2, n_clusters, sharey='row', figsize=(20, 8))

for j in range(n_clusters):
    countries = country_df[country_df.cluster5==order_to_plot[j]].country
    countries = [country for country in countries if country in case_df_after_500.columns]

    # in first row, plot country stringency index
    axes[0, j].plot(policy_df_day0.loc[-15:30, countries], color='tab:blue', alpha=0.4)
    axes[0, j].axvline(x=0, c='black', linestyle=':')
    axes[0, j].set_title('%s (%i)'%(order_to_plot[j], sum(country_df['cluster5']==order_to_plot[j])))
    if j == 0:
        axes[0, j].set_ylabel('Stringency index')
    remove_border(axes[0, j])

    # in second row, plot infection curves since 500th case
    axes[1, j].plot(case_df_after_500[countries], color='tab:red', alpha=0.4)
    axes[1, j].set_xlabel('Day')
    axes[1, j].set_ylim(0, 100000)
    if j == 0:
        axes[1, j].set_ylabel('Cum. cases since 500th case')
    remove_border(axes[1, j])
    
plt.suptitle('Typology of Early National Responses to Covid-19')

Cluster analysis shows that the five-cluster solution best characterizes the heterogeneity in early national responses to Covid-19.

After observing the patterns of policies (the first row in the figure above), we name the clusters as the follows:

* Proactive (9 countries): Countries in this group started acting before any Covid-19 case was diagnosed in the nation and then slowly and steadily reached the highest level of the stringency index within one month after the date of the first confirmed case
* Swiftly Responsive (13): Countries in this group reacted immediately after the occurence of first Covid-19 case and typically reached the highest policy intensity within two weeks after the appearance of the first confirmed case
* Responsive (25): Coutries in this group were initially cautious to take immediate actions following the diagosis of the first case. However, they strengthened their policy responses between the 10th and 20th day after the first confirmed case and reached high level by the 30th day
* Moderately Responsive (9): Countries in this group rose to only a medium level of policy intensity (between 40 and 80) at a slower pace, often taking more than three weeks.
* Slow Start (18): With little proactive actions before the first confirmed case, countries in this group had the slowest response to the virus. Most of them did not reach half of the policy intensity as their counterparts in the other categories did by the end of one month.


The five figures in the second row each draw the infection curves of the countries in a cluster, starting from the day of the 500th confirmed case and displayed only up to 100,000 cases. As expected, the Proactive and Swiftly Responsive groups acquired fewer confirmed cases at a slower pace, despite of some outliers. The infection curves in the Responsive group started to raise moderately, indicating larger number of infection cases. The two rightmost figures show that the Moderately Responsive group and the Slow Start group experienced acute increases of the largest number of confirmed cases. 

In [None]:
# display top three largest countries in each cluster (largest population)
top_three = country_df.sort_values(['cluster5', 'PopTotal'], ascending=False).groupby('cluster5').apply(lambda df: df.iloc[:3])

fig, axes = plt.subplots(1, n_clusters, sharex=True, sharey='row', figsize=(20, 4))
for j in range(n_clusters):
    countries = top_three[top_three.cluster5==order_to_plot[j]].country
    countries = [country for country in countries if country in case_df_after_500.columns]

    policy_df_day0.loc[-15:30, countries].plot(ax=axes[j])

    axes[j].set_title(order_to_plot[j])
    axes[j].set_xlabel('Day')
    if j == 0:
        axes[j].set_ylabel('Stringency Index')    
    axes[j].legend()
    remove_border(axes[j])

Here we present the three countries with the largest total populations from each cluster.

### 3.2 Analyse impact of response types based on country attributes
There exists significant disparities across countries in terms of their economic volumes and wellness levels. We would like to address this context in our analysis. We are interested in observing for similar countries implemented with varying containment and mitigation strategies, if the response type would have a more obvious impact.

There are many attributes we can use. In this study, we selected three metrics: GDP per capita, population density and life expectancy. See 5.6 for details how we generated the country categories.

In [None]:
# create country group order based on their wealth and life expectancy level
country_category_wealth_order = ['LowestGDPLowestLifeExp', 'LowerGDPLowerLifeExp', 'MedGDPHigherDensity', 'MedGDPHigherDensity', 'HigherGDPLongerLifeExp', 'HighestGDPLongestLifeExp']

metrics = ['GDP2018PerCapita', 'PopDensity', 'LifeExpectancy2018']
sns.set(style='dark')
fig, axes = plt.subplots(1, 3, sharex=True, figsize=(20, 4))

for index, metric in enumerate(metrics):
    sns.boxenplot(x="country_category_name", y=metric, data=country_df, ax=axes[index%3])
    axes[index%3].set_xticklabels(['LowestGDP\nLowestLifeExp', 'LowerGDP\nLowerLifeExp', 'MedGDP\nHigherDensity', 'MedGDP\nHigherDensity', 'HigherGDP\nLongerLifeExp', 'HighestGDP\nLongestLifeExp'], rotation = 45, ha="right")
    remove_border(axes[index%3])

Based on the boxenplot, we are categorizing the countries as follows:

* LowestGDPLowestLifeExp
* LowerGDPLowerLifeExp
* MedGDPHigherDensity
* MedGDPHigherDensity
* HigherGDPLongerLifeExp
* HighestGDPLongestLifeExp

In [None]:
sns.set(style='dark')
order_to_plot = ['Proactive', 'Swiftly Responsive', 'Responsive', 'Moderately Responsive', 'Slow Start']

fig, axes = plt.subplots(3, 2, sharex=True, sharey=True)
palette = plt.get_cmap('RdPu_r')

for i in range(3):
    for j in range(2):
        cat = country_category_wealth_order[i*2+j] # category name
        
        # plot countries in each policy cluser in a separate color 
        for cluster, df in country_df[country_df.country_category_name==cat].groupby('cluster5'):
            countries = df.country # countries in the category
            color = palette(order_to_plot.index(cluster)*50)
            axes[i, j].plot(case_df_after_500[countries], color=color, alpha=0.6)
        axes[i, j].set_title(cat)
        

# customize legend
lines = [Line2D([0], [0], color=palette(i*50), lw=2) for i in range(5)]
fig.legend(lines, order_to_plot, loc='lower center', ncol=5)

The graph above shows each country group's infected case growth. From 0 to 5, the lower the group number the less resources and wealth the country group has on average. We also color coded their case development trajectories based on their policy response typologies. Countries with darker color tend to react more rapidly and intensively to mitigate the spread of Covid19.

Wealthier countries seem to perform better independent of their policy response type. We observe a strong pattern among the medium level countries where countries that implemented more stringent policies (darker lines) tend to demonstrate a slower infection growth rate. Coutries in group 1 demonstrates a complicated stories with no apparent patterns. It seems to be more complicated to predict the impact of policy effects on containing the pandemic for countries with weak economic conditions and welfare state.

We will further this dicussion with regresion analysis.

### 3.3 Regression Analysis

In addition to descriptive analysis using visualization, we run regression models to evaluate the impact of government measures on infection rates. In this part of analysis, we operationalize infection rates in three ways: (1) the days it took as the number of confirmed cases in a country climbed from 500 to 1,000, (2) from 500 to 2,000, and (3) from 1,000 to 2,000. We regression policy categories on infection speed, while holding country-level attributes, including total population, population density, GDP, life expectancy, and continent, constant.

In [None]:
# use Responsive group as baseline group
country_df['cluster5_o'] = country_df.cluster5.map(lambda name: '0'+name if name=='Swiftly Responsive' else name)

res = []
res.append(smf.ols(formula='days_500_1000 ~ cluster5_o + np.log(PopTotal) + PopDensity + np.log(GDP2018) + continent', data=country_df).fit())
res.append(smf.ols(formula='days_500_2000 ~ cluster5_o + np.log(PopTotal) + PopDensity + np.log(GDP2018) + continent', data=country_df).fit())
res.append(smf.ols(formula='days_1000_2000 ~ cluster5_o + np.log(PopTotal) + PopDensity + np.log(GDP2018) + continent', data=country_df).fit())

labels = ['DV: days 500-1000', 'DV: days 500-2000', 'DV: days 1000-2000']

for i in range(3):
    x = res[i].params[1:5]
    err = (res[i].conf_int()[1] - res[i].params)[1:5]
    plt.errorbar(x = x, y = np.arange(4)+(i-1)*0.1, xerr = err, fmt=' ', label=labels[i])
    plt.scatter(x = x, y = np.arange(4)+(i-1)*0.1)

    
plt.yticks(np.arange(4), ['Moderately Responsive', 'Proactive', 'Responsive', 'Slow Start'])    
plt.xlabel('Effect Size')
plt.axvline(x=0, linestyle=':', color='black')
plt.legend()    
plt.title('Regression Models of National Response Types on Infection Rates \n (with Controls)')
remove_border()

We present the outcome of our gression models in coefficient plots. The Swiftly Responsive group is the baseline group and thus omitted from the plot. None of the effects are statistically significant, which is expected given the small sample size. Most effects are close to zero and make the clusters nondistinguishable in terms of their impact on the infection. However, the Proactive group has two positive point estimates of size 1-2, suggesting that countries in the Proactive group may have a little advantage over the Swifly Responsive group and take about one day longer to reach 500 confirmed cases to 2,000 and from 1,000 confirmed cases to 2,000. The Moderately Responsive group has two point estimates reaching -2. Countries in this group experienced a faster infection rate than those in the Swiftly Responsive group. 

In [None]:
# Find if there is correlation between policy intensity and infection case number by day 55
# if we are controling country types

regression_by_country_category = country_df[['country', 'cluster5', 'country_category_name']]

# Find each country's infection number by day 55
# Because the Covid19 pandemic is still developing,
# with the current data day 55 is one of the latest date that we have data of all the countries' infection rate
country_55 = case_df_day0.transpose().reset_index().rename(columns={'index': 'country'})[['country', 55]]

# Add infection number by day 55 to regression_by_country_category
regression_by_country_category = pd.merge(regression_by_country_category, country_55, on='country', how='left')
regression_by_country_category = regression_by_country_category.rename(columns={'cluster5': 'policy_response_type', 55: 'day_55_case_num'})

**Model fitting for countries with less resources**

In [None]:
# Linear Regression for countries with less resources
from statsmodels.formula.api import ols
country0 = regression_by_country_category[regression_by_country_category['country_category_name']=='LowestGDPLowestLifeExp']
fit = ols('day_55_case_num ~ C(policy_response_type)', data=country0).fit() 

print(fit.summary())

**Model fitting for average and wealthier countries**

In [None]:
# Linear Regression for average and wealthier countries
country1to5 = regression_by_country_category[regression_by_country_category['country_category_name']!='LowestGDPLowestLifeExp']
fit = ols('day_55_case_num ~ C(policy_response_type)', data=country1to5).fit() 

print(fit.summary())

We are interested in exploring if policy impact on infection rate would be more pronounced when we separate average and wealthy countries and countries with lowest GDP and lowest life expectancies. The two regression analysis above evaluates the correlation between countries' response types and the number of cases 55 days after first infection in each country. Unfortunately we found no statistically significant results. This could happen due to the limited timeframe or our choice of dependent variables. However compared to countries in group 0 which has a p-value of 0.44, other countries seem to have a slightly better fitting model with the policy response type. We also see a bigger impact on case number reduction when switched from `Moderately Responsive` group to `Proactive` or `Swiftly Responsive` response type among average and wealthy countries.

## 4. Summary

Many factors affects the effectiveness of policies. Our analysis illustrates some degree pf correlation between a country's policy stringency level and its outcome. For an average country, timely and stringent actions seem to contribute to the mitigation of Covid19. Yet, for many countries, Covid19 is still at relatively early stage of development. A stronger correlation may emerge in the future.

Countries with poor pre-existing healthcare conditions and low economic volumes are more vulnerable and may undergo more uncertainties as the Covid19 pandemic develops, in that it is more complicated to predict policy impact. For these countries deprived of resources, the current response strategies may still not be enough. It requires countries' creativity and even international support to help mitigate the situation.

## 5. Appendix: Data Preparation and Additional Analysis¶

In [None]:
# read raw data

# time series of daily cumualtive confirmed cases
case_df = pd.read_csv('/kaggle/input/covid19-data/time_series_covid19_confirmed_global.csv')

# stringency index
policy_df = pd.read_csv('/kaggle/input/covid19-data/OxCGRT_latest.csv')

# country-level covariates
pop_df = pd.read_csv('/kaggle/input/covid19-data/WPP2019_TotalPopulationBySex.csv')
gdp_df = pd.read_csv('/kaggle/input/covid19-data/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_988718.csv')
region_df = pd.read_csv('/kaggle/input/covid19-data/countryContinent.csv')
life_exp_df = pd.read_csv('/kaggle/input/covid19-data/LifeExpectancy.csv')

### 5.1 Find dates when each country reached first, 500th, 1000th, and 2000th confirmed cases¶

In [None]:
# rename columns
case_df = case_df.rename(columns={'Province/State': 'state', 'Country/Region': 'country'})

# some countries have separate records for its states. aggregate records by country
cols = case_df.columns.tolist()

# remove not needed columns
cols.remove('state')
cols.remove('Lat')
cols.remove('Long')

# aggregate
case_df = case_df[cols].groupby('country').sum(min_count=1).reset_index()

# rename country names to match countries in case_df and policy_df
to_rename = {
    'Korea, South': 'South Korea',
    'US': 'United States', 
    'Taiwan*': 'Taiwan', 
    'Czechia': 'Czech Republic',
    'Slovakia': 'Slovak Republic'
}
case_df.country = case_df.country.map(lambda s: to_rename.get(s, s))

In [None]:
# restrict sample to countries with at least 1000 confirmed cases
print('before:', case_df.shape[0])
case_df = case_df[(case_df.iloc[:, 1:] >= 1000).any(axis=1)]
print('after:', case_df.shape[0])

In [None]:
# retrieve calendar dates. case_df columns are calendar dates
dates = case_df.columns[1:]

def find_first_N_date(row, N):
    """
    Given a row of country daily counts of covid-19 cases and the threshold N, 
    return the date when the number of confirmed cases first surpassed N.
    """
    
    vals = list(row[1:] > N) # first column is country name. skip. 
    
    if True in vals:
        return parser.parse(dates[vals.index(True)])
    return None

# store dates in a new dataframe
country_df = pd.DataFrame({'country': case_df.country})
country_df['date_1'] = case_df.apply(lambda row: find_first_N_date(row, 0), axis=1)

# correct first dates for countries where pandemic started before 1/22
country_df.loc[country_df.country=='China', 'date_1'] = datetime(2019, 12, 1)
country_df.loc[country_df.country=='Japan', 'date_1'] = datetime(2020, 1, 16)
country_df.loc[country_df.country=='Korea, South', 'date_1'] = datetime(2020, 1, 20)
country_df.loc[country_df.country=='Taiwan*', 'date_1'] = datetime(2020, 1, 21)
country_df.loc[country_df.country=='Thailand', 'date_1'] = datetime(2020, 1, 13)
country_df.loc[country_df.country=='US', 'date_1'] = datetime(2020, 1, 20)

In [None]:
# finddates when the number of confirmed cases in country reached 500, 1000, and 2000
country_df['date_500'] = case_df.apply(lambda row: find_first_N_date(row, 500), axis=1)
country_df['date_1000'] = case_df.apply(lambda row: find_first_N_date(row, 1000), axis=1)
country_df['date_2000'] = case_df.apply(lambda row: find_first_N_date(row, 2000), axis=1)

# find number of days it took for the confirmed cases to grow from 1 to 500, 1000, and 2000
country_df['days_to_500'] = country_df.apply(lambda row: (row['date_500'] - row['date_1']).days, axis=1)
country_df['days_to_1000'] = country_df.apply(lambda row: (row['date_1000'] - row['date_1']).days, axis=1)
country_df['days_to_2000'] = country_df.apply(lambda row: (row['date_2000'] - row['date_1']).days, axis=1)

# find number of days between 500th case and 1000th case and so on
country_df['days_500_1000'] = country_df.days_to_1000 - country_df.days_to_500
country_df['days_500_2000'] = country_df.days_to_2000 - country_df.days_to_500
country_df['days_1000_2000'] = country_df.days_to_2000 - country_df.days_to_1000

country_df.head()

### 5.2 Restructure time series data of global confirmed cases
Each column represents a country. Each row represents a day. The first row shows case counts on the date of the first confirmed case. The calendar date of the first case in each country varies, but we restructured the data to align all ocuntries by day 1 of the pandemic in each country.

In [None]:
def shift_to_day0(row):
    """
    Shift row of daily confirmed cases from starting on 1/22 to starting on the day of first case confirmed. 
    """
    
    # date of first case
    day0 = country_df[country_df.country==row['country']].iloc[0]['date_1']
    
    # first case appeared after 1/22 when data collection started
    if day0 > datetime(2020, 1, 22):    
        dates = case_df.columns.tolist()[1:] 
        start_from = dates.index(day0.strftime('%-m/%-d/%y')) 
        col = row[1:][start_from:]
        col = col.reset_index(drop=True)
        col.name = row.country
        return col
        
    # first case appeared before data collection 1/22
    # insert None for the missing days
    else: 
        day_first_available = (parser.parse(row[1:].index[0]) - day0).days
        col = pd.Series([None] * (day_first_available) + row[1:].tolist(), name=row.country)
        col = col.reset_index(drop=True)
        return col
        
    return pd.Series([], name=row.country)


case_df_day0 = case_df.apply(shift_to_day0, axis=1).transpose()
case_df_day0.columns = case_df.country.values
case_df_day0.head()

### 5.3 Restructure Country Policy Stringency Index Data
Like the restructured daily case counts, here we create a new dataframe with countries as columns and days as rows. Each cell is the government stringency index of the corresponding country (column name) on the given day (row index). The row index indicates the number of days from the date of the first confirmed case (day 1). For example, the index 10 means 10 days after day 1, and the index -10 means 10 days before day 1, before the pandemic started.

In [None]:
# keep track of countries in our sample in select_countries
# only countries in both case count and policy index datasets are included in our sample
select_countries = [country for country in case_df_day0.columns 
                    if country in policy_df.CountryName.values]

# update selected countries in dataframes
case_df_day0 = case_df_day0[select_countries]
country_df = country_df[country_df.country.isin(select_countries)]

In [None]:
cols = []

for country in select_countries:
        
    # get entries for the country
    data = policy_df[policy_df.CountryName==country]

    # get date of first confirmed case
    day0 = country_df[country_df.country==country].iloc[0]['date_1']

    # calculate number of days between a given date and the date of first case
    data['days_since'] = data.Date.map(lambda i: (parser.parse(str(i))-day0).days)
    data = data.set_index('days_since')
    col = data['StringencyIndex']
    col.name = country
    cols.append(col)
    
policy_df_day0 = pd.concat(cols, axis=1)
policy_df_day0.loc[0:].head() # show data starting from day 1 of pandemic

In [None]:
# missing from day -15 to day 30
missing = policy_df_day0.loc[-15:30].isnull().sum(axis=0)
missing[missing > 0]

In [None]:
# remove countries with more than 3 missing data points

for country in missing[missing > 3].index:
    select_countries.remove(country)
        
print('new sample size:', len(select_countries))

# update sample of countries in these datasets
policy_df_day0 = policy_df_day0[select_countries]
case_df_day0 = case_df_day0[select_countries]
country_df = country_df[country_df.country.isin(select_countries)]

### 5.4 Create Typology of National Policy Responses Using Cluster Analysis

In [None]:
# construct feature matrix X

# each row is a country. each column is a feature
X_df = policy_df_day0.loc[-15:30].transpose() # our selected observation period
X_df = X_df.reset_index(drop=True)
X_df = X_df.fillna(0)

# to prioritize features prior to the date of first case, multiple prior features by 2
X_df2 = X_df.copy()
X_df2.iloc[:, 0:15] = X_df.iloc[:, 0:15]*2
X_2 = np.matrix(X_df2)

In [None]:
# elbow test to determine number of clusters
wcss = [] # within-cluster sum of squares
    
for i in range(2, 25):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=500, n_init=10, random_state=0)
    kmeans.fit(X_2)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(2, 25), wcss, marker='.')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.legend()
remove_border()

Considering the percentage of WCSS reduced, the size of each cluster, and the substantive meanings of the cluster solutions, we decide that four clusters are the best solution. Moreover, in the analysis not shown in this notebook, we have conducted cluster analysis using the Ward algorithm and found similar cluster solutions.

In [None]:
# kmeans clustering
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=500, n_init=10, random_state=0)
y5 = kmeans.fit_predict(X_2)

# add result to country_df
country_df['cluster5_num'] = y5

# rename clusters
map_to = {0: 'Proactive',
         1: 'Slow Start',
         2: 'Responsive',
         3: 'Swiftly Responsive',
         4: 'Moderately Responsive'}
country_df['cluster5'] = country_df.cluster5_num.map(lambda val: map_to[val])
country_df.cluster5.value_counts()

### 5.5 Merge Country-Level Covariates

In [None]:
country_df

### Add Country-Related Background Data: GDP in 2018, population, life expectancy by birth, and sub_region

In [None]:
population_fpath='/kaggle/input/covid19-data/WPP2019_TotalPopulationBySex.csv'
GDP_fpath='/kaggle/input/covid19-data/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_988718.csv'
life_expectancy_fpath='/kaggle/input/covid19-data/LifeExpectancy.csv'

### Add Country GDP

In [None]:
# clean up original data sheet
# deleted first two rows from original data manually which would offset the form format
GDP_df = pd.read_csv(GDP_fpath)
GDP_df = GDP_df.loc[:,['Country Name', "Country Code","Indicator Name", "2018"]]

# double check currency is all in dollars before dropping the column
print(GDP_df['Indicator Name'].value_counts())

GDP_df = GDP_df.drop(["Indicator Name"], axis=1)

In [None]:
# find countries that we don't have data about their GDP
for c in country_df.country:
    if not c in GDP_df['Country Name'].values:
        print(c)

In [None]:
# rename countries that have inconsistent names in the two datasheets
to_rename = {'Egypt, Arab Rep.': 'Egypt', 
             'Iran, Islamic Rep.': 'Iran', 
             'Korea, Rep.': 'South Korea', 
             'Russian Federation': 'Russia', 
            'Brunei Darussalam': 'Brunei'}
GDP_df['Country Name'] = GDP_df['Country Name'].map(lambda name: to_rename.get(name, name))
GDP_df = GDP_df.rename(columns={'Country Name': 'country', '2018': 'GDP2018'})

In [None]:
country_background_df = country_df.merge(GDP_df[['country', 'Country Code', 'GDP2018']], on='country', how='left')

In [None]:
# find countries with GDP data as null

country_background_df[country_background_df.GDP2018.isnull()].country

In [None]:
# Manually fill out missing data
# Iran GDP in 2018 454,012.77 million
# https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?view=map

IranIndex = country_background_df[country_background_df['country']=='Iran'].index[0]
country_background_df.iloc[IranIndex, country_background_df.columns.get_loc('GDP2018')] = 454012.77*10**6

### Add Life Expectancy

In [None]:
life_expectancy_df = pd.read_csv(life_expectancy_fpath)
life_expectancy_df = life_expectancy_df.loc[:, ['Country Code', '2018']]

In [None]:
# find countries that we don't have data about their life expectancies
for c in country_background_df['Country Code']:
    if not c in life_expectancy_df['Country Code'].values:
        print(c)
# none

In [None]:
country_background_df = country_background_df.merge(life_expectancy_df[['Country Code', '2018']], on='Country Code', how='left')
country_background_df = country_background_df.rename(columns={'2018': 'LifeExpectancy2018'})

In [None]:
# find if there's any country without life expectancy data
country_background_df[country_background_df.LifeExpectancy2018.isnull()]
#none

### Add Population Data

In [None]:
# clean up population data
population_df = pd.read_csv(population_fpath)
population_df = population_df[population_df['Time']==2019].loc[:,['Location', 'PopTotal', 'PopDensity']]

In [None]:
for c in country_background_df.country:
    if not c in population_df['Location'].values:
        print(c)

In [None]:
# We observed there're two entries for U.S.A. We are going to use the mainland's data only
population_df[population_df.Location.isin(['United States of America', 'United States of America (and dependencies)'])]

In [None]:
to_rename = {
     'Iran (Islamic Republic of)': 'Iran', 
     'Republic of Korea': 'South Korea', 
     'Russian Federation': 'Russia',
     'United States of America': 'United States',
     'Brunei Darussalam': 'Brunei',
    'Czechia': 'Czech Republic',
    'Slovakia': 'Slovak Republic',
    'China, Taiwan Province of China': 'Taiwan',
    'Viet Nam': 'Vietnam',
    'Republic of Moldova': 'Moldova'
}

population_df['Location'] = population_df['Location'].map(lambda name: to_rename.get(name, name))
population_df = population_df.rename(columns={'Location': 'country'})

In [None]:
country_background_df = country_background_df.merge(population_df, on='country', how='left')

### 5.6 Cluster analysis to group countries based on their backgrounds

In [None]:
from sklearn import preprocessing

# Cleans up for K cluster
kmeans_countries = country_background_df.loc[:, ['GDP2018', 'LifeExpectancy2018', 'PopDensity', 'PopTotal']]

kmeans_countries

In [None]:
# Process data

# Transfer population from thousand to single unit
kmeans_countries['PopTotal'] = kmeans_countries['PopTotal']*1000.00
# Get GDP per capita
kmeans_countries['GDP2018'] = kmeans_countries['GDP2018']/kmeans_countries['PopTotal']
# Squaring LifeExpectancy to further differentiate data distribution
kmeans_countries['LifeExpectancy2018'] = kmeans_countries['LifeExpectancy2018']**2
kmeans_countries = kmeans_countries.drop(['PopTotal'], axis=1)
kmeans_countries

In [None]:
# elbow analysis
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(kmeans_countries.dropna())
visualizer.show()

In [None]:
# Use cluster analysis to categorize countries
from sklearn.cluster import KMeans
country_clusters = 6
kmeans_countries = kmeans_countries.dropna()
kmeans = KMeans(n_clusters=country_clusters, init='k-means++', max_iter=500, n_init=10, random_state=0)
y = kmeans.fit_predict(kmeans_countries)

country_background_df['country_category'] = y

In [None]:
# We decided to use 6 as cluster for simplicity and more severe group differentiation
# Transfer to GDP per capita
country_background_df['GDP2018PerCapita'] = country_background_df['GDP2018']/(country_background_df['PopTotal']*1000.0)

In [None]:
# Create boxenplot for countries in each response categories

metrics = ['GDP2018PerCapita', 'PopDensity', 'LifeExpectancy2018']
sns.set(style='whitegrid')
fig, axes = plt.subplots(1, 3, sharex=True, figsize=(20, 4))

for index, metric in enumerate(metrics):
    sns.boxenplot(x="country_category", y=metric, data=country_background_df, ax=axes[index%3])
    remove_border(axes[index%3])

group_label = {
    0: 'MedGDPLongerLifeExp',
    1: 'LowerGDPLowerLifeExp',
    2: 'HigherGDPLongerLifeExp',
    3: 'MedGDPHigherDensity',
    4: 'LowestGDPLowestLifeExp',
    5: 'HighestGDPLongestLifeExp',
}

# create country group order based on their wealth and life expectancy level
country_category_wealth_order = ['LowestGDPLowestLifeExp', 'LowerGDPLowerLifeExp', 'MedGDPHigherDensity', 'MedGDPHigherDensity', 'HigherGDPLongerLifeExp', 'HighestGDPLongestLifeExp']

country_background_df['country_category_name'] = country_background_df['country_category'].map(group_label)

### 5.7 Observe infection rate growth by country categories

In [None]:
# Group countries by their categories and merge with case numbers for each country
result = country_background_df.loc[:,['country', 'country_category_name', 'cluster5']]
result = result.groupby('country_category_name')

cases_by_country = {}
case_df_after_500_transpose = case_df_after_500.transpose().reset_index().rename(columns={'index': 'country'})
for val in country_category_wealth_order:
    cases_by_country[val] = pd.merge(
        result.get_group(val).drop(columns=['country_category_name']),
        case_df_after_500_transpose,
        on='country', 
        how='left'
    )

In [None]:
# Plot country group's color-coded infection case development
new_group_label = {
    0: 'LowestGDPLowestLifeExp',
    1: 'LowerGDPLowerLifeExp',
    2: 'MedGDPHigherDensity',
    3: 'MedGDPLongerLifeExp',
    4: 'HigherGDPLongerLifeExp',
    5: 'HighestGDPLongestLifeExp',
}

def get_cluster_number5(cluster):
    if (cluster=='Proactive'):
        return 0
    elif (cluster=='Swiftly Responsive'):
        return 1
    elif (cluster=='Responsive'):
        return 2
    elif (cluster=='Moderately Responsive'):
        return 3
    elif (cluster=='Slow Start'):
        return 4
    else:
        return

sns.set(style='dark')
palette = plt.get_cmap('RdPu_r')
fig, axes = plt.subplots(3, 2, sharex=True, sharey=True, figsize=(14, 8))

for index, val in enumerate(country_category_wealth_order):
    curr_df = cases_by_country[val].transpose().iloc[3:]
    for column in cases_by_country[val].transpose().iloc[3:]:
        color_num = get_cluster_number5(cases_by_country[val].transpose()[column]['cluster5'])
        axes[index//2, index%2].plot(curr_df.index, curr_df[column], color=palette(color_num*50), alpha=0.6)
    axes[index//2, index%2].set_title('Group'+str(index)+': '+val)
    axes[index//2, index%2].set_ylim([0,200000])
    remove_border(axes[index//2, index%2])

patch0 = mpatches.Patch(color=palette(0), label='Proactive')
patch1 = mpatches.Patch(color=palette(50), label='Swiftly Responsive')
patch2 = mpatches.Patch(color=palette(100), label='Responsive')
patch3 = mpatches.Patch(color=palette(200), label='Moderately Responsive')
patch4 = mpatches.Patch(color=palette(150), label='Slow Start')

fig.legend(loc='lower center', ncol=5, handles=[patch0, patch1, patch2, patch3, patch4])

In [None]:
# Create dataframe for regression analysis
response_type_map = {
    'Proactive': 0,
    'Swiftly Responsive': 1,
    'Responsive': 2,
    'Moderately Responsive': 3,
    'Slow Start': 4,
}

regression_by_country_category = country_background_df[['country', 'cluster5', 'country_category_name']]
regression_by_country_category['policy_response_order'] = regression_by_country_category['cluster5'].map(response_type_map)

In [None]:
# double check by day 55, all countries have data
case_df_day0.transpose()[55].isna().sum()

In [None]:
regression_by_country_category = pd.merge(regression_by_country_category, country_55, on='country', how='left')
regression_by_country_category = regression_by_country_category.rename(columns={'cluster5': 'policy_response_type', 55: 'day_55_case_num'})

In [None]:
# Linear Regression for countries with less resources
from statsmodels.formula.api import ols
country0 = regression_by_country_category[regression_by_country_category['country_category_name']=='LowestGDPLowestLifeExp']
fit = ols('day_55_case_num ~ C(policy_response_type)', data=country0).fit() 

fit.summary()

In [None]:
# Linear Regression for average and wealthier countries
country1to5 = regression_by_country_category[regression_by_country_category['country_category_name']!='LowestGDPLowestLifeExp']
fit = ols('day_55_case_num ~ C(policy_response_type)', data=country1to5).fit() 

fit.summary()