# Goal

Our predictions are pretty good, but some state predictions are not very accurate.

We've tried to ameliorate this by clustering "connected" states via the number of people travelling between states. This didn't work out very well.  

We've looked around for more data, and have found a database of government policy responses. Can we use this to cluster states? 



In [1]:
import numpy as np
import pandas as pd

In [2]:
# Set low_memory so pandas stops complaining about mixed dtypes. 
# This is probably a bad decision. 
df = pd.read_csv('../data/policy-combined.csv', low_memory=False)
df.head()
df.describe()

Unnamed: 0,Date,C1_School closing,C1_Flag,C2_Workplace closing,C2_Flag,C3_Cancel public events,C3_Flag,C4_Restrictions on gatherings,C4_Flag,C5_Close public transport,...,StringencyIndex,StringencyIndexForDisplay,StringencyLegacyIndex,StringencyLegacyIndexForDisplay,GovernmentResponseIndex,GovernmentResponseIndexForDisplay,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay
count,88510.0,84623.0,62019.0,84185.0,57447.0,84207.0,61224.0,84189.0,57507.0,84194.0,...,83843.0,84639.0,83843.0,84639.0,76083.0,76885.0,83794.0,84595.0,76077.0,76880.0
mean,20200610.0,1.826525,0.801142,1.336354,0.767316,1.286247,0.872044,2.261744,0.811206,0.562368,...,50.692412,50.71804,56.776758,56.811443,48.413032,48.493649,50.747201,50.815922,40.027045,40.079832
std,315.1221,1.266415,0.399144,1.063181,0.422546,0.866134,0.334044,1.674902,0.391349,0.719219,...,29.479242,29.393,30.681218,30.582344,26.167773,26.091436,27.346344,27.265819,33.726705,33.706564
min,20200100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20200320.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,25.93,25.93,34.52,35.71,30.36,30.95,31.6,31.94,0.0,0.0
50%,20200620.0,2.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,0.0,...,57.41,57.41,66.67,66.67,57.44,57.44,59.72,59.72,37.5,37.5
75%,20200910.0,3.0,1.0,2.0,1.0,2.0,1.0,4.0,1.0,1.0,...,75.0,75.0,80.95,80.95,68.15,68.15,72.22,72.22,62.5,62.5
max,20201130.0,3.0,1.0,3.0,1.0,2.0,1.0,4.0,1.0,2.0,...,100.0,100.0,100.0,100.0,95.54,95.54,98.96,98.96,100.0,100.0


In [3]:
df = pd.DataFrame(df.loc[df['CountryCode'] == 'USA'])
df = pd.DataFrame(df.loc[df['Jurisdiction'] == 'STATE_TOTAL'])

In [4]:
states = np.unique(df['RegionName'])

scores = dict()
for state in states: 
    state_data = df.loc[df['RegionName'] == state]
    scores[state] = state_data['GovernmentResponseIndex'].median()

In [5]:
print(scores)

{'Alabama': 54.17, 'Alaska': 61.31, 'Arizona': 62.2, 'Arkansas': 57.14, 'California': 71.73, 'Colorado': 63.69, 'Connecticut': 72.02, 'Delaware': 67.26, 'Florida': 59.52, 'Georgia': 60.71, 'Hawaii': 69.05, 'Idaho': 51.79, 'Illinois': 56.55, 'Indiana': 59.52, 'Iowa': 58.33, 'Kansas': 55.65, 'Kentucky': 69.05, 'Louisiana': 61.31, 'Maine': 70.83, 'Maryland': 65.18, 'Massachusetts': 64.88, 'Michigan': 56.85, 'Minnesota': 56.55, 'Mississippi': 58.93, 'Missouri': 48.21, 'Montana': 59.23, 'Nebraska': 58.33, 'Nevada': 60.71, 'New Hampshire': 51.19, 'New Jersey': 64.29, 'New Mexico': 72.02, 'New York': 77.08, 'North Carolina': 64.88, 'North Dakota': 48.51, 'Ohio': 61.9, 'Oklahoma': 46.43, 'Oregon': 60.71, 'Pennsylvania': 63.99, 'Rhode Island': 71.43, 'South Carolina': 56.85, 'South Dakota': 44.35, 'Tennessee': 56.25, 'Texas': 54.17, 'Utah': 55.95, 'Vermont': 70.24, 'Virgin Islands': 57.14, 'Virginia': 56.85, 'Washington': 62.8, 'Washington DC': 65.48, 'West Virginia': 58.93, 'Wisconsin': 59.52,