# Background

Data gathered from the [NYT github page](https://github.com/nytimes/covid-19-data), [USDA](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697), and the US Census [population](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/) and [land area](https://www.census.gov/geographies/reference-files/2010/geo/state-area.html) pages.

This notebook is the first in a series of analyzing covid data from head to toe. The goals for this notebook include:

* Importing data as gathered by the NYT
* Merging covid data with geographical and census datasets

The notebook is divided into four sections:

1. Setup: importing of libraries, creating helper functions
2. County Data: reading and merging all data related to counties including mask useage survey results, covid cases by county, county populations
3. State Data: reading and merging all data related to states including covid cases by state, state populations, state geographical data
4. Save to DB: saving the newly created `county_time`, `df_county`, and `df_state` dataframes to an sqlite database.

## NYT Methodology

The NYT used mask-survey responses to estimate mask-use at the county level.

> To transform raw survey responses into county-level estimates, the survey data was weighted by age and gender, and survey respondents’ locations were approximated from their ZIP codes. Then estimates of mask-wearing were made for each census tract by taking a weighted average of the 200 nearest responses, with closer responses getting more weight in the average. These tract-level estimates were then rolled up to the county level according to each tract’s total population.
> 
> By rolling the estimates up to counties, it reduces a lot of the random noise that is seen at the tract level. In addition, the shapes in the map are constructed from census tracts that have been merged together — this helps in displaying a detailed map, but is less useful than county-level in analyzing the data.

# Setup

## Libraries

In [None]:
import pandas as pd # dataframe analysis and manipulation
import numpy as np # mostly for np.nan

from bs4 import BeautifulSoup # for scraping
import requests # for downloading html files

## Helper Functions

In [None]:
def github_link_formatter(url):
    '''
    Formats a given direct github-file url so it can be used with pd.read_excel()
    '''
    url = url.replace('github.com','raw.githubusercontent.com')
    url = url.replace('/blob','')
    
    return url

In [None]:
def list_diff(list1,list2):
    '''
    Finds what is missing from, or what is different between, the two lists.
    
    return
    ------
    list_difference: list
    '''
    list_difference = {}
    
    if len(list1) > len(list2):
        bigger = list1
        smaller = list2
        small_list = 'list2'
    else:
        bigger = list2
        smaller = list1
        small_list = 'list1'
        
    for item in bigger:
        if item not in smaller:
            list_difference[item] = f'missing from {small_list}'

    return list_difference

# County Data

__Goals:__

* Merge mask dataset with latest county cases and populations
* Create a separate covid cases dataset for time domain 

## Mask Data

In [None]:
df_mask = pd.read_csv("../input/nytimes-covid19-data/mask-use/mask-use-by-county.csv")
df_mask.columns = df_mask.columns.str.lower()
df_mask.head()

## Mask Mandate Data

In order to find out the effect of masks on corona spread, it would be useful to know when masks were mandated. To do this, we gathered data from [AARP](https://www.aarp.org/health/healthy-living/info-2020/states-mask-mandates-coronavirus.html) and [CNN](https://www.cnn.com/2020/06/19/us/states-face-mask-coronavirus-trnd/index.html). In situations when the date was different, the earliest date was taken. `Type` includes tags for description of mandate currently in effect.

In [None]:
url = github_link_formatter('https://github.com/pomkos/nyt-covid-data/blob/master/data/added_data/mask_mandates.xlsx')

df_mand = pd.read_excel(url, skiprows=2)
df_mand.columns = df_mand.columns.str.lower()

In [None]:
df_mand['type'] = df_mand['type'].str.lower()
df_mand['type_split'] = df_mand['type'].str.split(',')

In [None]:
del df_mand['type_split']

In [None]:
def who_exempt(cell):
    if pd.isna(cell):
        return 'no mandate'
    elif 'children' in cell:
        return 'child exempt'
    elif 'toddler' in cell:
        return 'toddler exempt'
    else:
        return 'no exemptions'

In [None]:
df_mand['children_toddlers_none'] = df_mand['type'].apply(who_exempt)

In [None]:
df_mand['month_mandate'] = df_mand['date'].dt.month
df_mand['month_mandate'] = df_mand['month_mandate'].fillna('no mandate')

In [None]:
df_mand.head()

States implemented mask orders at different times, it may be more useful to look at blocks of times.

In [None]:
import datetime as dt

In [None]:
def mandate_when(x):
    if pd.isna(x):
        return 'No Mandate'
    elif x<dt.datetime.strptime('20200515','%Y%m%d'):
        return 'Before May 15'
    elif x>dt.datetime.strptime('20200715','%Y%m%d'):
        return 'After Jul 15'
    else:
        return 'In Between'

In [None]:
df_mand['mandate_when'] = df_mand['date'].apply(mandate_when)

In [None]:
df_mand.head()

## Covid Data

In [None]:
county_time = pd.read_csv("../input/nytimes-covid19-data/us-counties.csv",parse_dates=['date'])

In [None]:
county_time = county_time.astype({
    'county':str,
    'fips':float,
    'cases':int,
    'deaths':int
})

In [None]:
# Rename columns
county_time.columns = ['date','county','state','fips','covid_cases','covid_deaths']
# Rearrange columns
county_time = county_time[['date','state','county','fips','covid_cases','covid_deaths']]
county_time.head()

On a previous (now removed) merging of the NYT `df_mask` and NYT `county_time` datasets we found that the `df_mask` dataset was missing some counties. To double check what these counties are, the [USDA link](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697) was scraped for all fips, name, and state data.

## What was Missing?

In [None]:
response = requests.get("https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697")
soup = BeautifulSoup(response.content)

In [None]:
table_tag = soup.find(class_='data')

county_scrape = pd.DataFrame(columns=['fips','county','state'])

for tr in table_tag.find_all('tr')[1:]:
    tds = tr.find_all('td')
    d = pd.DataFrame(data = {'fips':[tds[0].text], 'county':[tds[1].text], 'state':[tds[2].text]})
    county_scrape = county_scrape.append(d)

In [None]:
county_scrape = county_scrape.reset_index(drop=True)
county_scrape = county_scrape.astype({
    'fips':int,
    'county':str,
    'state':'category'
})

In [None]:
county_scrape.head()

In [None]:
missing = list_diff(county_scrape['county'].unique(),county_time['county'].unique())
missing_df = pd.DataFrame.from_dict(missing,orient='index')
scraped = county_scrape.set_index('county')
missing_df = missing_df.sort_index().reset_index()
missing_from_scrape = scraped.merge(missing_df, left_on='county',right_on='index')

In [None]:
missing_from_scrape.groupby('state').count().sort_values('fips',ascending=False)

The vast majority of missing counties are from Virginia and Alaska.

In [None]:
missing_from_scrape[(missing_from_scrape['state']=='AS') ] # repeated for other states

The reasons for missing counties include:

* Virginia's missing counties are all cities
* Alaska: unclear why
* Louisiana: unincorporated communities or parishes
* MO: De Kalb has a population of 220, is part of St. Joseph statistical area.
* MO: St. Francois has a population of 65k, but the county seat is in Farmington. The rest are similar, with the county seat being in another county.
* NY: boroughs of NYC, St. Lawrence has county seat in Canton
* AS: Indian reservation

The rest of the list have 4 or less missing counties, reasons are assumed to be similar as above.

## Merging Mask and County Covid Data

The NYT mask data is merged with the NYT-included `county_time` dataset. This allows us to see  the number of covid cases per county, along with the reported mask use.

Because `county_time` is a timeseries, we first filter to include only the latest total cases (from August 15, 2020) and then take the mean per county. In this way we should have one row per county.

In [None]:
df_mask.head()

In [None]:
county_time.sort_values('date', ascending=False).head(2)

In [None]:
county_cases_aug = county_time[county_time['date'] == '20200815'].groupby('fips').mean().reset_index()

In [None]:
county_cases_aug

In [None]:
county_mask = county_cases_aug.merge(df_mask,left_on='fips',right_on='countyfp')

In [None]:
county_mask = county_mask[['fips', 'covid_cases', 'covid_deaths', 'never', 'rarely', 'sometimes','frequently', 'always']]
county_mask.columns = ['fips', 'covid_cases', 'covid_deaths', 'mask_never', 'mask_rarely', 'mask_sometimes','mask_frequently', 'mask_always']

In [None]:
county_mask.head()

So far we:

* Imported and cleaned `df_mask` dataframe. It includes survey results of how often people wear masks in each county.
* Import and cleaned the `county_time` dataframe, then filtered for the most recent data on Aug 15, 2020. This includes total (not new) covid19 cases and deaths.
* The two dataframes were merged into one `county_mask` dataframe. This includes all NYT data to date (Aug 16, 2020).

__Problem:__
After using the `groupby()` function all `str` type columns were removed (as you cannot take the mean of strings). We lost the county and state names, but the `fips` values remained.

__Solution:__ We can merge our new `county_mask` dataset with another dataset to get these names back. Since we want to find the population per county anyways, we will use the [US Census 2019 population estimate](https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-total.html) dataset to get county name, state name, county population, and some other variables.

## County Population

__NOTE:__ last 6 rows of the raw excel file contain disclaimers and citing instructions

__NOTE:__ When shown without a date variable, county covid data from here on will reflect the mean of only "recent" (August 15, 2020) figures.

In [None]:
url2 = github_link_formatter('https://github.com/pomkos/nyt-covid-data/blob/master/data/added_data/countypop.xlsx')
countypop = pd.read_excel(url2)
countypop.columns = countypop.columns.str.lower()

The following keys were provided in a separate pdf file by the US census, letting us map them for interpretation.

In [None]:
region_key = {
    1:'northeast',
    2:'midwest',
    3:'south',
    4:'west'
}
division_key = {
    1:'new_england',
    2:'middle_atlantic',
    3:'east_north_central',
    4:'west_north_central',
    5:'south_atlantic',
    6:'east_south_central',
    7:'west_south_central',
    8:'mountain',
    9:'pacific'
}
sumlev_key = {
    40:'state_or_equiv',
    50:'county_or_equiv'
}

In [None]:
# select only the columns that are relevant, which is the latest (2019) estimates
countypop = countypop[['sumlev','region','division','state','county','stname','ctyname','popestimate2019',
                       'births2019','internationalmig2019','domesticmig2019','rbirth2019','rdeath2019']]
countypop.columns = ['sumlev','region_fips','division_fips','state_fips','county_fips','state','county',
                     'population','births','intnl_migration','domestic_migration','birth_rate','death_rate']
countypop.head()

In [None]:
# find the counties that are present in our dataset
cty_fip = county_time[['state','county','fips']].groupby(['state','county']).mean().reset_index()
cty_fip.head()

It looks like the two datasets have very similar naming styles, with the exception that the NYT `county_time` dataset does not include the words `County` or `Parish` after each territory. These are removed from the Census dataset, along with any spaces, and then merged on relevant county and state names.

In [None]:
# format for merging
countypop['county'] = countypop['county'].str.replace('County','')
countypop['county'] = countypop['county'].str.replace('Parish','')
countypop['county'] = countypop['county'].str.replace(' ','')

In [None]:
# merge fips from covid dataset with 2019pop
cty_pop = cty_fip.merge(countypop,left_on=['state','county'],right_on=['state','county'])

In [None]:
cty_pop = cty_pop[['state', 'county', 'fips', 'sumlev', 'region_fips', 'division_fips',
       'population','births', 'intnl_migration', 'domestic_migration', 'birth_rate',
       'death_rate']]

In [None]:
cty_pop.head()

## Merging Mask with County Pop

We now have two datasets:

* `county_mask` that contains all NYT data regarding covid19 cases and mask use
* `cty_pop` that contains US Census data about the population

If we can merge these dataframes, we can find out the number of cases per population and a bunch of other interesting statistics. So that's what we will do next.

In [None]:
list_diff(county_time['county'].unique(),cty_pop['county'].unique())

From our list it looks like the following are not included in our `cty_pop` Census dataset, but are in the NYT `county_mask` dataset:

* Cities (ex: Los Angeles, Walla Walla)
* Commonwealth areas (ex: Saipan) are not included
* Some counties (ex: Roger Mills County in Oklahoma)

Some other areas (ex: Roger Mills County) are also not included.

We will ignore these for now, but still keep them in our dataframe by doing a left merge. This will keep all rows in the `county_mask` dataframe even though they have no corresponding data in the `cty_pop` dataframe.

In [None]:
df_county = county_mask.merge(cty_pop, on=['fips'], how='left')

In [None]:
100 * (sum(df_county['population'].isna()) / df_county.shape[0])

Approximately 8% of the NYT dataframe has no corresponding data in the Census dataframe. This is acceptable to us for now, so we will go ahead with the analysis.

In [None]:
df_county.head()

We map the given keys to their appropriate values for ease of categorization in the future

In [None]:
df_county['region'] = df_county['region_fips'].map(region_key)
df_county['division'] = df_county['division_fips'].map(division_key)
df_county['area_type'] = df_county['sumlev'].map(sumlev_key)

And add some preliminary per capita calculations

In [None]:
df_county['cases_per_million'] = (df_county['covid_cases']/df_county['population']) * 1000000
df_county['cases_per_hthousand'] = (df_county['covid_cases']/df_county['population']) * 100000
df_county['cases_per_thousand'] = (df_county['covid_cases']/df_county['population']) * 1000
df_county['cases_per_hundred'] = (df_county['covid_cases']/df_county['population']) * 100

In [None]:
# rearrange the columns
df_county = df_county[['state','region', 'county', 'division', 'area_type',
                       'population', 'covid_cases', 'covid_deaths', 'cases_per_million', 'cases_per_hthousand', 
                       'cases_per_thousand', 'cases_per_hundred',
                       'mask_never', 'mask_rarely','mask_sometimes', 'mask_frequently', 'mask_always', 
                       'births','intnl_migration', 'domestic_migration', 
                       'birth_rate', 'death_rate',
                       'fips', 'sumlev', 'region_fips', 'division_fips'
                    ]]

In [None]:
df_county.head()

# State Data

__Goals:__ Merge state population, land area, and latest covid data into one dataset.

## Covid Data

In [None]:
state_covid = pd.read_csv("../input/nytimes-covid19-data/us-states.csv",parse_dates=['date'])

In [None]:
state_covid = state_covid.astype({
    'state':str,
    'fips':float,
    'cases':int,
    'deaths':int
})

In [None]:
# Rename columns
state_covid.columns = ['date','state','state_fips','covid_cases','covid_deaths']
state_covid.head()

In [None]:
state_time = state_covid.copy()

We will do the same sort of filtering as for the county data by only including total covid cases on August 15, 2020

In [None]:
state_covid = state_covid[state_covid['date'] >= '20200815'].reset_index(drop=True)
state_covid = state_covid.groupby('state_fips').mean().reset_index()

## Land Area

It would be nice to know the county and state land areas, so we can get some sort of estimate for the population density. I was unable to find information for every county, however information about state land area was found at this [US Census source](https://www.census.gov/geographies/reference-files/2010/geo/state-area.html).

We will scrape and format this data to get it ready for a future merge.

In [None]:
response = requests.get("https://www.census.gov/geographies/reference-files/2010/geo/state-area.html")

In [None]:
soup = BeautifulSoup(response.content)

In [None]:
table_tag = soup.find('tbody')

state_land_scrape = pd.DataFrame(columns=range(1,17))

for tr in table_tag.find_all('tr')[3:]:
    tds = tr.find_all('td')
    d = {}
    for i in range(0,17):
        d[i] = [tds[i].text]
    data = pd.DataFrame.from_dict(data=d)
    state_land_scrape = state_land_scrape.append(data)

In [None]:
cols = ['state']
areas = ['total_area_','land_area_','total_water_area_','inland_water_area_','coastal_water_area_',
         'great_lakes_water_area_','territorial_water_area_','latitude','longitude']
for i in range(1,17):
    if (i in range(1,16)) & (i % 2 == 0):
        unit = 'sqkm'
    elif (i in range(1,16)) & (i % 2 != 0):
        unit = 'sqmi'
    if (i == 1) | (i == 2):
        cols.append(f'total_area_{unit}')
    elif (i == 3) | (i == 4):
        cols.append(f'land_area_{unit}')
    elif (i == 5) | (i == 6):
        cols.append(f'total_water_area_{unit}')
    elif (i == 7) | (i == 8):
        cols.append(f'inland_water_area_{unit}')
    elif (i == 9) | (i == 10):
        cols.append(f'coastal_water_area_{unit}')
    elif (i == 11) | (i == 12):
        cols.append(f'great_lakes_water_area_{unit}')
    elif (i == 13) | (i == 14):
        cols.append(f'territorial_water_area_{unit}')
    elif (i == 15):
        cols.append('latitude')
    elif (i == 16):
        cols.append('longitude')

In [None]:
state_land_scrape.columns=cols
state_land_scrape = state_land_scrape.reset_index(drop=True)
state_land_scrape = state_land_scrape.iloc[3:,:].reset_index(drop=True)

In [None]:
state_land_scrape.head()

## State Pop

We could just figure out the state population from the county data, but the [US census](https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/) has this precompiled for us.

In [None]:
url3 = github_link_formatter("https://github.com/pomkos/nyt-covid-data/blob/master/data/added_data/statepop.csv")
statepop_raw = pd.read_csv(url3)
statepop_raw.columns = statepop_raw.columns.str.lower()

In [None]:
statepop_raw.head()

In [None]:
statepop = statepop_raw[['name','popestimate2019','sumlev','region','division','state']].reset_index(drop=True)
statepop.columns = ['state', 'population','sumlev','region_fips','division_fips','state_fips']

In [None]:
statepop.head()

In [None]:
# 0 represents territories and regions, which we are not interested in. 
# This data is also included in our county datasets.
statepop = statepop[statepop['state_fips']!=0].reset_index(drop=True)

In [None]:
statepop.head()

## Merge Land Area with State Population

By merging the two we can get population density data.

In [None]:
# Merge population with land area
statepop = statepop.merge(state_land_scrape, on='state')

In [None]:
statepop = statepop.replace(to_replace = '—', value = np.nan)
statepop = statepop.replace(to_replace = ',', value = '')

In [None]:
statepop.head()

## Merge with State Covid Data

In [None]:
df_state = state_covid.merge(statepop, on='state_fips')

We replicate what we did to `df_county` with `df_state`

In [None]:
cols = ['region_fips', 'division_fips',
       'total_area_sqmi', 'total_area_sqkm', 'land_area_sqmi',
       'land_area_sqkm', 'total_water_area_sqmi', 'total_water_area_sqkm',
       'inland_water_area_sqmi', 'inland_water_area_sqkm',
       'coastal_water_area_sqmi', 'coastal_water_area_sqkm',
       'great_lakes_water_area_sqmi', 'great_lakes_water_area_sqkm',
       'territorial_water_area_sqmi', 'territorial_water_area_sqkm',
       'latitude', 'longitude']

In [None]:
for col in cols:
    df_state[col] = df_state[col].str.replace('X','NaN')
    df_state[col] = df_state[col].str.replace(',','')
    df_state[col] = df_state[col].astype(float)

In [None]:
df_state['region'] = df_state['region_fips'].map(region_key)
df_state['division'] = df_state['division_fips'].map(division_key)
df_state['area_type'] = df_state['sumlev'].map(sumlev_key)

And add some preliminary per capita calculations

In [None]:
df_state['cases_per_million'] = (df_state['covid_cases']/df_state['population']) * 1000000
df_state['cases_per_hthousand'] = (df_state['covid_cases']/df_state['population']) * 100000
df_state['cases_per_thousand'] = (df_state['covid_cases']/df_state['population']) * 1000
df_state['cases_per_hundred'] = (df_state['covid_cases']/df_state['population']) * 100

And rearrange the columns

In [None]:
df_state = df_state[['state', 'region', 'division', 'area_type','covid_cases', 'covid_deaths', 'population',
                    'cases_per_million', 'cases_per_hthousand', 'cases_per_thousand',
                    'cases_per_hundred', 'state_fips','sumlev', 'region_fips', 'division_fips', 
                    'total_area_sqmi','total_area_sqkm', 'land_area_sqmi', 'land_area_sqkm',
                    'total_water_area_sqmi', 'total_water_area_sqkm',
                    'inland_water_area_sqmi', 'inland_water_area_sqkm',
                    'coastal_water_area_sqmi', 'coastal_water_area_sqkm',
                    'great_lakes_water_area_sqmi', 'great_lakes_water_area_sqkm',
                    'territorial_water_area_sqmi', 'territorial_water_area_sqkm',
                    'latitude', 'longitude']]

In [None]:
df_state.head()

# Save to DB

Finally `county_time`,`state_time`,`df_county`, `df_state` are saved into an sqlite database for ease of access:

```python
import sqlalchemy as sq
location = 'sqlite:///data/nyt_covid.db'
cnx = sq.create_engine(location)

# save county time series
county_time.to_sql('county_time_dates', con=cnx, if_exists='fail', index=False)
# save state time series
state_time.to_sql('state_time_dates',con=cnx,if_exists='fail',index=False)
# save county
df_county.to_sql('county_dataset', con=cnx, if_exists='fail', index=False)
# save state
df_state.to_sql('state_dataset', con=cnx, if_exists='fail', index=False)
# save mask mandate
df_mand.to_sql('mandate_date',con=cnx,if_exists='fail',index=False)
```

To access in the future we just need to run:

```python
import pandas as pd
import sqlalchemy as sq

location = 'sqlite:///data/nyt_covid.db'
cnx = sq.create_engine(location)

county_time = pd.read_sql_table('county_time_dates',cnx)
state_time = pd.read_sql_table('state_time_dates',cnx)
df_county = pd.read_sql_table('county_dataset',cnx)
df_state = pd.read_sql_table('state_dataset',cnx)
df_mand = pd.read_sql_table('mandate_date',cnx)
```

# Conclusion

In this notebook we:

## County Level:

1. Imported NYT mask-use survey (`df_mask`)
2. Manually gathered data from AARP and CNN for dates that mask orders went into effect (`df_mand`)
2. Imported NYT county-level total covid cases over time (`county_time`)
3. Scraped the US Census for county-state-fips information (`county_scrape`) to double check the accuracy of the NYT dataset
3. Filtered for only the latest, Aug 15, total covid cases (`county_cases_aug`)
4. Imported the US Census county 2019 population estimate (`cty_pop`)
5. Merged `df_mask` + `county_cases_aug` + `cty_pop` into one dataframe `df_county`

NOTE: Some 8% of counties were excluded, as were larger cities such as NYC.

## State Level:

1. Imported NYT state-level total covid cases over time (`state_time`)
2. Filtered for only the latest, Aug 15, total covid cases (`state_covid`)
3. Imported the US Census state 2019 population estimate (`statepop`)
4. Scraped and cleaned the US Census state area geographic reference (`state_land_scrape`)
5. Merged `statepop` + `state_land_scrape` + `state_covid` into one dataframe `df_state`

# Next Steps

In the next notebook we will look at the following questions:

1. Do states with earlier mask mandates have lower cases of covid?
2. How did cases change in each state over time?
3. Which states (large, med, small population) fared better?
4. Which states (Northeast, West, Midwest, South) fared better?
5. Which states (Democrat led, Republican led) fared better?
6. Is there any pattern in the first appearance of covid from county-to-county?
7. Has the amount of cases relative to deaths or hospitalizations change over time? IE: are cases becoming more severe? (Age groups may be relevant here)