<a href="https://colab.research.google.com/github/vgaurav-umich/siads592/blob/master/get_covid19_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covid-19 Datasets
This Notebook will download required covid-19 datasets.

We will be getting Covid-19 datasets from variety of sources incl.
1. [New York Times](https://github.com/nytimes/covid-19-data)
2. [Covid Tracking Project from The Atlantic](https://covidtracking.com/race)                                                   



In [0]:
import pandas as pd
import numpy as np

## Purpose

The purpose of this notebook is to download factual data about COVID-19 infections and deaths. 

The NYT data file will give us county level death and infection data. File is updated Daily and contains cumulative counts of cases.  

**Note about Live Data**

NYT provides two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography.

Turning live indicator will force this notebook to fetch live data. Deafult is False. A key difference between the historical and live files is that the numbers in the historical files are the final counts at the end of each day, while the live files have figures that may be a partial count released during the day but cannot necessarily be considered the final, end-of-day tally.


## Download NYT Data 

**US County level deaths and infection cases cumulative count**


**Note from NYT Team** 

> Each row of data reports the cumulative number of coronavirus cases and deaths based on our best reporting up to the moment we publish an update. Our counts include both laboratory confirmed and probable cases using criteria that were developed by states and the federal government. Not all geographies are reporting probable cases and yet others are providing confirmed and probable as a single total. Please read here for a full discussion of this issue.

> We do our best to revise earlier entries in the data when we receive new information. If a county is not listed for a date, then there were zero reported confirmed cases and deaths.

> State and county files contain FIPS codes, a standard geographic identifier, to make it easier for an analyst to combine this data with other data sets like a map file or population data.

### Download

In [0]:
# indicator to store live status of data pull
live_data_ind = False #@param {type:"boolean"}

if live_data_ind == False:
  nyt_covid19_county_data_file_path = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" #@param {type:"raw"}
else:
  nyt_covid19_county_data_file_path = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/live/us-counties.csv" #@param {type:"raw"}

In [0]:
def read_file(data_file_path):
  return pd.read_csv(data_file_path)

In [0]:
# Step 1: Load Data
# The daily number of cases and deaths nationwide, by states and county, including U.S. territories and the District of Columbia
covid19_nyt_county_df = read_file(nyt_covid19_county_data_file_path)

In [0]:
# Debug
# Check data types to convert
# covid19_nyt_county_df.info()
# We see that lot of places data need to changed
# Date is in object type, fips is in float type

In [0]:
# Debug
# Check to see latest data
# us_county_data_df.sort_values('date', ascending = False).head() 
# Get a feel of data to figure out changes we need to make
# us_county_data_df.head()
# We may not need to print every county, but in case we need to debug
# [county for county in sorted(us_county_data_df['county'].unique())]
# print(f"total number of counties represented in this dataset: {len(us_county_data_df['county'].unique())}.")
# print(f"total number of Null FIPS represented in this dataset: {us_county_data_df['fips'].isnull().sum()}.")
# us_county_data_df.query('fips.isnull()', engine = 'python')
# We now know that lot of COunty values are coded as UNKNOWN
# We see that NYT dataset does not have County assigned to each rows. How to handle with UNKNOWN depneds on analysis we will be doing

### Data Clean-up

In [0]:
# Step 2 Clean-up
# need to change datatype for date columns
covid19_nyt_county_df['date'] = pd.to_datetime(covid19_nyt_county_df['date'])

covid19_nyt_county_df['county'] = covid19_nyt_county_df['county'].str.replace(' city', '')
covid19_nyt_county_df['county'] = covid19_nyt_county_df['county'].str.replace('Larue', 'LaRue')
covid19_nyt_county_df['county'] = covid19_nyt_county_df['county'].str.replace('Juneau City and Borough', 'Juneau')
covid19_nyt_county_df['county'] = covid19_nyt_county_df['county'].str.replace('New York City', 'New York')
covid19_nyt_county_df['county'] = np.where((covid19_nyt_county_df['fips'] == 24510),'Baltimore City', covid19_nyt_county_df['county'])
covid19_nyt_county_df['county'] = covid19_nyt_county_df['county'].str.replace('Sitka City and Borough', 'Sitka')
# Interestingly NYT dataset is missing FIPS code for New York City
covid19_nyt_county_df['fips'] = np.where((covid19_nyt_county_df['county'] == 'New York'), 36061, covid19_nyt_county_df['fips'])
print("\n==================================== Covid-19 NYT County Dataset =================================")
covid19_nyt_county_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224666 entries, 0 to 224665
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   date    224666 non-null  datetime64[ns]
 1   county  224666 non-null  object        
 2   state   224666 non-null  object        
 3   fips    222381 non-null  float64       
 4   cases   224666 non-null  int64         
 5   deaths  224666 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 10.3+ MB


### Enrich Data

In [0]:
# Enrich NYT data with our Demographic Datasets
%run 'get_demographic_data.ipynb'

  if self.run_code(code, result):



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3273 entries, 1 to 3273
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   fips                                       3272 non-null   float64
 1   geo_name                                   3273 non-null   object 
 2   hispanic_latino_any_race_PE                3273 non-null   float64
 3   non_hispanic_latino_any_race_PE            3273 non-null   float64
 4   white_alone_PE                             3273 non-null   float64
 5   black_african_american_alone_PE            3273 non-null   float64
 6   american_indian_alaska_native_alone_PE     3273 non-null   float64
 7   asian_alone_PE                             3273 non-null   float64
 8   native_hawaiian_pacific_islander_alone_PE  3273 non-null   float64
 9   some_other_race_alone_PE                   3273 non-null   float64
 10  two_or_more_races_PE   

In [0]:
def calc_death_rate(df):
  df['deaths_per_100k'] = round(df['deaths'] * 100000 / df['population'])
  df['cases_per_100k'] = round(df['cases'] * 100000 / df['population'])
  return df

In [0]:
covid19_demographic_state_df = covid19_nyt_county_df.groupby(
    ['state', 'date']
  ).sum().reset_index().drop(
      'fips', 
      axis = 1
  ).merge(
      state_demographic_df, 
      on = 'state')
  
covid19_demographic_state_df = calc_death_rate(covid19_demographic_state_df)
print("\n==================================== covid19_demographic_state_df: Covid-19 State Dataset w/ Demographics =================================")
covid19_demographic_state_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5262 entries, 0 to 5261
Data columns (total 19 columns):
 #   Column                                     Non-Null Count  Dtype         
---  ------                                     --------------  -----         
 0   state                                      5262 non-null   object        
 1   date                                       5262 non-null   datetime64[ns]
 2   cases                                      5262 non-null   int64         
 3   deaths                                     5262 non-null   int64         
 4   population                                 5262 non-null   int64         
 5   state_code                                 5262 non-null   object        
 6   fips                                       5262 non-null   int64         
 7   geo_name                                   5262 non-null   object        
 8   hispanic_latino_any_race_PE                5262 non-null   float64       
 9   non_hispanic_latin

In [0]:
# Get MSA level data from NYT dataset by 1. merging with county CBSA dataset and then grouping on CBSA
covid19_demographic_county_cbsa_df = covid19_nyt_county_df.merge(
    county_cbsa_demographic_df, 
    on = 'fips'
  ).drop(
      ['county_y', 'state_y'], 
      axis = 1
  ).rename(
      {'county_x': 'county', 'state_x': 'state'}, 
      axis = 1)
  
covid19_demographic_county_cbsa_df = calc_death_rate(covid19_demographic_county_cbsa_df)
print("\n==================================== covid19_demographic_county_cbsa_df: Covid-19 CBSA County Dataset w/ Demographics =================================")
covid19_demographic_county_cbsa_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 222381 entries, 0 to 222380
Data columns (total 21 columns):
 #   Column                                     Non-Null Count   Dtype         
---  ------                                     --------------   -----         
 0   date                                       222381 non-null  datetime64[ns]
 1   county                                     222381 non-null  object        
 2   state                                      222381 non-null  object        
 3   fips                                       222381 non-null  float64       
 4   cases                                      222381 non-null  int64         
 5   deaths                                     222381 non-null  int64         
 6   geo_name                                   222381 non-null  object        
 7   hispanic_latino_any_race_PE                222381 non-null  float64       
 8   non_hispanic_latino_any_race_PE            222381 non-null  float64       
 9   whi

In [0]:
covid19_demographic_cbsa_df = covid19_demographic_county_cbsa_df.groupby(
    ['cbsa', 'date', 'cbsa_type']
  ).sum()[['cases', 'deaths']].reset_index().merge(
      cbsa_demographic_df, 
      on = ['cbsa', 'cbsa_type'])
  
covid19_demographic_cbsa_df = calc_death_rate(covid19_demographic_cbsa_df)
print("\n==================================== covid19_demographic_cbsa_df: Covid-19 CBSA Only Dataset w/ Demographics =================================")
covid19_demographic_cbsa_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 71799 entries, 0 to 71798
Data columns (total 17 columns):
 #   Column                                     Non-Null Count  Dtype         
---  ------                                     --------------  -----         
 0   cbsa                                       71799 non-null  object        
 1   date                                       71799 non-null  datetime64[ns]
 2   cbsa_type                                  71799 non-null  object        
 3   cases                                      71799 non-null  int64         
 4   deaths                                     71799 non-null  int64         
 5   population                                 71799 non-null  int64         
 6   hispanic_latino_any_race_PE                71799 non-null  float64       
 7   white_alone_PE                             71799 non-null  float64       
 8   black_african_american_alone_PE            71799 non-null  float64       
 9   american_indian_

## Download The Atlantic (COVID Tracking Project) Dataset for Deaths by Race/ Ethnicity


In [0]:
atlantic_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vR_xmYt4ACPDZCDJcY12kCiMiH0ODyx3E1ZvgOHB8ae1tRcjXbs_yWBOA4j4uoCEADVfC1PS2jYO68B/pub?gid=43720681&single=true&output=csv" #@param {type:"raw"}

In [0]:
# Read data from The Atlantic
covid19_atlantic_race_death_state_df = read_file(atlantic_url)

In [0]:
# print("\n============ covid19_atlantic_race_death_state_df: Covid-19 Atlantic Dataset with Race/Entnicity Case and Deaths ===========")
# Let's examine what this dataset includes
# covid19_atlantic_race_death_state_df.info()
# It has 56 entries likely for each state plus some union territories 
# print(covid19_atlantic_race_death_state_df.head())
#  We also see that they are updating data daily

In [0]:
# Fix Datetime Column Datatype
covid19_atlantic_race_death_state_df['Date'] = pd.to_datetime(
    covid19_atlantic_race_death_state_df['Date'], 
    format = "%Y%m%d", 
    errors = 'coerce')
# covid19_atlantic_race_death_state_df.head()

In [0]:
# Enrich this dataset with FIPS codes and State Name along with Other Demographics.
# We have that info in already curated covid19_demographic_state_df dataset

# covid19_nyt_county_df.query('state == "Guam"')
# state_full_df # NMI - 69 AS - 60 GU - 66
# covid19_nyt_county_df.query('state == "Northern Mariana Islands" and date == "2020-06-07"')
# covid19_nyt_county_df.query('county == "Unknown" and date == "2020-06-07"')

# race enthnity data from US Census doe snot have any records for NMI and Guam
# race_ethnicity_county_df.query('geo_name.str.contains("Northern Mariana Islands")', engine =  'python')

In [0]:
racial_analysis_df = covid19_atlantic_race_death_state_df.merge(
    covid19_demographic_state_df, 
    left_on = ['Date', 'State'], 
    right_on = ['date', 'state_code']
  ).drop(
      ['State', 'Date', 'geo_name'], 
      axis = 1)
  
print("\n============ racial_analysis_df: Covid-19 Atlantic Dataset with Enriched Race/Entnicity Case and Deaths ===========")
racial_analysis_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 0 to 51
Data columns (total 44 columns):
 #   Column                                     Non-Null Count  Dtype         
---  ------                                     --------------  -----         
 0   Cases_Total                                52 non-null     float64       
 1   Cases_White                                47 non-null     float64       
 2   Cases_Black                                47 non-null     float64       
 3   Cases_LatinX                               20 non-null     float64       
 4   Cases_Asian                                42 non-null     float64       
 5   Cases_AIAN                                 28 non-null     float64       
 6   Cases_NHPI                                 18 non-null     float64       
 7   Cases_Multiracial                          13 non-null     float64       
 8   Cases_Other                                45 non-null     float64       
 9   Cases_Unknown         

In [0]:
# racial_analysis_df.head()

Unnamed: 0,Cases_Total,Cases_White,Cases_Black,Cases_LatinX,Cases_Asian,Cases_AIAN,Cases_NHPI,Cases_Multiracial,Cases_Other,Cases_Unknown,Cases_Ethnicity_Hispanic,Cases_Ethnicity_NonHispanic,Cases_Ethnicity_Unknown,Deaths_Total,Deaths_White,Deaths_Black,Deaths_LatinX,Deaths_Asian,Deaths_AIAN,Deaths_NHPI,Deaths_Multiracial,Deaths_Other,Deaths_Unknown,Deaths_Ethnicity_Hispanic,Deaths_Ethnicity_NonHispanic,Deaths_Ethnicity_Unknown,state,date,cases,deaths,population,state_code,fips,hispanic_latino_any_race_PE,non_hispanic_latino_any_race_PE,white_alone_PE,black_african_american_alone_PE,american_indian_alaska_native_alone_PE,asian_alone_PE,native_hawaiian_pacific_islander_alone_PE,some_other_race_alone_PE,two_or_more_races_PE,deaths_per_100k,cases_per_100k
0,593.0,349.0,16.0,,57.0,69.0,16.0,22.0,12.0,52.0,45.0,476.0,72.0,11.0,6.0,0.0,,2.0,2.0,1.0,0.0,0.0,0.0,0.0,11.0,0.0,Alaska,2020-06-10,642,9,740747,AK,2,6.9,93.1,61.0,3.1,14.0,6.2,1.2,0.2,7.4,1.0,87.0
1,21989.0,7967.0,9221.0,,103.0,,,,1089.0,3609.0,2086.0,14969.0,4932.0,744.0,360.0,333.0,,5.0,,,,12.0,34.0,18.0,640.0,86.0,Alabama,2020-06-10,21989,744,4903185,AL,1,4.2,95.8,65.7,26.4,0.5,1.3,0.0,0.2,1.7,15.0,448.0
2,10368.0,4973.0,2920.0,,93.0,33.0,561.0,,867.0,921.0,2203.0,8165.0,921.0,165.0,96.0,51.0,,1.0,,6.0,,8.0,,8.0,154.0,3.0,Arkansas,2020-06-10,10368,165,3017804,AR,5,7.3,92.7,72.7,15.3,0.6,1.5,0.3,0.2,2.2,5.0,344.0
3,29852.0,5978.0,913.0,7392.0,290.0,3694.0,,,845.0,10740.0,7392.0,11720.0,10740.0,1095.0,515.0,33.0,195.0,15.0,194.0,,,20.0,123.0,195.0,777.0,123.0,Arizona,2020-06-10,29981,1100,7278717,AZ,4,31.1,68.9,55.1,4.1,3.9,3.2,0.2,0.1,2.2,15.0,412.0
4,136191.0,18064.0,4713.0,54151.0,8019.0,191.0,708.0,706.0,10784.0,38855.0,54151.0,43185.0,38855.0,4663.0,1524.0,451.0,1837.0,684.0,13.0,21.0,31.0,39.0,63.0,1837.0,2763.0,63.0,California,2020-06-10,140123,4869,39512223,CA,6,38.9,61.1,37.5,5.5,0.4,14.1,0.4,0.2,3.0,12.0,355.0


In [0]:

# df = analysis_df.query('Deaths_Total >= 20') \
#   .melt(id_vars = 'state', 
#         value_vars = ['black_african_american_alone_PE', 'black_death_pct'], var_name = 'percentage_type', value_name = 'percentage')

# alt.Chart(analysis_df).mark_bar().encode(x = '')

In [0]:
# analysis_df[['Cases_Total', 'cases']].astype(int) # for most part tehy seems to match with sligh variation 
# analysis_df[['Deaths_Total', 'deaths']].astype(int) # for most part tehy seems to match with slight variation 
# analysis_df[['Cases_Total', 'cases']].astype(int) # for most part tehy seems to match with slight variation 
# For the purpsoe of this analysis we are going to use Atlantic's numbers

In [0]:
# Let's focus on impact on Hispanic and Non hispanic Entnicity 
# case_ethnicity_cols = ['date','state','fips','Cases_Total', 'cases','Cases_Ethnicity_Hispanic', 'Cases_Ethnicity_NonHispanic', 'Cases_Ethnicity_Unknown', 'population', 'hispanic_latino_any_race_PE', 'non_hispanic_latino_any_race_PE'] 
# death_ethnicity_cols = ['date','state','fips','Deaths_Total', 'deaths', 'Deaths_Ethnicity_Hispanic', 'Deaths_Ethnicity_NonHispanic', 'Deaths_Ethnicity_Unknown', 'population', 'hispanic_latino_any_race_PE', 'non_hispanic_latino_any_race_PE'] 
# analysis_case_ethnicity_df = analysis_df[case_ethnicity_cols]
# analysis_death_ethnicity_df = analysis_df[death_ethnicity_cols]