<a href="https://colab.research.google.com/github/w-oke/covid_reproduction/blob/main/001-etl-data-covid-19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The basis for this notebook was copied from:<br>
https://colab.research.google.com/github/ChadFulton/sm-notebooks-2021/blob/main/001-etl-data-covid-19.ipynb

## Google COVID-19 public datasets / BigQuery

Google is curating and making available a set of "[COVID-19 public datasets](https://cloud.google.com/blog/products/data-analytics/publicly-available-covid-19-data-for-analytics)" that include global data about the COVID-19 pandemic. The data, their ETL code, and information about sources is available in a [Github repository](https://github.com/GoogleCloudPlatform/covid-19-open-data/). However, Google has gone further and made the COVID-19 data available as part of their [BigQuery Public Datasets Program](https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program). This means that we can use BigQuery to interact with the dataset using SQL-like queries, and these queries will be [free until September 15, 2021](https://cloud.google.com/blog/products/data-analytics/publicly-available-covid-19-data-for-analytics).

**ETL strategy**: in this notebook, we'll just use Google's [BigQuery Python libriary](https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python) to query the data directly. The managed BigQuery service is convenient enough that for the visualization we're creating here, we don't need to do any custom ETL work (other than writing the query) or store the output data locally, although we might do so in later notebooks.

## Google COVID-19 public datasets / BigQuery

Finally, we can use the `bigquery` Python library to directly query Google's COVID-19 public datasets. This is conveninent because (1) we can use an SQL-type query to work with the datasets so that we only download the data we actually need, and (2) we can download the data directly into a Pandas DataFrame.

To get starting using this dataset via BigQuery, there are a few steps that must be completed, such as:

- [Create an account and project with Google Cloud Platform](https://cloud.google.com/bigquery/public-data#before_you_begin)
- [Download the Google cloud library for Python (usually using either `pip` or Anaconda)](https://cloud.google.com/bigquery/docs/reference/libraries#installing_the_client_library)
- [Set up authentication for accessing your project](https://cloud.google.com/bigquery/docs/reference/libraries#setting_up_authentication)

To work with county-level data use `aggregation_level = 2`.

Here's the query from the original notebook:

```
query = """
SELECT # only return these 4 columns:
  subregion2_code as fips,      # rename column
  subregion2_name as county,    # rename column
  date,
  new_deceased
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  country_code = 'US'                # only look in the United States
  AND aggregation_level = 2          # only return county-level data
  AND new_deceased IS NOT NULL       # filter by records that have a death
  AND date >= DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY); # only access the last week worth of data
"""
```

After performing the query, the author calculated the total number of deaths over the week for each county:
```
df.groupby(['fips', 'county'], as_index=False).sum()
```

In [4]:
# Basic imports that we will use throughout the notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle

from google.cloud import bigquery

In [5]:
# Handle authentication in Colab
try:
    from google.colab import auth
    auth.authenticate_user()
    print('Authenticated')
except ImportError:
    pass

Authenticated


In [6]:
# independent var -> normalize to population
var = {'y': ['new_tested', 'new_confirmed']}

# regional information (string) remove what isn't needed
var['region'] = '''location_key
    place_id
    wikidata_id
    country_code
    subregion1_code
    subregion1_name'''.split()

# datetime
var['date'] = ['date']

# population (int -> normalize to population)
var['population'] = '''population
    population_age_00_09
    population_age_10_19
    population_age_20_29
    population_age_30_39
    population_age_40_49
    population_age_50_59
    population_age_60_69
    population_age_70_79
    population_age_80_and_older
    area_sq_km
    cumulative_persons_vaccinated
    cumulative_persons_fully_vaccinated
    cumulative_vaccine_doses_administered'''.split()

# number (int -> normalize and convert to float)
# var['integer'] = ['elevation_m']

# string (should be int then normalized to float)
var['string'] = '''mobility_retail_and_recreation
    mobility_grocery_and_pharmacy
    mobility_parks
    mobility_transit_stations
    mobility_workplaces
    mobility_residential'''.split()

# float
var['float'] = '''stringency_index
    average_temperature_celsius
    rainfall_mm
    snowfall_mm'''.split()

# rating 1-5 (int)
var['rating'] = '''school_closing
    workplace_closing
    cancel_public_events
    restrictions_on_gatherings
    public_transport_closing
    stay_at_home_requirements
    restrictions_on_internal_movement
    international_travel_controls
    income_support
    debt_relief
    fiscal_measures
    international_support
    public_information_campaigns
    testing_policy
    contact_tracing
    emergency_investment_in_healthcare
    investment_in_vaccines
    facial_coverings
    vaccination_policy'''.split()

with open('var_dictionary.pkl', 'wb') as f:
    pickle.dump(var, f)

# create a single list of all the features
var_all = [item for sublist in list(var.values()) for item in sublist]
print('The first 4 items in "var_all" are: ', var_all[0:4])

# create a single string of all the features
var_all2 = ', '.join(var_all)

The first 4 items in "var_all" are:  ['new_tested', 'new_confirmed', 'location_key', 'place_id']


In [None]:
# note: the autenticated user must have access to the specified project
# the listed project was created by Wesley Oke
client = bigquery.Client(project='citric-trees-332113')

# Construct a BigQuery client object.
query = "SELECT " + var_all2 + """
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  cumulative_persons_fully_vaccinated IS NOT NULL
  AND mobility_parks IS NOT NULL
  AND new_tested != 'None'
LIMIT 200000
"""

# Run the query and save the result as a dataframe
df = (client.query(query)
             .result()
             .to_dataframe())

In [None]:
# Convert the 'date' column to a Datetime format
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192050 entries, 0 to 192049
Data columns (total 52 columns):
 #   Column                                 Non-Null Count   Dtype         
---  ------                                 --------------   -----         
 0   new_tested                             192050 non-null  object        
 1   new_confirmed                          191555 non-null  float64       
 2   location_key                           192050 non-null  object        
 3   place_id                               192050 non-null  object        
 4   wikidata_id                            192050 non-null  object        
 5   country_code                           192050 non-null  object        
 6   subregion1_code                        176602 non-null  object        
 7   subregion1_name                        176602 non-null  object        
 8   date                                   192050 non-null  datetime64[ns]
 9   population                             191998 no

In [None]:
df.head()

Unnamed: 0,new_tested,new_confirmed,location_key,place_id,wikidata_id,country_code,subregion1_code,subregion1_name,date,population,population_age_00_09,population_age_10_19,population_age_20_29,population_age_30_39,population_age_40_49,population_age_50_59,population_age_60_69,population_age_70_79,population_age_80_and_older,area_sq_km,cumulative_persons_vaccinated,cumulative_persons_fully_vaccinated,cumulative_vaccine_doses_administered,mobility_retail_and_recreation,mobility_grocery_and_pharmacy,mobility_parks,mobility_transit_stations,mobility_workplaces,mobility_residential,stringency_index,average_temperature_celsius,rainfall_mm,snowfall_mm,school_closing,workplace_closing,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,fiscal_measures,international_support,public_information_campaigns,testing_policy,contact_tracing,emergency_investment_in_healthcare,investment_in_vaccines,facial_coverings,vaccination_policy
0,84410,36101.0,AR,ChIJZ8b99fXKvJURqA_wKpl3Lz0,Q414,AR,,,2021-05-19,44938712.0,6718871.0,7045513.0,6430658.0,5777148.0,4507125.0,3911943.0,2914251.0,1817557.0,994030.0,2780400.0,8735614.0,2322117,11057731.0,-36,5,-50,-28,-4,9,84.26,17.185185,0.225778,,3.0,2.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,1.0,0.0,0.0,2.0,2.0,2.0,0.0,0.0,3.0,3.0
1,43968,2039.0,AR,ChIJZ8b99fXKvJURqA_wKpl3Lz0,Q414,AR,,,2021-09-16,44938712.0,6718871.0,7045513.0,6430658.0,5777148.0,4507125.0,3911943.0,2914251.0,1817557.0,994030.0,2780400.0,29075298.0,19739283,48814581.0,-10,21,-23,-1,20,2,75.93,15.882716,0.0,,1.0,2.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,1.0,,,2.0,2.0,2.0,,0.0,3.0,4.0
2,65622,26199.0,AR,ChIJZ8b99fXKvJURqA_wKpl3Lz0,Q414,AR,,,2021-04-26,44938712.0,6718871.0,7045513.0,6430658.0,5777148.0,4507125.0,3911943.0,2914251.0,1817557.0,994030.0,2780400.0,6793307.0,909996,7703303.0,-38,-2,-55,-31,-4,8,77.31,15.838889,0.0,,3.0,2.0,2.0,4.0,2.0,2.0,2.0,3.0,1.0,1.0,0.0,0.0,2.0,2.0,2.0,0.0,0.0,3.0,3.0
3,38720,1299.0,AR,ChIJZ8b99fXKvJURqA_wKpl3Lz0,Q414,AR,,,2021-09-29,44938712.0,6718871.0,7045513.0,6430658.0,5777148.0,4507125.0,3911943.0,2914251.0,1817557.0,994030.0,2780400.0,29763790.0,22312002,52075792.0,-12,20,-21,-1,26,1,75.93,18.37037,0.0,,1.0,2.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,1.0,,,2.0,2.0,2.0,,0.0,3.0,4.0
4,24060,5849.0,AR,ChIJZ8b99fXKvJURqA_wKpl3Lz0,Q414,AR,,,2021-03-28,44938712.0,6718871.0,7045513.0,6430658.0,5777148.0,4507125.0,3911943.0,2914251.0,1817557.0,994030.0,2780400.0,3105690.0,686682,3792372.0,-39,-11,-47,-27,-8,8,71.76,16.123457,0.141111,,1.0,2.0,2.0,4.0,2.0,2.0,2.0,3.0,1.0,1.0,0.0,0.0,2.0,2.0,2.0,0.0,0.0,3.0,2.0


In [None]:
df.to_parquet('covid_google_df_mobility_parks.parquet')

In [None]:
# note: the autenticated user must have access to the specified project
# the listed project was created by Wesley Oke
client = bigquery.Client(project='citric-trees-332113')

# Construct a BigQuery client object.
query = "SELECT " + var_all2 + """
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  cumulative_persons_fully_vaccinated IS NOT NULL
  AND school_closing IS NOT NULL
  AND new_tested != 'None'
LIMIT 200000
"""

# Run the query and save the result as a dataframe
df = (client.query(query)
             .result()
             .to_dataframe())

In [None]:
# Convert the 'date' column to a Datetime format
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34652 entries, 0 to 34651
Data columns (total 52 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   new_tested                             34652 non-null  object        
 1   new_confirmed                          34648 non-null  float64       
 2   location_key                           34652 non-null  object        
 3   place_id                               33683 non-null  object        
 4   wikidata_id                            34652 non-null  object        
 5   country_code                           34652 non-null  object        
 6   subregion1_code                        17522 non-null  object        
 7   subregion1_name                        17522 non-null  object        
 8   date                                   34652 non-null  datetime64[ns]
 9   population                             34558 non-null  float6

In [None]:
df.head()

Unnamed: 0,new_tested,new_confirmed,location_key,place_id,wikidata_id,country_code,subregion1_code,subregion1_name,date,population,population_age_00_09,population_age_10_19,population_age_20_29,population_age_30_39,population_age_40_49,population_age_50_59,population_age_60_69,population_age_70_79,population_age_80_and_older,area_sq_km,cumulative_persons_vaccinated,cumulative_persons_fully_vaccinated,cumulative_vaccine_doses_administered,mobility_retail_and_recreation,mobility_grocery_and_pharmacy,mobility_parks,mobility_transit_stations,mobility_workplaces,mobility_residential,stringency_index,average_temperature_celsius,rainfall_mm,snowfall_mm,school_closing,workplace_closing,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,fiscal_measures,international_support,public_information_campaigns,testing_policy,contact_tracing,emergency_investment_in_healthcare,investment_in_vaccines,facial_coverings,vaccination_policy
0,25800,90.0,NZ,ChIJh5Z3Fw4gLG0RM0dqdeIY1rE,Q664,NZ,,,2021-10-13,4822233.0,615284.0,624953.0,671235.0,619066.0,591874.0,628691.0,522312.0,361832.0,186986.0,267710.0,3499039.0,2550698,6049737.0,-14,13,-22,-56,-21,14,81.02,9.475309,2.159,,3,3.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,2.0,,,2.0,2.0,2.0,,0.0,3.0,5.0
1,15242,18.0,NZ,ChIJh5Z3Fw4gLG0RM0dqdeIY1rE,Q664,NZ,,,2021-09-11,4822233.0,615284.0,624953.0,671235.0,619066.0,591874.0,628691.0,522312.0,361832.0,186986.0,267710.0,2848957.0,1451956,4300913.0,-26,6,-28,-58,-21,14,81.02,10.771605,1.608667,,3,3.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,2.0,,,2.0,2.0,2.0,,0.0,3.0,5.0
2,7257,2.0,NZ,ChIJh5Z3Fw4gLG0RM0dqdeIY1rE,Q664,NZ,,,2021-07-07,4822233.0,615284.0,624953.0,671235.0,619066.0,591874.0,628691.0,522312.0,361832.0,186986.0,267710.0,781991.0,515421,1297412.0,9,12,-24,-26,6,5,22.22,9.54321,4.73075,,0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,2.0,0.0,0.0,2.0,2.0,2.0,0.0,0.0,2.0,3.0
3,26328,102.0,NZ,ChIJh5Z3Fw4gLG0RM0dqdeIY1rE,Q664,NZ,,,2021-10-20,4822233.0,615284.0,624953.0,671235.0,619066.0,591874.0,628691.0,522312.0,361832.0,186986.0,267710.0,3605642.0,2872682,6478324.0,-10,17,-6,-51,-12,11,81.02,11.0,0.028222,,3,3.0,2.0,4.0,2.0,2.0,2.0,4.0,1.0,,,,2.0,2.0,2.0,,0.0,3.0,5.0
4,59005,6594.0,PH,ChIJY96HXyFTQDIRV9opeu-QR3g,Q928,PH,,,2021-05-05,100979303.0,21661851.0,20685127.0,17827941.0,14084581.0,11133653.0,8037381.0,4677308.0,2079178.0,792283.0,300000.0,1786480.0,342705,2129185.0,-33,5,-20,-48,-34,21,68.06,29.455556,1.8542,,1,2.0,2.0,4.0,1.0,2.0,2.0,3.0,0.0,0.0,0.0,0.0,2.0,3.0,2.0,0.0,0.0,4.0,3.0


In [None]:
df.to_parquet('covid_google_df_school_closing.parquet')