<a href="https://colab.research.google.com/github/w-oke/covid_reproduction/blob/main/covid_google_1_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The basis for this notebook was copied from:<br>
https://colab.research.google.com/github/ChadFulton/sm-notebooks-2021/blob/main/001-etl-data-covid-19.ipynb

## Google COVID-19 public datasets / BigQuery

Google is curating and making available a set of "[COVID-19 public datasets](https://cloud.google.com/blog/products/data-analytics/publicly-available-covid-19-data-for-analytics)" that include global data about the COVID-19 pandemic. The data, their ETL code, and information about sources is available in a [Github repository](https://github.com/GoogleCloudPlatform/covid-19-open-data/). However, Google has gone further and made the COVID-19 data available as part of their [BigQuery Public Datasets Program](https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program). This means that we can use BigQuery to interact with the dataset using SQL-like queries, and these queries will be [free until September 15, 2021](https://cloud.google.com/blog/products/data-analytics/publicly-available-covid-19-data-for-analytics).

**ETL strategy**: in this notebook, we'll just use Google's [BigQuery Python libriary](https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python) to query the data directly. The managed BigQuery service is convenient enough that for the visualization we're creating here, we don't need to do any custom ETL work (other than writing the query) or store the output data locally, although we might do so in later notebooks.

We can use the `bigquery` Python library to directly query Google's COVID-19 public datasets. This is conveninent because (1) we can use an SQL-type query to work with the datasets so that we only download the data we actually need, and (2) we can download the data directly into a Pandas DataFrame.

To get starting using this dataset via BigQuery, there are a few steps that must be completed, such as:

- [Create an account and project with Google Cloud Platform](https://cloud.google.com/bigquery/public-data#before_you_begin)
- [Download the Google cloud library for Python (usually using either `pip` or Anaconda)](https://cloud.google.com/bigquery/docs/reference/libraries#installing_the_client_library)
- [Set up authentication for accessing your project](https://cloud.google.com/bigquery/docs/reference/libraries#setting_up_authentication)

Here's the query from the original notebook:

```
query = """
SELECT # only return these 4 columns:
  subregion2_code as fips,      # rename column
  subregion2_name as county,    # rename column
  date,
  new_deceased
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  country_code = 'US'                # only look in the United States
  AND aggregation_level = 2          # only return county-level data
  AND new_deceased IS NOT NULL       # filter by records that have a death
  AND date >= DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY); # only access the last week worth of data
"""
```

After performing the query, the author calculated the total number of deaths over the week for each county:
```
df.groupby(['fips', 'county'], as_index=False).sum()
```

In [None]:
# Basic imports that we will use throughout the notebook
# import numpy as np
import pandas as pd
import pickle

from google.cloud import bigquery

In [None]:
# Handle authentication in Colab
try:
    from google.colab import auth
    auth.authenticate_user()
    print('Authenticated')
except ImportError:
    pass

Authenticated


The following cell identifies the columns that will be downloaded from the Google dataset.

Some notes about the available columns are captured in the spreadsheet at: https://docs.google.com/spreadsheets/d/1iIi6c7_9ryeNxdhuz5nGdwvJ9RWG5tuHh-zHSr1Oefs/edit?usp=sharing

In [None]:
# independent var -> normalize to population
var = {'y': ['new_tested', 'new_confirmed']}

# regional information (string) remove what isn't needed
var['region'] = '''location_key
    place_id
    wikidata_id
    country_code
    subregion1_code
    subregion1_name'''.split()

# datetime
var['date'] = ['date']

# population (int -> normalize to population)
var['population'] = '''population
    population_age_00_09
    population_age_10_19
    population_age_20_29
    population_age_30_39
    population_age_40_49
    population_age_50_59
    population_age_60_69
    population_age_70_79
    population_age_80_and_older
    area_sq_km
    cumulative_persons_vaccinated
    cumulative_persons_fully_vaccinated
    cumulative_vaccine_doses_administered'''.split()

# string (should be int then normalized to float)
var['string'] = '''mobility_retail_and_recreation
    mobility_grocery_and_pharmacy
    mobility_parks
    mobility_transit_stations
    mobility_workplaces
    mobility_residential'''.split()

# float
var['float'] = '''stringency_index
    average_temperature_celsius
    rainfall_mm
    snowfall_mm'''.split()

# rating 1-5 (int)
var['rating'] = '''school_closing
    workplace_closing
    cancel_public_events
    restrictions_on_gatherings
    public_transport_closing
    stay_at_home_requirements
    restrictions_on_internal_movement
    international_travel_controls
    income_support
    debt_relief
    fiscal_measures
    international_support
    public_information_campaigns
    testing_policy
    contact_tracing
    emergency_investment_in_healthcare
    investment_in_vaccines
    facial_coverings
    vaccination_policy'''.split()

# save the variables to file
with open('covid_google_var_dictionary.pkl', 'wb') as f:
    pickle.dump(var, f)

In [None]:
# create a single list of all the features
var_all = [item for sublist in list(var.values()) for item in sublist]
print('The first 4 items in "var_all" are: ', var_all[0:4])

# create a single string of all the features
var_all2 = ', '.join(var_all)

The first 4 items in "var_all" are:  ['new_tested', 'new_confirmed', 'location_key', 'place_id']


In [None]:
# note: the autenticated user must have access to the specified project
# the listed project was created by Wesley Oke
client = bigquery.Client(project='citric-trees-332113')

# Construct a BigQuery client object.
query = "SELECT " + var_all2 + """
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  school_closing is NOT NULL
  /* cumulative_vaccine_doses_administered IS NOT NULL */
  /* AND new_confirmed IS NOT NULL */
  AND mobility_parks IS NOT NULL
  AND new_tested != 'None'
LIMIT 200000
"""

# Run the query and save the result as a dataframe
df = (client.query(query)
             .result()
             .to_dataframe())

In [None]:
# Convert the 'date' column to a Datetime format
df['date'] = pd.to_datetime(df['date'])
# convert new_tested from string to int
df.dropna(subset=['population'], inplace=True)
df.dropna(subset=['new_tested'], inplace=True)
df.dropna(subset=['new_confirmed'], inplace=True)
df['new_tested'] = df['new_tested'].astype('int')
df['new_confirmed'] = df['new_confirmed'].astype('int')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84040 entries, 0 to 84732
Data columns (total 52 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   new_tested                             84040 non-null  int64         
 1   new_confirmed                          84040 non-null  int64         
 2   location_key                           84040 non-null  object        
 3   place_id                               84040 non-null  object        
 4   wikidata_id                            84040 non-null  object        
 5   country_code                           84040 non-null  object        
 6   subregion1_code                        36431 non-null  object        
 7   subregion1_name                        36431 non-null  object        
 8   date                                   84040 non-null  datetime64[ns]
 9   population                             84040 non-null  float6

In [None]:
df.head()

Unnamed: 0,new_tested,new_confirmed,location_key,place_id,wikidata_id,country_code,subregion1_code,subregion1_name,date,population,population_age_00_09,population_age_10_19,population_age_20_29,population_age_30_39,population_age_40_49,population_age_50_59,population_age_60_69,population_age_70_79,population_age_80_and_older,area_sq_km,cumulative_persons_vaccinated,cumulative_persons_fully_vaccinated,cumulative_vaccine_doses_administered,mobility_retail_and_recreation,mobility_grocery_and_pharmacy,mobility_parks,mobility_transit_stations,mobility_workplaces,mobility_residential,stringency_index,average_temperature_celsius,rainfall_mm,snowfall_mm,school_closing,workplace_closing,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,fiscal_measures,international_support,public_information_campaigns,testing_policy,contact_tracing,emergency_investment_in_healthcare,investment_in_vaccines,facial_coverings,vaccination_policy
0,992,49,BA,ChIJ16k3xxWiSxMRDOm3QwPi920,Q225,BA,,,2020-04-26,3280815.0,295212.0,346275.0,403272.0,458385.0,447738.0,500182.0,463795.0,242498.0,123458.0,51210.0,,,,-66,-34,-10,-49,-40,9,90.74,13.027778,0.072571,,3,3.0,2.0,4.0,2.0,2.0,2.0,3.0,1.0,1.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,3.0,0.0
1,2781,501,BA,ChIJ16k3xxWiSxMRDOm3QwPi920,Q225,BA,,,2020-12-31,3280815.0,295212.0,346275.0,403272.0,458385.0,447738.0,500182.0,463795.0,242498.0,123458.0,51210.0,,,,-3,48,29,-5,-28,0,42.59,2.888889,24.60625,25.4,1,2.0,1.0,3.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,3.0,0.0
2,2451,279,BA,ChIJ16k3xxWiSxMRDOm3QwPi920,Q225,BA,,,2021-02-11,3280815.0,295212.0,346275.0,403272.0,458385.0,447738.0,500182.0,463795.0,242498.0,123458.0,51210.0,0.0,,0.0,-17,5,-25,-23,-11,-5,42.59,,,,1,2.0,1.0,3.0,0.0,2.0,0.0,1.0,1.0,1.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,3.0,1.0
3,3596,1077,BA,ChIJ16k3xxWiSxMRDOm3QwPi920,Q225,BA,,,2020-12-07,3280815.0,295212.0,346275.0,403272.0,458385.0,447738.0,500182.0,463795.0,242498.0,123458.0,51210.0,,,,-20,-5,-22,-22,-7,1,50.0,8.0,7.239,,2,1.0,2.0,3.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,3.0,0.0
4,582,290,BA,ChIJ16k3xxWiSxMRDOm3QwPi920,Q225,BA,,,2020-09-10,3280815.0,295212.0,346275.0,403272.0,458385.0,447738.0,500182.0,463795.0,242498.0,123458.0,51210.0,,,,-3,9,38,1,-17,-4,40.74,20.677778,0.0,,1,1.0,2.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,3.0,0.0


In [None]:
df.to_parquet('covid_google_df.parquet') # output based on query WHERE mobility_parks IS NOT NULL