# AC209B Final Project
## Module E: Predicting COVID-19 Cases
- Austin Rochon
- Emily Xie
- Mark Lock

<hr style="height:2pt">

## Table of Contents

0. [Introduction](#introduction)
1. [Data Summary](#data_summary)  
2. [Modeling](#modeling)
10. [Appendix I: Data Collection and Cleaning](#data_collection)  
    a. [COVID-19 Cases by State](#us_state_covid)  
    b. [Google Community Mobility Reports](#google_community_reports)  
    c. [Google Search Data](#google_search_data)  

<a id='introduction'></a>
## Introduction
***

In [1]:
import pandas as pd
import pickle
import datetime

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler


<a id='data_summary'></a>
## Data Summary
***

For our response, we used a state-by-state COVID-19 cases dataset, curated by the New York times:
- **[COVID-19 Cases by State (Response)](https://github.com/nytimes/covid-19-data):** The NYT publishes national, state-level, and county-level COVID-19 case data to their github repository daily

To make predictions, we leveraged the following predictor data:
- **[Google Trends Search Data](https://trends.google.com/trends/?geo=US):** Through the Google Trends API, we collected search query data related to the coronavirus, and focused in particular on coronavirus symptom-based searches
- **[Google Community Mobility Reports](https://www.google.com/covid19/mobility/data_documentation.html?hl=en):** In March of 2020, Google began publishing "Community Mobility Reports", which tracked changes in activity from a pre-coronavirus baseline to now for countries across the world. It built these reports by collecting the location data tied to cell phones. In particular, it tracked changes in mobility for the following six categories: 
    - Retail & recreation
    - Grocery & pharmacy
    - Parks
    - Transit stations
    - Workplaces
    - Residential
- **[State Social Distancing Measures](http://www.healthdata.org/covid/faqs#social%20distancing):** The University of Washington's Institute for Health Metrics and Evaluation (IHME) has produced one of the most famous COVID-19 models. As part of its modeling, it incorporates state-wide social distancing measures. We've leveraged their to determine whether and when a state put in place the following the following measures:
    - Educational facilities closed
    - Non-essential businesses ordered to close
    - People ordered to stay at home
    - Severe travel restrictions
    - Any gathering restrictions
    - Any business closures
- **[Hospitalization and ICU Data](https://covidtracking.com/data):** The Covid Tracking Project monitors hospitalization and ICU numbers for the states that report them. However, a big caveat for this data is that many states do not report these numbers
- **[Weather Data]():** NOAA - temperature, humidity?



<a id='data_collection'></a>
## Appendix I: Data Collection and Cleaning
***

<a id='us_state_covid'></a>
### COVID-19 Cases by State
We'll start by fetching and cleaning our response data: COVID-19 cases by state

In [11]:
# fetch US state data from the NY Times github
covid_us = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")

# fetch US totals data and match formatting
covid_us_totals = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv")
covid_us_totals["state"] = "United States"
covid_us_totals["fips"] = 0
covid_us_totals = covid_us_totals[["date", "state", "fips", "cases", "deaths"]]

# merge two dfs
covid_us = pd.concat([covid_us, covid_us_totals])

# transform date to datetime object
covid_us["date"] = pd.to_datetime(covid_us["date"])

covid_us

Unnamed: 0,date,state,fips,cases,deaths
0,2020-01-21,Washington,53,1,0
1,2020-01-22,Washington,53,1,0
2,2020-01-23,Washington,53,1,0
3,2020-01-24,Illinois,17,1,0
4,2020-01-24,Washington,53,1,0
...,...,...,...,...,...
104,2020-05-04,United States,0,1186979,68843
105,2020-05-05,United States,0,1210686,71077
106,2020-05-06,United States,0,1235190,73785
107,2020-05-07,United States,0,1264001,75744


In [12]:
def impute_missing_dates(df):
    '''
    function that imputes cases and deaths data for missing 
    dates with 0. returned df should have same MIN_DATE and 
    MAX_DATE for all states.
    '''
    # set min values
    MIN_DATE = df["date"].min()
    MIN_CASES = 0
    MIN_DEATHS = 0
    
    # iterate through all states
    imputed_data = []    
    for state in df["state"].unique():
        # build list of missing dates
        # https://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
        sdate = MIN_DATE
        edate = df.loc[df["state"] == state]["date"].min()
        delta = edate - sdate 

        # iterate through all missing dates and impute case and 
        # death data
        for i in range(delta.days):
            day = sdate + datetime.timedelta(days=i)
            imputed_data.append({"date": day,
                                 "state": state,
                                 "fips": df.loc[df["state"]==state].iloc[0]["fips"],
                                 "cases": MIN_CASES,
                                 "deaths": MIN_DEATHS})

    # final cleanup
    new_df = pd.concat([pd.DataFrame(imputed_data), df])
    new_df = new_df.sort_values(by=["state", "date"]).reset_index().drop("index", axis=1)
    
    return new_df
    

def days_since_20200101(df):
    '''
    Creates a new column which measures the number of days that 
    a given observation is from 20200101 (which will be our 
    baseline date)
    '''
    # create days_since_20200101 col
    START_DATE = datetime.datetime.strptime("2020-01-01", "%Y-%m-%d")
    df["days_since_20200101"] = (df["date"] - START_DATE).dt.days
    
    return df
    
    
# map state abbreviation
states_dict = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
    'United States': 'USA'
}

In [13]:
# clean data by imputing missing dates and mapping abbreviations
covid_us_cleaned = impute_missing_dates(covid_us)
covid_us_cleaned["abbrev"] = covid_us_cleaned["state"].map(states_dict)

# create days_since_20200101 col
covid_us_cleaned = days_since_20200101(covid_us_cleaned)
covid_us_cleaned

Unnamed: 0,date,state,fips,cases,deaths,abbrev,days_since_20200101
0,2020-01-21,Alabama,1,0,0,AL,20
1,2020-01-22,Alabama,1,0,0,AL,21
2,2020-01-23,Alabama,1,0,0,AL,22
3,2020-01-24,Alabama,1,0,0,AL,23
4,2020-01-25,Alabama,1,0,0,AL,24
...,...,...,...,...,...,...,...
6099,2020-05-04,Wyoming,56,596,7,WY,124
6100,2020-05-05,Wyoming,56,604,7,WY,125
6101,2020-05-06,Wyoming,56,631,7,WY,126
6102,2020-05-07,Wyoming,56,635,7,WY,127


<a id='google_search'></a>
### Google COVID-19 Searches

Next, we collected Google search data related to COVID-19. We started by curating a list of Google search queries that we thought could be predictive of coronavirus cases. To extract the Google query data, we used [pytrends](https://github.com/GeneralMills/pytrends) a library that allows you to query the GoogleTrends API for search data.

In [14]:
import pytrends
from pytrends.request import TrendReq

# connect 
pyt = TrendReq(hl='en-US', tz=300)

We started by fetching a list of queries related to "coronavirus symptoms"

In [15]:
kw_list = ["coronavirus symptoms"]
now = datetime.datetime.now().strftime("%Y-%m-%d")
time_frame = f'''2020-01-01 {now}'''
pyt.build_payload(kw_list, cat=0, timeframe=time_frame, geo='US', gprop='')
pyt.related_queries()[kw_list[0]]["rising"]

Unnamed: 0,query,value
0,the coronavirus,989350
1,the symptoms of the coronavirus,675550
2,the symptoms of coronavirus,663400
3,corona symptoms,490450
4,corona,488550
5,what are coronavirus symptoms,413900
6,corona virus,354150
7,corona virus symptoms,345900
8,what are symptoms of coronavirus,343550
9,what are the coronavirus symptoms,325700


The final query list we landed on was:
- fever
- shortness of breath
- loss of smell
- coronavirus testing near me
- do i have coronavirus

Note: the Google Trends API only allows you to compare 5 queries at a time. Thus, we used our highest volume query, `flu`, and included it in several calls as a baseline.

In [20]:
def fetch_query_data(kw_list):
    '''
    Fetches Google search data for a list of keywords
    for all states, one at a time. We do it one at a time
    so that the search data is relative to the timing of 
    each state
    '''
    
    # instantiate state list and empty df to store results
    states = list(states_dict.values())
    state_queries = pd.DataFrame(columns = ["date", "abbrev"] + kw_list)
    state_queries = state_queries.set_index("date")

    print(f"Fetching data for the following queries: {kw_list}")
    # fetch all queries for each state in date range
    for state in states:
        if state == "USA":
            geo = "US"
        else:
            geo = f"US-{state}"
        try:
            pyt.build_payload(kw_list, cat=0, timeframe=f"2020-01-1 {now}", geo=geo, gprop='')
        except:
            print(f"NOT FOUND: {state}")
            continue
        interest = pyt.interest_over_time().reset_index()
        interest["abbrev"] = state
        state_queries = pd.concat([interest, state_queries])

    # clean data types
    state_queries["date"] = pd.to_datetime(state_queries["date"])
#     state_queries["coronavirus symptoms"] = state_queries["coronavirus symptoms"].astype(int)
    state_queries[kw_list] = state_queries[kw_list].apply(pd.to_numeric)
    state_queries = state_queries[["date", "abbrev"] + kw_list]
    
    # sort
    state_queries = state_queries.sort_values(by=["abbrev", "date"])
    
    return(state_queries)

In [21]:
# build queries 5 at a time, benchmark is flu
# start queries
benchmark = "flu"
kw_list = [benchmark, "shortness of breath", "loss of smell", "loss of taste", "cough"]
queries_master = fetch_query_data(kw_list)

Fetching data for the following queries: ['flu', 'shortness of breath', 'loss of smell', 'loss of taste', 'cough']


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




NOT FOUND: AS
NOT FOUND: GU
NOT FOUND: MP
NOT FOUND: PR
NOT FOUND: VI


In [22]:
# fetch new query and merge with preceding queries
kw_list = [benchmark, "fever", "coronavirus testing near me", "do i have coronavirus", 
           "covid testing center"]
queries = fetch_query_data(kw_list)

# merge with preceding
queries_master = pd.merge(queries_master, queries.drop(benchmark, axis=1), 
         how="left", on=["date", "abbrev"])

Fetching data for the following queries: ['flu', 'fever', 'coronavirus testing near me', 'do i have coronavirus', 'covid testing center']


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




NOT FOUND: AS
NOT FOUND: GU
NOT FOUND: MP
NOT FOUND: PR
NOT FOUND: VT
NOT FOUND: VI
NOT FOUND: VA
NOT FOUND: WA
NOT FOUND: WV
NOT FOUND: WI
NOT FOUND: WY
NOT FOUND: USA


In [None]:
# fetch new query and merge with preceding queries
kw_list = [benchmark, "chills", "sore throat", "fatigue", "chest pain"]
queries = fetch_query_data(kw_list)

# merge with preceding
queries_master = pd.merge(queries_master, queries.drop(benchmark, axis=1), 
         how="left", on=["date", "abbrev"])

