# Dataset and Motivation

For this notebook we will be working with the Our World in Data COVID-19 Dataset. The data contains information about vaccinations, tests & positivity rates, hospitalization & ICU numbers, confirmed cases, confirmed deaths, and more in relation to the COVID-19 pandemic from around the world. The data is compiled from a number of sources, including the Ceter for Systems Science and Engineering at John Hopkins University, European Centre for Disease Prevention and Control, various governmental sources, official reports, and more. A full description of the data sources included in this dataset can be found at the official [Github Repository](https://github.com/owid/covid-19-data/tree/master/public/data/) for the dataset. While some variables are updated daily, others are updated weekly or periodically depending on the availability of data from the official sources.

## Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
covid_data = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

In [4]:
size = covid_data.shape
print("The dataset includes " + str(size[0]) + " observations of " + str(size[1]) + " features.")

The dataset includes 132644 observations of 65 features.


In [5]:
print("The features included in the dataset are: " + str(list(covid_data.columns)))

The features included in the dataset are: ['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'peo

# Research Question

How has COVID-19 spread since the first case was discovered? How has the trend changed since the introduction of key intervention measures such as social distancing, mask requirements, vaccinations, and more?

# Data Cleaning

In [6]:
covid_data.sample(5)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
59633,ITA,Europe,Italy,2021-08-29,4530246.0,5954.0,6519.0,129093.0,37.0,48.857,...,19.8,27.8,,3.18,83.51,0.892,135601.3,12.52,5.21,2246.264383
58408,IMN,Europe,Isle of Man,2021-10-22,,,,,,,...,,,,,81.4,,,,,
114790,SWE,Europe,Sweden,2021-08-11,1108057.0,1236.0,746.857,14658.0,0.0,0.143,...,18.8,18.9,,2.22,82.8,0.945,,,,
78407,MDA,Europe,Moldova,2020-04-03,591.0,86.0,56.0,8.0,2.0,0.857,...,5.9,44.6,86.979,5.8,71.9,0.75,,,,
58403,IMN,Europe,Isle of Man,2021-10-17,,,,,,,...,,,,,81.4,,,,,


We will narrow our dataset to only focus on US data.

In [7]:
us_data = covid_data[covid_data['iso_code'] == 'USA'].reset_index()

In [8]:
us_data.shape

(660, 66)

We can drop some columns that are not relevant to our analysis.

In [9]:
us_data.drop(columns=['new_cases_smoothed', 'new_cases_smoothed_per_million', 'new_deaths_smoothed', 'new_deaths_smoothed_per_million', 'excess_mortality', 'excess_mortality_cumulative', 'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative_per_million', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'new_vaccinations_smoothed', 'new_vaccinations_smoothed_per_million', 'iso_code', 'continent', 'location', 'gdp_per_capita', 'extreme_poverty'], inplace=True)

In [10]:
us_data.shape

(660, 49)

We are left with 660 observations of 48 features.

In [11]:
print("The remaining variables available to us to examine are: " + str(list(us_data.columns)))

The remaining variables available to us to examine are: ['index', 'date', 'total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_cases_per_million', 'new_cases_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred', 'stringency_index', 'population', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'cardiovasc_death_rate', 'diabetes_prevalence

In [12]:
us_data.dtypes

index                                    int64
date                                    object
total_cases                            float64
new_cases                              float64
total_deaths                           float64
new_deaths                             float64
total_cases_per_million                float64
new_cases_per_million                  float64
total_deaths_per_million               float64
new_deaths_per_million                 float64
reproduction_rate                      float64
icu_patients                           float64
icu_patients_per_million               float64
hosp_patients                          float64
hosp_patients_per_million              float64
weekly_icu_admissions                  float64
weekly_icu_admissions_per_million      float64
weekly_hosp_admissions                 float64
weekly_hosp_admissions_per_million     float64
new_tests                              float64
total_tests                            float64
total_tests_p

We see that the date column is incorrectly typed -- we must correct this.

In [13]:
us_data['date'] = pd.to_datetime(us_data['date'], utc=False)

In [14]:
us_data.dtypes

index                                           int64
date                                   datetime64[ns]
total_cases                                   float64
new_cases                                     float64
total_deaths                                  float64
new_deaths                                    float64
total_cases_per_million                       float64
new_cases_per_million                         float64
total_deaths_per_million                      float64
new_deaths_per_million                        float64
reproduction_rate                             float64
icu_patients                                  float64
icu_patients_per_million                      float64
hosp_patients                                 float64
hosp_patients_per_million                     float64
weekly_icu_admissions                         float64
weekly_icu_admissions_per_million             float64
weekly_hosp_admissions                        float64
weekly_hosp_admissions_per_m

The date column is now properly typed.

In [15]:
us_data.isna().sum()

index                                    0
date                                     0
total_cases                              0
new_cases                                1
total_deaths                            38
new_deaths                              38
total_cases_per_million                  0
new_cases_per_million                    1
total_deaths_per_million                38
new_deaths_per_million                  38
reproduction_rate                       44
icu_patients                           175
icu_patients_per_million               175
hosp_patients                          175
hosp_patients_per_million              175
weekly_icu_admissions                  660
weekly_icu_admissions_per_million      660
weekly_hosp_admissions                 592
weekly_hosp_admissions_per_million     592
new_tests                               42
total_tests                             42
total_tests_per_thousand                42
new_tests_per_thousand                  42
positive_ra

The handwashing facilities column seems to only contain nulls, so we will drop it.

In [16]:
us_data.drop(columns=['handwashing_facilities'], inplace=True)

In [17]:
us_data[pd.isna(us_data['reproduction_rate'])]

Unnamed: 0,index,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,...,median_age,aged_65_older,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,125180,2020-01-22,1.0,,,,0.003,,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
1,125181,2020-01-23,1.0,0.0,,,0.003,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
2,125182,2020-01-24,2.0,1.0,,,0.006,0.003,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
3,125183,2020-01-25,2.0,0.0,,,0.006,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
4,125184,2020-01-26,5.0,3.0,,,0.015,0.009,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
5,125185,2020-01-27,5.0,0.0,,,0.015,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
6,125186,2020-01-28,5.0,0.0,,,0.015,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
7,125187,2020-01-29,6.0,1.0,,,0.018,0.003,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
8,125188,2020-01-30,6.0,0.0,,,0.018,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
9,125189,2020-01-31,8.0,2.0,,,0.024,0.006,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926


We can replace most of the null values with 0s with the exception of the reproduction_rate column. For the rest of the columns, the nulls appear in locations where no data was available for that particular metric, which in this case implies a zero. However, for the reproduction_rate, it is more complicated and indicates that we did not have enough data to calculate the correct rate. We will leave these nulls in our data.

In [18]:
cols = list(us_data.columns)
cols.remove('reproduction_rate')
us_data[cols] = us_data[cols].fillna(0)
us_data.isna().sum()

index                                   0
date                                    0
total_cases                             0
new_cases                               0
total_deaths                            0
new_deaths                              0
total_cases_per_million                 0
new_cases_per_million                   0
total_deaths_per_million                0
new_deaths_per_million                  0
reproduction_rate                      44
icu_patients                            0
icu_patients_per_million                0
hosp_patients                           0
hosp_patients_per_million               0
weekly_icu_admissions                   0
weekly_icu_admissions_per_million       0
weekly_hosp_admissions                  0
weekly_hosp_admissions_per_million      0
new_tests                               0
total_tests                             0
total_tests_per_thousand                0
new_tests_per_thousand                  0
positive_rate                     

reproduction_rate is now the only column with nulls contained.

## Feature Engineering

Since we are interested in looking at the changes in trends given the implementation of various intervention measures, we will add indicator columns to understand which measures were in place for each observation. Specifically, we will look at when social distancing measures were implemented, stay at home orders were issued, mask mandates were announced, and vaccinations became available.

- According to the [Kaiser Family Foundation](https://www.kff.org/policy-watch/stay-at-home-orders-to-fight-covid19/), the first stay at home order was announced in  King County in Washington state on March 4, 2020
- According to [NPR](https://www.npr.org/2020/03/16/816658125/white-house-announces-new-social-distancing-guidelines-around-coronavirus), the White House announced social distancing guidelines on March 16, 2020.
- According to [Wikipedia](https://en.wikipedia.org/wiki/Face_masks_during_the_COVID-19_pandemic_in_the_United_States#Timeline), the CDC issued the first federal guidance recommending non-medical face coverings to be worn on April 3, 2020.
- According to the [FDA](<https://www.fda.gov/emergency-preparedness-and-response/coronavirus-disease-2019-covid-19/covid-19-frequently-asked-questions#:~:text=On%20December%2011%2C%202020,)%20of%20a%20vaccine.>), the first Emergency Use Authorization for a COVID vaccine was granted on December 11, 2020 for the Pfizer-BioNTech Vaccine.

In [19]:
us_data['stay_at_home'] = us_data['date'] > '2020-03-04'
us_data['social_distancing'] = us_data['date'] > '2020-03-16'
us_data['face_covering'] = us_data['date'] > '2020-04-04'
us_data['vaccines'] = us_data['date'] > '2020-12-11'

In [20]:
us_data.sample(5)

Unnamed: 0,index,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,...,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index,stay_at_home,social_distancing,face_covering,vaccines
439,125619,2021-04-05,30853266.0,74686.0,555011.0,471.0,92676.086,224.339,1667.125,1.415,...,10.79,19.1,24.6,2.77,78.86,0.926,True,True,True,True
446,125626,2021-04-12,31336673.0,68299.0,561883.0,452.0,94128.129,205.154,1687.767,1.358,...,10.79,19.1,24.6,2.77,78.86,0.926,True,True,True,True
45,125225,2020-03-07,403.0,166.0,17.0,3.0,1.211,0.499,0.051,0.009,...,10.79,19.1,24.6,2.77,78.86,0.926,True,False,False,False
602,125782,2021-09-15,41660101.0,171309.0,667497.0,2765.0,125137.323,514.573,2005.007,8.305,...,10.79,19.1,24.6,2.77,78.86,0.926,True,True,True,True
40,125220,2020-03-02,55.0,23.0,6.0,5.0,0.165,0.069,0.018,0.015,...,10.79,19.1,24.6,2.77,78.86,0.926,False,False,False,False


In [21]:
def determine_measures(row):
    measures = ''
    for col in ['stay_at_home', 'social_distancing', 'face_covering', 'vaccines']:
        if row[col]:
            measures = measures + col + " | "

    return measures[:-3]

In [22]:
us_data['prevention_measures'] = us_data.apply(determine_measures, axis=1)

In [23]:
us_data[['stay_at_home', 'social_distancing', 'face_covering', 'vaccines', 'prevention_measures']].sample(10)

Unnamed: 0,stay_at_home,social_distancing,face_covering,vaccines,prevention_measures
378,True,True,True,True,stay_at_home | social_distancing | face_coveri...
241,True,True,True,False,stay_at_home | social_distancing | face_covering
212,True,True,True,False,stay_at_home | social_distancing | face_covering
293,True,True,True,False,stay_at_home | social_distancing | face_covering
352,True,True,True,True,stay_at_home | social_distancing | face_coveri...
358,True,True,True,True,stay_at_home | social_distancing | face_coveri...
577,True,True,True,True,stay_at_home | social_distancing | face_coveri...
136,True,True,True,False,stay_at_home | social_distancing | face_covering
530,True,True,True,True,stay_at_home | social_distancing | face_coveri...
637,True,True,True,True,stay_at_home | social_distancing | face_coveri...


# Visualization