# Dataset and Motivation

For this notebook we will be working with the Our World in Data COVID-19 Dataset. The data contains information about vaccinations, tests & positivity rates, hospitalization & ICU numbers, confirmed cases, confirmed deaths, and more in relation to the COVID-19 pandemic from around the world. The data is compiled from a number of sources, including the Ceter for Systems Science and Engineering at John Hopkins University, European Centre for Disease Prevention and Control, various governmental sources, official reports, and more. A full description of the data sources included in this dataset can be found at the official [Github Repository](https://github.com/owid/covid-19-data/tree/master/public/data/) for the dataset. While some variables are updated daily, others are updated weekly or periodically depending on the availability of data from the official sources.

## Packages

In [1]:
import pandas as pd

In [7]:
covid_data = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

In [12]:
size = covid_data.shape
print("The dataset includes " + str(size[0]) + " observations of " + str(size[1]) + " features.")

The dataset includes 132644 observations of 65 features.


In [14]:
print("The features included in the dataset are: " + str(list(covid_data.columns)))

The features included in the dataset are: ['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'new_vaccinations_smoothed', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'peo

# Research Question

How has COVID-19 spread since the first case was discovered? How has the trend changed since the introduction of key intervention measures such as social distancing, mask requirements, vaccinations, and more?

# Data Cleaning

In [15]:
covid_data.sample(5)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
20823,CMR,Africa,Cameroon,2021-04-24,65998.0,0.0,169.857,991.0,0.0,7.429,...,,,2.735,1.3,59.29,0.563,,,,
50244,GUY,South America,Guyana,2020-12-30,6319.0,18.0,8.714,164.0,0.0,0.286,...,,,77.159,1.6,69.91,0.682,,,,
130267,OWID_WRL,,World,2020-04-01,959098.0,83000.0,68490.714,50684.0,6014.0,4095.0,...,6.434,34.635,60.13,2.705,72.58,0.737,,,,
23837,CHL,South America,Chile,2020-03-23,801.0,114.0,88.429,2.0,1.0,0.286,...,34.2,41.5,,2.11,80.18,0.851,,,,
2494,DZA,Africa,Algeria,2021-10-20,205529.0,76.0,89.857,5878.0,3.0,2.286,...,0.7,30.4,83.741,1.9,76.88,0.748,,,,


We will narrow our dataset to only focus on US data.

In [32]:
us_data = covid_data[covid_data['iso_code'] == 'USA'].reset_index()

In [33]:
us_data.shape

(660, 66)

We can drop some columns that are not relevant to our analysis.

In [34]:
us_data.drop(columns=['new_cases_smoothed', 'new_cases_smoothed_per_million', 'new_deaths_smoothed', 'new_deaths_smoothed_per_million', 'excess_mortality', 'excess_mortality_cumulative', 'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative_per_million', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'new_vaccinations_smoothed', 'new_vaccinations_smoothed_per_million', 'iso_code', 'continent', 'location', 'gdp_per_capita', 'extreme_poverty'], inplace=True)

In [35]:
us_data.shape

(660, 49)

We are left with 660 observations of 48 features.

In [36]:
print("The remaining variables available to us to examine are: " + str(list(us_data.columns)))

The remaining variables available to us to examine are: ['index', 'date', 'total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_cases_per_million', 'new_cases_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million', 'weekly_icu_admissions', 'weekly_icu_admissions_per_million', 'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'new_vaccinations', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred', 'stringency_index', 'population', 'population_density', 'median_age', 'aged_65_older', 'aged_70_older', 'cardiovasc_death_rate', 'diabetes_prevalence

In [37]:
us_data.isna().sum()

index                                    0
date                                     0
total_cases                              0
new_cases                                1
total_deaths                            38
new_deaths                              38
total_cases_per_million                  0
new_cases_per_million                    1
total_deaths_per_million                38
new_deaths_per_million                  38
reproduction_rate                       47
icu_patients                           176
icu_patients_per_million               176
hosp_patients                          176
hosp_patients_per_million              176
weekly_icu_admissions                  660
weekly_icu_admissions_per_million      660
weekly_hosp_admissions                 592
weekly_hosp_admissions_per_million     592
new_tests                               45
total_tests                             45
total_tests_per_thousand                45
new_tests_per_thousand                  45
positive_ra

The handwashing facilities column seems to only contain nulls, so we will drop it.

In [38]:
us_data.drop(columns=['handwashing_facilities'], inplace=True)

In [39]:
us_data[pd.isna(us_data['reproduction_rate'])]

Unnamed: 0,index,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,...,median_age,aged_65_older,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,125180,2020-01-22,1.0,,,,0.003,,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
1,125181,2020-01-23,1.0,0.0,,,0.003,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
2,125182,2020-01-24,2.0,1.0,,,0.006,0.003,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
3,125183,2020-01-25,2.0,0.0,,,0.006,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
4,125184,2020-01-26,5.0,3.0,,,0.015,0.009,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
5,125185,2020-01-27,5.0,0.0,,,0.015,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
6,125186,2020-01-28,5.0,0.0,,,0.015,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
7,125187,2020-01-29,6.0,1.0,,,0.018,0.003,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
8,125188,2020-01-30,6.0,0.0,,,0.018,0.0,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926
9,125189,2020-01-31,8.0,2.0,,,0.024,0.006,,,...,38.3,15.413,9.732,151.089,10.79,19.1,24.6,2.77,78.86,0.926


We can replace most of the null values with 0s with the exception of the reproduction_rate column. For the rest of the columns, the nulls appear in locations where no data was available for that particular metric, which in this case implies a zero. However, for the reproduction_rate, it is more complicated and indicates that we did not have enough data to calculate the correct rate. We will leave these nulls in our data.

In [45]:
cols = list(us_data.columns)
cols.remove('reproduction_rate')
us_data[cols] = us_data[cols].fillna(0)
us_data.isna().sum()

index                                   0
date                                    0
total_cases                             0
new_cases                               0
total_deaths                            0
new_deaths                              0
total_cases_per_million                 0
new_cases_per_million                   0
total_deaths_per_million                0
new_deaths_per_million                  0
reproduction_rate                      47
icu_patients                            0
icu_patients_per_million                0
hosp_patients                           0
hosp_patients_per_million               0
weekly_icu_admissions                   0
weekly_icu_admissions_per_million       0
weekly_hosp_admissions                  0
weekly_hosp_admissions_per_million      0
new_tests                               0
total_tests                             0
total_tests_per_thousand                0
new_tests_per_thousand                  0
positive_rate                     

reproduction_rate is now the only column with nulls contained.

## Feature Engineering

Since we are interested in looking at the changes in trends given the implementation of various intervention measures, we will add indicator columns to understand which measures were in place for each observation. Specifically, we will look at when social distancing measures were implemented, stay at home orders were issued, mask mandates were announced, and vaccinations became available.

- According to the [Kaiser Family Foundation](https://www.kff.org/policy-watch/stay-at-home-orders-to-fight-covid19/), the first stay at home order was announced in  King County in Washington state on March 4, 2020
- According to [NPR](https://www.npr.org/2020/03/16/816658125/white-house-announces-new-social-distancing-guidelines-around-coronavirus), the White House announced social distancing guidelines on March 16, 2020.

# Visualization