# Effects of Vaccination on Hospital Impacts and Death Rates of Covid-19 By State

Coronavirus (COVID-19) is an infectious respiratory disease caused by the SARS-CoV-2 virus. The virus, being highly infectious, sparked a global pandemic and caused mandatory quarantines around the globe. By the end of 2020, vaccines were approved for administration and over time, have proven to decrease the chance of infection and the severity of the symptoms. However, COVID-19 is still an issue to this day and we have seen sporatic spikes in COVID-19 contraction since. COVID-19 has over 6 million world wide confirmed deaths and over 569 million cases of the disease, recorded in July 2022, over the two years the disease has been prevalent. The United States, covered in this project, has over 1 million of those deaths.

This project aims to explore COVID-19 data collected from the CDC and healthdata.gov to see if our model can analyze previous COVID-19 data and predict seasonality of infections and accurately forecast a change in hospitalizations and deaths. 


## 1. Imports

This notebook aims to import the raw data and clean the messy data in preparation for data analysis and machine learning.

The three datasets are as follows:
1. Covid-19 Reported Patient Impact and Hospital Capacity by State Timeseries collected from healthdata.gov
<br><tab>https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh
2. Vaccine Distribution and Administration by State collected from the CDC
<br><tab>https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc
3. Covid Cases and Deaths Over Time collected from the CDC
<br><tab>https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36

We import three datasets into our notebook directly from the site using pandas and its .read_csv() function. **Every time this notebook is run, it will redownload the raw data, including whatever data is new since the last download.**

In [1]:
import datetime
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
covid_impact_raw = pd.read_csv('https://healthdata.gov/api/views/g62h-syeh/rows.csv?accessType=DOWNLOAD')
covid_vaccine_raw = pd.read_csv('https://data.cdc.gov/api/views/unsk-b7fc/rows.csv?accessType=DOWNLOAD')
covid_case_deaths_raw = pd.read_csv('https://data.cdc.gov/api/views/9mfq-cb36/rows.csv?accessType=DOWNLOAD')

## 2. Data Exploration

The first step before any analysis can be conducted is to explore each imported dataset and clean it for unnecessary unknown or null values. We also filter the datasets by the relevant columns and merge the three sets into one final table.

### 2.1 Hospital and Patient Impact Dataset

In [4]:
covid_impact_raw.head()

Unnamed: 0,state,date,critical_staffing_shortage_today_yes,critical_staffing_shortage_today_no,critical_staffing_shortage_today_not_reported,critical_staffing_shortage_anticipated_within_week_yes,critical_staffing_shortage_anticipated_within_week_no,critical_staffing_shortage_anticipated_within_week_not_reported,hospital_onset_covid,hospital_onset_covid_coverage,...,previous_day_admission_pediatric_covid_confirmed_5_11,previous_day_admission_pediatric_covid_confirmed_5_11_coverage,previous_day_admission_pediatric_covid_confirmed_unknown,previous_day_admission_pediatric_covid_confirmed_unknown_coverage,staffed_icu_pediatric_patients_confirmed_covid,staffed_icu_pediatric_patients_confirmed_covid_coverage,staffed_pediatric_icu_bed_occupancy,staffed_pediatric_icu_bed_occupancy_coverage,total_staffed_pediatric_icu_beds,total_staffed_pediatric_icu_beds_coverage
0,VI,2020/10/16,1,1,0,2,0,0,0.0,2,...,,0,,0,,0,,0,,0
1,VT,2020/10/15,1,15,1,1,15,1,0.0,16,...,,0,,0,0.0,1,18.0,1,33.0,1
2,RI,2020/10/12,2,12,1,2,12,1,3.0,14,...,,0,,0,0.0,3,0.0,3,0.0,3
3,VT,2020/10/11,1,15,1,1,15,1,0.0,16,...,,0,,0,0.0,1,16.0,1,33.0,1
4,VT,2020/10/08,1,15,1,1,15,1,0.0,16,...,,0,,0,0.0,1,17.0,1,33.0,1


In [5]:
covid_impact_raw.shape

(47693, 135)

In [6]:
impact_cols = list(covid_impact_raw.columns)
impact_cols

['state',
 'date',
 'critical_staffing_shortage_today_yes',
 'critical_staffing_shortage_today_no',
 'critical_staffing_shortage_today_not_reported',
 'critical_staffing_shortage_anticipated_within_week_yes',
 'critical_staffing_shortage_anticipated_within_week_no',
 'critical_staffing_shortage_anticipated_within_week_not_reported',
 'hospital_onset_covid',
 'hospital_onset_covid_coverage',
 'inpatient_beds',
 'inpatient_beds_coverage',
 'inpatient_beds_used',
 'inpatient_beds_used_coverage',
 'inpatient_beds_used_covid',
 'inpatient_beds_used_covid_coverage',
 'previous_day_admission_adult_covid_confirmed',
 'previous_day_admission_adult_covid_confirmed_coverage',
 'previous_day_admission_adult_covid_suspected',
 'previous_day_admission_adult_covid_suspected_coverage',
 'previous_day_admission_pediatric_covid_confirmed',
 'previous_day_admission_pediatric_covid_confirmed_coverage',
 'previous_day_admission_pediatric_covid_suspected',
 'previous_day_admission_pediatric_covid_suspected_

In [7]:
covid_impact_raw.isnull().sum().sort_values(ascending=False).head(10)

geocoded_state                                                      47693
previous_day_admission_pediatric_covid_confirmed_12_17              36779
previous_day_admission_pediatric_covid_confirmed_5_11               36770
previous_day_admission_pediatric_covid_confirmed_0_4                36408
previous_day_admission_pediatric_covid_confirmed_unknown            36296
staffed_icu_pediatric_patients_confirmed_covid                      30149
on_hand_supply_therapeutic_c_bamlanivimab_etesevimab_courses        20498
previous_week_therapeutic_c_bamlanivimab_etesevimab_courses_used    20481
on_hand_supply_therapeutic_b_bamlanivimab_courses                   16151
previous_week_therapeutic_b_bamlanivimab_courses_used               16116
dtype: int64

In [8]:
covid_impact_raw['state'].value_counts()

TX    949
NC    949
HI    949
MT    949
IN    949
MN    949
AL    949
NV    931
KS    918
IL    901
MS    900
WV    899
MO    896
OR    895
CA    894
LA    894
PR    894
WA    891
KY    889
ME    889
OH    889
MI    889
WI    889
PA    889
NJ    889
ND    889
WY    889
IA    889
NE    889
GA    889
MD    889
OK    889
AZ    888
SC    888
VA    888
RI    887
AR    884
FL    878
ID    877
NY    876
TN    876
NM    876
CO    876
VT    873
CT    872
UT    870
AK    867
SD    865
MA    863
NH    863
DE    863
DC    862
VI    851
AS    348
Name: state, dtype: int64

The COVID Hospital Impact dataset contains over 46,000 individual records and 135 features.

The data is collected daily since 01-01-2020 with additional columns added over time as discussed on the healthdata.gov page containing the dataset. Certain states seem to have more collected values than others due lack of regular collection and reporting of data.

The potential columns of relevance are 
'state',
 'date', 
 'critical_staffing_shortage_today_yes',
 'critical_staffing_shortage_today_no',
 'critical_staffing_shortage_today_not_reported',
 'critical_staffing_shortage_anticipated_within_week_yes',
 'critical_staffing_shortage_anticipated_within_week_no',
 'inpatient_beds_used_covid',
 'inpatient_beds_used_covid_coverage',
 'previous_day_admission_adult_covid_confirmed',
 'previous_day_admission_adult_covid_confirmed_coverage',
 'previous_day_admission_adult_covid_suspected',
 'previous_day_admission_adult_covid_suspected_coverage',
 'inpatient_bed_covid_utilization'
 'adult_icu_bed_covid_utilization'
 
The column descriptions can be found at the following site:
https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh

In [9]:
cols = ['state', 'date', 'critical_staffing_shortage_today_yes', 'critical_staffing_shortage_today_no', 
        'critical_staffing_shortage_today_not_reported', 'critical_staffing_shortage_anticipated_within_week_yes', 
        'critical_staffing_shortage_anticipated_within_week_no', 'inpatient_beds_used_covid', 
        'inpatient_beds_used_covid_coverage', 'previous_day_admission_adult_covid_confirmed', 
        'previous_day_admission_adult_covid_confirmed_coverage', 'previous_day_admission_adult_covid_suspected', 
        'previous_day_admission_adult_covid_suspected_coverage', 'inpatient_bed_covid_utilization',
        'adult_icu_bed_covid_utilization']
filt_covid_impact = covid_impact_raw[cols]

In [10]:
filt_covid_impact.head()

Unnamed: 0,state,date,critical_staffing_shortage_today_yes,critical_staffing_shortage_today_no,critical_staffing_shortage_today_not_reported,critical_staffing_shortage_anticipated_within_week_yes,critical_staffing_shortage_anticipated_within_week_no,inpatient_beds_used_covid,inpatient_beds_used_covid_coverage,previous_day_admission_adult_covid_confirmed,previous_day_admission_adult_covid_confirmed_coverage,previous_day_admission_adult_covid_suspected,previous_day_admission_adult_covid_suspected_coverage,inpatient_bed_covid_utilization,adult_icu_bed_covid_utilization
0,VI,2020/10/16,1,1,0,2,0,4.0,2,0.0,2,0.0,2,0.021277,0.05
1,VT,2020/10/15,1,15,1,1,15,6.0,16,0.0,17,10.0,16,0.004781,0.019608
2,RI,2020/10/12,2,12,1,2,12,123.0,14,3.0,15,2.0,14,0.057774,0.082278
3,VT,2020/10/11,1,15,1,1,15,0.0,16,0.0,17,0.0,16,0.0,0.0
4,VT,2020/10/08,1,15,1,1,15,1.0,16,0.0,17,4.0,16,0.000797,0.0


In [11]:
filt_covid_impact.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47693 entries, 0 to 47692
Data columns (total 15 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   state                                                   47693 non-null  object 
 1   date                                                    47693 non-null  object 
 2   critical_staffing_shortage_today_yes                    47693 non-null  int64  
 3   critical_staffing_shortage_today_no                     47693 non-null  int64  
 4   critical_staffing_shortage_today_not_reported           47693 non-null  int64  
 5   critical_staffing_shortage_anticipated_within_week_yes  47693 non-null  int64  
 6   critical_staffing_shortage_anticipated_within_week_no   47693 non-null  int64  
 7   inpatient_beds_used_covid                               47609 non-null  float64
 8   inpatient_beds_used_covid_coverage  

In [12]:
# converting the date column to a datetime dtype

filt_covid_impact['date'] = pd.to_datetime(filt_covid_impact['date'])

In [13]:
filt_covid_impact.isnull().sum()

state                                                        0
date                                                         0
critical_staffing_shortage_today_yes                         0
critical_staffing_shortage_today_no                          0
critical_staffing_shortage_today_not_reported                0
critical_staffing_shortage_anticipated_within_week_yes       0
critical_staffing_shortage_anticipated_within_week_no        0
inpatient_beds_used_covid                                   84
inpatient_beds_used_covid_coverage                           0
previous_day_admission_adult_covid_confirmed              6781
previous_day_admission_adult_covid_confirmed_coverage        0
previous_day_admission_adult_covid_suspected              6932
previous_day_admission_adult_covid_suspected_coverage        0
inpatient_bed_covid_utilization                            270
adult_icu_bed_covid_utilization                           7491
dtype: int64

In [14]:
filt_covid_impact[filt_covid_impact['inpatient_beds_used_covid'].isnull()]['date'].value_counts().sort_index().head()

2020-01-01    2
2020-01-02    2
2020-01-03    2
2020-01-04    3
2020-01-05    3
Name: date, dtype: int64

In [15]:
filt_covid_impact = filt_covid_impact[filt_covid_impact['date']>datetime.datetime(2020,1,20)]

In [16]:
filt_covid_impact.shape

(47533, 15)

In [17]:
filt_covid_impact[filt_covid_impact['adult_icu_bed_covid_utilization'].isnull()]

Unnamed: 0,state,date,critical_staffing_shortage_today_yes,critical_staffing_shortage_today_no,critical_staffing_shortage_today_not_reported,critical_staffing_shortage_anticipated_within_week_yes,critical_staffing_shortage_anticipated_within_week_no,inpatient_beds_used_covid,inpatient_beds_used_covid_coverage,previous_day_admission_adult_covid_confirmed,previous_day_admission_adult_covid_confirmed_coverage,previous_day_admission_adult_covid_suspected,previous_day_admission_adult_covid_suspected_coverage,inpatient_bed_covid_utilization,adult_icu_bed_covid_utilization
37,NE,2020-08-06,4,3,90,4,3,112.0,71,0.0,3,0.0,2,0.027384,
44,PR,2020-07-25,1,1,55,1,1,571.0,56,0.0,2,0.0,3,0.071364,
45,PR,2020-07-24,1,0,56,1,0,453.0,56,,0,0.0,1,0.054976,
47,ND,2020-07-22,0,0,52,0,0,51.0,4,,0,,0,0.062044,
48,PR,2020-07-22,1,1,55,1,1,419.0,55,0.0,2,0.0,2,0.051856,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21576,ND,2020-05-01,0,0,56,0,0,60.0,26,,0,,0,0.029670,
21579,OH,2020-03-28,0,0,12,0,0,28.0,12,,0,,0,0.023490,
21580,HI,2020-03-02,0,0,4,0,0,0.0,4,,0,,0,0.000000,
21582,NJ,2020-05-18,0,0,74,0,0,3153.0,74,,0,,0,0.197693,


We see a small amount of null values in the 'inpatient_beds_used_covid' column that are shown to be values in the early 2020's. 

The United States did not see its first confirmed COVID-19 case until 01-20-2020, so I remove all the columns before that date, which removes about 160 rows.

The 'previous_day_admission_adult_covid_confirmed', 'previous_day_admission_adult_covid_suspected', 'adult_icu_bed_covid_utilization' columns have a large amount of null values that all stem from the early to mid 2020's. I choose not to remove these rows due to the information in the 'inpatient_beds_used_covid' that could be essential in the data training. The null values can be Imputed during modeling, but may be useful in certain machine models, like XGBClassifier.

### 2.2 Vaccine Distribution and Administration by State Dataset

In [18]:
covid_vaccine_raw.head()

Unnamed: 0,Date,MMWR_week,Location,Distributed,Distributed_Janssen,Distributed_Moderna,Distributed_Pfizer,Distributed_Unk_Manuf,Dist_Per_100K,Distributed_Per_100k_5Plus,...,Additional_Doses_Unk_Manuf,Second_Booster,Second_Booster_50Plus,Second_Booster_50Plus_Vax_Pct,Second_Booster_65Plus,Second_Booster_65Plus_Vax_Pct,Second_Booster_Janssen,Second_Booster_Moderna,Second_Booster_Pfizer,Second_Booster_Unk_Manuf
0,08/03/2022,31,AK,1721465,85800,667420,967445,0,235319,252984.0,...,66.0,,38932.0,32.6,23911.0,39.4,54.0,22049.0,19472.0,14.0
1,08/03/2022,31,LA,8927950,327400,3668180,4927870,0,192049,205367.0,...,110.0,,172409.0,23.6,122667.0,28.4,154.0,82271.0,99685.0,9.0
2,08/03/2022,31,PW,46890,3800,30000,13090,0,217769,230158.0,...,0.0,,1099.0,22.9,374.0,23.8,0.0,1121.0,20.0,0.0
3,08/03/2022,31,FL,52029985,2425100,19355320,30235465,0,242251,255827.0,...,9607.0,,1260797.0,29.0,920552.0,34.6,1932.0,661693.0,657146.0,3661.0
4,08/03/2022,31,GU,340160,24100,94480,221580,0,201889,222675.0,...,6.0,,10503.0,31.0,5468.0,40.9,5.0,4716.0,6392.0,0.0


In [19]:
covid_vaccine_raw.shape

(35928, 93)

In [20]:
covid_vaccine_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35928 entries, 0 to 35927
Data columns (total 93 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Date                                    35928 non-null  object 
 1   MMWR_week                               35928 non-null  int64  
 2   Location                                35928 non-null  object 
 3   Distributed                             35928 non-null  int64  
 4   Distributed_Janssen                     35928 non-null  int64  
 5   Distributed_Moderna                     35928 non-null  int64  
 6   Distributed_Pfizer                      35928 non-null  int64  
 7   Distributed_Unk_Manuf                   35928 non-null  int64  
 8   Dist_Per_100K                           35928 non-null  int64  
 9   Distributed_Per_100k_5Plus              35480 non-null  float64
 10  Distributed_Per_100k_12Plus             35928 non-null  in

In [21]:
covid_vaccine_raw.columns

Index(['Date', 'MMWR_week', 'Location', 'Distributed', 'Distributed_Janssen',
       'Distributed_Moderna', 'Distributed_Pfizer', 'Distributed_Unk_Manuf',
       'Dist_Per_100K', 'Distributed_Per_100k_5Plus',
       'Distributed_Per_100k_12Plus', 'Distributed_Per_100k_18Plus',
       'Distributed_Per_100k_65Plus', 'Administered', 'Administered_5Plus',
       'Administered_12Plus', 'Administered_18Plus', 'Administered_65Plus',
       'Administered_Janssen', 'Administered_Moderna', 'Administered_Pfizer',
       'Administered_Unk_Manuf', 'Admin_Per_100K', 'Admin_Per_100k_5Plus',
       'Admin_Per_100k_12Plus', 'Admin_Per_100k_18Plus',
       'Admin_Per_100k_65Plus', 'Recip_Administered',
       'Administered_Dose1_Recip', 'Administered_Dose1_Pop_Pct',
       'Administered_Dose1_Recip_5Plus',
       'Administered_Dose1_Recip_5PlusPop_Pct',
       'Administered_Dose1_Recip_12Plus',
       'Administered_Dose1_Recip_12PlusPop_Pct',
       'Administered_Dose1_Recip_18Plus',
       'Administere

The COVID Vaccination in the U.S. by State dataset contains over 35,000 individual records and 93 features.

The data is collected daily since 12-13-2020, beginning with weekly data reporting after 06-15-2022.

The potential columns of relevance are 
'Date', 'Location', 'Distributed', 'Dist_Per_100K', 'Administered', 'Administered_Janssen', 
        'Administered_Moderna', 'Administered_Pfizer', 'Admin_Per_100K', 'Recip_Administered', 
        'Series_Complete_Yes', 'Series_Complete_Pop_Pct', 'Additional_Doses', 'Second_Booster'
 
The column descriptions can be found at the following site:
https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

In [22]:
cols = ['Date', 'Location', 'Distributed', 'Dist_Per_100K', 'Administered', 'Administered_Janssen', 
        'Administered_Moderna', 'Administered_Pfizer', 'Admin_Per_100K', 'Recip_Administered', 
        'Series_Complete_Yes', 'Series_Complete_Pop_Pct', 'Additional_Doses', 'Second_Booster']
filt_covid_vaccine = covid_vaccine_raw[cols]

In [23]:
filt_covid_vaccine.isnull().sum()

Date                           0
Location                       0
Distributed                    0
Dist_Per_100K                  0
Administered                   0
Administered_Janssen           0
Administered_Moderna           0
Administered_Pfizer            0
Admin_Per_100K                 0
Recip_Administered             0
Series_Complete_Yes            0
Series_Complete_Pop_Pct        0
Additional_Doses           16348
Second_Booster             35865
dtype: int64

In [24]:
filt_covid_vaccine = filt_covid_vaccine.fillna(0)

In [25]:
filt_covid_vaccine.isnull().sum()

Date                       0
Location                   0
Distributed                0
Dist_Per_100K              0
Administered               0
Administered_Janssen       0
Administered_Moderna       0
Administered_Pfizer        0
Admin_Per_100K             0
Recip_Administered         0
Series_Complete_Yes        0
Series_Complete_Pop_Pct    0
Additional_Doses           0
Second_Booster             0
dtype: int64

The large null values for additional_doses and second_booster can be attributed to the lack of additional doses and second boosters administered at the beginning of the pandemic. These null values can be imputed to 0's.

In [26]:
# The column names are readjusted for consistent naming

col_dict = {'Date':'date', 'Location':'state', 'Distributed':'distributed', 'Dist_Per_100K':'dist_per_100k', 
            'Administered':'admin', 'Administered_Janssen':'administered_j', 'Administered_Moderna':'administered_m',
            'Administered_Pfizer':'administered_p', 'Admin_Per_100K':'admin_per_100k', 'Recip_Administered': 'total_admin',
            'Series_Complete_Yes':'fully_vacc', 'Series_Complete_Pop_Pct':'fully_vacc_pop_perc', 
            'Additional_Doses':'first_booster', 'Second_Booster':'second_booster'}
filt_covid_vaccine.rename(columns = col_dict, inplace = True)

In [27]:
filt_covid_vaccine.head()

Unnamed: 0,date,state,distributed,dist_per_100k,admin,administered_j,administered_m,administered_p,admin_per_100k,total_admin,fully_vacc,fully_vacc_pop_perc,first_booster,second_booster
0,08/03/2022,AK,1721465,235319,1193878,45975,460982,685618,163200,1208382,461227,63.0,215249.0,0.0
1,08/03/2022,LA,8927950,192049,6490846,201008,2648043,3638334,139624,6468369,2517249,54.1,1042687.0,0.0
2,08/03/2022,PW,46890,217769,49270,2356,37718,9031,228822,49645,18308,85.0,12000.0,0.0
3,08/03/2022,FL,52029985,242251,39035508,1491938,14478465,22909920,181749,38776091,14592869,67.9,6159483.0,0.0
4,08/03/2022,GU,340160,201889,370542,13599,117001,239548,219921,370890,140917,83.6,71141.0,0.0


In [28]:
# converting the date column to a datetime dtype

filt_covid_vaccine['date'] = pd.to_datetime(filt_covid_vaccine['date'])

In [29]:
filt_covid_vaccine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35928 entries, 0 to 35927
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 35928 non-null  datetime64[ns]
 1   state                35928 non-null  object        
 2   distributed          35928 non-null  int64         
 3   dist_per_100k        35928 non-null  int64         
 4   admin                35928 non-null  int64         
 5   administered_j       35928 non-null  int64         
 6   administered_m       35928 non-null  int64         
 7   administered_p       35928 non-null  int64         
 8   admin_per_100k       35928 non-null  int64         
 9   total_admin          35928 non-null  int64         
 10  fully_vacc           35928 non-null  int64         
 11  fully_vacc_pop_perc  35928 non-null  float64       
 12  first_booster        35928 non-null  float64       
 13  second_booster       35928 non-

### 2.3 Covid Cases and Deaths Over Time by State

In [30]:
covid_case_deaths_raw.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,12/22/2021,DE,165076,151750.0,13326.0,662,38.0,2345,2133.0,212.0,2,0.0,12/24/2021 12:00:00 AM,Agree,Agree
1,11/06/2020,LA,191715,,,870,0.0,6016,5787.0,229.0,21,0.0,11/07/2020 02:45:17 PM,Not agree,Agree
2,02/01/2021,DC,37008,,,136,0.0,916,,,3,0.0,02/02/2021 02:51:51 PM,,
3,11/07/2021,DE,143685,132310.0,11375.0,296,30.0,2186,1992.0,194.0,3,0.0,11/09/2021 12:00:00 AM,Agree,Agree
4,04/21/2021,FSM,0,0.0,0.0,0,0.0,0,0.0,0.0,0,0.0,04/21/2021 12:00:00 AM,Agree,Agree


In [31]:
covid_case_deaths_raw.shape

(55620, 15)

In [32]:
covid_case_deaths_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55620 entries, 0 to 55619
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   submission_date  55620 non-null  object 
 1   state            55620 non-null  object 
 2   tot_cases        55620 non-null  int64  
 3   conf_cases       31209 non-null  float64
 4   prob_cases       31137 non-null  float64
 5   new_case         55620 non-null  int64  
 6   pnew_case        52007 non-null  float64
 7   tot_death        55620 non-null  int64  
 8   conf_death       30596 non-null  float64
 9   prob_death       30596 non-null  float64
 10  new_death        55620 non-null  int64  
 11  pnew_death       52039 non-null  float64
 12  created_at       55620 non-null  object 
 13  consent_cases    46345 non-null  object 
 14  consent_deaths   47277 non-null  object 
dtypes: float64(6), int64(4), object(5)
memory usage: 6.4+ MB


The COVID Case and Death in the U.S. by State dataset contains over 54,000 individual records and 15 features.

The data is collected daily since 01-23-2020.

The potential columns of relevance are 
'submission_date','state','tot_cases','new_case','pnew_case','tot_death','new_death','pnew_death'

We are choosing not to use the conf_cases, prob_cases, conf_death, and prob_death categories due to some states not consenting to reporting this data (consent marked in the consent_cases and consent_deaths columns)
 
The column descriptions can be found at the following site:
https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36

In [33]:
cols = ['submission_date','state','tot_cases','new_case','pnew_case','tot_death','new_death','pnew_death']
filt_covid_case_deaths = covid_case_deaths_raw[cols]

In [34]:
# The column names are readjusted for consistent naming

col_dict = {'submission_date':'date', 'pnew_case':'prob_new_case', 'pnew_death':'prob_new_death'}
filt_covid_case_deaths = filt_covid_case_deaths.rename(columns = col_dict)

In [35]:
filt_covid_case_deaths.head()

Unnamed: 0,date,state,tot_cases,new_case,prob_new_case,tot_death,new_death,prob_new_death
0,12/22/2021,DE,165076,662,38.0,2345,2,0.0
1,11/06/2020,LA,191715,870,0.0,6016,21,0.0
2,02/01/2021,DC,37008,136,0.0,916,3,0.0
3,11/07/2021,DE,143685,296,30.0,2186,3,0.0
4,04/21/2021,FSM,0,0,0.0,0,0,0.0


In [36]:
# converting the date column to a datetime dtype

filt_covid_case_deaths['date'] = pd.to_datetime(filt_covid_case_deaths['date'])

In [37]:
filt_covid_case_deaths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55620 entries, 0 to 55619
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            55620 non-null  datetime64[ns]
 1   state           55620 non-null  object        
 2   tot_cases       55620 non-null  int64         
 3   new_case        55620 non-null  int64         
 4   prob_new_case   52007 non-null  float64       
 5   tot_death       55620 non-null  int64         
 6   new_death       55620 non-null  int64         
 7   prob_new_death  52039 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(4), object(1)
memory usage: 3.4+ MB


In [38]:
filt_covid_case_deaths.isnull().sum()

date                 0
state                0
tot_cases            0
new_case             0
prob_new_case     3613
tot_death            0
new_death            0
prob_new_death    3581
dtype: int64

I choose to leave the null values as null for now. Upon creating the machine learning Pipeline, the mean can be imputed for the null values.

## 3. Data Merge

Now that we have three cleaned datasets, we merge the sets based off the date and state columns

In [39]:
merge_covid = pd.merge(filt_covid_impact, filt_covid_vaccine, how="left", on=['date', 'state'], suffixes=("_x", "_y"))

In [40]:
merge_covid = pd.merge(merge_covid, filt_covid_case_deaths, how='left', on=['date', 'state'], suffixes=("_x", "_y"))

In [41]:
merge_covid.isnull().sum()

state                                                         0
date                                                          0
critical_staffing_shortage_today_yes                          0
critical_staffing_shortage_today_no                           0
critical_staffing_shortage_today_not_reported                 0
critical_staffing_shortage_anticipated_within_week_yes        0
critical_staffing_shortage_anticipated_within_week_no         0
inpatient_beds_used_covid                                    54
inpatient_beds_used_covid_coverage                            0
previous_day_admission_adult_covid_confirmed               6631
previous_day_admission_adult_covid_confirmed_coverage         0
previous_day_admission_adult_covid_suspected               6782
previous_day_admission_adult_covid_suspected_coverage         0
inpatient_bed_covid_utilization                             218
adult_icu_bed_covid_utilization                            7341
distributed                             

The new null values due to merging are due to the vaccination schedule changing from reporting daily to weekly.

## 4. Cleaned Merged DataFrame Save

In [42]:
merge_covid = merge_covid.reset_index(drop=True)
merge_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47533 entries, 0 to 47532
Data columns (total 33 columns):
 #   Column                                                  Non-Null Count  Dtype         
---  ------                                                  --------------  -----         
 0   state                                                   47533 non-null  object        
 1   date                                                    47533 non-null  datetime64[ns]
 2   critical_staffing_shortage_today_yes                    47533 non-null  int64         
 3   critical_staffing_shortage_today_no                     47533 non-null  int64         
 4   critical_staffing_shortage_today_not_reported           47533 non-null  int64         
 5   critical_staffing_shortage_anticipated_within_week_yes  47533 non-null  int64         
 6   critical_staffing_shortage_anticipated_within_week_no   47533 non-null  int64         
 7   inpatient_beds_used_covid                               47

In [43]:
merge_covid.to_csv('../clean_data/merge_covid.csv')

## 5. Continued

Now that the datasets are merged and cleaned, we export the CSV and continue the analysis in the next notebook.