# Effects of Vaccination on Hospital Impacts and Death Rates of Covid-19 By State

Coronavirus (COVID-19) is an infectious respiratory disease caused by the SARS-CoV-2 virus. The virus, being highly infectious, sparked a global pandemic and caused mandatory quarantines around the globe. By the end of 2020, vaccines were approved for administration and over time, have proven to decrease the chance of infection and the severity of the symptoms. However, COVID-19 is still an issue to this day and we have seen sporatic spikes in COVID-19 contraction since. COVID-19 has over 6 million world wide confirmed deaths and over 569 million cases of the disease, recorded in July 2022, over the two years the disease has been prevalent. The United States, covered in this project, has over 1 million of those deaths.

This project aims to explore COVID-19 data collected from the CDC and healthdata.gov to see if our model can analyze previous COVID-19 data and predict seasonality of infections and accurately forecast a change in hospitalizations and deaths. 


## 1. Imports

This notebook aims to import the raw data and clean the messy data in preparation for data analysis and machine learning.

The three datasets are as follows:
1. Covid-19 Reported Patient Impact and Hospital Capacity by State Timeseries collected from healthdata.gov
<br><tab>https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh
2. Vaccine Distribution and Administration by State collected from the CDC
<br><tab>https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc
3. Covid Cases and Deaths Over Time collected from the CDC
<br><tab>https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36

We import three datasets into our notebook using the Pandas library and its .read_csv() function.

In [1]:
import datetime
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Changing the current working directory and creating a file path
cwd = os.getcwd()
cwd = cwd.replace('Notebook', '')

In [121]:
covid_impact_raw = pd.read_csv(cwd+'Raw_Data\\COVID-19_Reported_Patient_Impact_and_Hospital_Capacity_by_State_Timeseries.csv')
covid_vaccine_raw = pd.read_csv(cwd+'Raw_Data\\COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')
covid_case_deaths_raw = pd.read_csv(cwd+'Raw_Data\\United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv')

## 2. Data Exploration

The first step before any analysis can be conducted is to explore each imported dataset and clean it for unnecessary unknown or null values. We also filter the datasets by the relevant columns and merge the three sets into one final table.

### 2.1 Hospital and Patient Impact Dataset

In [4]:
covid_impact_raw.head()

Unnamed: 0,state,date,critical_staffing_shortage_today_yes,critical_staffing_shortage_today_no,critical_staffing_shortage_today_not_reported,critical_staffing_shortage_anticipated_within_week_yes,critical_staffing_shortage_anticipated_within_week_no,critical_staffing_shortage_anticipated_within_week_not_reported,hospital_onset_covid,hospital_onset_covid_coverage,...,previous_day_admission_pediatric_covid_confirmed_5_11,previous_day_admission_pediatric_covid_confirmed_5_11_coverage,previous_day_admission_pediatric_covid_confirmed_unknown,previous_day_admission_pediatric_covid_confirmed_unknown_coverage,staffed_icu_pediatric_patients_confirmed_covid,staffed_icu_pediatric_patients_confirmed_covid_coverage,staffed_pediatric_icu_bed_occupancy,staffed_pediatric_icu_bed_occupancy_coverage,total_staffed_pediatric_icu_beds,total_staffed_pediatric_icu_beds_coverage
0,VT,2020/10/16,1,15,1,1,15,1,0.0,16,...,,0,,0,0.0,1,19.0,1,33.0,1
1,VI,2020/10/15,1,1,0,2,0,0,0.0,2,...,,0,,0,,0,,0,,0
2,VI,2020/10/14,1,1,0,2,0,0,0.0,2,...,,0,,0,,0,,0,,0
3,VI,2020/10/13,1,1,0,2,0,0,0.0,2,...,,0,,0,,0,,0,,0
4,AK,2020/10/10,2,21,0,4,19,0,1.0,23,...,,0,,0,0.0,21,45.0,21,95.0,20


In [5]:
covid_impact_raw.shape

(46937, 135)

In [6]:
impact_cols = list(covid_impact_raw.columns)
impact_cols

['state',
 'date',
 'critical_staffing_shortage_today_yes',
 'critical_staffing_shortage_today_no',
 'critical_staffing_shortage_today_not_reported',
 'critical_staffing_shortage_anticipated_within_week_yes',
 'critical_staffing_shortage_anticipated_within_week_no',
 'critical_staffing_shortage_anticipated_within_week_not_reported',
 'hospital_onset_covid',
 'hospital_onset_covid_coverage',
 'inpatient_beds',
 'inpatient_beds_coverage',
 'inpatient_beds_used',
 'inpatient_beds_used_coverage',
 'inpatient_beds_used_covid',
 'inpatient_beds_used_covid_coverage',
 'previous_day_admission_adult_covid_confirmed',
 'previous_day_admission_adult_covid_confirmed_coverage',
 'previous_day_admission_adult_covid_suspected',
 'previous_day_admission_adult_covid_suspected_coverage',
 'previous_day_admission_pediatric_covid_confirmed',
 'previous_day_admission_pediatric_covid_confirmed_coverage',
 'previous_day_admission_pediatric_covid_suspected',
 'previous_day_admission_pediatric_covid_suspected_

In [7]:
covid_impact_raw.isnull().sum().sort_values(ascending=False).head(10)

geocoded_state                                                      46937
previous_day_admission_pediatric_covid_confirmed_12_17              36751
previous_day_admission_pediatric_covid_confirmed_5_11               36746
previous_day_admission_pediatric_covid_confirmed_0_4                36404
previous_day_admission_pediatric_covid_confirmed_unknown            36296
staffed_icu_pediatric_patients_confirmed_covid                      30149
on_hand_supply_therapeutic_c_bamlanivimab_etesevimab_courses        20486
previous_week_therapeutic_c_bamlanivimab_etesevimab_courses_used    20469
on_hand_supply_therapeutic_b_bamlanivimab_courses                   16119
previous_week_therapeutic_b_bamlanivimab_courses_used               16084
dtype: int64

In [17]:
covid_impact_raw['state'].value_counts()

TX    935
AL    935
MT    935
IN    935
NC    935
MN    935
HI    935
NV    917
KS    904
IL    887
MS    886
WV    885
MO    882
OR    881
CA    880
PR    880
LA    880
WA    877
ND    875
NJ    875
MI    875
IA    875
OH    875
GA    875
NE    875
PA    875
OK    875
MD    875
WI    875
KY    875
WY    875
ME    875
SC    874
AZ    874
VA    874
RI    873
AR    870
FL    864
ID    863
TN    862
CO    862
NY    862
NM    862
VT    859
CT    858
UT    856
AK    853
SD    851
NH    849
DE    849
MA    849
DC    848
VI    837
AS    334
Name: state, dtype: int64

The COVID Hospital Impact dataset contains 46,937 individual records and 135 features.

The data is collected daily since 01-01-2020 with additional columns added over time as discussed on the healthdata.gov page containing the dataset. Certain states seem to have more collected values than others due lack of regular collection and reporting of data.

The potential columns of relevance are 
'state',
 'date', 
 'critical_staffing_shortage_today_yes',
 'critical_staffing_shortage_today_no',
 'critical_staffing_shortage_today_not_reported',
 'critical_staffing_shortage_anticipated_within_week_yes',
 'critical_staffing_shortage_anticipated_within_week_no',
 'inpatient_beds_used_covid',
 'inpatient_beds_used_covid_coverage',
 'previous_day_admission_adult_covid_confirmed',
 'previous_day_admission_adult_covid_confirmed_coverage',
 'previous_day_admission_adult_covid_suspected',
 'previous_day_admission_adult_covid_suspected_coverage',
 'inpatient_bed_covid_utilization'
 'adult_icu_bed_covid_utilization'
 
The column descriptions can be found at the following site:
https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh

In [39]:
cols = ['state', 'date', 'critical_staffing_shortage_today_yes', 'critical_staffing_shortage_today_no', 
        'critical_staffing_shortage_today_not_reported', 'critical_staffing_shortage_anticipated_within_week_yes', 
        'critical_staffing_shortage_anticipated_within_week_no', 'inpatient_beds_used_covid', 
        'inpatient_beds_used_covid_coverage', 'previous_day_admission_adult_covid_confirmed', 
        'previous_day_admission_adult_covid_confirmed_coverage', 'previous_day_admission_adult_covid_suspected', 
        'previous_day_admission_adult_covid_suspected_coverage', 'inpatient_bed_covid_utilization',
        'adult_icu_bed_covid_utilization']
filt_covid_impact = covid_impact_raw[cols]

In [40]:
filt_covid_impact.head()

Unnamed: 0,state,date,critical_staffing_shortage_today_yes,critical_staffing_shortage_today_no,critical_staffing_shortage_today_not_reported,critical_staffing_shortage_anticipated_within_week_yes,critical_staffing_shortage_anticipated_within_week_no,inpatient_beds_used_covid,inpatient_beds_used_covid_coverage,previous_day_admission_adult_covid_confirmed,previous_day_admission_adult_covid_confirmed_coverage,previous_day_admission_adult_covid_suspected,previous_day_admission_adult_covid_suspected_coverage,inpatient_bed_covid_utilization,adult_icu_bed_covid_utilization
0,VT,2020/10/16,1,15,1,1,15,2.0,16,0.0,17,2.0,16,0.001594,0.0
1,VI,2020/10/15,1,1,0,2,0,4.0,2,0.0,2,0.0,2,0.021277,0.05
2,VI,2020/10/14,1,1,0,2,0,4.0,2,0.0,2,0.0,2,0.021277,0.05
3,VI,2020/10/13,1,1,0,2,0,4.0,2,0.0,2,0.0,2,0.021277,0.05
4,AK,2020/10/10,2,21,0,4,19,54.0,23,10.0,23,7.0,23,0.033666,0.087591


In [41]:
filt_covid_impact.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46937 entries, 0 to 46936
Data columns (total 15 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   state                                                   46937 non-null  object 
 1   date                                                    46937 non-null  object 
 2   critical_staffing_shortage_today_yes                    46937 non-null  int64  
 3   critical_staffing_shortage_today_no                     46937 non-null  int64  
 4   critical_staffing_shortage_today_not_reported           46937 non-null  int64  
 5   critical_staffing_shortage_anticipated_within_week_yes  46937 non-null  int64  
 6   critical_staffing_shortage_anticipated_within_week_no   46937 non-null  int64  
 7   inpatient_beds_used_covid                               46853 non-null  float64
 8   inpatient_beds_used_covid_coverage  

In [48]:
# converting the date column to a datetime dtype

filt_covid_impact['date'] = pd.to_datetime(filt_covid_impact['date'])

In [49]:
filt_covid_impact.isnull().sum()

state                                                        0
date                                                         0
critical_staffing_shortage_today_yes                         0
critical_staffing_shortage_today_no                          0
critical_staffing_shortage_today_not_reported                0
critical_staffing_shortage_anticipated_within_week_yes       0
critical_staffing_shortage_anticipated_within_week_no        0
inpatient_beds_used_covid                                   54
inpatient_beds_used_covid_coverage                           0
previous_day_admission_adult_covid_confirmed              6631
previous_day_admission_adult_covid_confirmed_coverage        0
previous_day_admission_adult_covid_suspected              6782
previous_day_admission_adult_covid_suspected_coverage        0
inpatient_bed_covid_utilization                            218
adult_icu_bed_covid_utilization                           7341
dtype: int64

In [151]:
filt_covid_impact[filt_covid_impact['inpatient_beds_used_covid'].isnull()]['date'].value_counts().sort_index().head()

2020-01-21    1
2020-01-22    1
2020-01-23    1
2020-01-24    1
2020-01-25    1
Name: date, dtype: int64

In [51]:
filt_covid_impact = filt_covid_impact[filt_covid_impact['date']>datetime.datetime(2020,1,20)]

In [53]:
filt_covid_impact.shape

(46777, 15)

In [72]:
filt_covid_impact[filt_covid_impact['adult_icu_bed_covid_utilization'].isnull()]

Unnamed: 0,state,date,critical_staffing_shortage_today_yes,critical_staffing_shortage_today_no,critical_staffing_shortage_today_not_reported,critical_staffing_shortage_anticipated_within_week_yes,critical_staffing_shortage_anticipated_within_week_no,inpatient_beds_used_covid,inpatient_beds_used_covid_coverage,previous_day_admission_adult_covid_confirmed,previous_day_admission_adult_covid_confirmed_coverage,previous_day_admission_adult_covid_suspected,previous_day_admission_adult_covid_suspected_coverage,inpatient_bed_covid_utilization,adult_icu_bed_covid_utilization
44,ND,2020-07-22,0,0,52,0,0,51.0,4,,0,,0,0.062044,
45,VI,2020-07-19,0,0,2,0,0,2.0,2,2.0,2,,0,,
46,ND,2020-07-15,0,0,52,0,0,16.0,4,,0,,0,0.025478,
48,ID,2020-07-10,0,0,45,0,0,131.0,43,,0,,0,0.032948,
49,ND,2020-07-05,0,0,52,0,0,33.0,19,,0,,0,0.024000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21526,AR,2020-04-27,0,0,82,0,0,161.0,81,,0,,0,0.015167,
21529,NC,2020-03-24,0,0,5,0,0,22.0,5,,0,,0,0.031609,
21530,NV,2020-02-20,0,0,1,0,0,0.0,1,,0,,0,0.000000,
21532,CO,2020-05-14,0,0,97,0,0,764.0,96,,0,,0,0.117774,


We see a small amount of null values in the 'inpatient_beds_used_covid' column that are shown to be values in the early 2020's. 

The United States did not see its first confirmed COVID-19 case until 01-20-2020, so I remove all the columns before that date, which removes about 160 rows.

The 'previous_day_admission_adult_covid_confirmed', 'previous_day_admission_adult_covid_suspected', 'adult_icu_bed_covid_utilization' columns have a large amount of null values that all stem from the early to mid 2020's. I choose not to remove these rows due to the information in the 'inpatient_beds_used_covid' that could be essential in the data training. The null values can be Imputed during modeling, but may be useful in certain machine models, like XGBClassifier.

### 2.2 Vaccine Distribution and Administration by State Dataset

In [41]:
covid_vaccine_raw.head()

Unnamed: 0,Date,MMWR_week,Location,Distributed,Distributed_Janssen,Distributed_Moderna,Distributed_Pfizer,Distributed_Unk_Manuf,Dist_Per_100K,Distributed_Per_100k_5Plus,...,Additional_Doses_Unk_Manuf,Second_Booster,Second_Booster_50Plus,Second_Booster_50Plus_Vax_Pct,Second_Booster_65Plus,Second_Booster_65Plus_Vax_Pct,Second_Booster_Janssen,Second_Booster_Moderna,Second_Booster_Pfizer,Second_Booster_Unk_Manuf
0,07/20/2022,29,MP,132230,3600,27120,101510,0,255019,274895.0,...,0.0,,931.0,9.6,389.0,16.1,1.0,183.0,909.0,0.0
1,07/20/2022,29,AR,6728980,257400,2858520,3613060,0,222976,237829.0,...,547.0,,115707.0,24.8,85837.0,30.3,75.0,64106.0,56202.0,85.0
2,07/20/2022,29,AS,123210,600,24700,97910,0,259981,285804.0,...,0.0,,1372.0,16.8,603.0,25.8,0.0,342.0,1109.0,0.0
3,07/20/2022,29,TX,66253035,2649100,23902980,39700955,0,228491,245336.0,...,416.0,,982243.0,23.1,638897.0,28.7,351.0,529515.0,528390.0,45.0
4,07/20/2022,29,WI,12628245,453300,4748500,7426445,0,216889,229942.0,...,192.0,,494607.0,35.0,354333.0,43.9,210.0,227391.0,283122.0,40.0


In [43]:
covid_vaccine_raw.shape

(35800, 93)

In [51]:
covid_vaccine_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35800 entries, 0 to 35799
Data columns (total 93 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Date                                    35800 non-null  object 
 1   MMWR_week                               35800 non-null  int64  
 2   Location                                35800 non-null  object 
 3   Distributed                             35800 non-null  int64  
 4   Distributed_Janssen                     35800 non-null  int64  
 5   Distributed_Moderna                     35800 non-null  int64  
 6   Distributed_Pfizer                      35800 non-null  int64  
 7   Distributed_Unk_Manuf                   35800 non-null  int64  
 8   Dist_Per_100K                           35800 non-null  int64  
 9   Distributed_Per_100k_5Plus              35352 non-null  float64
 10  Distributed_Per_100k_12Plus             35800 non-null  in

In [73]:
covid_vaccine_raw.columns

Index(['Date', 'MMWR_week', 'Location', 'Distributed', 'Distributed_Janssen',
       'Distributed_Moderna', 'Distributed_Pfizer', 'Distributed_Unk_Manuf',
       'Dist_Per_100K', 'Distributed_Per_100k_5Plus',
       'Distributed_Per_100k_12Plus', 'Distributed_Per_100k_18Plus',
       'Distributed_Per_100k_65Plus', 'Administered', 'Administered_5Plus',
       'Administered_12Plus', 'Administered_18Plus', 'Administered_65Plus',
       'Administered_Janssen', 'Administered_Moderna', 'Administered_Pfizer',
       'Administered_Unk_Manuf', 'Admin_Per_100K', 'Admin_Per_100k_5Plus',
       'Admin_Per_100k_12Plus', 'Admin_Per_100k_18Plus',
       'Admin_Per_100k_65Plus', 'Recip_Administered',
       'Administered_Dose1_Recip', 'Administered_Dose1_Pop_Pct',
       'Administered_Dose1_Recip_5Plus',
       'Administered_Dose1_Recip_5PlusPop_Pct',
       'Administered_Dose1_Recip_12Plus',
       'Administered_Dose1_Recip_12PlusPop_Pct',
       'Administered_Dose1_Recip_18Plus',
       'Administere

The COVID Vaccination in the U.S. by State dataset contains 35,800 individual records and 93 features.

The data is collected daily since 12-13-2020, beginning with weekly data reporting after 06-15-2022.

The potential columns of relevance are 
'Date', 'Location', 'Distributed', 'Dist_Per_100K', 'Administered', 'Administered_Janssen', 
        'Administered_Moderna', 'Administered_Pfizer', 'Admin_Per_100K', 'Recip_Administered', 
        'Series_Complete_Yes', 'Series_Complete_Pop_Pct', 'Additional_Doses', 'Second_Booster'
 
The column descriptions can be found at the following site:
https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

In [74]:
cols = ['Date', 'Location', 'Distributed', 'Dist_Per_100K', 'Administered', 'Administered_Janssen', 
        'Administered_Moderna', 'Administered_Pfizer', 'Admin_Per_100K', 'Recip_Administered', 
        'Series_Complete_Yes', 'Series_Complete_Pop_Pct', 'Additional_Doses', 'Second_Booster']
filt_covid_vaccine = covid_vaccine_raw[cols]

In [76]:
filt_covid_vaccine.isnull().sum()

Date                           0
Location                       0
Distributed                    0
Dist_Per_100K                  0
Administered                   0
Administered_Janssen           0
Administered_Moderna           0
Administered_Pfizer            0
Admin_Per_100K                 0
Recip_Administered             0
Series_Complete_Yes            0
Series_Complete_Pop_Pct        0
Additional_Doses           16348
Second_Booster             35739
dtype: int64

In [100]:
filt_covid_vaccine = filt_covid_vaccine.fillna(0)

In [101]:
filt_covid_vaccine.isnull().sum()

Date                       0
Location                   0
Distributed                0
Dist_Per_100K              0
Administered               0
Administered_Janssen       0
Administered_Moderna       0
Administered_Pfizer        0
Admin_Per_100K             0
Recip_Administered         0
Series_Complete_Yes        0
Series_Complete_Pop_Pct    0
Additional_Doses           0
Second_Booster             0
dtype: int64

The large null values for additional_doses and second_booster can be attributed to the lack of additional doses and second boosters administered at the beginning of the pandemic. These null values can be imputed to 0's.

In [104]:
# The column names are readjusted for consistent naming

col_dict = {'Date':'date', 'Location':'state', 'Distributed':'distributed', 'Dist_Per_100K':'dist_per_100k', 
            'Administered':'admin', 'Administered_Janssen':'administered_j', 'Administered_Moderna':'administered_m',
            'Administered_Pfizer':'administered_p', 'Admin_Per_100K':'admin_per_100k', 'Recip_Administered': 'total_admin',
            'Series_Complete_Yes':'fully_vacc', 'Series_Complete_Pop_Pct':'fully_vacc_pop_perc', 
            'Additional_Doses':'first_booster', 'Second_Booster':'second_booster'}
filt_covid_vaccine.rename(columns = col_dict, inplace = True)

In [105]:
filt_covid_vaccine.head()

Unnamed: 0,date,state,distributed,dist_per_100k,admin,administered_j,administered_m,administered_p,admin_per_100k,total_admin,fully_vacc,fully_vacc_pop_perc,first_booster,second_booster
0,07/20/2022,MP,132230,255019,111089,1385,15331,94364,214247,111260,43461,83.8,21930.0,0.0
1,07/20/2022,AR,6728980,222976,4373349,124778,1870963,2373340,144918,4430099,1664434,55.2,684484.0,0.0
2,07/20/2022,AS,123210,259981,112281,577,25012,85484,236920,112744,42027,88.7,23866.0,0.0
3,07/20/2022,TX,66253035,228491,48669565,1582567,17741246,29339405,167850,47385980,18131575,62.5,7318075.0,0.0
4,07/20/2022,WI,12628245,216889,10744653,336736,4106404,6298409,184539,10743409,3858313,66.3,2230266.0,0.0


In [107]:
# converting the date column to a datetime dtype

filt_covid_vaccine['date'] = pd.to_datetime(filt_covid_vaccine['date'])

In [108]:
filt_covid_vaccine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35800 entries, 0 to 35799
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 35800 non-null  datetime64[ns]
 1   state                35800 non-null  object        
 2   distributed          35800 non-null  int64         
 3   dist_per_100k        35800 non-null  int64         
 4   admin                35800 non-null  int64         
 5   administered_j       35800 non-null  int64         
 6   administered_m       35800 non-null  int64         
 7   administered_p       35800 non-null  int64         
 8   admin_per_100k       35800 non-null  int64         
 9   total_admin          35800 non-null  int64         
 10  fully_vacc           35800 non-null  int64         
 11  fully_vacc_pop_perc  35800 non-null  float64       
 12  first_booster        35800 non-null  float64       
 13  second_booster       35800 non-

### 2.3 Covid Cases and Deaths Over Time by State

In [122]:
covid_case_deaths_raw.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,12/01/2021,ND,163565,135705.0,27860.0,589,220.0,1907,,,9,0.0,12/02/2021 02:35:20 PM,Agree,Not agree
1,11/07/2021,DE,143685,132310.0,11375.0,296,30.0,2186,1992.0,194.0,3,0.0,11/09/2021 12:00:00 AM,Agree,Agree
2,05/12/2022,CT,777064,696528.0,80536.0,1963,173.0,10883,8906.0,1977.0,0,0.0,05/13/2022 01:28:57 PM,Agree,Agree
3,10/04/2020,MD,127290,,,471,0.0,4092,3933.0,159.0,3,0.0,10/06/2020 12:00:00 AM,,Agree
4,02/06/2020,NE,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Agree,Agree


In [123]:
covid_case_deaths_raw.shape

(54780, 15)

In [124]:
covid_case_deaths_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54780 entries, 0 to 54779
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   submission_date  54780 non-null  object 
 1   state            54780 non-null  object 
 2   tot_cases        54780 non-null  int64  
 3   conf_cases       30691 non-null  float64
 4   prob_cases       30619 non-null  float64
 5   new_case         54780 non-null  int64  
 6   pnew_case        51167 non-null  float64
 7   tot_death        54780 non-null  int64  
 8   conf_death       30106 non-null  float64
 9   prob_death       30106 non-null  float64
 10  new_death        54780 non-null  int64  
 11  pnew_death       51199 non-null  float64
 12  created_at       54780 non-null  object 
 13  consent_cases    45645 non-null  object 
 14  consent_deaths   46563 non-null  object 
dtypes: float64(6), int64(4), object(5)
memory usage: 6.3+ MB


The COVID Case and Death in the U.S. by State dataset contains 54,780 individual records and 15 features.

The data is collected daily since 01-23-2020.

The potential columns of relevance are 
'submission_date','state','tot_cases','new_case','pnew_case','tot_death','new_death','pnew_death'

We are choosing not to use the conf_cases, prob_cases, conf_death, and prob_death categories due to some states not consenting to reporting this data (consent marked in the consent_cases and consent_deaths columns)
 
The column descriptions can be found at the following site:
https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36

In [130]:
cols = ['submission_date','state','tot_cases','new_case','pnew_case','tot_death','new_death','pnew_death']
filt_covid_case_deaths = covid_case_deaths_raw[cols]

In [131]:
# The column names are readjusted for consistent naming

col_dict = {'submission_date':'date', 'pnew_case':'prob_new_case', 'pnew_death':'prob_new_death'}
filt_covid_case_deaths = filt_covid_case_deaths.rename(columns = col_dict)

In [132]:
filt_covid_case_deaths.head()

Unnamed: 0,date,state,tot_cases,new_case,prob_new_case,tot_death,new_death,prob_new_death
0,12/01/2021,ND,163565,589,220.0,1907,9,0.0
1,11/07/2021,DE,143685,296,30.0,2186,3,0.0
2,05/12/2022,CT,777064,1963,173.0,10883,0,0.0
3,10/04/2020,MD,127290,471,0.0,4092,3,0.0
4,02/06/2020,NE,0,0,,0,0,


In [133]:
# converting the date column to a datetime dtype

filt_covid_case_deaths['date'] = pd.to_datetime(filt_covid_case_deaths['date'])

In [134]:
filt_covid_case_deaths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54780 entries, 0 to 54779
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            54780 non-null  datetime64[ns]
 1   state           54780 non-null  object        
 2   tot_cases       54780 non-null  int64         
 3   new_case        54780 non-null  int64         
 4   prob_new_case   51167 non-null  float64       
 5   tot_death       54780 non-null  int64         
 6   new_death       54780 non-null  int64         
 7   prob_new_death  51199 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(4), object(1)
memory usage: 3.3+ MB


In [135]:
filt_covid_case_deaths.isnull().sum()

date                 0
state                0
tot_cases            0
new_case             0
prob_new_case     3613
tot_death            0
new_death            0
prob_new_death    3581
dtype: int64

I choose to leave the null values as null for now. Upon creating the machine learning Pipeline, the mean can be imputed for the null values.

## 3. Data Merge

Now that we have three cleaned datasets, we merge the sets based off the date and state columns

In [163]:
merge_covid = pd.merge(filt_covid_impact, filt_covid_vaccine, how="left", on=['date', 'state'], suffixes=("_x", "_y"))

In [173]:
merge_covid = pd.merge(merge_covid, filt_covid_case_deaths, how='left', on=['date', 'state'], suffixes=("_x", "_y"))

In [174]:
merge_covid.isnull().sum()

state                                                         0
date                                                          0
critical_staffing_shortage_today_yes                          0
critical_staffing_shortage_today_no                           0
critical_staffing_shortage_today_not_reported                 0
critical_staffing_shortage_anticipated_within_week_yes        0
critical_staffing_shortage_anticipated_within_week_no         0
inpatient_beds_used_covid                                    54
inpatient_beds_used_covid_coverage                            0
previous_day_admission_adult_covid_confirmed               6631
previous_day_admission_adult_covid_confirmed_coverage         0
previous_day_admission_adult_covid_suspected               6782
previous_day_admission_adult_covid_suspected_coverage         0
inpatient_bed_covid_utilization                             218
adult_icu_bed_covid_utilization                            7341
distributed                             

The new null values due to merging are due to the vaccination schedule changing from reporting daily to weekly.

## 4. Cleaned Merged DataFrame Save

In [178]:
merge_covid = merge_covid.reset_index(drop=True)
merge_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46777 entries, 0 to 46776
Data columns (total 33 columns):
 #   Column                                                  Non-Null Count  Dtype         
---  ------                                                  --------------  -----         
 0   state                                                   46777 non-null  object        
 1   date                                                    46777 non-null  datetime64[ns]
 2   critical_staffing_shortage_today_yes                    46777 non-null  int64         
 3   critical_staffing_shortage_today_no                     46777 non-null  int64         
 4   critical_staffing_shortage_today_not_reported           46777 non-null  int64         
 5   critical_staffing_shortage_anticipated_within_week_yes  46777 non-null  int64         
 6   critical_staffing_shortage_anticipated_within_week_no   46777 non-null  int64         
 7   inpatient_beds_used_covid                               46

In [180]:
merge_covid.to_csv('../clean_data/merge_covid.csv')

## 5. Continued

Now that the datasets are merged and cleaned, we export the CSV and continue the analysis in the next notebook.