# <a id='toc1_'></a>[COVID-19 Prediction](#toc0_)

Sam Celarek  
Data Science   
scelarek@gmail.com  

June 4th, 2023


**Table of contents**<a id='toc0_'></a>    
- [COVID-19 Prediction](#toc1_)    
- [1. Introduction](#toc2_)    
  - [1.1. Key Questions](#toc2_1_)    
- [2. Setup and Data Collection](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[1. Introduction](#toc0_)

In this project, we will perform a set of analyses on the relationship between different variables and the mosquito number, as well as the probability of finding West Nile Virus (WNV) at any particular time and location. 


## <a id='toc2_1_'></a>[1.1. Key Questions](#toc0_)

## <a id='toc2_1_'></a>[1.2. Data Sources and Methods](#toc0_)


# <a id='toc3_'></a>[2. Setup and Data Collection](#toc0_)

We will be utilizing the cleaned mosquito tracking data from the city of Chicago, Illinois, spanning from 2008 to 2019 provided [here](link_to_dataset). This section will include the necessary libraries and modules for the analysis, as well as the data preparation steps.


In [None]:
from my_code import *

In [None]:
import covsirphy as cs

print(f"Covsirphy version: {cs.__version__}")


In [None]:
# initialize styling params
np.random.seed(123)

plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
plt.style.use('fivethirtyeight')

# sns.set_style("whitegrid")
sns.set_palette("viridis")
sns.set_context("notebook")

pd.set_option("display.max_columns", 50)
pd.set_option('display.max_colwidth', 1000)
pd.plotting.register_matplotlib_converters()
os.environ["PYTHONHASHSEED"] = "123"

# import warnings
# warnings.filterwarnings('ignore')
# from fbprophet import Prophet
# from fbprophet.plot import plot_plotly, add_changepoints_to_plot

cs.config.logger(level=2)

Country of Interest: United States of America

In [None]:
country_ISO3 = "USA"
location_key = "US"

## CovSIRPhy Dataset Loading

The CovSIRPhy dataset is a collection of COVID-19 data from around the world, including the number of confirmed cases, recovered cases, and deaths. The data is collected from the following sources:


In [None]:
eng = cs.DataEngineer()
eng.download(country=None, databases=["covid19dh", "owid", "wpp", 'japan'])

# Set Country of interest and Join Column for Other Dataframes

print("\n".join(eng.citations()))

In [None]:
# Convert Date Column to Datetime, Resampling for only the dates in question, and Filling of Missing Values with Forward Fill and 0
eng.clean(kinds=['resample', 'fillna', 'convert_date'], date_range=('2020-01-03', '2022-09-15'))
eng.transform()

# Day to Day Differences
eng.diff(column="Confirmed", suffix="_Daily_Diff", freq="D")
eng.diff(column="Fatal", suffix="_Daily_Diff", freq="D")
eng.diff(column="Recovered", suffix="_Daily_Diff", freq="D")
eng.diff(column="Susceptible", suffix="_Daily_Diff", freq="D")
eng.diff(column="Tests", suffix="_Daily_Diff", freq="D")

# Addition
eng.add(columns=["Fatal", "Recovered"], new="Total_Removed")

# Division and Ratios
eng.div(numerator="Confirmed", denominator="Tests", new="Confirmed_per_Test")
eng.div(numerator="Fatal", denominator="Confirmed", new="Fatal_per_Confirmed")
eng.div(numerator="Recovered", denominator="Confirmed", new="Recovered_per_Confirmed")
eng.div(numerator="Fatal", denominator="Total_Removed", new="Fatal_to_Total_Removed")

eng.all().info()
eng.all().tail()

In [None]:
# Create subset of data for the country of interest and the dates of interest
# Complement does two things here: forces always increasing cumulative values, estimates recovered cases using value of estimated recovery period

actual_df, status, _ = eng.subset(geo=country_ISO3, start_date='2020-01-03', end_date='2022-09-15', complement=True)
print(status)

actual_df.info()
display(actual_df.tail())


In [None]:
# Create a SIRF Model from actual df
dyn_act = cs.Dynamics.from_data(model=cs.SIRFModel, data=actual_df, name=country_ISO3)

dyn_act.register().tail()


In [None]:
# Breakdown of the SIRF Model Parameters and points of change
dyn_act.segment()

# Show summary
dyn_act.summary().tail(), dyn_act.summary().head()


In [None]:
# Calculate tau value and Disease Parameters from Actual SIRF Data
dyn_act.estimate()
print(f"Tau value [min]: {dyn_act.tau or 'un-set'}")

# Show summary
dyn_act.summary().head()


In [None]:
# Interpolate Disease Parameters
track_df = dyn_act.track()
track_df.tail()


In [None]:
# Assess Disease Parameter Data
track_df.info()

display(track_df.tail())

track_df.isna().sum().sum()


In [None]:
# Assess USA Data
actual_df.info()

display(actual_df.tail())

actual_df.isna().sum().sum()


In [None]:
# merge two datasets together on date
disease_df = pd.merge(actual_df, track_df, how='left', on='Date')


disease_df.info()

display(disease_df.head())

disease_df.isna().sum().sum()

In [None]:

disease_df.to_parquet('../Data/disease_df.parquet', compression='snappy')

## Cleaning

In [246]:
disease_df = pd.read_parquet('../Data/disease_df.parquet')
disease_df.columns = [i.lower().replace(' ', '_') for i in disease_df.columns]
disease_df = disease_df.rename_axis('date')
disease_df.info()
disease_df.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 987 entries, 2020-01-03 to 2022-09-15
Data columns (total 45 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   cancel_events                                               987 non-null    Float64
 1   confirmed_daily_diff                                        987 non-null    Float64
 2   confirmed_per_test                                          987 non-null    Float64
 3   contact_tracing                                             987 non-null    Float64
 4   fatal_daily_diff                                            987 non-null    Float64
 5   fatal_per_confirmed                                         987 non-null    Float64
 6   fatal_to_total_removed                                      987 non-null    Float64
 7   gatherings_restrictions                                     987 non-nu

Unnamed: 0_level_0,cancel_events,confirmed_daily_diff,confirmed_per_test,contact_tracing,fatal_daily_diff,fatal_per_confirmed,fatal_to_total_removed,gatherings_restrictions,infected,information_campaigns,internal_movement_restrictions,international_movement_restrictions,population,recovered_daily_diff,recovered_per_confirmed,school_closing,stay_home_restrictions,stringency_index,susceptible,susceptible_daily_diff,testing_policy,tests,tests_daily_diff,total_removed,transport_closing,vaccinated_full,vaccinated_once,vaccinations,vaccinations_boosters,workplace_closing,confirmed,fatal,recovered,country_united_states,product_0,"product_johnson&johnson,_moderna,_novavax,_pfizer/biontech",rt,theta,kappa,rho,sigma,alpha1_[-],1/alpha2_[day],1/beta_[day],1/gamma_[day]
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1
2020-01-03,0.0,0.0,,0.0,0.0,,,0.0,0,0.0,0.0,0.0,326687501.0,0.0,,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,0,12.91,0.476218,0.001231,0.098248,0.002755,0.476,406,5,182
2020-01-04,0.0,0.0,,0.0,0.0,,,0.0,0,0.0,0.0,0.0,326687501.0,0.0,,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,0,12.91,0.476218,0.001231,0.098248,0.002755,0.476,406,5,182
2020-01-05,0.0,0.0,,0.0,0.0,,,0.0,0,0.0,0.0,0.0,326687501.0,0.0,,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,0,12.91,0.476218,0.001231,0.098248,0.002755,0.476,406,5,182
2020-01-06,0.0,0.0,,0.0,0.0,,,0.0,0,0.0,0.0,0.0,326687501.0,0.0,,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,0,12.91,0.476218,0.001231,0.098248,0.002755,0.476,406,5,182
2020-01-07,0.0,0.0,,0.0,0.0,,,0.0,0,0.0,0.0,0.0,326687501.0,0.0,,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,0,12.91,0.476218,0.001231,0.098248,0.002755,0.476,406,5,182


In [247]:
disease_df.apply(lambda x: (x != x).sum())

cancel_events                                                  0
confirmed_daily_diff                                           0
confirmed_per_test                                            18
contact_tracing                                                0
fatal_daily_diff                                               0
fatal_per_confirmed                                           18
fatal_to_total_removed                                        57
gatherings_restrictions                                        0
infected                                                       0
information_campaigns                                          0
internal_movement_restrictions                                 0
international_movement_restrictions                            0
population                                                     0
recovered_daily_diff                                           0
recovered_per_confirmed                                       18
school_closing           

In [248]:
display(disease_df[disease_df.applymap(lambda x: (x != x)).fatal_per_confirmed][['confirmed', 'fatal', 'fatal_per_confirmed']].sum())
display(disease_df[disease_df.applymap(lambda x: (x != x)).fatal_to_total_removed][['fatal', 'total_removed', 'fatal_to_total_removed']].sum())
display(disease_df[disease_df.applymap(lambda x: (x != x)).confirmed_per_test][['confirmed', 'tests', 'confirmed_per_test']].sum())
display(disease_df[disease_df.applymap(lambda x: (x != x)).recovered_per_confirmed][['recovered', 'confirmed', 'recovered_per_confirmed']].sum())

disease_df = disease_df.applymap(lambda x: 0 if (x != x) else x)


confirmed              0.0
fatal                  0.0
fatal_per_confirmed    NaN
dtype: float64

fatal                     0.0
total_removed             0.0
fatal_to_total_removed    NaN
dtype: float64

confirmed             0.0
tests                 0.0
confirmed_per_test    NaN
dtype: float64

recovered                  0.0
confirmed                  0.0
recovered_per_confirmed    NaN
dtype: float64

In [249]:
disease_df.apply(lambda x: (x != x).sum())

cancel_events                                                 0
confirmed_daily_diff                                          0
confirmed_per_test                                            0
contact_tracing                                               0
fatal_daily_diff                                              0
fatal_per_confirmed                                           0
fatal_to_total_removed                                        0
gatherings_restrictions                                       0
infected                                                      0
information_campaigns                                         0
internal_movement_restrictions                                0
international_movement_restrictions                           0
population                                                    0
recovered_daily_diff                                          0
recovered_per_confirmed                                       0
school_closing                          

In [250]:

disease_df.apply(lambda x: (x == np.inf).sum())

cancel_events                                                  0
confirmed_daily_diff                                           0
confirmed_per_test                                            40
contact_tracing                                                0
fatal_daily_diff                                               0
fatal_per_confirmed                                            0
fatal_to_total_removed                                         0
gatherings_restrictions                                        0
infected                                                       0
information_campaigns                                          0
internal_movement_restrictions                                 0
international_movement_restrictions                            0
population                                                     0
recovered_daily_diff                                           0
recovered_per_confirmed                                        0
school_closing           

In [251]:
display(disease_df[disease_df.applymap(lambda x: (x == np.inf)).confirmed_per_test][['confirmed', 'tests', 'confirmed_per_test']].head())


Unnamed: 0_level_0,confirmed,tests,confirmed_per_test
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-21,1,0.0,inf
2020-01-22,1,0.0,inf
2020-01-23,1,0.0,inf
2020-01-24,2,0.0,inf
2020-01-25,3,0.0,inf


In [252]:
disease_df = disease_df.applymap(lambda x: 1 if (x == np.inf) else x)


display(disease_df[disease_df.applymap(lambda x: (x == np.inf)).confirmed_per_test][['confirmed', 'tests', 'confirmed_per_test']].head())


Unnamed: 0_level_0,confirmed,tests,confirmed_per_test
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [253]:
# disease_df[disease_df.confirmed >= disease_df.tests].sum()

## Google Datasets Loading and Cleaning

In [254]:
folder_holder = "C://Users/Samsickle/Documents/Universal_Code_Bank/BrainStation_Capstone/Data/"


def clean_df(df, location_key):
    # Filter the dataframe based on location key and date.
    df = df.query('location_key == @location_key and date >= "2020-01-03"')
    
    # Rename the columns to lowercase and replace spaces with underscores.
    df.columns = [i.lower().replace(' ', '_') for i in df.columns]

    # Convert the 'date' column to datetime format and set it as index.
    df['date'] = pd.to_datetime(df.date)
    df = df.set_index('date')
    df = df.drop(columns=['location_key'])
    
    # Print the info and head of the DataFrame.
    df.info()
    display(df.head())

    return df


In [255]:
# Load the hospitalizations data from the CSV file
hospitalizations_df = pd.read_csv(f'{folder_holder}hospitalizations.csv')

# Clean the hospitalizations data using the 'clean_df' function
hospitalizations_df = clean_df(hospitalizations_df, location_key)

# Load the mobility data from the CSV file
mobility_df = pd.read_csv(f'{folder_holder}mobility.csv')

# Clean the mobility data using the 'clean_df' function
mobility_df = clean_df(mobility_df, location_key)

# Load the weather data from the CSV file
weather_df = pd.read_csv(f'{folder_holder}weather.csv')

# Clean the weather data using the 'clean_df' function and fill any NA/NaN values with 0
weather_df = clean_df(weather_df, location_key).fillna(0)

# Load the government response data from the CSV file
gov_response_df = pd.read_csv(f'{folder_holder}oxford-government-response.csv')

# Clean the government response data using the 'clean_df' function
gov_response_df = clean_df(gov_response_df, location_key)


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 977 entries, 2020-01-13 to 2022-09-15
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   new_hospitalized_patients           977 non-null    float64
 1   cumulative_hospitalized_patients    977 non-null    float64
 2   current_hospitalized_patients       977 non-null    float64
 3   new_intensive_care_patients         419 non-null    float64
 4   cumulative_intensive_care_patients  420 non-null    float64
 5   current_intensive_care_patients     977 non-null    float64
 6   new_ventilator_patients             419 non-null    float64
 7   cumulative_ventilator_patients      420 non-null    float64
 8   current_ventilator_patients         420 non-null    float64
dtypes: float64(9)
memory usage: 76.3 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = pd.to_datetime(df.date)


Unnamed: 0_level_0,new_hospitalized_patients,cumulative_hospitalized_patients,current_hospitalized_patients,new_intensive_care_patients,cumulative_intensive_care_patients,current_intensive_care_patients,new_ventilator_patients,cumulative_ventilator_patients,current_ventilator_patients
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-01-13,0.0,0.0,0.0,,0.0,0.0,,0.0,0.0
2020-01-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 941 entries, 2020-02-15 to 2022-09-12
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   mobility_retail_and_recreation  941 non-null    float64
 1   mobility_grocery_and_pharmacy   941 non-null    float64
 2   mobility_parks                  941 non-null    float64
 3   mobility_transit_stations       941 non-null    float64
 4   mobility_workplaces             941 non-null    float64
 5   mobility_residential            941 non-null    float64
dtypes: float64(6)
memory usage: 51.5 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = pd.to_datetime(df.date)


Unnamed: 0_level_0,mobility_retail_and_recreation,mobility_grocery_and_pharmacy,mobility_parks,mobility_transit_stations,mobility_workplaces,mobility_residential
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-02-15,6.0,2.0,15.0,3.0,2.0,-1.0
2020-02-16,7.0,1.0,16.0,2.0,0.0,-1.0
2020-02-17,6.0,0.0,28.0,-9.0,-24.0,5.0
2020-02-18,0.0,-1.0,6.0,1.0,0.0,1.0
2020-02-19,2.0,0.0,8.0,1.0,1.0,0.0


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 986 entries, 2020-01-03 to 2022-09-14
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   average_temperature_celsius  986 non-null    float64
 1   minimum_temperature_celsius  986 non-null    float64
 2   maximum_temperature_celsius  986 non-null    float64
 3   rainfall_mm                  986 non-null    float64
 4   snowfall_mm                  91 non-null     float64
 5   dew_point                    986 non-null    float64
 6   relative_humidity            986 non-null    float64
dtypes: float64(7)
memory usage: 61.6 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = pd.to_datetime(df.date)


Unnamed: 0_level_0,average_temperature_celsius,minimum_temperature_celsius,maximum_temperature_celsius,rainfall_mm,snowfall_mm,dew_point,relative_humidity
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-03,1.055556,-2.648148,5.703704,0.0,30.48,-2.475309,77.390895
2020-01-04,0.006173,-6.617284,9.197531,0.0,30.48,-5.407407,67.23791
2020-01-05,5.203704,0.54321,8.580247,0.0,,-2.790123,56.438457
2020-01-06,0.654321,-4.919753,8.148148,0.0,,-4.993827,66.21492
2020-01-07,1.567901,-4.709877,11.012346,0.0,,-5.487654,59.734417


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 917 entries, 2020-01-03 to 2022-07-07
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   school_closing                      917 non-null    float64
 1   workplace_closing                   917 non-null    float64
 2   cancel_public_events                917 non-null    float64
 3   restrictions_on_gatherings          917 non-null    float64
 4   public_transport_closing            917 non-null    float64
 5   stay_at_home_requirements           917 non-null    float64
 6   restrictions_on_internal_movement   917 non-null    float64
 7   international_travel_controls       917 non-null    float64
 8   income_support                      917 non-null    float64
 9   debt_relief                         917 non-null    float64
 10  fiscal_measures                     534 non-null    float64
 11  international_support     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = pd.to_datetime(df.date)


Unnamed: 0_level_0,school_closing,workplace_closing,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,fiscal_measures,international_support,public_information_campaigns,testing_policy,contact_tracing,emergency_investment_in_healthcare,investment_in_vaccines,facial_coverings,vaccination_policy,stringency_index
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [256]:
# Calculate the number of common indices between the two dataframes.
len(disease_df.index) + len(gov_response_df.index) - len(disease_df.index.symmetric_difference(gov_response_df.index))


1834

In [257]:
# Identify columns common to 'disease_df' and 'gov_response_df'.
list_of_same_columns = disease_df.columns[disease_df.columns.isin(gov_response_df.columns)]

# Count duplicate rows when the common columns of the two dataframes are concatenated.
pd.concat([disease_df[list_of_same_columns], gov_response_df[list_of_same_columns]]).duplicated().sum()


1832

In [258]:
# Remove the common columns from 'gov_response_df'.
gov_response_df = gov_response_df.drop(columns = list_of_same_columns)


### Final Merging of Datasets

In [259]:
# Create a list of dataframes to be merged
time_series_dfs = [disease_df, hospitalizations_df, mobility_df, gov_response_df, weather_df]

# Use functools.reduce to merge all dataframes in the list on 'date' column, with 'left' join method
master_df = functools.reduce(lambda a, b: pd.merge(a, b, how='left', right_on='date', left_on='date'), time_series_dfs)

# Display summary and first few rows of the master dataframe
master_df.info()
master_df.head()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 987 entries, 2020-01-03 to 2022-09-15
Data columns (total 82 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   cancel_events                                               987 non-null    float64
 1   confirmed_daily_diff                                        987 non-null    float64
 2   confirmed_per_test                                          987 non-null    float64
 3   contact_tracing                                             987 non-null    float64
 4   fatal_daily_diff                                            987 non-null    float64
 5   fatal_per_confirmed                                         987 non-null    float64
 6   fatal_to_total_removed                                      987 non-null    float64
 7   gatherings_restrictions                                     987 non-nu

Unnamed: 0_level_0,cancel_events,confirmed_daily_diff,confirmed_per_test,contact_tracing,fatal_daily_diff,fatal_per_confirmed,fatal_to_total_removed,gatherings_restrictions,infected,information_campaigns,internal_movement_restrictions,international_movement_restrictions,population,recovered_daily_diff,recovered_per_confirmed,school_closing,stay_home_restrictions,stringency_index,susceptible,susceptible_daily_diff,testing_policy,tests,tests_daily_diff,total_removed,transport_closing,...,mobility_transit_stations,mobility_workplaces,mobility_residential,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,fiscal_measures,international_support,public_information_campaigns,emergency_investment_in_healthcare,investment_in_vaccines,facial_coverings,vaccination_policy,average_temperature_celsius,minimum_temperature_celsius,maximum_temperature_celsius,rainfall_mm,snowfall_mm,dew_point,relative_humidity
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.055556,-2.648148,5.703704,0.0,30.48,-2.475309,77.390895
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006173,-6.617284,9.197531,0.0,30.48,-5.407407,67.23791
2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.203704,0.54321,8.580247,0.0,0.0,-2.790123,56.438457
2020-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.654321,-4.919753,8.148148,0.0,0.0,-4.993827,66.21492
2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.567901,-4.709877,11.012346,0.0,0.0,-5.487654,59.734417


In [260]:
"""
Cell generated by Data Wrangler.
"""
def clean_data(master_df):
    master_df = master_df.drop(columns=['country_united_states', 'product_0', 'product_johnson&johnson,_moderna,_novavax,_pfizer/biontech', 'fiscal_measures', 'international_support', 'investment_in_vaccines', 'emergency_investment_in_healthcare'])
    return master_df

master_df = clean_data(master_df.copy())
master_df.head()

Unnamed: 0_level_0,cancel_events,confirmed_daily_diff,confirmed_per_test,contact_tracing,fatal_daily_diff,fatal_per_confirmed,fatal_to_total_removed,gatherings_restrictions,infected,information_campaigns,internal_movement_restrictions,international_movement_restrictions,population,recovered_daily_diff,recovered_per_confirmed,school_closing,stay_home_restrictions,stringency_index,susceptible,susceptible_daily_diff,testing_policy,tests,tests_daily_diff,total_removed,transport_closing,...,current_ventilator_patients,mobility_retail_and_recreation,mobility_grocery_and_pharmacy,mobility_parks,mobility_transit_stations,mobility_workplaces,mobility_residential,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,public_information_campaigns,facial_coverings,vaccination_policy,average_temperature_celsius,minimum_temperature_celsius,maximum_temperature_celsius,rainfall_mm,snowfall_mm,dew_point,relative_humidity
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
2020-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.055556,-2.648148,5.703704,0.0,30.48,-2.475309,77.390895
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006173,-6.617284,9.197531,0.0,30.48,-5.407407,67.23791
2020-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.203704,0.54321,8.580247,0.0,0.0,-2.790123,56.438457
2020-01-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.654321,-4.919753,8.148148,0.0,0.0,-4.993827,66.21492
2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0,0.0,...,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.567901,-4.709877,11.012346,0.0,0.0,-5.487654,59.734417


In [261]:
master_df.to_parquet('../Data/master_df.parquet', compression='snappy')
master_df = pd.read_parquet('../Data/master_df.parquet')

# ***DATA LEAKAGE ALERT***

In [262]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor

# Initialize the imputer
imputer = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=0))

# Fit the imputer on the dataframe and transform
df_imputed = imputer.fit_transform(master_df)

# Convert back to pandas dataframe and assign column names
df_imputed = pd.DataFrame(df_imputed, columns=master_df.columns)

df_imputed


Unnamed: 0,cancel_events,confirmed_daily_diff,confirmed_per_test,contact_tracing,fatal_daily_diff,fatal_per_confirmed,fatal_to_total_removed,gatherings_restrictions,infected,information_campaigns,internal_movement_restrictions,international_movement_restrictions,population,recovered_daily_diff,recovered_per_confirmed,school_closing,stay_home_restrictions,stringency_index,susceptible,susceptible_daily_diff,testing_policy,tests,tests_daily_diff,total_removed,transport_closing,...,current_ventilator_patients,mobility_retail_and_recreation,mobility_grocery_and_pharmacy,mobility_parks,mobility_transit_stations,mobility_workplaces,mobility_residential,cancel_public_events,restrictions_on_gatherings,public_transport_closing,stay_at_home_requirements,restrictions_on_internal_movement,international_travel_controls,income_support,debt_relief,public_information_campaigns,facial_coverings,vaccination_policy,average_temperature_celsius,minimum_temperature_celsius,maximum_temperature_celsius,rainfall_mm,snowfall_mm,dew_point,relative_humidity
0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.00,326687501.0,0.0,0.0,0.0,0.0,0.0,0.0,...,135.1,0.9,0.4,9.8,-8.2,-18.0,3.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.055556,-2.648148,5.703704,0.000000,30.48,-2.475309,77.390895
1,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.00,326687501.0,0.0,0.0,0.0,0.0,0.0,0.0,...,135.1,-0.1,0.1,7.0,-7.1,-18.0,3.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006173,-6.617284,9.197531,0.000000,30.48,-5.407407,67.237910
2,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.00,326687501.0,0.0,0.0,0.0,0.0,0.0,0.0,...,126.0,5.0,0.4,15.9,-8.1,-18.3,3.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.203704,0.543210,8.580247,0.000000,0.00,-2.790123,56.438457
3,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.00,326687501.0,0.0,0.0,0.0,0.0,0.0,0.0,...,135.1,0.9,0.9,9.1,-7.1,-18.0,3.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.654321,-4.919753,8.148148,0.000000,0.00,-4.993827,66.214920
4,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,326687501.0,0.0,0.0,0.0,0.0,0.00,326687501.0,0.0,0.0,0.0,0.0,0.0,0.0,...,126.0,3.6,1.4,12.0,-7.1,-18.0,3.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.567901,-4.709877,11.012346,0.000000,0.00,-5.487654,59.734417
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
982,0.0,11464.0,0.104090,1.0,19.0,0.011011,1.0,0.0,1171596.0,2.0,-1.0,4.0,326687501.0,0.0,0.0,-2.0,-1.0,34.26,231677010.0,-11464.0,3.0,912769124.0,0.0,1046164.0,0.0,...,2915.1,-10.0,-3.0,16.0,-14.0,-8.0,1.0,0.0,1.2,1.0,0.7,0.9,4.0,0.1,0.1,2.0,2.8,5.0,15.205556,7.327778,25.138889,1.574800,0.00,7.622222,60.607572
983,0.0,87513.0,0.104186,1.0,569.0,0.011007,1.0,0.0,1156745.0,2.0,-1.0,4.0,326687501.0,0.0,0.0,-2.0,-1.0,34.26,231589497.0,-87513.0,3.0,912769124.0,0.0,1046733.0,0.0,...,2888.8,-11.0,-2.0,19.0,-20.0,-23.0,4.0,0.0,1.2,1.0,0.7,0.9,4.0,0.1,0.1,2.0,2.8,5.0,19.116667,8.644444,30.777778,0.000000,0.00,8.183333,49.600466
984,0.0,61642.0,0.104254,1.0,488.0,0.011005,1.0,0.0,1209106.0,2.0,-1.0,4.0,326687501.0,0.0,0.0,-2.0,-1.0,34.26,231527855.0,-61642.0,3.0,912769124.0,0.0,1047221.0,0.0,...,2509.9,-9.6,-2.6,31.9,-20.4,-23.4,3.2,0.0,1.2,1.0,0.7,0.9,4.0,0.1,0.1,2.0,2.7,5.0,21.983333,12.761111,33.616667,0.000000,0.00,10.966667,49.999442
985,0.0,115429.0,0.104380,1.0,920.0,0.011001,1.0,0.0,1315079.0,2.0,-1.0,4.0,326687501.0,0.0,0.0,-2.0,-1.0,34.26,231412426.0,-115429.0,3.0,912769124.0,0.0,1048141.0,0.0,...,1961.3,-8.4,-0.5,24.6,-19.5,-24.1,3.7,0.0,1.2,1.0,0.7,0.9,4.0,0.1,0.1,2.0,2.6,5.0,23.518519,21.407407,27.049383,0.000000,0.00,12.790123,51.394811


In [274]:
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 987 entries, 0 to 986
Data columns (total 75 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   cancel_events                        987 non-null    float64
 1   confirmed_daily_diff                 987 non-null    float64
 2   confirmed_per_test                   987 non-null    float64
 3   contact_tracing                      987 non-null    float64
 4   fatal_daily_diff                     987 non-null    float64
 5   fatal_per_confirmed                  987 non-null    float64
 6   fatal_to_total_removed               987 non-null    float64
 7   gatherings_restrictions              987 non-null    float64
 8   infected                             987 non-null    float64
 9   information_campaigns                987 non-null    float64
 10  internal_movement_restrictions       987 non-null    float64
 11  international_movement_restricti

### Extra Code

In [264]:

# # Growth factor: (delta Number_n) / (delta Number_n)
# df = df.diff() / df.diff().shift(freq="D")

# # Rolling mean (window: 7 days)
# df = df.rolling(7).mean().dropna().loc[:covid_df["Date"].max(), :]
# numeric_columns_assessment(eng.all()).T

In [265]:
# """
# Cell generated by Data Wrangler.
# """
# def clean_data(df):
#     # Filter rows based on column: 'location_key'
#     # df = df[df['location_key'].str.contains(r"^(US|CA|DE|GB|FR|JP|AU|BR|ZA|IN)$", na=False)]
#     df = df[df['location_key'].str.contains(r"^(US)$", na=False)] # only the us to start

#     # Change column type to datetime64[ns] for column: 'date'
#     try:
#         df = df.astype({'date': 'datetime64[ns]'})
#     except:
#         pass
#     # Change column type to category for column: 'location_key'
#     df = df.astype({'location_key': 'category'})
#     return df


# time_series_dfs = list(map(clean_data, timeland_df))

# # # Assume dfs is your list of dataframes
# time_series_dfs = reduce(lambda left,right: pd.merge(left,right,on=['location_key', 'date'], how='left'), time_series_dfs).copy()


In [266]:
# time_series_dfs.to_parquet(f'{folder_holder}time_series_dfs.parquet.gzip', compression='gzip')

# time_series_dfs = pd.read_parquet(f'{folder_holder}time_series_dfs.parquet.gzip')

In [267]:
# time_series_dfs_line_plot = time_series_dfs.dropna(subset='new_confirmed').copy()
# # sns.lineplot(data = time_series_dfs_line_plot, x = 'date', y = 'new_confirmed')

# time_series_dfs_line_plot.plot(x = 'date', y = 'new_confirmed', figsize = (15, 10))

# # add lines for each column
# fig = px.line(time_series_dfs_line_plot, x=air_traffic.index, y=air_traffic.columns,)

# # axis labels and title
# fig.update_layout(
#     yaxis_title="Passenger-miles (billions)", 
#     legend_title="", 
#     title="Daily air travel from 1979 to 2002"
# )

# # activate slider
# fig.update_xaxes(rangeslider_visible=True)

# fig.show()

In [268]:
# folder_holder = "C://Users/Samsickle/Documents/BrainStation_Capstone/Data/"
# # C:\Users\Samsickle\Documents\BrainStation_Capstone\Data

# # # # time series data
# # hospitalizations_df = pd.read_csv(f'{folder_holder}hospitalizations.csv') # 2
# # mobility_df = pd.read_csv(f'{folder_holder}mobility.csv') # 4
# # gov_response_df = pd.read_csv(f'{folder_holder}oxford-government-response.csv') # 5
# # weather_df = pd.read_csv(f'{folder_holder}weather.csv') # 6

# epid_df = pd.read_csv(f'{folder_holder}epidemiology.csv') # 1
# # vac_df = pd.read_csv(f'{folder_holder}vaccinations.csv') # 3


In [269]:
# epid_df.sample(5)

In [270]:
# Performs a Fillna with FFill on the data set or a fillna with a 0 if there is no previous value
# Retrieves only the dates between '2020-01-01' and '2022-09-15'

# eng.clean(kinds=['resample', 'fillna'], date_range=('2020-01-03', '2022-09-15'))
# eng.all().query('ISO3 == @country_ISO3').isna().sum().sum()

# creates an engine with only USA data and with the desire date
# complement does three things, forces always increasing cumulate values, estimates recovered cases using value of estimated recovery period

# eng.subset(geo=country_ISO3, start_date='2020-01-03', end_date='2022-09-15', complement=True)
# eng.all().ISO3.unique(), eng.all().Date.min(), eng.all().Date.max()
# Uses the SIR model to estimate the number of infected and susceptible people

# main_variables = ['Infected', 'Susceptible']
# eng.transform()
# eng.all().query('ISO3 == @country_ISO3').info()
# eng.all().query('ISO3 == @country_ISO3')[main_variables].describe().T
# estimates the length of recovery and the length of the incubation period


# eng.clean()
# eng.transform()


# actual_df, status, _ = eng.subset(geo=country_ISO3, start_date='2020-01-03', end_date='2022-09-15', complement=True)
# print(status)
# actual_df.tail()



# with_df, status, status_dict = eng.subset(geo=@country_ISO3, start_date='2020-01-01', end_date='2022-09-15', complement=True)
# print(f"{status}\n")
# print(status_dict)
# with_df.info()
# with_df.head()


# """
# Cell generated by Data Wrangler.
# """
# def clean_data(with_df):
#     # Replace all instances of 0 with 0 in column: 'Positive_rate'
#     with_df.loc[with_df['Positive_rate'] != with_df['Positive_rate'], 'Positive_rate'] = 0
#     with_df.loc[with_df['Positive_rate'] == np.inf, 'Positive_rate'] = 0
#     return with_df

# with_df_clean = clean_data(with_df.copy())
# with_df_clean.head()
# cs.line_plot(with_df[["Confirmed", "Fatal", "Recovered"]], title="USA: records WITH complement")
# with_df.info()

# snr_act = cs.ODEScenario.auto_build(geo=country_ISO3, model=cs.SIRFModel, complement=True)

# snr_act.simulate(name="Baseline");
# dyn_act = snr_act.to_dynamics(name="Baseline")
# # Show summary
# display(dyn_act.summary())
# # Simulation
# dyn_act_df = dyn_act.simulate(model_specific=False)
# cs.line_plot(
#     dyn_act_df.drop("Susceptible", axis=1), "USA: Simulated data (Baseline scenario)")

Data Wireframe:

1. Date and Location:
    - 'date' - the day of the observations
    - 'location_key' - the country of the observations. I choose ten countries of interest in different regions

United States (US)  
Canada (CA)  
Germany (DE)  
United Kingdom (GB)  
France (FR)  
Japan (JP)  
Australia (AU)  
Brazil (BR)  
South Africa (ZA)  
India (IN)  
These countries are often used as indicators for their respective regions due to their significant economic influence, political stability, and comprehensive data collection practices.

2. COVID-19 Statistics:
    - 'new_confirmed' (New Positive Cases) - the number of new confirmed cases of COVID-19, this includes some negative numbers to account for data corrections in the previous days, however because these numbers are aggregated accross a whole country, the negative numbers are often very small and remain positive
    - 'new_deceased' (New Deaths) - the number of new deaths due to COVID-19 also shares the negative number problem
    - 'new_hospitalized_patients' (New Hospitalizations) 

3. Mobility Data:
    - 'mobility_retail_and_recreation'
    - 'mobility_grocery_and_pharmacy'
    - 'mobility_parks'
    - 'mobility_transit_stations'
    - 'mobility_workplaces'
    - 'mobility_residential'



4. Vaccination Data:
    - 'new_persons_vaccinated'
    - 'cumulative_persons_vaccinated'
    - 'new_persons_fully_vaccinated'
    - 'cumulative_persons_fully_vaccinated'
    - 'new_vaccine_doses_administered'
    - 'cumulative_vaccine_doses_administered'

5. Policy Measures:
    - 'school_closing'
    - 'workplace_closing'
    - 'cancel_public_events'
    - 'restrictions_on_gatherings'
    - 'public_transport_closing'
    - 'stay_at_home_requirements'
    - 'restrictions_on_internal_movement'
    - 'international_travel_controls'
    - 'income_support'
    - 'debt_relief'
    - 'fiscal_measures'
    - 'international_support'
    - 'public_information_campaigns'
    - 'testing_policy'
    - 'contact_tracing'
    - 'emergency_investment_in_healthcare'
    - 'investment_in_vaccines'
    - 'facial_coverings'
    - 'vaccination_policy'
    - 'stringency_index'

6. Weather Data:
    - 'average_temperature_celsius'
    - 'minimum_temperature_celsius'
    - 'maximum_temperature_celsius'
    - 'rainfall_mm'
    - 'snowfall_mm'
    - 'dew_point'
    - 'relative_humidity'


In [271]:
# # location dfs
# geography_df = pd.read_csv(f'{folder_holder}geography.csv') #1
# health_df = pd.read_csv(f'{folder_holder}health.csv') #2
# demographics_df = pd.read_csv(f'{folder_holder}demographics.csv') #3
# economics_df = pd.read_csv(f'{folder_holder}economy.csv') #4


# locationland_df = [geography_df, health_df, demographics_df, economics_df]

In [272]:
# location_df = list(map(clean_data, locationland_df))

# # Assume dfs is your list of dataframes
# location_df = reduce(lambda left,right: pd.merge(left,right,on='location_key', how='left'), location_df).copy()

# location_df.head()


In [273]:

# location_df.to_pickle('../Data/location_df.pkl')
# location_df = pd.read_pickle('../Data/location_df.pkl')

# location_df.sample(3)

This dataframe provides a comprehensive snapshot of COVID-19 data, mobility metrics, government restrictions, and weather conditions for specific locations on specific dates. Here's a brief overview of the columns:

1. `Entry ID`: A unique identifier for each row in the dataframe.
2. `Date`: The date for the day on which the data was recorded.
3. `Location Key`: A code representing the location (10 different countries in total) for which the data is reported.

4. `New Confirmed`: The number of new confirmed COVID-19 cases on the given date.
5. `New Deceased`: The number of new COVID-19 related deaths on the given date.
6. `New Recovered`: The number of new recoveries from COVID-19 on the given date.
7. `New Tested`: The number of new COVID-19 tests conducted on the given date.

8. `New Hospitalizations`: The number of new hospitalizations due to COVID-19 on the given date.
9. `Current Hospitalizations`: The total number of current hospitalizations due to COVID-19 on the given date.

10. `New Fully Vaccinated (29+ other Vaccination Columns)`: The number of new fully vaccinated individuals on the given date. There are 29 other columns related to vaccination data here too.

11. `Retail and Recreation Mobility (5+ other Mobility Metrics)`: A measure of mobility in retail and recreation spaces, along with 5 other columns related to different aspects of mobility.

12. `School Closing (19+ other Government Restrictions)`: A measure indicating whether schools were closed on the given date, along with 19 other columns related to different government restrictions.

13. `Average Temp (6+ Other Weather Columns)`: The average temperature on the given date, along with 6 other columns related to different weather conditions.

In total there are 9880 and 82 rows for 6.3mbs of data. The main way I could increase or decrease the size of the dataset would be to include more countries, regions, or counties in the analysis. For now this is my starter df.
