# Links
 [Vaccination Data Dictionary](https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh)

 [Deaths Data Dictionary](https://github.com/CSSEGISandData/COVID-19/tree/f57525e860010f6c5c0c103fd97e2e7282b480c8)

 [Information on time lag](https://www.hsph.harvard.edu/news/hsph-in-the-news/data-animation-shows-time-lag-between-covid-19-cases-and-deaths/#:~:text=The%20animation%20shows%20that%20deaths,remained%20low%20in%20many%20states.). For each age category, see which lag amount works best. Do for other variables as well.

# Model Input Description

The model will be a national prediciton model, where each prediciton of daily deaths is based on individual models for each state. This will require aggregating the data at the state level. Over counties $j$ and states $i$ for age group $a$:

$$y^\text{national}_t = \sum_j \text{deaths}_{j,t}$$

$$\hat{y}^\text{national}_t = \sum_i f_{i}(X_{i,t})$$

where $f$ is a state-level model.

$X_{i,t}$ will comprise:

- Deaths
  - `deaths` (target)
  - `deaths` (lagged)
- Dose Administration Percentages
  - `dose_admin_pct_<Age group>`: Dose Administration Percentage (DAP)
  
  $$\text{doses administered}_{i,t,a} = \sum_{j \in i} \frac{\text{doses administered}_{j,a}}{\text{completeness}_{j,a}}$$

  $$\text{doses available}_{i,t,a} = \sum_{j \in i} \text{doses available}_{t,a} \times \text{population}_{j,t,a}$$

  $$\text{DAP}_{i,t,a} = \frac{\text{doses administered}_{i,t,a}}{\text{doses available}_{i,t,a}}$$



- Social Vulnerability Index (SVI) Population Exposure
  - `pop_in_<SVI category>`: Total number of people exposed to counties with given SVI index.
- Metro Area Population
  - `pop_in_<Metro category>`: Total number of people in metro versus non-metro areas.


# Imports

In [None]:
!pip install xgboost mlflow -q

In [None]:
import  numpy as np
import  pandas as pd
import  matplotlib.pyplot as plt

import  os
import  pickle
pd.set_option('display.max_columns', 200)

from    google.colab import drive
drive.mount('/content/gdrive')

PROJECT_PATH  = '/content/gdrive/MyDrive/OperAI/final-project'
VIZ_PATH      = os.path.join(PROJECT_PATH, 'viz')
DATA_PATH     = os.path.join(PROJECT_PATH, 'data')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# df = pd.read_csv(os.path.join(DATA_PATH, 'df.csv'))
%%time
df = pd.read_pickle(os.path.join(DATA_PATH, 'df.pkl'))

CPU times: user 649 ms, sys: 4.54 s, total: 5.19 s
Wall time: 22.1 s


# Cleaning

## Pre-Aggregation Cleaning

### Date and FIPS

Normal cleaning.

Phase 1 and 2 train and test logic:
- `Phase 1 train`: Prior to April 1, 2022
- `Phase 1 test`:  April 1, 2022 through June 30, 2022
- `Phase 2 train`: Prior to July 1, 2022
- `Phase 2 test`:  July 1, 2022 through September 2022

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['fips'] = df['fips'].astype(int).apply(lambda code: f'{code:05d}')
df = df.sort_values(by=['fips', 'date'], ascending=True)

### County Completeness

The `completeness_pct` variable tells us when there is vaccination data available for the county. We will use this to adjust total vaccination administration percentages below (see description above, too).

In [None]:
len(df.loc[df['completeness_pct'] > 0, 'completeness_pct'])

1688259

In [None]:
len(df.loc[df['completeness_pct'].notna(), 'completeness_pct'])

1892106

In [None]:
len(df.loc[df['completeness_pct'].notna() & df['completeness_pct'] > 0, 'completeness_pct'])

1688259

In [None]:
completeness_usable_idx  = df['completeness_pct'] > 0

### Age Level Population

For the age group populations, we will need to impute medians for missing values. These values should be relatively stable.

In [None]:
census_age_pop_vars = ['census2019_5pluspop', 'census2019_5to17pop',
                       'census2019_12pluspop', 'census2019_18pluspop',
                       'census2019_65pluspop']

df[census_age_pop_vars].isna().sum()

census2019_5pluspop     1159222
census2019_5to17pop     1500520
census2019_12pluspop        382
census2019_18pluspop          0
census2019_65pluspop    1246140
dtype: int64

In [None]:
for age_var in census_age_pop_vars:
    df[age_var] = df.groupby('fips')[age_var].transform('median')

In [None]:
df[census_age_pop_vars].isna().sum()

census2019_5pluspop     0
census2019_5to17pop     0
census2019_12pluspop    0
census2019_18pluspop    0
census2019_65pluspop    0
dtype: int64

### Deaths

Currently, deaths are cumulative. We need to calculate daily deaths as well, and then we can calculate percentages relative to county population values (for both cumulative and daily deaths).

In [None]:
df['cum_deaths'] = df['deaths']
df['daily_deaths'] = df.groupby('fips')['deaths'].diff().fillna(0)

In [None]:
df['daily_death_pct'] = df['daily_deaths'] / df['population']
df['cum_death_pct'] = df['deaths'] / df['population']

### Dose Availability and Administration Values

We need to have a single unified metric for each age group that will be available no matter how many dose options are available. Given the context of COVID, where new variants necessitated the addition of boosters, it makes sense to track this way -- as new treatments become available, it suggest new, potentially dangerous variants are cropping up, which may influence death rates. This is not a perfect approach, but it should allow for a flexible modeling approach.



In [None]:
age_groups = {
    '5plus': [
        ['administered_dose1_recip_5plus', 'series_complete_5plus', 'booster_doses_5plus', 'bivalent_booster_5plus'],
        ['dose_1_avail_5plus', 'dose_2_avail_5plus', 'booster_1_avail_5plus', 'booster_bivalent_avail_5plus']
    ],
    '12plus': [
        ['administered_dose1_recip_12plus', 'series_complete_12plus', 'booster_doses_12plus', 'bivalent_booster_12plus'],
        ['dose_1_avail_12plus', 'dose_2_avail_12plus', 'booster_1_avail_12plus', 'booster_bivalent_avail_12plus']
    ],
    '18plus': [
        ['administered_dose1_recip_18plus', 'series_complete_18plus', 'booster_doses_18plus', 'bivalent_booster_18plus'],
        ['dose_1_avail_18plus', 'dose_2_avail_18plus', 'booster_1_avail_18plus', 'booster_bivalent_avail_18plus']
    ],
    '65plus': [
        ['administered_dose1_recip_65plus', 'series_complete_65plus', 'booster_doses_65plus', 'second_booster_65plus', 'bivalent_booster_65plus'],
        ['dose_1_avail_65plus', 'dose_2_avail_65plus', 'booster_1_avail_65plus', 'booster_2_avail_65plus', 'booster_bivalent_avail_65plus']
    ]
}

In [None]:
for age_group, (existing_age_vars, new_age_vars) in age_groups.items():
    print(age_group)
    print(f"   Existing vars: {existing_age_vars}")
    print(f"   New vars: {new_age_vars}")

    for i, new_var in enumerate(new_age_vars):
        # Create boolean of whether any boosters were in use
        df[new_var] = (df[existing_age_vars[i]] > 0)

    # Sum over bools of doses available
    df[f'num_doses_avail_{age_group}'] = df[new_age_vars].sum(axis=1).fillna(0)

    # Sum over all dose types
    df[f'dose_numerator_{age_group}'] = df[existing_age_vars].sum(axis=1).fillna(0)

    # Upscale based on completeness percentages
    df.loc[completeness_usable_idx, f'dose_numerator_{age_group}'] = (
        df.loc[completeness_usable_idx, f'dose_numerator_{age_group}'] / df.loc[completeness_usable_idx, 'completeness_pct']
    )

    # Create denominator from available doses * population
    df[f'dose_denominator_{age_group}'] = df[f'num_doses_avail_{age_group}'] * df[f'census2019_{age_group}pop']

5plus
   Existing vars: ['administered_dose1_recip_5plus', 'series_complete_5plus', 'booster_doses_5plus', 'bivalent_booster_5plus']
   New vars: ['dose_1_avail_5plus', 'dose_2_avail_5plus', 'booster_1_avail_5plus', 'booster_bivalent_avail_5plus']
12plus
   Existing vars: ['administered_dose1_recip_12plus', 'series_complete_12plus', 'booster_doses_12plus', 'bivalent_booster_12plus']
   New vars: ['dose_1_avail_12plus', 'dose_2_avail_12plus', 'booster_1_avail_12plus', 'booster_bivalent_avail_12plus']
18plus
   Existing vars: ['administered_dose1_recip_18plus', 'series_complete_18plus', 'booster_doses_18plus', 'bivalent_booster_18plus']
   New vars: ['dose_1_avail_18plus', 'dose_2_avail_18plus', 'booster_1_avail_18plus', 'booster_bivalent_avail_18plus']
65plus
   Existing vars: ['administered_dose1_recip_65plus', 'series_complete_65plus', 'booster_doses_65plus', 'second_booster_65plus', 'bivalent_booster_65plus']
   New vars: ['dose_1_avail_65plus', 'dose_2_avail_65plus', 'booster_1_avai

In [None]:
df.head(5)

Unnamed: 0,date,fips,mmwr_week,recip_county,recip_state,completeness_pct,administered_dose1_recip,administered_dose1_pop_pct,administered_dose1_recip_5plus,administered_dose1_recip_5pluspop_pct,administered_dose1_recip_12plus,administered_dose1_recip_12pluspop_pct,administered_dose1_recip_18plus,administered_dose1_recip_18pluspop_pct,administered_dose1_recip_65plus,administered_dose1_recip_65pluspop_pct,series_complete_yes,series_complete_pop_pct,series_complete_5plus,series_complete_5pluspop_pct,series_complete_5to17,series_complete_5to17pop_pct,series_complete_12plus,series_complete_12pluspop_pct,series_complete_18plus,series_complete_18pluspop_pct,series_complete_65plus,series_complete_65pluspop_pct,booster_doses,booster_doses_vax_pct,booster_doses_5plus,booster_doses_5plus_vax_pct,booster_doses_12plus,booster_doses_12plus_vax_pct,booster_doses_18plus,booster_doses_18plus_vax_pct,booster_doses_50plus,booster_doses_50plus_vax_pct,booster_doses_65plus,booster_doses_65plus_vax_pct,second_booster_50plus,second_booster_50plus_vax_pct,second_booster_65plus,second_booster_65plus_vax_pct,svi_ctgy,series_complete_pop_pct_svi,series_complete_5pluspop_pct_svi,series_complete_5to17pop_pct_svi,series_complete_12pluspop_pct_svi,series_complete_18pluspop_pct_svi,series_complete_65pluspop_pct_svi,metro_status,series_complete_pop_pct_ur_equity,series_complete_5pluspop_pct_ur_equity,series_complete_5to17pop_pct_ur_equity,series_complete_12pluspop_pct_ur_equity,series_complete_18pluspop_pct_ur_equity,series_complete_65pluspop_pct_ur_equity,booster_doses_vax_pct_svi,booster_doses_12plusvax_pct_svi,booster_doses_18plusvax_pct_svi,booster_doses_65plusvax_pct_svi,booster_doses_vax_pct_ur_equity,booster_doses_12plusvax_pct_ur_equity,booster_doses_18plusvax_pct_ur_equity,booster_doses_65plusvax_pct_ur_equity,census2019,census2019_5pluspop,census2019_5to17pop,census2019_12pluspop,census2019_18pluspop,census2019_65pluspop,bivalent_booster_5plus,bivalent_booster_5plus_pop_pct,bivalent_booster_12plus,bivalent_booster_12plus_pop_pct,bivalent_booster_18plus,bivalent_booster_18plus_pop_pct,bivalent_booster_65plus,bivalent_booster_65plus_pop_pct,combined_key,population,deaths,cum_deaths,daily_deaths,daily_death_pct,cum_death_pct,dose_1_avail_5plus,dose_2_avail_5plus,booster_1_avail_5plus,booster_bivalent_avail_5plus,num_doses_avail_5plus,dose_numerator_5plus,dose_denominator_5plus,dose_1_avail_12plus,dose_2_avail_12plus,booster_1_avail_12plus,booster_bivalent_avail_12plus,num_doses_avail_12plus,dose_numerator_12plus,dose_denominator_12plus,dose_1_avail_18plus,dose_2_avail_18plus,booster_1_avail_18plus,booster_bivalent_avail_18plus,num_doses_avail_18plus,dose_numerator_18plus,dose_denominator_18plus,dose_1_avail_65plus,dose_2_avail_65plus,booster_1_avail_65plus,booster_2_avail_65plus,booster_bivalent_avail_65plus,num_doses_avail_65plus,dose_numerator_65plus,dose_denominator_65plus
1961617,2020-12-13,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,41.0,41.0,0.0,0.0,0.000734,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0
1957919,2020-12-14,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,41.0,41.0,0.0,0.0,0.000734,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0
1955846,2020-12-15,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,43.0,43.0,2.0,3.6e-05,0.00077,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0
1950822,2020-12-16,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,43.0,43.0,0.0,0.0,0.00077,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0
1947776,2020-12-17,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,43.0,43.0,0.0,0.0,0.00077,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0


Will need to either cut or fix the ones over 100%!

In [None]:
# print(df.shape)

# df = df.loc[df['dose_admin_pct_5plus']  <= 1.0, :]
# df = df.loc[df['dose_admin_pct_12plus'] <= 1.0, :]
# df = df.loc[df['dose_admin_pct_18plus'] <= 1.0, :]
# df = df.loc[df['dose_admin_pct_65plus'] <= 1.0, :]

# print(df.shape)

### Turn Categorical Variables into Numerical Exposure Variables

Normal stuff.

In [None]:
cat_vars = [
    'svi_ctgy',
    'metro_status'
]

for var in cat_vars:
    print(df[var].value_counts(dropna=False))
    df[var] = df[var].astype('category')

df = df.dropna(subset=cat_vars)

df[cat_vars].info()

A      474722
D      474157
B      473568
C      473556
NaN       577
Name: svi_ctgy, dtype: int64
Non-metro    1168576
Metro         727415
NaN              589
Name: metro_status, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1895414 entries, 1961617 to 32589
Data columns (total 2 columns):
 #   Column        Dtype   
---  ------        -----   
 0   svi_ctgy      category
 1   metro_status  category
dtypes: category(2)
memory usage: 18.1 MB


In [None]:
for category in df['svi_ctgy'].cat.categories:
    df[f'pop_in_svi_ctgy_{category}'] = 0
    df.loc[df['svi_ctgy'] == category, f'pop_in_svi_ctgy_{category}'] = df.loc[df['svi_ctgy'] == category, 'population']

for category in df['metro_status'].cat.categories:
    df[f'pop_in_{category}'] = 0
    df.loc[df['metro_status'] == category, f'pop_in_{category}'] = df.loc[df['metro_status'] == category, 'population']

df.sample(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f'pop_in_svi_ctgy_{category}'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f'pop_in_svi_ctgy_{category}'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f'pop_in_svi_ctgy_{category}'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index

Unnamed: 0,date,fips,mmwr_week,recip_county,recip_state,completeness_pct,administered_dose1_recip,administered_dose1_pop_pct,administered_dose1_recip_5plus,administered_dose1_recip_5pluspop_pct,administered_dose1_recip_12plus,administered_dose1_recip_12pluspop_pct,administered_dose1_recip_18plus,administered_dose1_recip_18pluspop_pct,administered_dose1_recip_65plus,administered_dose1_recip_65pluspop_pct,series_complete_yes,series_complete_pop_pct,series_complete_5plus,series_complete_5pluspop_pct,series_complete_5to17,series_complete_5to17pop_pct,series_complete_12plus,series_complete_12pluspop_pct,series_complete_18plus,series_complete_18pluspop_pct,series_complete_65plus,series_complete_65pluspop_pct,booster_doses,booster_doses_vax_pct,booster_doses_5plus,booster_doses_5plus_vax_pct,booster_doses_12plus,booster_doses_12plus_vax_pct,booster_doses_18plus,booster_doses_18plus_vax_pct,booster_doses_50plus,booster_doses_50plus_vax_pct,booster_doses_65plus,booster_doses_65plus_vax_pct,second_booster_50plus,second_booster_50plus_vax_pct,second_booster_65plus,second_booster_65plus_vax_pct,svi_ctgy,series_complete_pop_pct_svi,series_complete_5pluspop_pct_svi,series_complete_5to17pop_pct_svi,series_complete_12pluspop_pct_svi,series_complete_18pluspop_pct_svi,series_complete_65pluspop_pct_svi,metro_status,series_complete_pop_pct_ur_equity,series_complete_5pluspop_pct_ur_equity,series_complete_5to17pop_pct_ur_equity,series_complete_12pluspop_pct_ur_equity,series_complete_18pluspop_pct_ur_equity,series_complete_65pluspop_pct_ur_equity,booster_doses_vax_pct_svi,booster_doses_12plusvax_pct_svi,booster_doses_18plusvax_pct_svi,booster_doses_65plusvax_pct_svi,booster_doses_vax_pct_ur_equity,booster_doses_12plusvax_pct_ur_equity,booster_doses_18plusvax_pct_ur_equity,booster_doses_65plusvax_pct_ur_equity,census2019,census2019_5pluspop,census2019_5to17pop,census2019_12pluspop,census2019_18pluspop,census2019_65pluspop,bivalent_booster_5plus,bivalent_booster_5plus_pop_pct,bivalent_booster_12plus,bivalent_booster_12plus_pop_pct,bivalent_booster_18plus,bivalent_booster_18plus_pop_pct,bivalent_booster_65plus,bivalent_booster_65plus_pop_pct,combined_key,population,deaths,cum_deaths,daily_deaths,daily_death_pct,cum_death_pct,dose_1_avail_5plus,dose_2_avail_5plus,booster_1_avail_5plus,booster_bivalent_avail_5plus,num_doses_avail_5plus,dose_numerator_5plus,dose_denominator_5plus,dose_1_avail_12plus,dose_2_avail_12plus,booster_1_avail_12plus,booster_bivalent_avail_12plus,num_doses_avail_12plus,dose_numerator_12plus,dose_denominator_12plus,dose_1_avail_18plus,dose_2_avail_18plus,booster_1_avail_18plus,booster_bivalent_avail_18plus,num_doses_avail_18plus,dose_numerator_18plus,dose_denominator_18plus,dose_1_avail_65plus,dose_2_avail_65plus,booster_1_avail_65plus,booster_2_avail_65plus,booster_bivalent_avail_65plus,num_doses_avail_65plus,dose_numerator_65plus,dose_denominator_65plus,pop_in_svi_ctgy_A,pop_in_svi_ctgy_B,pop_in_svi_ctgy_C,pop_in_svi_ctgy_D,pop_in_Metro,pop_in_Non-metro
516094,2022-02-26,36111,8,Ulster County,NY,97.6,143519.0,80.8,143508.0,84.5,138753.0,87.8,129807.0,88.5,36104.0,95.0,127510.0,71.8,127503.0,75.0,,,123393.0,78.1,115319.0,78.6,31587.0,87.3,69291.0,54.3,,,69281.0,56.1,67091.0,58.2,43618.0,67.6,23464.0,74.3,,,,,B,8.0,8.0,,8.0,8.0,8.0,Metro,4.0,4.0,,4.0,4.0,4.0,,,,,,,,,177573.0,169913.0,23284.0,157994.0,146629.0,36183.0,,,,,,,,,"Ulster, New York, US",177573.0,359.0,359.0,0.0,0.0,0.002022,True,True,False,False,2,2776.752049,339826.0,True,True,True,False,3,3395.768443,473982.0,True,True,True,False,3,3198.944672,439887.0,True,True,True,False,False,3,933.965164,108549.0,0,177573,0,0,177573,0
1300126,2021-07-02,41047,26,Marion County,OR,97.7,178025.0,51.2,,,177509.0,60.8,167177.0,63.5,47662.0,84.7,161248.0,46.4,,,,,160892.0,55.2,152700.0,58.0,45358.0,80.6,,,,,,,,,,,,,,,,,D,15.0,,,16.0,16.0,16.0,Metro,3.0,,,4.0,4.0,4.0,,,,,,,,,347818.0,325089.0,61900.0,291728.0,263189.0,56279.0,,,,,,,,,"Marion, Oregon, US",347818.0,326.0,326.0,1.0,3e-06,0.000937,False,False,False,False,0,0.0,0.0,True,True,False,False,2,3463.674514,583456.0,True,True,False,False,2,3274.073695,526378.0,True,True,False,False,False,2,952.09826,112558.0,0,0,0,347818,347818,0
1390290,2021-06-05,21033,22,Caldwell County,KY,94.2,4813.0,37.8,,,4813.0,44.4,4770.0,48.2,2086.0,76.7,4368.0,34.3,,,,,4368.0,40.3,4356.0,44.0,1956.0,71.9,,,,,,,,,,,,,,,,,B,6.0,,,7.0,7.0,8.0,Non-metro,6.0,,,7.0,7.0,8.0,,,,,,,,,12747.0,12004.0,2114.0,10851.0,9890.0,2719.0,,,,,,,,,"Caldwell, Kentucky, US",12747.0,32.0,32.0,0.0,0.0,0.00251,False,False,False,False,0,0.0,0.0,True,True,False,False,2,97.462845,21702.0,True,True,False,False,2,96.878981,19780.0,True,True,False,False,False,2,42.908705,5438.0,0,12747,0,0,0,12747
1373650,2021-06-10,26079,23,Kalkaska County,MI,95.1,7016.0,38.9,,,7016.0,44.8,6894.0,48.1,2865.0,76.9,6636.0,36.8,,,,,6636.0,42.4,6513.0,45.4,2814.0,75.5,,,,,,,,,,,,,,,,,C,10.0,,,11.0,11.0,12.0,Non-metro,6.0,,,7.0,7.0,8.0,,,,,,,,,18038.0,17118.0,2775.0,15649.0,14343.0,3728.0,,,,,,,,,"Kalkaska, Michigan, US",18038.0,24.0,24.0,1.0,5.5e-05,0.001331,False,False,False,False,0,0.0,0.0,True,True,False,False,2,143.554154,31298.0,True,True,False,False,2,140.977918,28686.0,True,True,False,False,False,2,59.716088,7456.0,0,0,18038,0,0,18038
57816,2023-01-11,38039,2,Griggs County,ND,92.7,1188.0,53.2,1186.0,56.1,1160.0,60.4,1118.0,63.5,542.0,80.2,1068.0,47.9,1067.0,50.5,59.0,16.7,1047.0,54.6,1008.0,57.2,486.0,71.9,529.0,49.5,529.0,49.6,522.0,49.9,510.0,50.6,431.0,58.2,306.0,63.0,181.0,42.0,148.0,48.4,A,1.0,2.0,1.0,2.0,2.0,3.0,Non-metro,5.0,6.0,5.0,6.0,6.0,7.0,3.0,3.0,4.0,2.0,7.0,7.0,8.0,6.0,2231.0,2114.0,353.0,1919.0,1761.0,676.0,273.0,12.9,268.0,14.0,259.0,14.7,191.0,28.3,"Griggs, North Dakota, US",2231.0,2.0,2.0,0.0,0.0,0.000896,True,True,True,True,4,32.955771,8456.0,True,True,True,True,4,32.330097,7676.0,True,True,True,True,4,31.229773,7044.0,True,True,True,True,True,5,18.047465,3380.0,2231,0,0,0,0,2231


## State-Level Aggregation

In [None]:
df.rename(columns = {
    'pop_in_Metro':     'pop_in_metro',
    'pop_in_Non-metro': 'pop_in_nonmetro'
}, inplace=True)

In [None]:
df.head()

Unnamed: 0,date,fips,mmwr_week,recip_county,recip_state,completeness_pct,administered_dose1_recip,administered_dose1_pop_pct,administered_dose1_recip_5plus,administered_dose1_recip_5pluspop_pct,administered_dose1_recip_12plus,administered_dose1_recip_12pluspop_pct,administered_dose1_recip_18plus,administered_dose1_recip_18pluspop_pct,administered_dose1_recip_65plus,administered_dose1_recip_65pluspop_pct,series_complete_yes,series_complete_pop_pct,series_complete_5plus,series_complete_5pluspop_pct,series_complete_5to17,series_complete_5to17pop_pct,series_complete_12plus,series_complete_12pluspop_pct,series_complete_18plus,series_complete_18pluspop_pct,series_complete_65plus,series_complete_65pluspop_pct,booster_doses,booster_doses_vax_pct,booster_doses_5plus,booster_doses_5plus_vax_pct,booster_doses_12plus,booster_doses_12plus_vax_pct,booster_doses_18plus,booster_doses_18plus_vax_pct,booster_doses_50plus,booster_doses_50plus_vax_pct,booster_doses_65plus,booster_doses_65plus_vax_pct,second_booster_50plus,second_booster_50plus_vax_pct,second_booster_65plus,second_booster_65plus_vax_pct,svi_ctgy,series_complete_pop_pct_svi,series_complete_5pluspop_pct_svi,series_complete_5to17pop_pct_svi,series_complete_12pluspop_pct_svi,series_complete_18pluspop_pct_svi,series_complete_65pluspop_pct_svi,metro_status,series_complete_pop_pct_ur_equity,series_complete_5pluspop_pct_ur_equity,series_complete_5to17pop_pct_ur_equity,series_complete_12pluspop_pct_ur_equity,series_complete_18pluspop_pct_ur_equity,series_complete_65pluspop_pct_ur_equity,booster_doses_vax_pct_svi,booster_doses_12plusvax_pct_svi,booster_doses_18plusvax_pct_svi,booster_doses_65plusvax_pct_svi,booster_doses_vax_pct_ur_equity,booster_doses_12plusvax_pct_ur_equity,booster_doses_18plusvax_pct_ur_equity,booster_doses_65plusvax_pct_ur_equity,census2019,census2019_5pluspop,census2019_5to17pop,census2019_12pluspop,census2019_18pluspop,census2019_65pluspop,bivalent_booster_5plus,bivalent_booster_5plus_pop_pct,bivalent_booster_12plus,bivalent_booster_12plus_pop_pct,bivalent_booster_18plus,bivalent_booster_18plus_pop_pct,bivalent_booster_65plus,bivalent_booster_65plus_pop_pct,combined_key,population,deaths,cum_deaths,daily_deaths,daily_death_pct,cum_death_pct,dose_1_avail_5plus,dose_2_avail_5plus,booster_1_avail_5plus,booster_bivalent_avail_5plus,num_doses_avail_5plus,dose_numerator_5plus,dose_denominator_5plus,dose_1_avail_12plus,dose_2_avail_12plus,booster_1_avail_12plus,booster_bivalent_avail_12plus,num_doses_avail_12plus,dose_numerator_12plus,dose_denominator_12plus,dose_1_avail_18plus,dose_2_avail_18plus,booster_1_avail_18plus,booster_bivalent_avail_18plus,num_doses_avail_18plus,dose_numerator_18plus,dose_denominator_18plus,dose_1_avail_65plus,dose_2_avail_65plus,booster_1_avail_65plus,booster_2_avail_65plus,booster_bivalent_avail_65plus,num_doses_avail_65plus,dose_numerator_65plus,dose_denominator_65plus,pop_in_svi_ctgy_A,pop_in_svi_ctgy_B,pop_in_svi_ctgy_C,pop_in_svi_ctgy_D,pop_in_metro,pop_in_nonmetro
1961617,2020-12-13,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,41.0,41.0,0.0,0.0,0.000734,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0,0,55869,0,0,55869,0
1957919,2020-12-14,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,41.0,41.0,0.0,0.0,0.000734,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0,0,55869,0,0,55869,0
1955846,2020-12-15,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,43.0,43.0,2.0,3.6e-05,0.00077,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0,0,55869,0,0,55869,0
1950822,2020-12-16,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,43.0,43.0,0.0,0.0,0.00077,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0,0,55869,0,0,55869,0
1947776,2020-12-17,1001,51,Autauga County,AL,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,B,,,,,,,Metro,,,,,,,,,,,,,,,55869.0,52592.0,9688.0,47574.0,42904.0,8924.0,,,,,,,,,"Autauga, Alabama, US",55869.0,43.0,43.0,0.0,0.0,0.00077,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,0,0.0,0.0,False,False,False,False,False,0,0.0,0.0,0,55869,0,0,55869,0


### Aggregation

In [None]:
grouping_vars = ['recip_state', 'date']

agg_dict = {
    # Weeks----------------------------------
    # 'mmwr_week'                   : 'first',
    # SVI------------------------------------
    'pop_in_svi_ctgy_A'           : 'sum',
    'pop_in_svi_ctgy_B'           : 'sum',
    'pop_in_svi_ctgy_C'           : 'sum',
    'pop_in_svi_ctgy_D'           : 'sum',
    # Metro----------------------------------
    'pop_in_metro'                : 'sum',
    'pop_in_nonmetro'             : 'sum',
    'population'                  : 'sum',
    # Dose Numerator-------------------------
    'dose_numerator_5plus'        : 'sum',
    'dose_numerator_12plus'       : 'sum',
    'dose_numerator_18plus'       : 'sum',
    'dose_numerator_65plus'       : 'sum',
    # Dose Denominator-----------------------
    'dose_denominator_5plus'      : 'sum',
    'dose_denominator_12plus'     : 'sum',
    'dose_denominator_18plus'     : 'sum',
    'dose_denominator_65plus'     : 'sum',
    # Deaths---------------------------------
    'cum_deaths'                  : 'sum',
    'daily_deaths'                : 'sum'
}

df_state = df.groupby(['recip_state', 'date']).agg(agg_dict)

df_state = df_state.reset_index()

df_state.sample(5, random_state=1)

Unnamed: 0,recip_state,date,pop_in_svi_ctgy_A,pop_in_svi_ctgy_B,pop_in_svi_ctgy_C,pop_in_svi_ctgy_D,pop_in_metro,pop_in_nonmetro,population,dose_numerator_5plus,dose_numerator_12plus,dose_numerator_18plus,dose_numerator_65plus,dose_denominator_5plus,dose_denominator_12plus,dose_denominator_18plus,dose_denominator_65plus,cum_deaths,daily_deaths
9617,KS,2021-06-24,1181612,454994,987936,288772,2001341,911973,2913314.0,0.0,26018.69239,25068.670954,8080.150054,0.0,4903394.0,4426128.0,950974.0,5139.0,0.0
567,AK,2022-10-12,5810,600959,53308,62266,493166,229177,722343.0,11989.73306,11497.402464,10672.197125,2587.689938,2015637.0,1803372.0,1633524.0,361040.0,1371.0,0.0
17577,NE,2022-04-23,598678,1077368,207517,50845,1273991,660417,1934408.0,19895.562914,25556.611479,23738.498896,6555.640177,2703114.0,4041897.0,3646139.0,761391.0,3433.0,0.0
23011,PR,2021-01-22,1627544,750728,799764,576903,3596674,158265,3754939.0,0.0,1409.051546,1408.412371,229.072165,0.0,4012512.0,3692258.0,969118.0,0.0,0.0
16949,ND,2022-03-15,579768,157056,0,25238,384553,377509,762062.0,9027.12743,10426.058315,9910.74514,2937.829374,1415922.0,1911738.0,1745673.0,359535.0,2228.0,2.0


In [None]:
df_state.shape

(30628, 19)

## Post-Aggregation Cleaning

### Make sure all dates are consecutive

In [None]:
start_date = df_state['date'].min()
end_date = df_state['date'].max()

print(start_date, end_date)

all_dates = pd.date_range(start=start_date, end=end_date, freq='D')

# Step 2: List of all states
all_states = df_state['recip_state'].unique()

# Step 3: Create a new DataFrame with all combinations of dates and states
df_ts = pd.MultiIndex.from_product([all_dates, all_states], names=['date', 'recip_state']).to_frame(index=False)

print(df_ts.shape)

complete_df = df_ts.merge(df_state, on=['date', 'recip_state'], how='left')
print(complete_df.shape)
print(complete_df.isna().sum())

complete_df = complete_df.fillna(method='ffill')

2020-12-13 00:00:00 2023-03-08 00:00:00
(42432, 2)
(42432, 19)
date                           0
recip_state                    0
pop_in_svi_ctgy_A          11804
pop_in_svi_ctgy_B          11804
pop_in_svi_ctgy_C          11804
pop_in_svi_ctgy_D          11804
pop_in_metro               11804
pop_in_nonmetro            11804
population                 11804
dose_numerator_5plus       11804
dose_numerator_12plus      11804
dose_numerator_18plus      11804
dose_numerator_65plus      11804
dose_denominator_5plus     11804
dose_denominator_12plus    11804
dose_denominator_18plus    11804
dose_denominator_65plus    11804
cum_deaths                 11804
daily_deaths               11804
dtype: int64


### Date-Based Train/Test Definition

In [None]:
complete_df['phase_1_train'] = (complete_df['date'] <  '2022-04-01')
complete_df['phase_1_test']  = (complete_df['date'] >= '2022-04-01') & (complete_df['date'] < '2022-07-01')
complete_df['phase_2_train'] = (complete_df['date'] <  '2022-07-01')
complete_df['phase_2_test']  = (complete_df['date'] >= '2022-07-01') & (complete_df['date'] < '2022-10-01')

print('-'*80)
print(f"Phase 1 train len : {complete_df['phase_1_train'].sum():>10,.0f}")
print(f"Phase 1 test len  : {complete_df['phase_1_test'].sum():>10,.0f}")
print(f"Phase 2 train len : {complete_df['phase_2_train'].sum():>10,.0f}")
print(f"Phase 2 test len  : {complete_df['phase_2_test'].sum():>10,.0f}")
print('-'*80)

complete_df.head()

--------------------------------------------------------------------------------
Phase 1 train len :     24,648
Phase 1 test len  :      4,732
Phase 2 train len :     29,380
Phase 2 test len  :      4,784
--------------------------------------------------------------------------------


Unnamed: 0,date,recip_state,pop_in_svi_ctgy_A,pop_in_svi_ctgy_B,pop_in_svi_ctgy_C,pop_in_svi_ctgy_D,pop_in_metro,pop_in_nonmetro,population,dose_numerator_5plus,dose_numerator_12plus,dose_numerator_18plus,dose_numerator_65plus,dose_denominator_5plus,dose_denominator_12plus,dose_denominator_18plus,dose_denominator_65plus,cum_deaths,daily_deaths,phase_1_train,phase_1_test,phase_2_train,phase_2_test
0,2020-12-13,AK,5810.0,600959.0,53308.0,62266.0,493166.0,229177.0,722343.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,173.0,0.0,True,False,True,False
1,2020-12-13,AL,440936.0,783206.0,2157113.0,1521930.0,3767757.0,1135428.0,4903185.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4102.0,0.0,True,False,True,False
2,2020-12-13,AR,24919.0,714157.0,1099356.0,1179372.0,1892893.0,1124911.0,3017804.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2945.0,0.0,True,False,True,False
3,2020-12-13,AZ,0.0,0.0,4873487.0,2405230.0,6925947.0,352770.0,7278717.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7357.0,0.0,True,False,True,False
4,2020-12-13,CA,754281.0,11532085.0,6842942.0,20382915.0,38674939.0,837284.0,39512223.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21101.0,0.0,True,False,True,False


In [None]:
print(complete_df.loc[complete_df['phase_2_train'], 'date'].max())
print(complete_df.loc[complete_df['phase_2_test'], 'date'].min())

2022-06-30 00:00:00
2022-07-01 00:00:00


### Get indicators for month from date

In [None]:
complete_df['month'] = complete_df['date'].dt.month.astype('category')
complete_df['dayofweek'] = complete_df['date'].dt.dayofweek.astype('category')

### Calculate State-Level Death Percentages

In [None]:
complete_df['cum_death_pct']   = complete_df['cum_deaths'] / complete_df['population']
complete_df['daily_death_pct'] = complete_df['daily_deaths'] / complete_df['population']

### Calculate Dose Administration Percentage at State Level

In [None]:
age_group_list = ['5plus', '12plus', '18plus', '65plus']

for age_group in age_group_list:
    complete_df[f'pct_doses_admin_{age_group}'] = (
        complete_df[f'dose_numerator_{age_group}'] / complete_df[f'dose_denominator_{age_group}']
    ).fillna(0)

### Calculate SVI and Metro Populations as % of State Population

In [None]:
complete_df['pct_pop_in_svi_ctgy_A'] = complete_df['pop_in_svi_ctgy_A'] / complete_df['population']
complete_df['pct_pop_in_svi_ctgy_B'] = complete_df['pop_in_svi_ctgy_B'] / complete_df['population']
complete_df['pct_pop_in_svi_ctgy_C'] = complete_df['pop_in_svi_ctgy_C'] / complete_df['population']
complete_df['pct_pop_in_svi_ctgy_D'] = complete_df['pop_in_svi_ctgy_D'] / complete_df['population']

complete_df['pct_pop_in_metro']      = complete_df['pop_in_metro'] / complete_df['population']
complete_df['pct_pop_in_nonmetro']   = complete_df['pop_in_nonmetro'] / complete_df['population']

### Re-order columns

In [None]:
id_vars = ['recip_state', 'date']

phase_vars = [
    'phase_1_train',
    'phase_1_test',
    'phase_2_train',
    'phase_2_test'
]

X_vars = [
    # Month-----------------------
    'month',
    'dayofweek',
    # Population------------------
    'population',
    # SVI-------------------------
    'pct_pop_in_svi_ctgy_A',
    'pct_pop_in_svi_ctgy_B',
    'pct_pop_in_svi_ctgy_C',
    'pct_pop_in_svi_ctgy_D',
    # Metro-----------------------
    'pct_pop_in_metro',
    'pct_pop_in_nonmetro',
    # Dose Administration---------
    'pct_doses_admin_5plus',
    'pct_doses_admin_12plus',
    'pct_doses_admin_18plus',
    'pct_doses_admin_65plus',
]

y_vars = [
    'cum_deaths',
    'daily_deaths',
    'cum_death_pct',
    'daily_death_pct'
]

col_order = id_vars + phase_vars + X_vars + y_vars

complete_df = complete_df[col_order]

complete_df.sample(5)

Unnamed: 0,recip_state,date,phase_1_train,phase_1_test,phase_2_train,phase_2_test,month,dayofweek,population,pct_pop_in_svi_ctgy_A,pct_pop_in_svi_ctgy_B,pct_pop_in_svi_ctgy_C,pct_pop_in_svi_ctgy_D,pct_pop_in_metro,pct_pop_in_nonmetro,pct_doses_admin_5plus,pct_doses_admin_12plus,pct_doses_admin_18plus,pct_doses_admin_65plus,cum_deaths,daily_deaths,cum_death_pct,daily_death_pct
17195,OH,2021-11-08,True,False,True,False,11,0,11689100.0,0.232137,0.238934,0.506648,0.02228,0.799183,0.200817,0.0,0.006292,0.006505,0.008655,26548.0,52.0,0.002271,4e-06
18990,GA,2021-12-13,True,False,True,False,12,0,10617423.0,0.131282,0.233374,0.337522,0.297823,0.830058,0.169942,0.005869,0.006409,0.006499,0.008555,29729.0,36.0,0.0028,3e-06
24348,IA,2022-03-26,True,False,True,False,3,5,3155070.0,0.444228,0.471676,0.078765,0.005331,0.600552,0.399448,0.006864,0.006149,0.006384,0.008783,9402.0,0.0,0.00298,0.0
8484,DE,2021-05-25,True,False,True,False,5,1,973764.0,0.0,0.814343,0.185657,0.0,1.0,0.0,0.0,0.005481,0.005834,0.008523,1659.0,0.0,0.001704,0.0
2575,NC,2021-01-31,True,False,True,False,1,6,10488084.0,0.030116,0.40699,0.29266,0.270233,0.788319,0.211681,0.0,0.000592,0.00065,0.00169,9335.0,178.0,0.00089,1.7e-05


In [None]:
complete_df.shape

(42432, 23)

In [None]:
train_or_test_idx = (
    complete_df['phase_1_train'] |
    complete_df['phase_1_test'] |
    complete_df['phase_2_train'] |
    complete_df['phase_2_test']
)

complete_df = complete_df.loc[train_or_test_idx, :]
complete_df.shape

(34164, 23)

In [None]:
complete_df = complete_df.sort_values(by=['recip_state', 'date'])
complete_df.head()

Unnamed: 0,recip_state,date,phase_1_train,phase_1_test,phase_2_train,phase_2_test,month,dayofweek,population,pct_pop_in_svi_ctgy_A,pct_pop_in_svi_ctgy_B,pct_pop_in_svi_ctgy_C,pct_pop_in_svi_ctgy_D,pct_pop_in_metro,pct_pop_in_nonmetro,pct_doses_admin_5plus,pct_doses_admin_12plus,pct_doses_admin_18plus,pct_doses_admin_65plus,cum_deaths,daily_deaths,cum_death_pct,daily_death_pct
0,AK,2020-12-13,True,False,True,False,12,6,722343.0,0.008043,0.831958,0.073799,0.0862,0.682731,0.317269,0.0,0.0,0.0,0.0,173.0,0.0,0.000239,0.0
52,AK,2020-12-14,True,False,True,False,12,0,722343.0,0.008043,0.831958,0.073799,0.0862,0.682731,0.317269,0.0,0.0,0.0,0.0,173.0,0.0,0.000239,0.0
104,AK,2020-12-15,True,False,True,False,12,1,722343.0,0.008043,0.831958,0.073799,0.0862,0.682731,0.317269,0.0,0.0,0.0,0.0,176.0,3.0,0.000244,4e-06
156,AK,2020-12-16,True,False,True,False,12,2,722343.0,0.008043,0.831958,0.073799,0.0862,0.682731,0.317269,0.0,0.0,0.0,0.0,178.0,2.0,0.000246,3e-06
208,AK,2020-12-17,True,False,True,False,12,3,722343.0,0.008043,0.831958,0.073799,0.0862,0.682731,0.317269,0.0,0.0,0.0,0.0,180.0,2.0,0.000249,3e-06


## Export

In [None]:
complete_df.to_pickle(os.path.join(DATA_PATH, 'df_state_timeseries_v2.pkl'))

# Viz

In [None]:
# def plot_county_timeseries(fips: str, daily: bool = True, plt_vars = combined_vars):

#     daily_str = 'daily_deaths' if daily else 'cumulative'

#     plt_data = df.loc[df['fips'] == fips, plt_vars]
#     plt.figure(figsize=(10, 4))

#     ax1 = plt.gca()
#     ax2 = ax1.twinx()

#     lines = []

#     # Vaccination rates
#     for v, var in enumerate(dose_admin_vars):
#         line, = ax1.plot(plt_data['date'], plt_data[var], label=var)
#         lines.append(line)

#     # Death rates
#     death_var = 'daily_death_pct' if daily else 'cum_death_pct'
#     line, = ax2.plot(plt_data['date'], plt_data[[death_var]], color='black', label='Death Pct')
#     lines.append(line)

#     labels = [l.get_label() for l in lines]

#     county_name   = plt_data.loc[:, 'combined_key'].iloc[0]
#     county_pop    = int(plt_data.loc[:, 'population'].iloc[0])
#     county_svi    = plt_data.loc[:, 'svi_ctgy'].iloc[0]
#     county_metro  = plt_data.loc[:, 'metro_status'].iloc[0]

#     plt.title(f'{county_name} (pop={county_pop/1000:.1f}k, svi={county_svi}, {county_metro})')
#     plt.legend(lines, labels, loc='upper left', bbox_to_anchor=(1, 1))
#     plt.tight_layout()
#     plt.savefig(os.path.join(VIZ_PATH, 'timeseries', f'timeseries-{fips}-{daily_str}'))
#     # plt.show()
#     plt.close()
#     del plt_data

In [None]:
# len(df['fips'].unique())

In [None]:
# for county_fips in df['fips'].sample(10, random_state=1):
#     plot_county_timeseries(fips=county_fips)