# Modeling & Cleaning Issues:

### Data Cleaning for Cases and Deaths:
Deaths and cases were cumulative to each date. To get just the cases in 2021, the cases and deaths in 2020 need to be subtracted. Same for 2022.

This would explain why the cases and death numbers seem too high.

This could also be part of the reason the models are not performing well at the county level.


Do we want to examine total cases as our y variable (sum of 20, 21, and 22) or do we also separately want to examine early verses late pandemic outcomes? 

John- if for now you want to look at total pandemic outcomes and keep it simple- you could treat case rate or death rate as cases_2022/population or deaths_2022/ population and just drop 2020 and 2021. these numbers were cumulative and would have included the totals for the prior years. 


## Next Steps:

1. Retry merging the county datasets on countyFIPS not on county. We lost almost half the dataset in the merge and i think we can retain a lot more of this by merging on the fips code instead of dropping the fips code

2. I have simplified county vax data saved. I have the vax rates by county from Sept 2021 saved as county_vax_2021 and for 2022 saved as county_vax_2022. County_vax_2021 only includes the percent of the pop that received the first dose and the percent of 65 and older who received the first dose while the 2022 dataset also includes boosters etc. However, there was more missing data in 2022 so we may just want to use 2021.
3. John- I saw your comment on PCA. I agree that could help simplify certain steps for the model. Let me know if you want me to take a grouping of columns (especially the pre-existing conditions, employment rates etc) and try to reduce the dimensionality so we can focus on other columns. 
4. What do you think about taking state out of the X variable?
5. Do we want to manufacture a binarized mask column where if mask > 0 

In [2]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import statsmodels.api as sm

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.ensemble import VotingRegressor

from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor


In [3]:
df = pd.read_csv('Data/County_Data_needsVAXmerge.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,FIPS,State,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,...,percent Not Proficient in English,percent Female,number Rural,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022,Masks
0,0,1001,Alabama,Autauga,8824.0,10471.0,0.0,8707.0,18,19,...,1,51.3,22921.0,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0,267.0
1,1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,...,0,51.5,77060.0,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0,267.0
2,2,1005,Alabama,Barbour,9586.0,11333.0,0.0,7310.0,26,22,...,1,47.2,18613.0,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0,267.0
3,4,1007,Alabama,Bibb,11784.0,14813.0,0.0,11328.0,20,20,...,0,46.5,15663.0,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0,267.0
4,5,1009,Alabama,Blount,10908.0,0.0,5620.0,11336.0,21,20,...,2,50.7,51562.0,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0,267.0


In [4]:
# Drop rows that we will not be using
df.drop(columns = ['Unnamed: 0', 'County', 'YPLL Rate (Black)', 'YPLL Rate (Hispanic)', 'YPLL Rate (White)', 'Number Uninsured', 'Number Primary Care Physicians', 
                        'Preventable Hosp. Rate (Black)', 'Preventable Hosp. Rate (Hispanic)', 'Preventable Hosp. Rate (White)',  'Percent Vaccinated Flu (Black)', 
                        'Percent  Vaccinated (Hispanic) Flu', 'Percent Vaccinated (White) Flu', 'Number Some College', 'Number Unemployed', 'Labor Force', 'PCP Ratio',
                        '80th Percentile Income', '20th Percentile Income', '95% CI - Low', '95% CI - High', 'Life Expectancy (Black)', 'Life Expectancy (Hispanic)', 
                        'Life Expectancy (White)', 'Number HIV Cases', 'Household income (Black)', 'Household income (Hispanic)', 'Household income (White)'], inplace = True)

In [5]:
df.shape

(1850, 53)

In [6]:
df.isna().sum()

FIPS                                                   0
State                                                  0
Years of Potential Life Lost Rate (premature death)    0
% Fair/Poor Health                                     0
percent_smokers                                        0
percent_obese                                          0
Food Environment Index                                 0
% Physically Inactive                                  0
percent Excessive Drinking                             0
Percent Uninsured                                      0
PCP Rate                                               0
Preventable Hosp stays Rate                            0
Percent Vaccinated Flu                                 0
High School Graduation Rate                            0
Percent Some College                                   0
Percent Unemployed                                     0
Income Ratio                                           0
Average Daily PM2.5            

In [None]:
# Make FIPS index 
df.set_index('FIPS', inplace=True)

In [None]:
# Create new columns for per populaltion stats - YPL, Number pre-mature Deaths, Number rural 
df['YPL'] = df['Years of Potential Life Lost Rate (premature death)']/df['Population']
df['pre mature deaths'] = df['Number pre-mature Deaths']/df['Population']
df['rural'] = df['number Rural']/df['Population']

df.drop(columns = ['Years of Potential Life Lost Rate (premature death)', 'Number pre-mature Deaths', 'number Rural'], inplace = True)

In [None]:
# Dummify State and Presence of water violation
df = pd.get_dummies(columns = ['State'], data = df, drop_first=True)
df['water'] = df['Presence of water violation'].map({'No': 0, 'Yes': 1})
df.drop(columns = ['Presence of water violation'], inplace = True)

In [None]:
# Calculate total cases and deaths, and convert to % of population 
df['cases'] = df['cases_2020'] + df['cases_2021'] + df['cases_2022']
df['deaths'] = df['deaths_2020'] + df['deaths_2021'] + df['deaths_2022']

df['case_rate'] = df['cases']/df['Population']
df['death_rate'] = df['deaths']/df['Population']

# Deaths seem to be off? More deaths than population 
df.drop(columns = ['cases_2020', 'cases_2021', 'cases_2022', 'deaths_2020', 'deaths_2021', 'deaths_2022', 'cases', 'deaths'], inplace = True)

In [None]:
# Drop Na values (1850 rows -> 1807)
df.dropna(inplace=True)
df.shape

In [None]:
# y variable will be case rate or death rate
y = df['case_rate']
# y = df['death_rate']

# X variables
X = df.drop(columns = ['case_rate', 'death_rate'])

# TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)