# Modeling & Cleaning Issues:

### Data Cleaning for Cases and Deaths:
Deaths and cases were cumulative to each date. To get just the cases in 2021, the cases and deaths in 2020 need to be subtracted. Same for 2022.

This would explain why the cases and death numbers seem too high.

This could also be part of the reason the models are not performing well at the county level.


Do we want to examine total cases as our y variable (sum of 20, 21, and 22) or do we also separately want to examine early verses late pandemic outcomes? 

John- if for now you want to look at total pandemic outcomes and keep it simple- you could treat case rate or death rate as cases_2022/population or deaths_2022/ population and just drop 2020 and 2021. these numbers were cumulative and would have included the totals for the prior years. 


## Next Steps:

1. Retry merging the county datasets on countyFIPS not on county. We lost almost half the dataset in the merge and i think we can retain a lot more of this by merging on the fips code instead of dropping the fips code

2. I have simplified county vax data saved. I have the vax rates by county from Sept 2021 saved as county_vax_2021 and for 2022 saved as county_vax_2022. County_vax_2021 only includes the percent of the pop that received the first dose and the percent of 65 and older who received the first dose while the 2022 dataset also includes boosters etc. However, there was more missing data in 2022 so we may just want to use 2021.
3. John- I saw your comment on PCA. I agree that could help simplify certain steps for the model. Let me know if you want me to take a grouping of columns (especially the pre-existing conditions, employment rates etc) and try to reduce the dimensionality so we can focus on other columns. 
4. What do you think about taking state out of the X variable?
5. Do we want to manufacture a binarized mask column where if mask > 0 

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import statsmodels.api as sm

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.ensemble import VotingRegressor

from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor


In [2]:
new_pop = pd.read_csv('./Data/population_w_percent.csv')
new_pop['pop %'] = new_pop['pop %']/100
pop_percent = new_pop.drop(columns=['County Name', 'State', 'population', 'Total Population'])


In [3]:
cases = pd.read_csv('./Ignore/covid_confirmed_usafacts.csv')
cases['County'] = cases['County Name'].str.replace(r'\bCounty\b', '', regex=True).str.strip()

cases = pd.merge(cases, pop_percent, how = 'left', on='countyFIPS').copy()

In [4]:
state_name = {
    'AL':'Alabama',
    'AK':'Alaska',
    'AZ':'Arizona',
    'AR':'Arkansas',
    'CA':'California',
    'CO':'Colorado',
    'CT':'Connecticut',
    'DE':'Delaware',
    'DC':'District of Columbia',
    'FL':'Florida',
    'GA':'Georgia',
    'HI':'Hawaii',
    'ID':'Idaho',
    'IL':'Illinois',
    'IN':'Indiana',
    'IA':'Iowa',
    'KS':'Kansas',
    'KY':'Kentucky',
    'LA':'Louisiana',
    'ME':'Maine',
    'MD':'Maryland',
    'MA':'Massachusetts',
    'MI':'Michigan',
    'MN':'Minnesota',
    'MS':'Mississippi',
    'MO':'Missouri',
    'MT':'Montana',
    'NE':'Nebraska',
    'NV':'Nevada',
    'NH':'New Hampshire',
    'NJ':'New Jersey',
    'NM':'New Mexico',
    'NY':'New York',
    'NC':'North Carolina',
    'ND':'North Dakota',
    'OH':'Ohio',
    'OK':'Oklahoma',
    'OR':'Oregon',
    'PA':'Pennsylvania',
    'RI':'Rhode Island',
    'SC':'South Carolina',
    'SD':'South Dakota',
    'TN':'Tennessee',
    'TX':'Texas',
    'UT':'Utah',
    'VT':'Vermont',
    'VA':'Virginia',
    'WA':'Washington',
    'WV':'West Virginia',
    'WI':'Wisconsin',
    'WY':'Wyoming'
}

state_abbreviations = list(state_name.keys())

In [5]:
cases_2020 = cases[['countyFIPS', 'County', 'State', 'StateFIPS', '2020-12-31', 'pop %']].copy()
cases_2020.rename(columns = {'2020-12-31': 'cases_2020', 'pop %': 'pop_per'}, inplace =True)

In [6]:
for state_abbr in state_abbreviations:
    # Calculate total_unallocated for the current state
    total_unallocated = cases_2020.loc[
        (cases_2020['State'] == state_abbr) & (cases_2020['County'] == 'Statewide Unallocated'),
        'cases_2020'
    ].values[0]
    
    county_list = cases_2020.loc[cases_2020['State'] == state_abbr, 'County'].unique()
    
    for county in county_list:
        cases_2020.loc[
            (cases_2020['State'] == state_abbr) & (cases_2020['County'] == county),
            'cases_2020'
        ] += total_unallocated * cases_2020.loc[
            (cases_2020['State'] == state_abbr) & (cases_2020['County'] == county),
            'pop_per'
        ]

In [7]:
cases_2020 = cases_2020.drop_duplicates()

In [8]:
cases_2021 = cases[['countyFIPS', 'County', 'State', 'StateFIPS', '2021-12-31', 'pop %']].copy()
cases_2021.rename(columns = {'2021-12-31': 'cases_2021', 'pop %': 'pop_per'}, inplace =True)

In [9]:
for state_abbr in state_abbreviations:
    # Calculate total_unallocated for the current state
    total_unallocated = cases_2021.loc[
        (cases_2021['State'] == state_abbr) & (cases_2021['County'] == 'Statewide Unallocated'),
        'cases_2021'
    ].values[0]
    
    county_list = cases_2021.loc[cases_2021['State'] == state_abbr, 'County'].unique()
    
    for county in county_list:
        cases_2021.loc[
            (cases_2021['State'] == state_abbr) & (cases_2021['County'] == county),
            'cases_2021'
        ] += total_unallocated * cases_2021.loc[
            (cases_2021['State'] == state_abbr) & (cases_2021['County'] == county),
            'pop_per'
        ]

In [10]:
cases_2021 = cases_2021.drop_duplicates()

In [11]:
cases_2022 = cases[['countyFIPS', 'County', 'State', 'StateFIPS', '2022-12-31', 'pop %']].copy()
cases_2022.rename(columns = {'2022-12-31': 'cases_2022', 'pop %': 'pop_per'}, inplace =True)

In [12]:
for state_abbr in state_abbreviations:
    # Calculate total_unallocated for the current state
    total_unallocated = cases_2022.loc[
        (cases_2022['State'] == state_abbr) & (cases_2022['County'] == 'Statewide Unallocated'),
        'cases_2022'
    ].values[0]
    
    county_list = cases_2022.loc[cases_2022['State'] == state_abbr, 'County'].unique()
    
    for county in county_list:
        cases_2022.loc[
            (cases_2022['State'] == state_abbr) & (cases_2022['County'] == county),
            'cases_2022'
        ] += total_unallocated * cases_2022.loc[
            (cases_2022['State'] == state_abbr) & (cases_2022['County'] == county),
            'pop_per'
        ]

In [13]:
cases_2022 = cases_2022.drop_duplicates()

merged_cases = pd.merge(cases_2020, cases_2021, how='left', on='countyFIPS')
county_cases = pd.merge(merged_cases, cases_2022, how='left', on='countyFIPS')

county_cases.drop(columns = ['County','State', 'StateFIPS', 'pop_per_x', 'County_y', 'State_y', 'StateFIPS_y',
                  'pop_per_y'], inplace=True)
county_cases.head()


Unnamed: 0,countyFIPS,County_x,State_x,StateFIPS_x,cases_2020,cases_2021,cases_2022,pop_per
0,0,Statewide Unallocated,AL,1,0.0,0.0,0.0,0.0
1,0,Statewide Unallocated,AL,1,0.0,0.0,0.0,0.0
2,0,Statewide Unallocated,AL,1,0.0,0.0,1000.0,0.0
3,0,Statewide Unallocated,AL,1,0.0,0.0,3033.0,0.0
4,0,Statewide Unallocated,AL,1,0.0,0.0,4479.0,0.0


In [14]:
columns_to_check_duplicates = ['countyFIPS', 'County_x', 'State_x', 'StateFIPS_x']
county_cases.drop_duplicates(subset=columns_to_check_duplicates, keep='first', inplace=True)


In [15]:
county_cases.shape

(3193, 8)

In [16]:
county_cases.head()

Unnamed: 0,countyFIPS,County_x,State_x,StateFIPS_x,cases_2020,cases_2021,cases_2022,pop_per
0,0,Statewide Unallocated,AL,1,0.0,0.0,0.0,0.0
2601,1001,Autauga,AL,1,4190.0,11018.0,18961.0,0.011394
2602,1003,Baldwin,AL,1,13601.0,39911.0,67496.0,0.045528
2603,1005,Barbour,AL,1,1514.0,3860.0,7027.0,0.005035
2604,1007,Bibb,AL,1,1834.0,4533.0,7692.0,0.004567


In [17]:
county_cases.rename(columns = {'County_x' : 'County', 'State_x': 'State', 'StateFIPS_x': 'StateFIPS'}, inplace =True)


In [18]:
county_cases = county_cases.drop_duplicates()
county_cases.shape

(3193, 8)

In [19]:
rank = pd.read_csv('./Data/2019 County Health Rankings Data - cleaned.csv')


In [20]:
deaths = pd.read_csv('./Ignore/covid_deaths_usafacts.csv')

for i in range(5, len(deaths.columns)):
    deaths[deaths.columns[i]] = deaths[deaths.columns[i]] + deaths[deaths.columns[i - 1]]

In [21]:
deaths['County'] = deaths['County Name'].str.replace(r'\bCounty\b', '', regex=True).str.strip()
deaths = pd.merge(deaths, pop_percent, how = 'left', on='countyFIPS').copy()


  deaths['County'] = deaths['County Name'].str.replace(r'\bCounty\b', '', regex=True).str.strip()


In [22]:
deaths_2020 = deaths[['countyFIPS', 'County', 'State', 'StateFIPS', '2020-12-31', 'pop %']].copy()
deaths_2020.rename(columns = {'2020-12-31': 'deaths_2020', 'pop %': 'pop_per'}, inplace =True)
for state_abbr in state_abbreviations:
    # Calculate total_unallocated for the current state
    total_unallocated = deaths_2020.loc[
        (deaths_2020['State'] == state_abbr) & (deaths_2020['County'] == 'Statewide Unallocated'),
        'deaths_2020'
    ].values[0]
    
    county_list = deaths_2020.loc[deaths_2020['State'] == state_abbr, 'County'].unique()
    
    for county in county_list:
        deaths_2020.loc[
            (deaths_2020['State'] == state_abbr) & (deaths_2020['County'] == county),
            'deaths_2020'
        ] += total_unallocated * deaths_2020.loc[
            (deaths_2020['State'] == state_abbr) & (deaths_2020['County'] == county),
            'pop_per'
        ]

In [23]:
deaths_2020 = deaths_2020.drop_duplicates()


In [24]:
deaths_2021 = deaths[['countyFIPS', 'County', 'State', 'StateFIPS', '2021-12-31', 'pop %']].copy()
deaths_2021.rename(columns = {'2021-12-31': 'deaths_2021', 'pop %': 'pop_per'}, inplace =True)
for state_abbr in state_abbreviations:
    # Calculate total_unallocated for the current state
    total_unallocated = deaths_2021.loc[
        (deaths_2021['State'] == state_abbr) & (deaths_2021['County'] == 'Statewide Unallocated'),
        'deaths_2021'
    ].values[0]
    
    county_list = deaths_2021.loc[deaths_2021['State'] == state_abbr, 'County'].unique()
    
    for county in county_list:
        deaths_2021.loc[
            (deaths_2021['State'] == state_abbr) & (deaths_2021['County'] == county),
            'deaths_2021'
        ] += total_unallocated * deaths_2021.loc[
            (deaths_2021['State'] == state_abbr) & (deaths_2021['County'] == county),
            'pop_per'
        ]

deaths_2021 = deaths_2021.drop_duplicates()


In [25]:
deaths_2022 = deaths[['countyFIPS', 'County', 'State', 'StateFIPS', '2022-12-31', 'pop %']].copy()
deaths_2022.rename(columns = {'2022-12-31': 'deaths_2022', 'pop %': 'pop_per'}, inplace =True)
for state_abbr in state_abbreviations:
    # Calculate total_unallocated for the current state
    total_unallocated = deaths_2022.loc[
        (deaths_2022['State'] == state_abbr) & (deaths_2022['County'] == 'Statewide Unallocated'),
        'deaths_2022'
    ].values[0]
    
    county_list = deaths_2022.loc[deaths_2022['State'] == state_abbr, 'County'].unique()
    
    for county in county_list:
        deaths_2022.loc[
            (deaths_2022['State'] == state_abbr) & (deaths_2022['County'] == county),
            'deaths_2022'
        ] += total_unallocated * deaths_2022.loc[
            (deaths_2022['State'] == state_abbr) & (deaths_2022['County'] == county),
            'pop_per'
        ]
        
deaths_2022 = deaths_2022.drop_duplicates()


In [26]:
merged_deaths = pd.merge(deaths_2020, deaths_2021, how='left', on='countyFIPS')
county_deaths = pd.merge(merged_deaths, deaths_2022, how='left', on='countyFIPS')
county_deaths.drop(columns = ['County','State', 'StateFIPS', 'pop_per_x', 'County_y', 'State_y', 'StateFIPS_y',
                  'pop_per_y'], inplace=True)
county_deaths.head()

Unnamed: 0,countyFIPS,County_x,State_x,StateFIPS_x,deaths_2020,deaths_2021,deaths_2022,pop_per
0,0,Statewide Unallocated,AL,1,0.0,0.0,0.0,0.0
1,0,Statewide Unallocated,AL,1,0.0,0.0,618.0,0.0
2,0,Statewide Unallocated,AL,1,0.0,0.0,380.0,0.0
3,0,Statewide Unallocated,AL,1,0.0,0.0,11.0,0.0
4,0,Statewide Unallocated,AL,1,0.0,0.0,2348.0,0.0


In [27]:
columns_to_check_duplicates = ['countyFIPS', 'County_x', 'State_x', 'StateFIPS_x']

county_deaths.drop_duplicates(subset=columns_to_check_duplicates, keep='first', inplace=True)

county_deaths.rename(columns = {'County_x' : 'County', 'State_x': 'State', 'StateFIPS_x': 'StateFIPS'}, inplace =True)

county_deaths = county_deaths.drop_duplicates()
county_deaths.head()

Unnamed: 0,countyFIPS,County,State,StateFIPS,deaths_2020,deaths_2021,deaths_2022,pop_per
0,0,Statewide Unallocated,AL,1,0.0,0.0,0.0,0.0
2601,1001,Autauga,AL,1,5631.0,47405.0,124934.0,0.011394
2602,1003,Baldwin,AL,1,12412.0,148723.0,397246.0,0.045528
2603,1005,Barbour,AL,1,2035.0,24364.0,60044.0,0.005035
2604,1007,Bibb,AL,1,2678.0,28085.0,65964.0,0.004567


In [28]:
print(county_deaths.shape)
print(county_cases.shape)

(3193, 8)
(3193, 8)


In [29]:
county_cases.drop(columns = ['StateFIPS','pop_per'], inplace=True)
county_deaths.drop(columns = ['County', 'countyFIPS', 'State','StateFIPS','pop_per'], inplace=True)

In [30]:
cases_deaths = pd.concat([county_cases,county_deaths], axis=1, join='outer')


In [31]:
cases_deaths.shape

(3193, 9)

In [32]:
rank.head()

Unnamed: 0,FIPS,State,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,...,percent 65 and over,percent African American,percent American Indian/Alaskan Native,percent Asian,percent Native Hawaiian/Other Pacific Islander,percent Hispanic,percent Non-Hispanic White,percent Not Proficient in English,percent Female,number Rural
0,1001,Alabama,Autauga,8824.0,10471.0,,8707.0,18,19,38,...,15.1,19.3,0.5,1.3,0.1,2.9,74.5,1,51.3,22921.0
1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,...,19.9,9.0,0.8,1.2,0.1,4.6,83.0,0,51.5,77060.0
2,1005,Alabama,Barbour,9586.0,11333.0,,7310.0,26,22,44,...,18.8,47.9,0.7,0.5,0.2,4.2,46.0,1,47.2,18613.0
3,1007,Alabama,Bibb,11784.0,14813.0,,11328.0,20,20,38,...,16.0,21.5,0.4,0.2,0.1,2.6,74.3,0,46.5,15663.0
4,1009,Alabama,Blount,10908.0,,5620.0,11336.0,21,20,34,...,17.8,1.5,0.6,0.3,0.1,9.6,86.9,2,50.7,51562.0


In [33]:
cases_deaths.rename(columns={'countyFIPS': 'FIPS'},inplace=True)

In [34]:
rcd = pd.merge(rank,cases_deaths, how='inner', on='FIPS')
rcd.shape

(3142, 81)

In [35]:
rcd.head()

Unnamed: 0,FIPS,State_x,County_x,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,...,percent Female,number Rural,County_y,State_y,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022
0,1001,Alabama,Autauga,8824.0,10471.0,,8707.0,18,19,38,...,51.3,22921.0,Autauga,AL,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0
1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,...,51.5,77060.0,Baldwin,AL,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0
2,1005,Alabama,Barbour,9586.0,11333.0,,7310.0,26,22,44,...,47.2,18613.0,Barbour,AL,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0
3,1007,Alabama,Bibb,11784.0,14813.0,,11328.0,20,20,38,...,46.5,15663.0,Bibb,AL,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0
4,1009,Alabama,Blount,10908.0,,5620.0,11336.0,21,20,34,...,50.7,51562.0,Blount,AL,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0


In [36]:
rcd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3142 entries, 0 to 3141
Data columns (total 81 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   FIPS                                                 3142 non-null   int64  
 1   State_x                                              3142 non-null   object 
 2   County_x                                             3142 non-null   object 
 3   Years of Potential Life Lost Rate (premature death)  2908 non-null   float64
 4   YPLL Rate (Black)                                    1351 non-null   float64
 5   YPLL Rate (Hispanic)                                 837 non-null    float64
 6   YPLL Rate (White)                                    1578 non-null   float64
 7   % Fair/Poor Health                                   3142 non-null   int64  
 8   percent_smokers                                      3142 non-null  

In [37]:
mask = pd.read_csv('./Data/county_mask_mandata.csv')
#We can't drop state, some states have counties with the same name...
mask = mask.rename(columns = {'Count' : 'Masks'})
mask['County'] = mask['County'].str.replace(r'\bCounty\b', '', regex=True).str.strip()


In [38]:
mask.head()

Unnamed: 0,State,County,Masks
0,AL,Autauga,267
1,AL,Baldwin,267
2,AL,Barbour,267
3,AL,Bibb,267
4,AL,Blount,267


In [39]:
rcd.isna().sum()

FIPS                                                      0
State_x                                                   0
County_x                                                  0
Years of Potential Life Lost Rate (premature death)     234
YPLL Rate (Black)                                      1791
                                                       ... 
cases_2021                                                0
cases_2022                                                0
deaths_2020                                               0
deaths_2021                                               0
deaths_2022                                               0
Length: 81, dtype: int64

In [40]:
# Look at just the columns with missing data
missing_columns = rcd.columns[rcd.isnull().any()]
print(missing_columns)

# We will have to address each of these columns if john did not already

Index(['Years of Potential Life Lost Rate (premature death)',
       'YPLL Rate (Black)', 'YPLL Rate (Hispanic)', 'YPLL Rate (White)',
       'Food Environment Index', 'Number Uninsured', 'Percent Uninsured',
       'Number Primary Care Physicians', 'PCP Rate', 'PCP Ratio',
       'Preventable Hosp stays Rate', 'Preventable Hosp. Rate (Black)',
       'Preventable Hosp. Rate (Hispanic)', 'Preventable Hosp. Rate (White)',
       'Percent Vaccinated Flu', 'Percent Vaccinated Flu (Black)',
       'Percent  Vaccinated (Hispanic) Flu', 'Percent Vaccinated (White) Flu',
       'High School Graduation Rate', 'Number Unemployed', 'Labor Force',
       'Percent Unemployed', 'Average Daily PM2.5',
       'Presence of water violation', 'Life Expectancy', '95% CI - Low',
       '95% CI - High', 'Life Expectancy (Black)',
       'Life Expectancy (Hispanic)', 'Life Expectancy (White)',
       'Number pre-mature Deaths', 'Number HIV Cases', 'HIV Prevalence Rate',
       'Percent Uninsured Adults', 'P

In [41]:
rcd.head()

Unnamed: 0,FIPS,State_x,County_x,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,...,percent Female,number Rural,County_y,State_y,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022
0,1001,Alabama,Autauga,8824.0,10471.0,,8707.0,18,19,38,...,51.3,22921.0,Autauga,AL,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0
1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,...,51.5,77060.0,Baldwin,AL,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0
2,1005,Alabama,Barbour,9586.0,11333.0,,7310.0,26,22,44,...,47.2,18613.0,Barbour,AL,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0
3,1007,Alabama,Bibb,11784.0,14813.0,,11328.0,20,20,38,...,46.5,15663.0,Bibb,AL,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0
4,1009,Alabama,Blount,10908.0,,5620.0,11336.0,21,20,34,...,50.7,51562.0,Blount,AL,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0


In [42]:
rcd.rename(columns={'County_x': 'County'}, inplace=True)

In [43]:
rcd.shape

(3142, 81)

In [44]:
mask.shape

(2423, 3)

In [45]:
mask.head()

Unnamed: 0,State,County,Masks
0,AL,Autauga,267
1,AL,Baldwin,267
2,AL,Barbour,267
3,AL,Bibb,267
4,AL,Blount,267


In [46]:
rcd.head()

Unnamed: 0,FIPS,State_x,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,...,percent Female,number Rural,County_y,State_y,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022
0,1001,Alabama,Autauga,8824.0,10471.0,,8707.0,18,19,38,...,51.3,22921.0,Autauga,AL,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0
1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,...,51.5,77060.0,Baldwin,AL,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0
2,1005,Alabama,Barbour,9586.0,11333.0,,7310.0,26,22,44,...,47.2,18613.0,Barbour,AL,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0
3,1007,Alabama,Bibb,11784.0,14813.0,,11328.0,20,20,38,...,46.5,15663.0,Bibb,AL,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0
4,1009,Alabama,Blount,10908.0,,5620.0,11336.0,21,20,34,...,50.7,51562.0,Blount,AL,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0


In [47]:
rcd.rename(columns={'State_y': 'State'}, inplace=True)

In [48]:
rcd.head()

Unnamed: 0,FIPS,State_x,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,...,percent Female,number Rural,County_y,State,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022
0,1001,Alabama,Autauga,8824.0,10471.0,,8707.0,18,19,38,...,51.3,22921.0,Autauga,AL,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0
1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,...,51.5,77060.0,Baldwin,AL,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0
2,1005,Alabama,Barbour,9586.0,11333.0,,7310.0,26,22,44,...,47.2,18613.0,Barbour,AL,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0
3,1007,Alabama,Bibb,11784.0,14813.0,,11328.0,20,20,38,...,46.5,15663.0,Bibb,AL,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0
4,1009,Alabama,Blount,10908.0,,5620.0,11336.0,21,20,34,...,50.7,51562.0,Blount,AL,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0


In [49]:
county_data = pd.merge(rcd,mask, on=['State', 'County'], how='left')


In [50]:
county_data.shape

(3142, 82)

In [51]:
county_data.head()

Unnamed: 0,FIPS,State_x,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,...,number Rural,County_y,State,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022,Masks
0,1001,Alabama,Autauga,8824.0,10471.0,,8707.0,18,19,38,...,22921.0,Autauga,AL,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0,267.0
1,1003,Alabama,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,...,77060.0,Baldwin,AL,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0,267.0
2,1005,Alabama,Barbour,9586.0,11333.0,,7310.0,26,22,44,...,18613.0,Barbour,AL,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0,267.0
3,1007,Alabama,Bibb,11784.0,14813.0,,11328.0,20,20,38,...,15663.0,Bibb,AL,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0,267.0
4,1009,Alabama,Blount,10908.0,,5620.0,11336.0,21,20,34,...,51562.0,Blount,AL,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0,267.0


In [52]:
county_data['Masks'].isna().sum()

910

In [53]:
county_data['Masks'].fillna(0, inplace=True)

In [54]:
county_data['Masks'].isna().sum()

0

In [55]:
county_vax = pd.read_csv('Data/Cleaned/county_vax_2021.csv')

In [56]:
county_vax.shape

(3274, 5)

In [57]:
county_vax.isna().sum()

FIPS                                      0
Recip_County                              0
Recip_State                               0
Administered_Dose1_Pop_Pct                0
Administered_Dose1_Recip_65PlusPop_Pct    0
dtype: int64

In [69]:
county_vax.rename(columns={'Recip_County': 'County', 'Recip_State': 'State'}, inplace=True)

In [70]:
county_vax.head()

Unnamed: 0,FIPS,County,State,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_65PlusPop_Pct
0,2013,Aleutians East Borough,AK,74.1,47.3
1,2016,Aleutians West Census Area,AK,64.9,65.2
2,2020,Anchorage Municipality,AK,60.7,92.3
3,2050,Bethel Census Area,AK,60.4,87.2
4,2060,Bristol Bay Borough,AK,99.9,90.4


In [62]:
county_vax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3274 entries, 0 to 3273
Data columns (total 5 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   FIPS                                    3274 non-null   object 
 1   Recip_County                            3274 non-null   object 
 2   Recip_State                             3274 non-null   object 
 3   Administered_Dose1_Pop_Pct              3274 non-null   float64
 4   Administered_Dose1_Recip_65PlusPop_Pct  3274 non-null   float64
dtypes: float64(2), object(3)
memory usage: 128.0+ KB


In [58]:
county_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3142 entries, 0 to 3141
Data columns (total 82 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   FIPS                                                 3142 non-null   int64  
 1   State_x                                              3142 non-null   object 
 2   County                                               3142 non-null   object 
 3   Years of Potential Life Lost Rate (premature death)  2908 non-null   float64
 4   YPLL Rate (Black)                                    1351 non-null   float64
 5   YPLL Rate (Hispanic)                                 837 non-null    float64
 6   YPLL Rate (White)                                    1578 non-null   float64
 7   % Fair/Poor Health                                   3142 non-null   int64  
 8   percent_smokers                                      3142 non-null  

In [63]:
county_data['FIPS'] = county_data['FIPS'].astype('object')

In [60]:
county_data.drop(columns=['State_x', 'County_y'], inplace=True)

In [74]:
county_vax.head()

Unnamed: 0,FIPS,County,State,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_65PlusPop_Pct
0,2013,Aleutians East Borough,AK,74.1,47.3
1,2016,Aleutians West Census Area,AK,64.9,65.2
2,2020,Anchorage Municipality,AK,60.7,92.3
3,2050,Bethel Census Area,AK,60.4,87.2
4,2060,Bristol Bay Borough,AK,99.9,90.4


In [75]:
county_vax = county_vax.sort_values(by='FIPS', ascending=True)

In [76]:
county_vax.head()

Unnamed: 0,FIPS,County,State,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_65PlusPop_Pct
326,10001,Kent County,DE,51.3,83.7
327,10003,New Castle County,DE,65.9,94.6
328,10005,Sussex County,DE,65.3,97.2
30,1001,Autauga County,AL,42.2,73.8
31,1003,Baldwin County,AL,53.2,89.9


In [78]:
county_vax['County'] = county_vax['County'].str.replace(r'\bCounty\b', '', regex=True).str.strip()

In [79]:
county_vax.head()

Unnamed: 0,FIPS,County,State,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_65PlusPop_Pct
326,10001,Kent,DE,51.3,83.7
327,10003,New Castle,DE,65.9,94.6
328,10005,Sussex,DE,65.3,97.2
30,1001,Autauga,AL,42.2,73.8
31,1003,Baldwin,AL,53.2,89.9


In [71]:
county_data.head()

Unnamed: 0,FIPS,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,Food Environment Index,...,percent Female,number Rural,State,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022,Masks
0,1001,Autauga,8824.0,10471.0,,8707.0,18,19,38,7.2,...,51.3,22921.0,AL,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0,267.0
1,1003,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,8.0,...,51.5,77060.0,AL,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0,267.0
2,1005,Barbour,9586.0,11333.0,,7310.0,26,22,44,5.6,...,47.2,18613.0,AL,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0,267.0
3,1007,Bibb,11784.0,14813.0,,11328.0,20,20,38,7.6,...,46.5,15663.0,AL,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0,267.0
4,1009,Blount,10908.0,,5620.0,11336.0,21,20,34,8.5,...,50.7,51562.0,AL,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0,267.0


In [83]:
df = pd.merge(county_data, county_vax, how='inner', on=['State', 'County'])
print(df.shape)
df.head()

(3001, 83)


Unnamed: 0,FIPS_x,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,Food Environment Index,...,cases_2020,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022,Masks,FIPS_y,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_65PlusPop_Pct
0,1001,Autauga,8824.0,10471.0,,8707.0,18,19,38,7.2,...,4190.0,11018.0,18961.0,5631.0,47405.0,124934.0,267.0,1001,42.2,73.8
1,1003,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,8.0,...,13601.0,39911.0,67496.0,12412.0,148723.0,397246.0,267.0,1003,53.2,89.9
2,1005,Barbour,9586.0,11333.0,,7310.0,26,22,44,5.6,...,1514.0,3860.0,7027.0,2035.0,24364.0,60044.0,267.0,1005,44.5,75.3
3,1007,Bibb,11784.0,14813.0,,11328.0,20,20,38,7.6,...,1834.0,4533.0,7692.0,2678.0,28085.0,65964.0,267.0,1007,36.6,64.2
4,1009,Blount,10908.0,,5620.0,11336.0,21,20,34,8.5,...,4641.0,11256.0,17731.0,3855.0,56300.0,144559.0,267.0,1009,31.9,56.6


In [73]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3142 entries, 0 to 3141
Data columns (total 83 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   FIPS_x                                               3142 non-null   object 
 1   County                                               3142 non-null   object 
 2   Years of Potential Life Lost Rate (premature death)  2908 non-null   float64
 3   YPLL Rate (Black)                                    1351 non-null   float64
 4   YPLL Rate (Hispanic)                                 837 non-null    float64
 5   YPLL Rate (White)                                    1578 non-null   float64
 6   % Fair/Poor Health                                   3142 non-null   int64  
 7   percent_smokers                                      3142 non-null   int64  
 8   percent_obese                                        3142 non-null  

In [85]:
df.shape

(3001, 83)

In [84]:
df.to_csv('Data/Cleaned/county_df.csv', index=False)

In [82]:
df.isna().sum()

FIPS_x                                                    0
County                                                    0
Years of Potential Life Lost Rate (premature death)     234
YPLL Rate (Black)                                      1791
YPLL Rate (Hispanic)                                   2305
                                                       ... 
deaths_2022                                               0
Masks                                                     0
FIPS_y                                                  141
Administered_Dose1_Pop_Pct                              141
Administered_Dose1_Recip_65PlusPop_Pct                  141
Length: 83, dtype: int64

In [68]:
#df = pd.read_csv('Data/County_Data_needsVAXmerge.csv')
df.head()

Unnamed: 0,FIPS,County,Years of Potential Life Lost Rate (premature death),YPLL Rate (Black),YPLL Rate (Hispanic),YPLL Rate (White),% Fair/Poor Health,percent_smokers,percent_obese,Food Environment Index,...,cases_2021,cases_2022,deaths_2020,deaths_2021,deaths_2022,Masks,Recip_County,Recip_State,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_65PlusPop_Pct
0,1001,Autauga,8824.0,10471.0,,8707.0,18,19,38,7.2,...,11018.0,18961.0,5631.0,47405.0,124934.0,267.0,,,,
1,1003,Baldwin,7225.0,10042.0,3087.0,7278.0,18,17,31,8.0,...,39911.0,67496.0,12412.0,148723.0,397246.0,267.0,,,,
2,1005,Barbour,9586.0,11333.0,,7310.0,26,22,44,5.6,...,3860.0,7027.0,2035.0,24364.0,60044.0,267.0,,,,
3,1007,Bibb,11784.0,14813.0,,11328.0,20,20,38,7.6,...,4533.0,7692.0,2678.0,28085.0,65964.0,267.0,,,,
4,1009,Blount,10908.0,,5620.0,11336.0,21,20,34,8.5,...,11256.0,17731.0,3855.0,56300.0,144559.0,267.0,,,,


In [4]:
# Drop rows that we will not be using
df.drop(columns = ['Unnamed: 0', 'County', 'YPLL Rate (Black)', 'YPLL Rate (Hispanic)', 'YPLL Rate (White)', 'Number Uninsured', 'Number Primary Care Physicians', 
                        'Preventable Hosp. Rate (Black)', 'Preventable Hosp. Rate (Hispanic)', 'Preventable Hosp. Rate (White)',  'Percent Vaccinated Flu (Black)', 
                        'Percent  Vaccinated (Hispanic) Flu', 'Percent Vaccinated (White) Flu', 'Number Some College', 'Number Unemployed', 'Labor Force', 'PCP Ratio',
                        '80th Percentile Income', '20th Percentile Income', '95% CI - Low', '95% CI - High', 'Life Expectancy (Black)', 'Life Expectancy (Hispanic)', 
                        'Life Expectancy (White)', 'Number HIV Cases', 'Household income (Black)', 'Household income (Hispanic)', 'Household income (White)'], inplace = True)

In [5]:
df.shape

(1850, 53)

In [6]:
df.isna().sum()

FIPS                                                   0
State                                                  0
Years of Potential Life Lost Rate (premature death)    0
% Fair/Poor Health                                     0
percent_smokers                                        0
percent_obese                                          0
Food Environment Index                                 0
% Physically Inactive                                  0
percent Excessive Drinking                             0
Percent Uninsured                                      0
PCP Rate                                               0
Preventable Hosp stays Rate                            0
Percent Vaccinated Flu                                 0
High School Graduation Rate                            0
Percent Some College                                   0
Percent Unemployed                                     0
Income Ratio                                           0
Average Daily PM2.5            

In [None]:
# Make FIPS index 
df.set_index('FIPS', inplace=True)

In [None]:
# Create new columns for per populaltion stats - YPL, Number pre-mature Deaths, Number rural 
df['YPL'] = df['Years of Potential Life Lost Rate (premature death)']/df['Population']
df['pre mature deaths'] = df['Number pre-mature Deaths']/df['Population']
df['rural'] = df['number Rural']/df['Population']

df.drop(columns = ['Years of Potential Life Lost Rate (premature death)', 'Number pre-mature Deaths', 'number Rural'], inplace = True)

In [None]:
# Dummify State and Presence of water violation
df = pd.get_dummies(columns = ['State'], data = df, drop_first=True)
df['water'] = df['Presence of water violation'].map({'No': 0, 'Yes': 1})
df.drop(columns = ['Presence of water violation'], inplace = True)

In [None]:
# Calculate total cases and deaths, and convert to % of population 
df['cases'] = df['cases_2020'] + df['cases_2021'] + df['cases_2022']
df['deaths'] = df['deaths_2020'] + df['deaths_2021'] + df['deaths_2022']

df['case_rate'] = df['cases']/df['Population']
df['death_rate'] = df['deaths']/df['Population']

# Deaths seem to be off? More deaths than population 
df.drop(columns = ['cases_2020', 'cases_2021', 'cases_2022', 'deaths_2020', 'deaths_2021', 'deaths_2022', 'cases', 'deaths'], inplace = True)

In [None]:
# Drop Na values (1850 rows -> 1807)
df.dropna(inplace=True)
df.shape

In [None]:
# y variable will be case rate or death rate
y = df['case_rate']
# y = df['death_rate']

# X variables
X = df.drop(columns = ['case_rate', 'death_rate'])

# TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)