# Income and Education Exploration

- Median total income of households in 2015  - median_total_income - $x_1$
- Prevalence of low income - low_income_prevalence - $x_2$
- Percentage of people (15+) with less than high school education - percent_less_than_high_school - $x_3$
- Percentage of people (15+) with a secondary (high) school diploma - percent_high_school - $x_4$
- Percentage of people (15+) with a postsecondary education - percent_postsecondary - $x_5$
- Employment rate among people (15+) - employment_rate - $x_6$

## Useful definitions

- CHSA | LHA18_Code : Community Health Service Area | Local Health Area ID number
- CHSA | LHA18_Name : Community Health Service Area | Local Health Area name
- C_ADR_7day : Average daily rate per 100,000 population of COVID-19 cases reported during past 7 days (August 3 to 9, 2021)
- C_ADR_8_14day : Average daily rate per 100,000 population of COVID-19 cases reported 8 to 14 days ago (July 27 to August 2, 2021)
- C_ADR_7day_change : Absolute change in average daily rate compared to prior 7 day period
- 7d_positivity_all : Testing positivity rate (%) of all COVID-19 tests during past 7 days (August 3 to 9, 2021)
- 7d_positivity_public : Testing positivity rate (%) of publicly funded COVID-19 tests during past 7 days (August 3 to 9, 2021)
- D1_12_coverage : COVID-19 vaccine coverage among persons aged 12+ years receiving 1st dose (up to end of August 9, 2021)
- D1_18_coverage : COVID-19 vaccine coverage among persons aged 18+ years receiving 1st dose (up to end of August 9, 2021) - 
- D1_18_49_coverage : COVID-19 vaccine coverage among persons aged 18 to 49 years receiving 1st dose (up to end of August 9, 2021)
- D1_50_coverage : COVID-19 vaccine coverage among persons aged 50+ years receiving 1st dose (up to end of August 9, 2021)
- D2_12_coverage : COVID-19 vaccine coverage among persons aged 12+ years receiving 2nd dose (up to end of August 9, 2021)
- D2_18_coverage : COVID-19 vaccine coverage among persons aged 18+ years receiving 2nd dose (up to end of August 9, 2021) - $y_1$
- D2_18_49_coverage : COVID-19 vaccine coverage among persons aged 18 to 49 years receiving 2nd dose (up to end of August 9, 2021)
- D2_50_coverage : COVID-19 vaccine coverage among persons aged 50+ years receiving 2nd dose (up to end of August 9, 2021)

In [76]:
import pandas as pd
import seaborn as sns
from numpy import abs,mean
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import statsmodels.api as sm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set(rc={'figure.figsize':(20,20)})

In [77]:
socioecomic_factors_df = pd.read_csv('../data/socioeco_compiled.csv')
vaccination_df = pd.read_csv('../data/BCCDC_COVID19_CHSA_Data.csv')

In [78]:
income_and_edc_features = socioecomic_factors_df[['median_total_income',
                                                  'low_income_prevalence',
                                                  'percent_less_than_high_school',
                                                  'percent_high_school',
                                                  'percent_postsecondary',
                                                  'employment_rate',
                                                  'total_pop',
                                                  'hsa',
                                                  'subregion',
                                                  'code']]
income_and_edc_features = income_and_edc_features.rename(columns={"code": "CHSA18_Code"}).copy()
income_and_edc_features['median_total_income'] = income_and_edc_features['median_total_income'].map(lambda x : x.split('$')[1])
income_and_edc_features['median_total_income'] = pd.to_numeric(income_and_edc_features['median_total_income'])

# income_and_edc_features = income_and_edc_features.rename(columns={"code": "CHSA18_Code"}).copy()
income_and_edc_features['low_income_prevalence'] = income_and_edc_features['low_income_prevalence'].map(lambda x : x.split('%')[0])
income_and_edc_features['low_income_prevalence'] = pd.to_numeric(income_and_edc_features['low_income_prevalence'])

income_and_edc_features['percent_less_than_high_school'] = income_and_edc_features['percent_less_than_high_school'].map(lambda x : x.split('%')[0])
income_and_edc_features['percent_less_than_high_school'] = pd.to_numeric(income_and_edc_features['percent_less_than_high_school'])
income_and_edc_features['percent_high_school'] = income_and_edc_features['percent_high_school'].map(lambda x : x.split('%')[0])
income_and_edc_features['percent_high_school'] = pd.to_numeric(income_and_edc_features['percent_high_school'])
income_and_edc_features['percent_postsecondary'] = income_and_edc_features['percent_postsecondary'].map(lambda x : x.split('%')[0])
income_and_edc_features['percent_postsecondary'] = pd.to_numeric(income_and_edc_features['percent_postsecondary'])
income_and_edc_features['employment_rate'] = income_and_edc_features['employment_rate'].map(lambda x : x.split('%')[0])
income_and_edc_features['employment_rate'] = pd.to_numeric(income_and_edc_features['employment_rate'])



In [79]:
combined_df = pd.merge(income_and_edc_features,vaccination_df,on='CHSA18_Code')

In [80]:
X = combined_df[[
#     'median_total_income',
#                  'low_income_prevalence',
                 'percent_less_than_high_school',
#                  'percent_high_school',
#                  'percent_postsecondary',
#                  'employment_rate'
                ]]
y = combined_df[['D1_18_coverage']]

$$
y_1 = \sum_{i=1}^6a_ix_i +  b
$$

In [81]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,train_size=0.7)

In [82]:

D_1_coverage_predictor = LinearRegression()
D_1_coverage_predictor.fit(X_train,y_train)

LinearRegression()

In [83]:
# print(mean((D_2_coverage_predictor.predict(X_train) - y_train)**2))
fit = sm.OLS(y_train,sm.add_constant(X_train[[
#                 'median_total_income',                                        
#                 'low_income_prevalence',
                  'percent_less_than_high_school',
#                  'percent_high_school',
#                  'percent_postsecondary',
#                  'employment_rate'
                                             ]])).fit()

print(fit.summary())

                            OLS Regression Results                            
Dep. Variable:         D1_18_coverage   R-squared:                       0.225
Model:                            OLS   Adj. R-squared:                  0.215
Method:                 Least Squares   F-statistic:                     22.93
Date:                Sat, 25 Sep 2021   Prob (F-statistic):           7.72e-06
Time:                        19:08:08   Log-Likelihood:                -251.14
No. Observations:                  81   AIC:                             506.3
Df Residuals:                      79   BIC:                             511.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const         

  x = pd.concat(x[::order], 1)
