In [2]:
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os


In [4]:
!ls ./Data_Test_Train_Formats/


submission_format.csv
test_set_features.csv
training_set_features.csv
training_set_labels.csv


In [9]:
train_data = pd.read_csv('./Data_Test_Train_Formats/training_set_features.csv')
train_labels = pd.read_csv('./Data_Test_Train_Formats/training_set_labels.csv')

In [12]:
train_labels.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [29]:
df = train_labels.merge(train_data, on='respondent_id')
df.columns


Index(['respondent_id', 'h1n1_vaccine', 'seasonal_vaccine', 'h1n1_concern',
       'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation'],
      dtype='object')

In [25]:
df.isna().sum().sort_values(ascending=False)[:10]

employment_occupation    13470
employment_industry      13330
health_insurance         12274
income_poverty            4423
doctor_recc_h1n1          2160
doctor_recc_seasonal      2160
rent_or_own               2042
employment_status         1463
marital_status            1408
education                 1407
dtype: int64

Several fields seem to be missing quite a few (>10% of values...) so some sort of imputation strategy may come in handy, especially since domain suspicion of knowledge is that health insurance affects vaccination probability.



In [27]:
df['health_insurance'].value_counts() 

1.0    12697
0.0     1736
Name: health_insurance, dtype: int64

On the other hand, it might be safe to assume NaN's for insurance are 0s instead. 
(assuming it's plausible that 50% of people didn't have insurance?) 
- NOT REASONABLE ASSUMPTION! Below indicates <10% uninsured rate nationally. 
- Maybe need to fill with a random variable that has 90% chance of being a 1

https://www.cbpp.org/research/poverty-and-inequality/uninsured-rate-rose-in-2019-income-and-poverty-data-overtaken-by

Let's look at the association with vaccinations from different features. This is a bit strange since the targets are labels

In [43]:
targets = ['h1n1_vaccine', 'seasonal_vaccine']
# features measuring level of opinion concerning the topic
level_features = ['h1n1_concern',
       'h1n1_knowledge','opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc']
print(df[targets + level_features].corr()['h1n1_vaccine'])
print(df[targets + level_features].corr()['seasonal_vaccine'])

h1n1_vaccine                   1.000000
seasonal_vaccine               0.377143
h1n1_concern                   0.121929
h1n1_knowledge                 0.117951
opinion_h1n1_vacc_effective    0.269347
opinion_h1n1_risk              0.323265
opinion_h1n1_sick_from_vacc    0.075091
opinion_seas_vacc_effective    0.179272
opinion_seas_risk              0.258571
opinion_seas_sick_from_vacc    0.008360
Name: h1n1_vaccine, dtype: float64
h1n1_vaccine                   0.377143
seasonal_vaccine               1.000000
h1n1_concern                   0.154828
h1n1_knowledge                 0.120152
opinion_h1n1_vacc_effective    0.205072
opinion_h1n1_risk              0.216625
opinion_h1n1_sick_from_vacc    0.027404
opinion_seas_vacc_effective    0.361875
opinion_seas_risk              0.390106
opinion_seas_sick_from_vacc   -0.061510
Name: seasonal_vaccine, dtype: float64


Note, since the targets are also binary, the magnitude(and sign!) of the correlation isn't particularly meaningful. Instead, use them to rank most likely important features in order.

Opinions of vaccine effecitveness and disease risk seem correlated to each disease/vaccine pair.

The strongest correlation is actually between the two vaccines, but that won't be helpful in predicting as that is NOT a feature of the test set.

Resources for understanding how to determine importance of the different binary features available.

https://machinelearningmastery.com/chi-squared-test-for-machine-learning/

https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests/chi-square-goodness-of-fit-tests/v/chi-square-distribution-introduction?modal=1
    
https://medium.com/data-science-reporter/how-to-measure-feature-importance-in-a-binary-classification-model-d284b8c9a301

For binary variables (most of the ones in this dataset), the theoretical histogram we'd expect is a 50-50 split between 0 and 1