# H1N1 and Flu Vaccine Prediction: Data Exploration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Reading in Data

In [2]:
X = pd.read_csv('data/training_set_features.csv')
y = pd.read_csv('data/training_set_labels.csv')

In [3]:
data = pd.concat([X,y.drop('respondent_id', axis = 1)], axis = 1)

In [4]:
data.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0


## EDA

From the dataset description, we have the following features in our dataset:
<ul>
<li><code>h1n1_concern</code> - Level of concern about the H1N1 flu.<ul>
<li><code>0</code> = Not at all concerned; <code>1</code> = Not very concerned; <code>2</code> = Somewhat concerned; <code>3</code> = Very concerned.</li>
</ul>
</li>
<li><code>h1n1_knowledge</code> - Level of knowledge about H1N1 flu.<ul>
<li><code>0</code> = No knowledge; <code>1</code> = A little knowledge; <code>2</code> = A lot of knowledge.</li>
</ul>
</li>
<li><code>behavioral_antiviral_meds</code> - Has taken antiviral medications. (binary)</li>
<li><code>behavioral_avoidance</code> - Has avoided close contact with others with flu-like symptoms. (binary)</li>
<li><code>behavioral_face_mask</code> - Has bought a face mask. (binary)</li>
<li><code>behavioral_wash_hands</code> - Has frequently washed hands or used hand sanitizer. (binary)</li>
<li><code>behavioral_large_gatherings</code> - Has reduced time at large gatherings. (binary)</li>
<li><code>behavioral_outside_home</code> - Has reduced contact with people outside of own household. (binary)</li>
<li><code>behavioral_touch_face</code> - Has avoided touching eyes, nose, or mouth. (binary)</li>
<li><code>doctor_recc_h1n1</code> - H1N1 flu vaccine was recommended by doctor. (binary)</li>
<li><code>doctor_recc_seasonal</code> - Seasonal flu vaccine was recommended by doctor. (binary)</li>
<li><code>chronic_med_condition</code> - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)</li>
<li><code>child_under_6_months</code> - Has regular close contact with a child under the age of six months. (binary)</li>
<li><code>health_worker</code> - Is a healthcare worker. (binary)</li>
<li><code>health_insurance</code> - Has health insurance. (binary)</li>
<li><code>opinion_h1n1_vacc_effective</code> - Respondent's opinion about H1N1 vaccine effectiveness.<ul>
<li><code>1</code> = Not at all effective; <code>2</code> = Not very effective; <code>3</code> = Don't know; <code>4</code> = Somewhat effective; <code>5</code> = Very effective.</li>
</ul>
</li>
<li><code>opinion_h1n1_risk</code> - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.<ul>
<li><code>1</code> = Very Low; <code>2</code> = Somewhat low; <code>3</code> = Don't know; <code>4</code> = Somewhat high; <code>5</code> = Very high.</li>
</ul>
</li>
<li><code>opinion_h1n1_sick_from_vacc</code> - Respondent's worry of getting sick from taking H1N1 vaccine.<ul>
<li><code>1</code> = Not at all worried; <code>2</code> = Not very worried; <code>3</code> = Don't know; <code>4</code> = Somewhat worried; <code>5</code> = Very worried.</li>
</ul>
</li>
<li><code>opinion_seas_vacc_effective</code> - Respondent's opinion about seasonal flu vaccine effectiveness.<ul>
<li><code>1</code> = Not at all effective; <code>2</code> = Not very effective; <code>3</code> = Don't know; <code>4</code> = Somewhat effective; <code>5</code> = Very effective.</li>
</ul>
</li>
<li><code>opinion_seas_risk</code> - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.<ul>
<li><code>1</code> = Very Low; <code>2</code> = Somewhat low; <code>3</code> = Don't know; <code>4</code> = Somewhat high; <code>5</code> = Very high.</li>
</ul>
</li>
<li><code>opinion_seas_sick_from_vacc</code> - Respondent's worry of getting sick from taking seasonal flu vaccine.<ul>
<li><code>1</code> = Not at all worried; <code>2</code> = Not very worried; <code>3</code> = Don't know; <code>4</code> = Somewhat worried; <code>5</code> = Very worried.</li>
</ul>
</li>
<li><code>age_group</code> - Age group of respondent.</li>
<li><code>education</code> - Self-reported education level.</li>
<li><code>race</code> - Race of respondent.</li>
<li><code>sex</code> - Sex of respondent.</li>
<li><code>income_poverty</code> - Household annual income of respondent with respect to 2008 Census poverty thresholds.</li>
<li><code>marital_status</code> - Marital status of respondent.</li>
<li><code>rent_or_own</code> - Housing situation of respondent.</li>
<li><code>employment_status</code> - Employment status of respondent.</li>
<li><code>hhs_geo_region</code> - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings.</li>
<li><code>census_msa</code> - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.</li>
<li><code>household_adults</code> - Number of <em>other</em> adults in household, top-coded to 3.</li>
<li><code>household_children</code> - Number of children in household, top-coded to 3.</li>
<li><code>employment_industry</code> - Type of industry respondent is employed in. Values are represented as short random character strings.</li>
<li><code>employment_occupation</code> - Type of occupation of respondent. Values are represented as short random character strings.</li>
</ul>

In [5]:
data.dtypes

respondent_id                    int64
h1n1_concern                   float64
h1n1_knowledge                 float64
behavioral_antiviral_meds      float64
behavioral_avoidance           float64
behavioral_face_mask           float64
behavioral_wash_hands          float64
behavioral_large_gatherings    float64
behavioral_outside_home        float64
behavioral_touch_face          float64
doctor_recc_h1n1               float64
doctor_recc_seasonal           float64
chronic_med_condition          float64
child_under_6_months           float64
health_worker                  float64
health_insurance               float64
opinion_h1n1_vacc_effective    float64
opinion_h1n1_risk              float64
opinion_h1n1_sick_from_vacc    float64
opinion_seas_vacc_effective    float64
opinion_seas_risk              float64
opinion_seas_sick_from_vacc    float64
age_group                       object
education                       object
race                            object
sex                      

In [6]:
data.isna().sum()/len(data)

respondent_id                  0.000000
h1n1_concern                   0.003445
h1n1_knowledge                 0.004343
behavioral_antiviral_meds      0.002658
behavioral_avoidance           0.007788
behavioral_face_mask           0.000711
behavioral_wash_hands          0.001573
behavioral_large_gatherings    0.003258
behavioral_outside_home        0.003070
behavioral_touch_face          0.004793
doctor_recc_h1n1               0.080878
doctor_recc_seasonal           0.080878
chronic_med_condition          0.036358
child_under_6_months           0.030704
health_worker                  0.030104
health_insurance               0.459580
opinion_h1n1_vacc_effective    0.014640
opinion_h1n1_risk              0.014528
opinion_h1n1_sick_from_vacc    0.014790
opinion_seas_vacc_effective    0.017299
opinion_seas_risk              0.019246
opinion_seas_sick_from_vacc    0.020107
age_group                      0.000000
education                      0.052683
race                           0.000000


In [7]:
data.nunique()

respondent_id                  26707
h1n1_concern                       4
h1n1_knowledge                     3
behavioral_antiviral_meds          2
behavioral_avoidance               2
behavioral_face_mask               2
behavioral_wash_hands              2
behavioral_large_gatherings        2
behavioral_outside_home            2
behavioral_touch_face              2
doctor_recc_h1n1                   2
doctor_recc_seasonal               2
chronic_med_condition              2
child_under_6_months               2
health_worker                      2
health_insurance                   2
opinion_h1n1_vacc_effective        5
opinion_h1n1_risk                  5
opinion_h1n1_sick_from_vacc        5
opinion_seas_vacc_effective        5
opinion_seas_risk                  5
opinion_seas_sick_from_vacc        5
age_group                          5
education                          4
race                               4
sex                                2
income_poverty                     3
m

In [8]:
binary_dict = {0:'No', 1:'Yes'}
cols = data.columns
object_cols = [] # columns 
for col in cols:
    if data[col].nunique() == 2 or data[col].dtype == 'object':
        object_cols.append(col)
data[object_cols] = data[object_cols].replace(binary_dict) # map 0 and 1s to 'No' and 'Yes' respectively
data[object_cols] = data[object_cols].fillna('missing') # fill missing as 'missing'