Predict whether people received the H1N1 or Seasonal Flu vaccine. Audience: Public health efforts guidance. Choose only one target.

In [None]:
#IMPORT LIBRARIES

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For modelling
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, recall_score
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')


In [396]:
features_df = pd.read_csv('DATA/training_set_features.csv')
target_df = pd.read_csv('DATA/training_set_labels.csv')

In [397]:
features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [398]:
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   respondent_id     26707 non-null  int64
 1   h1n1_vaccine      26707 non-null  int64
 2   seasonal_vaccine  26707 non-null  int64
dtypes: int64(3)
memory usage: 626.1 KB


The two datasets have equal number of rows and share a common column 'respondent_id' which can be used to merge them. The target variables are 'h1n1_vaccine' and 'seasonal_vaccine'. We will choose one target variable for prediction.

In [399]:
print(features_df.shape)
print(target_df.shape)

(26707, 36)
(26707, 3)


In [400]:
#merging dataframes on respondent_id
df = features_df.merge(target_df, on='respondent_id')

In [401]:
#check new shape 
print(f"Combined Shape: {df.shape}")

Combined Shape: (26707, 38)


In [402]:
#check first 5 rows
df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,doctor_recc_seasonal,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0,2.0,2.0,1.0,2.0,55 - 64 Years,< 12 Years,White,Female,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,4.0,4.0,4.0,2.0,4.0,35 - 44 Years,12 Years,White,Male,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,0.0,0.0,,3.0,1.0,1.0,4.0,1.0,2.0,18 - 34 Years,College Graduate,White,Male,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,,3.0,3.0,5.0,5.0,4.0,1.0,65+ Years,12 Years,White,Female,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,3.0,3.0,2.0,3.0,1.0,4.0,45 - 54 Years,Some College,White,Female,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0


In [403]:

#checking info of combined dataframe
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

from the above analysis:
- 23 columns are of type float64
- 2 columns are of type int64
- 12 columns are of type object

In [404]:
#check for missing values and sort in descending order
df.isna().sum().to_frame().sort_values(0,ascending = False).assign(Dtype=df.dtypes)

Unnamed: 0,0,Dtype
employment_occupation,13470,object
employment_industry,13330,object
health_insurance,12274,float64
income_poverty,4423,object
doctor_recc_h1n1,2160,float64
doctor_recc_seasonal,2160,float64
rent_or_own,2042,object
employment_status,1463,object
marital_status,1408,object
education,1407,object


In [405]:
#sort percentage of missing values and display data types
(df.isna().mean() * 100).sort_values(ascending=False).to_frame(name='Percentage_missing').assign(Dtype=df.dtypes).style.bar(subset=['Percentage_missing'], color="#67eef0")



Unnamed: 0,Percentage_missing,Dtype
employment_occupation,50.436215,object
employment_industry,49.912008,object
health_insurance,45.957989,float64
income_poverty,16.561201,object
doctor_recc_h1n1,8.087767,float64
doctor_recc_seasonal,8.087767,float64
rent_or_own,7.645936,object
employment_status,5.477965,object
marital_status,5.272026,object
education,5.268282,object


#### Categorical columns with missing values

from the recent missing values analysis, we can see that `employment_industry`, `employment_occupation` and `health_insurance`have significant missing values. We will drop these columns for simplicity and since they are not important for our analysis.

In [406]:
#drop columns with significant missing values
df=df.drop(columns=['employment_industry', 'employment_occupation', 'health_insurance'])

Columns like `income_poverty`, `rent_or_own`,`employment_occupation` and `marital_status` have a moderate amount of missing data. Since these are "Object" (text) types, you can't use a mean or median. For these we will fill in the missing values with 'Unknown' so as we are not making any assumptions about the person's life.

In [407]:
#fill in missing values with 'Unknown' for categorical columns
cat_cols=['income_poverty', 'rent_or_own', 'marital_status', 'employment_status','education']
for col in cat_cols:
    df[col] = df[col].fillna('Unknown')

# df['income_poverty'] = df['income_poverty'].fillna('Unknown')
# df['rent_or_own'] = df['rent_or_own'].fillna('Unknown')
# df['marital_status'] = df['marital_status'].fillna('Unknown')
# df['employment_status'] = df['employment_status'].fillna('Unknown')



For columns like `h1n1_knowledge`,`option_h1n1_risk` or `opinion_seas_risk` the missingness is very low. we will impute with the mode of the respective columns.

In [408]:
print(df['h1n1_knowledge'].value_counts())
print()
print(df['opinion_seas_risk'].value_counts())
print()
print(df['opinion_h1n1_risk'].value_counts())

h1n1_knowledge
1.0    14598
2.0     9487
0.0     2506
Name: count, dtype: int64

opinion_seas_risk
2.0    8954
4.0    7630
1.0    5974
5.0    2958
3.0     677
Name: count, dtype: int64

opinion_h1n1_risk
2.0    9919
1.0    8139
4.0    5394
5.0    1750
3.0    1117
Name: count, dtype: int64


In [409]:
# #impute with the mode for columns with low missingness
low_missing_cols = ['h1n1_knowledge', 'opinion_seas_risk', 'opinion_h1n1_risk','doctor_recc_h1n1','doctor_recc_seasonal']
for cols in low_missing_cols:
    df[cols]=df[cols].fillna(df[cols].mode()[0])

now we have handled missing values in categorical columns. Next, we will handle missing values in numerical columns.

#### Numerical columns with missing values

For numerical columns with missing values, we can impute the missing values with the mean or median of the respective columns. We will use the mean for this analysis.

In [410]:
num_cols = df.select_dtypes(include=['number']).columns[df.select_dtypes(include=['number']).isna().any()]

for col in num_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

print(f'filled missing values for: {list(num_cols)}')

filled missing values for: ['h1n1_concern', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_sick_from_vacc', 'household_adults', 'household_children']


In [411]:
#recheck missing values
(df.isna().mean() * 100).sort_values(ascending=False).to_frame(name='Percentage_missing').assign(Dtype=df.dtypes).style.bar(subset=['Percentage_missing'], color="#67eef0")

Unnamed: 0,Percentage_missing,Dtype
respondent_id,0.0,int64
marital_status,0.0,object
opinion_seas_sick_from_vacc,0.0,float64
age_group,0.0,object
education,0.0,object
race,0.0,object
sex,0.0,object
income_poverty,0.0,object
rent_or_own,0.0,object
opinion_seas_vacc_effective,0.0,float64



we have handled missing values in both categorical and numerical columns. Next, we will encode categorical variables using one-hot encoding and label encoding as appropriate.


## PREPROCESSING


### Encoding Categorical Variables