# Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines

## Business Understanding
### Overview
In 2009, during the H1N1 influenza outbreak, the U.S. National H1N1 Flu Survey collected extensive data on individuals' vaccination statuses, backgrounds, opinions, and health behaviors. This dataset provides an opportunity to analyze the factors that influenced people's decisions to receive the H1N1 and seasonal flu vaccines.

By predicting vaccination uptake based on these factors, insights can be gained to improve the design and communication strategies for future vaccination campaigns. Public health officials can use these insights to tailor outreach efforts, allocate resources efficiently, and address vaccine hesitancy more effectively.

### Business Problem
The primary goal of this project is to predict whether an individual received the H1N1 flu vaccine based on their demographic information, personal beliefs, and health behaviors. This binary classification task can help public health agencies:

1. Identify patterns among populations who are more or less likely to get vaccinated.
2. Understand barriers to vaccine adoption, such as misconceptions, trust issues, or socio-economic challenges.
3. Develop targeted interventions to increase vaccination rates, especially in communities where uptake is low.
4. Optimize communication strategies by identifying which beliefs and behaviors are most strongly associated with vaccination decisions.

By solving this problem, public health organizations can improve vaccination outreach and preparedness for future epidemics or pandemics, ultimately protecting more people from preventable diseases.

## Data Understanding
The datasets being used for this project was obtained from [Driven Data](https://www.drivendata.org/competitions/66/flu-shot-learning/). Here, I am going to review the dataset to assess the structure and characteristics of the data.

In [24]:
# Import necessary packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve

In [2]:
# Loading the datasets

X_train = pd.read_csv('Data/training_set_features.csv')
X_test = pd.read_csv('Data/test_set_features.csv')
y_train = pd.read_csv('Data/training_set_labels.csv')

### a) .head()
Displays the first five rows of the data.

In [3]:
X_train.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [4]:
X_test.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
1,26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp
2,26709,2.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,"> $75,000",Married,Own,Employed,lrircsnp,Non-MSA,1.0,0.0,nduyfdeo,pvmttkik
3,26710,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,lrircsnp,"MSA, Not Principle City",1.0,0.0,,
4,26711,3.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,lzgpxyit,Non-MSA,0.0,1.0,fcxhlnwr,mxkfnird


In [5]:
y_train.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


### b) .info()
Gives general information on the data and each column.

In [6]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [7]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26708 entries, 0 to 26707
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26708 non-null  int64  
 1   h1n1_concern                 26623 non-null  float64
 2   h1n1_knowledge               26586 non-null  float64
 3   behavioral_antiviral_meds    26629 non-null  float64
 4   behavioral_avoidance         26495 non-null  float64
 5   behavioral_face_mask         26689 non-null  float64
 6   behavioral_wash_hands        26668 non-null  float64
 7   behavioral_large_gatherings  26636 non-null  float64
 8   behavioral_outside_home      26626 non-null  float64
 9   behavioral_touch_face        26580 non-null  float64
 10  doctor_recc_h1n1             24548 non-null  float64
 11  doctor_recc_seasonal         24548 non-null  float64
 12  chronic_med_condition        25776 non-null  float64
 13  child_under_6_mo

In [8]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   respondent_id     26707 non-null  int64
 1   h1n1_vaccine      26707 non-null  int64
 2   seasonal_vaccine  26707 non-null  int64
dtypes: int64(3)
memory usage: 626.1 KB


### c) .describe()
Gives summary statistics such as mean, count, etc of columns with numerical data.

In [9]:
X_train.describe()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
count,26707.0,26615.0,26591.0,26636.0,26499.0,26688.0,26665.0,26620.0,26625.0,26579.0,...,25903.0,14433.0,26316.0,26319.0,26312.0,26245.0,26193.0,26170.0,26458.0,26458.0
mean,13353.0,1.618486,1.262532,0.048844,0.725612,0.068982,0.825614,0.35864,0.337315,0.677264,...,0.111918,0.87972,3.850623,2.342566,2.35767,4.025986,2.719162,2.118112,0.886499,0.534583
std,7709.791156,0.910311,0.618149,0.215545,0.446214,0.253429,0.379448,0.47961,0.472802,0.467531,...,0.315271,0.3253,1.007436,1.285539,1.362766,1.086565,1.385055,1.33295,0.753422,0.928173
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
25%,6676.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0
50%,13353.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0
75%,20029.5,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,1.0,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0
max,26706.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0


In [10]:
X_test.describe()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
count,26708.0,26623.0,26586.0,26629.0,26495.0,26689.0,26668.0,26636.0,26626.0,26580.0,...,25919.0,14480.0,26310.0,26328.0,26333.0,26256.0,26209.0,26187.0,26483.0,26483.0
mean,40060.5,1.623145,1.266042,0.049645,0.729798,0.069279,0.826084,0.351517,0.337227,0.683747,...,0.111501,0.887914,3.844622,2.326838,2.360612,4.024832,2.708688,2.143392,0.89431,0.543745
std,7710.079831,0.902755,0.615617,0.217215,0.444072,0.253934,0.379045,0.477453,0.472772,0.465022,...,0.314758,0.315483,1.00757,1.275636,1.359413,1.083204,1.376045,1.339102,0.754244,0.935057
min,26707.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
25%,33383.75,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0
50%,40060.5,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0
75%,46737.25,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,1.0,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0
max,53414.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0


In [11]:
y_train.describe()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
count,26707.0,26707.0,26707.0
mean,13353.0,0.212454,0.465608
std,7709.791156,0.409052,0.498825
min,0.0,0.0,0.0
25%,6676.5,0.0,0.0
50%,13353.0,0.0,0.0
75%,20029.5,0.0,1.0
max,26706.0,1.0,1.0


## Data Preparation

### Data Merging

In [12]:
# Merge x_train and y_train

train_data = pd.merge(X_train, y_train, how='left', on='respondent_id')

### Data Cleaning
This will involve checking for duplicates and missing values and if duplicates or missing values are present in the data, action will be taken appropriately.

#### 1. train_data

In [13]:
# Check for duplicates

train_data.duplicated().sum()

0

There are no duplicated rows in the data.

In [14]:
# Check for missing values

train_data.isna().sum()[train_data.isna().sum() > 0]

h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
education                       1407
income_poverty                  4423
marital_status                  1408
rent_or_own                     2042
employment_status               1463
household_adults                 249
h

There are missing values.

**Dealing with missing values**

In [15]:
# Function to drop columns with more than 30% missing values

def drop_high_missing_cols(df, threshold=0.3):
    """
    Drops columns with more than a given threshold of missing values.

    Parameters:
    - df (pd.DataFrame): The input DataFrame.
    - threshold (float): The proportion of missing values above which columns are dropped (default is 0.3).

    Returns:
    - pd.DataFrame: The DataFrame with high-missing-value columns dropped.
    """
    # Calculate the proportion of missing values for each column
    missing_proportion = df.isna().mean()
    
    # Identify columns where the missing proportion exceeds the threshold
    cols_to_drop = missing_proportion[missing_proportion > threshold].index
    
    # Drop the identified columns
    df_cleaned = df.drop(columns=cols_to_drop)
    
    return df_cleaned

In [16]:
# Function to fill missing values 

def fill_missing_values(df):
    """
    Fills missing values in the DataFrame:
    - For float columns, fill with the median.
    - For object (string) columns, fill with 'unknown'.

    Parameters:
    - df (pd.DataFrame): The input DataFrame.

    Returns:
    - pd.DataFrame: The DataFrame with missing values filled.
    """
    df_filled = df.copy()

    # Fill float columns with the median
    float_cols = df_filled.select_dtypes(include=['float']).columns
    for col in float_cols:
        median_value = df_filled[col].median()
        df_filled[col].fillna(median_value, inplace=True)

    # Fill object (string) columns with 'unknown'
    object_cols = df_filled.select_dtypes(include=['object']).columns
    for col in object_cols:
        df_filled[col].fillna('unknown', inplace=True)

    return df_filled


In [17]:
# Drop columns with more than 30% of null values
train_data_cleaned = drop_high_missing_cols(train_data, threshold=0.3)

# Fill missing values
train_data_cleaned = fill_missing_values(train_data_cleaned)

In [18]:
# Check if there is still any missing values

train_data_cleaned.isna().sum()[train_data_cleaned.isna().sum() > 0]

Series([], dtype: int64)

There are no longer any missing values.

#### 2. X_test

In [19]:
# Check for duplicates

X_test.duplicated().sum()

0

There are no duplicated rows.

In [20]:
# Check for missing values

X_test.isna().sum()[X_test.isna().sum() > 0]

h1n1_concern                      85
h1n1_knowledge                   122
behavioral_antiviral_meds         79
behavioral_avoidance             213
behavioral_face_mask              19
behavioral_wash_hands             40
behavioral_large_gatherings       72
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            932
child_under_6_months             813
health_worker                    789
health_insurance               12228
opinion_h1n1_vacc_effective      398
opinion_h1n1_risk                380
opinion_h1n1_sick_from_vacc      375
opinion_seas_vacc_effective      452
opinion_seas_risk                499
opinion_seas_sick_from_vacc      521
education                       1407
income_poverty                  4497
marital_status                  1442
rent_or_own                     2036
employment_status               1471
household_adults                 225
h

There are missing values.

**Dealing with missing values**

In [21]:
# Drop columns with more than 30% of null values
X_test_cleaned = drop_high_missing_cols(X_test, threshold=0.3)

# Fill missing values
X_test_cleaned = fill_missing_values(X_test_cleaned)

In [22]:
# Check if there is still any missing values

X_test_cleaned.isna().sum()[X_test_cleaned.isna().sum() > 0]

Series([], dtype: int64)

There are no longer any missing values.

### Encoding
Converting categorical variables into numerical values. 

In [25]:
# Function to encode categorical columns

def encode_categorical_columns(df):
    """
    Encodes categorical columns in a DataFrame.
    - One-hot encodes nominal categorical columns (with >2 unique values).
    - Label-encodes binary categorical columns (with 2 unique values).

    Parameters:
        df (pd.DataFrame): The DataFrame containing categorical columns.

    Returns:
        pd.DataFrame: The DataFrame with encoded categorical columns.
    """
    # Create a copy to avoid modifying the original DataFrame
    df_encoded = df.copy()
    
    # Identify categorical columns
    categorical_cols = df_encoded.select_dtypes(include=['object']).columns

    # Initialize LabelEncoder for binary encoding
    label_encoder = LabelEncoder()

    for col in categorical_cols:
        unique_vals = df_encoded[col].dropna().unique()
        
        # If the column has exactly 2 unique values, apply Label Encoding (binary)
        if len(unique_vals) == 2:
            df_encoded[col] = label_encoder.fit_transform(df_encoded[col])
        # If the column has more than 2 unique values, apply One-Hot Encoding (nominal)
        elif len(unique_vals) > 2:
            df_encoded = pd.get_dummies(df_encoded, columns=[col], drop_first=True)

    return df_encoded


In [26]:
# Encoding train_data_cleaned
train_data_encoded = encode_categorical_columns(train_data_cleaned)

# Encoding X_test_cleaned
X_test_encoded = encode_categorical_columns(X_test_cleaned)