# Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines

## Business Understanding
### Overview
In 2009, during the H1N1 influenza outbreak, the U.S. National H1N1 Flu Survey collected extensive data on individuals' vaccination statuses, backgrounds, opinions, and health behaviors. This dataset provides an opportunity to analyze the factors that influenced people's decisions to receive the H1N1 and seasonal flu vaccines.

By predicting vaccination uptake based on these factors, insights can be gained to improve the design and communication strategies for future vaccination campaigns. Public health officials can use these insights to tailor outreach efforts, allocate resources efficiently, and address vaccine hesitancy more effectively.

### Business Problem
The primary goal of this project is to predict whether an individual received the H1N1 flu vaccine based on their demographic information, personal beliefs, and health behaviors. This binary classification task can help public health agencies:

1. Identify patterns among populations who are more or less likely to get vaccinated.
2. Understand barriers to vaccine adoption, such as misconceptions, trust issues, or socio-economic challenges.
3. Develop targeted interventions to increase vaccination rates, especially in communities where uptake is low.
4. Optimize communication strategies by identifying which beliefs and behaviors are most strongly associated with vaccination decisions.

By solving this problem, public health organizations can improve vaccination outreach and preparedness for future epidemics or pandemics, ultimately protecting more people from preventable diseases.

## Data Understanding
The datasets being used for this project was obtained from [Driven Data](https://www.drivendata.org/competitions/66/flu-shot-learning/). Here, I am going to review the dataset to assess the structure and characteristics of the data.

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost as xgb
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score

In [2]:
# Loading the datasets
train_features = pd.read_csv('Data/training_set_features.csv', index_col='respondent_id')
test_features = pd.read_csv('Data/test_set_features.csv', index_col='respondent_id')
train_labels = pd.read_csv('Data/training_set_labels.csv', index_col='respondent_id')

### a) .head()
Displays the first five rows of the data.

In [3]:
train_features.head()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [4]:
test_features.head()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp
26709,2.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,...,"> $75,000",Married,Own,Employed,lrircsnp,Non-MSA,1.0,0.0,nduyfdeo,pvmttkik
26710,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,lrircsnp,"MSA, Not Principle City",1.0,0.0,,
26711,3.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,lzgpxyit,Non-MSA,0.0,1.0,fcxhlnwr,mxkfnird


In [5]:
train_labels.head()

Unnamed: 0_level_0,h1n1_vaccine,seasonal_vaccine
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0
1,0,1
2,0,0
3,0,1
4,0,0


### b) .info()
Gives general information on the data and each column.

In [6]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26707 entries, 0 to 26706
Data columns (total 35 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26615 non-null  float64
 1   h1n1_knowledge               26591 non-null  float64
 2   behavioral_antiviral_meds    26636 non-null  float64
 3   behavioral_avoidance         26499 non-null  float64
 4   behavioral_face_mask         26688 non-null  float64
 5   behavioral_wash_hands        26665 non-null  float64
 6   behavioral_large_gatherings  26620 non-null  float64
 7   behavioral_outside_home      26625 non-null  float64
 8   behavioral_touch_face        26579 non-null  float64
 9   doctor_recc_h1n1             24547 non-null  float64
 10  doctor_recc_seasonal         24547 non-null  float64
 11  chronic_med_condition        25736 non-null  float64
 12  child_under_6_months         25887 non-null  float64
 13  health_worker        

In [7]:
test_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26708 entries, 26707 to 53414
Data columns (total 35 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26623 non-null  float64
 1   h1n1_knowledge               26586 non-null  float64
 2   behavioral_antiviral_meds    26629 non-null  float64
 3   behavioral_avoidance         26495 non-null  float64
 4   behavioral_face_mask         26689 non-null  float64
 5   behavioral_wash_hands        26668 non-null  float64
 6   behavioral_large_gatherings  26636 non-null  float64
 7   behavioral_outside_home      26626 non-null  float64
 8   behavioral_touch_face        26580 non-null  float64
 9   doctor_recc_h1n1             24548 non-null  float64
 10  doctor_recc_seasonal         24548 non-null  float64
 11  chronic_med_condition        25776 non-null  float64
 12  child_under_6_months         25895 non-null  float64
 13  health_worker    

In [8]:
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26707 entries, 0 to 26706
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   h1n1_vaccine      26707 non-null  int64
 1   seasonal_vaccine  26707 non-null  int64
dtypes: int64(2)
memory usage: 625.9 KB


### c) .describe()
Gives summary statistics such as mean, count, etc of columns with numerical data.

In [9]:
train_features.describe()

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
count,26615.0,26591.0,26636.0,26499.0,26688.0,26665.0,26620.0,26625.0,26579.0,24547.0,...,25903.0,14433.0,26316.0,26319.0,26312.0,26245.0,26193.0,26170.0,26458.0,26458.0
mean,1.618486,1.262532,0.048844,0.725612,0.068982,0.825614,0.35864,0.337315,0.677264,0.220312,...,0.111918,0.87972,3.850623,2.342566,2.35767,4.025986,2.719162,2.118112,0.886499,0.534583
std,0.910311,0.618149,0.215545,0.446214,0.253429,0.379448,0.47961,0.472802,0.467531,0.414466,...,0.315271,0.3253,1.007436,1.285539,1.362766,1.086565,1.385055,1.33295,0.753422,0.928173
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0
50%,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0
75%,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,0.0,1.0,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0
max,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0


In [10]:
test_features.describe()

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
count,26623.0,26586.0,26629.0,26495.0,26689.0,26668.0,26636.0,26626.0,26580.0,24548.0,...,25919.0,14480.0,26310.0,26328.0,26333.0,26256.0,26209.0,26187.0,26483.0,26483.0
mean,1.623145,1.266042,0.049645,0.729798,0.069279,0.826084,0.351517,0.337227,0.683747,0.222666,...,0.111501,0.887914,3.844622,2.326838,2.360612,4.024832,2.708688,2.143392,0.89431,0.543745
std,0.902755,0.615617,0.217215,0.444072,0.253934,0.379045,0.477453,0.472772,0.465022,0.416044,...,0.314758,0.315483,1.00757,1.275636,1.359413,1.083204,1.376045,1.339102,0.754244,0.935057
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0
50%,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0
75%,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,0.0,1.0,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0
max,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0


In [11]:
train_labels.describe()

Unnamed: 0,h1n1_vaccine,seasonal_vaccine
count,26707.0,26707.0
mean,0.212454,0.465608
std,0.409052,0.498825
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,1.0
max,1.0,1.0


## Data Preparation

### Data Merging

In [12]:
# Merge x_train and y_train
train_data = pd.merge(train_features, train_labels, how='left', on='respondent_id')

### Data Cleaning
This will involve checking for duplicates and missing values and if duplicates or missing values are present in the data, action will be taken appropriately.

#### 1. train_data

In [13]:
# Check for duplicates
train_data.duplicated().sum()

0

There are no duplicated rows in the data.

In [14]:
# Check for missing values
train_data.isna().sum()[train_data.isna().sum() > 0]

h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
education                       1407
income_poverty                  4423
marital_status                  1408
rent_or_own                     2042
employment_status               1463
household_adults                 249
h

There are missing values.

**Dealing with missing values**

In [15]:
# Function to drop columns with more than 30% missing values
def drop_high_missing_cols(df, threshold=0.3):
    """
    Drops columns with more than a given threshold of missing values.

    Parameters:
    - df (pd.DataFrame): The input DataFrame.
    - threshold (float): The proportion of missing values above which columns are dropped (default is 0.3).

    Returns:
    - pd.DataFrame: The DataFrame with high-missing-value columns dropped.
    """
    # Calculate the proportion of missing values for each column
    missing_proportion = df.isna().mean()
    
    # Identify columns where the missing proportion exceeds the threshold
    cols_to_drop = missing_proportion[missing_proportion > threshold].index
    
    # Drop the identified columns
    df_cleaned = df.drop(columns=cols_to_drop)
    
    return df_cleaned

In [16]:
# Function to fill missing values 
def fill_missing_values(df):
    """
    Fills missing values in the DataFrame:
    - For float columns, fill with the median.
    - For object (string) columns, fill with 'unknown'.

    Parameters:
    - df (pd.DataFrame): The input DataFrame.

    Returns:
    - pd.DataFrame: The DataFrame with missing values filled.
    """
    df_filled = df.copy()

    # Fill float columns with the median
    float_cols = df_filled.select_dtypes(include=['float']).columns
    for col in float_cols:
        median_value = df_filled[col].median()
        df_filled[col].fillna(median_value, inplace=True)

    # Fill object (string) columns with 'unknown'
    object_cols = df_filled.select_dtypes(include=['object']).columns
    for col in object_cols:
        df_filled[col].fillna('unknown', inplace=True)

    return df_filled


In [17]:
# Drop columns with more than 30% of null values
train_data_cleaned = drop_high_missing_cols(train_data, threshold=0.3)

# Fill missing values
train_data_cleaned = fill_missing_values(train_data_cleaned)

In [18]:
# Check if there is still any missing values
train_data_cleaned.isna().sum().sum()

0

There are no longer any missing values.

#### 2. test_features

In [19]:
# Check for duplicates
test_features.duplicated().sum()

0

There are no duplicated rows.

In [20]:
# Check for missing values
test_features.isna().sum()[test_features.isna().sum() > 0]

h1n1_concern                      85
h1n1_knowledge                   122
behavioral_antiviral_meds         79
behavioral_avoidance             213
behavioral_face_mask              19
behavioral_wash_hands             40
behavioral_large_gatherings       72
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            932
child_under_6_months             813
health_worker                    789
health_insurance               12228
opinion_h1n1_vacc_effective      398
opinion_h1n1_risk                380
opinion_h1n1_sick_from_vacc      375
opinion_seas_vacc_effective      452
opinion_seas_risk                499
opinion_seas_sick_from_vacc      521
education                       1407
income_poverty                  4497
marital_status                  1442
rent_or_own                     2036
employment_status               1471
household_adults                 225
h

There are missing values.

**Dealing with missing values**

In [21]:
# Drop columns with more than 30% of null values
X_test_cleaned = drop_high_missing_cols(test_features, threshold=0.3)

# Fill missing values
X_test_cleaned = fill_missing_values(X_test_cleaned)

In [22]:
# Check if there is still any missing values
X_test_cleaned.isna().sum().sum()

0

There are no longer any missing values.

### Data Preprocessing

#### Encoding
Converting categorical variables into numerical values. 

In [23]:
# Function to encode categorical columns
def encode_categorical_columns(df):
    """
    Encodes categorical columns in a DataFrame.
    - One-hot encodes nominal categorical columns (with >2 unique values).
    - Label-encodes binary categorical columns (with 2 unique values).

    Parameters:
        df (pd.DataFrame): The DataFrame containing categorical columns.

    Returns:
        pd.DataFrame: The DataFrame with encoded categorical columns.
    """
    # Create a copy to avoid modifying the original DataFrame
    df_encoded = df.copy()
    
    # Identify categorical columns
    categorical_cols = df_encoded.select_dtypes(include=['object']).columns

    # Initialize LabelEncoder for binary encoding
    label_encoder = LabelEncoder()

    for col in categorical_cols:
        unique_vals = df_encoded[col].dropna().unique()
        
        # If the column has exactly 2 unique values, apply Label Encoding (binary)
        if len(unique_vals) == 2:
            df_encoded[col] = label_encoder.fit_transform(df_encoded[col])
        # If the column has more than 2 unique values, apply One-Hot Encoding (nominal)
        elif len(unique_vals) > 2:
            df_encoded = pd.get_dummies(df_encoded, columns=[col], drop_first=True)

    return df_encoded


In [24]:
# Encoding train_data_cleaned
train_data_encoded = encode_categorical_columns(train_data_cleaned)

# Encoding X_test_cleaned
X_test_encoded = encode_categorical_columns(X_test_cleaned)
X_test_encoded.columns = X_test_encoded.columns.str.replace(r'[^A-Za-z0-9_]', '_', regex=True)

#### Train Test Split

In [25]:
# Define features and target variables
X = train_data_encoded.drop(columns=['h1n1_vaccine', 'seasonal_vaccine'])
y_h1n1 = train_data_encoded['h1n1_vaccine']
y_seasonal = train_data_encoded['seasonal_vaccine']

# Split the data into training and validation sets
X_train, X_val, y_h1n1_train, y_h1n1_val = train_test_split(X, y_h1n1, test_size=0.2, random_state=42)
_, _, y_seasonal_train, y_seasonal_val = train_test_split(X, y_seasonal, test_size=0.2, random_state=42)

## Modelling and Evaluation

I will be utilizing a couple of models:
1. **Random Forest Classifier**.
2. **XGBoost Classifier**.
3. **CatBoost Classifier**.
4. **Voting Classifier**

### 1. Random Forest Classifier

**Modelling**

In [26]:
# Initialize Random Forest models
h1n1_rf = RandomForestClassifier(random_state=42, class_weight='balanced')
seasonal_rf = RandomForestClassifier(random_state=42)

# Train the models
h1n1_rf.fit(X_train, y_h1n1_train)
seasonal_rf.fit(X_train, y_seasonal_train)

# Predict probabilities on the validation set
h1n1_val_preds = h1n1_rf.predict_proba(X_val)[:, 1]
seasonal_val_preds = seasonal_rf.predict_proba(X_val)[:, 1]

**Evaluation**

The model's performance will be evaluated by use of roc-auc score.

In [27]:
# Calculate ROC AUC scores
h1n1_roc_auc = roc_auc_score(y_h1n1_val, h1n1_val_preds)
seasonal_roc_auc = roc_auc_score(y_seasonal_val, seasonal_val_preds)

# Calculate the mean ROC AUC score
overall_roc_auc = (h1n1_roc_auc + seasonal_roc_auc) / 2

print(f"H1N1 Vaccine ROC AUC (Random Forest): {h1n1_roc_auc:.5f}")
print(f"Seasonal Vaccine ROC AUC (Random Forest): {seasonal_roc_auc:.5f}")
print(f"Overall ROC AUC Score (Random Forest): {overall_roc_auc:.5f}")

H1N1 Vaccine ROC AUC (Random Forest): 0.82392
Seasonal Vaccine ROC AUC (Random Forest): 0.84755
Overall ROC AUC Score (Random Forest): 0.83573


- **H1N1 Vaccine ROC AUC: 0.8239**: This indicates good predictive performance for the H1N1 vaccine, as values above 0.8 suggest a strong model.

- **Seasonal Vaccine ROC AUC: 0.8475**: This shows even better performance for predicting the seasonal flu vaccine.

✅ **Overall Score: 0.8357**: This is a strong performance, suggesting the model is well-calibrated and performing consistently across both target variables.

#### Hyperparameter Tuning
To optimize the Random Forest model, we'll use `GridSearchCV` to tune hyperparameters like `n_estimators`, `max_depth`, and `min_samples_split`.

**Modelling**

In [28]:
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV for H1N1 model
grid_search_h1n1 = GridSearchCV(h1n1_rf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search_h1n1.fit(X_train, y_h1n1_train)

# Best parameters for H1N1 model
best_h1n1_params = grid_search_h1n1.best_params_

# Initialize GridSearchCV for Seasonal Flu model
grid_search_seasonal = GridSearchCV(seasonal_rf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search_seasonal.fit(X_train, y_seasonal_train)

# Best parameters for Seasonal Flu model
best_seasonal_params = grid_search_seasonal.best_params_

# Train final models with best parameters
h1n1_rf_best = RandomForestClassifier(**best_h1n1_params, random_state=42, class_weight='balanced')
seasonal_rf_best = RandomForestClassifier(**best_seasonal_params, random_state=42)

h1n1_rf_best.fit(X_train, y_h1n1_train)
seasonal_rf_best.fit(X_train, y_seasonal_train)

# Predict probabilities on the validation set
h1n1_val_preds_best = h1n1_rf_best.predict_proba(X_val)[:, 1]
seasonal_val_preds_best = seasonal_rf_best.predict_proba(X_val)[:, 1]

**Evaluation**

In [29]:
# Calculate ROC AUC scores
h1n1_roc_auc_best = roc_auc_score(y_h1n1_val, h1n1_val_preds_best)
seasonal_roc_auc_best = roc_auc_score(y_seasonal_val, seasonal_val_preds_best)

# Calculate the mean ROC AUC score
overall_roc_auc_best = (h1n1_roc_auc_best + seasonal_roc_auc_best) / 2

# Print evaluation metrics
print(f"H1N1 Vaccine ROC AUC (Optimized Random Forest): {h1n1_roc_auc_best:.5f}")
print(f"Seasonal Vaccine ROC AUC (Optimized Random Forest): {seasonal_roc_auc_best:.5f}")
print(f"Overall ROC AUC Score (Optimized Random Forest): {overall_roc_auc_best:.5f}")

H1N1 Vaccine ROC AUC (Optimized Random Forest): 0.82998
Seasonal Vaccine ROC AUC (Optimized Random Forest): 0.85148
Overall ROC AUC Score (Optimized Random Forest): 0.84073


- **H1N1 Vaccine ROC AUC: 0.8299**: This shows a slight improvement in predictive performance for the H1N1 vaccine, demonstrating that the optimization has helped the model to make more accurate predictions compared to the non-optimized version (0.8239).

- **Seasonal Vaccine ROC AUC: 0.8515**: Similarly, the seasonal flu vaccine prediction has improved slightly, with the optimized model performing better than the non-optimized version (0.8475).

✅ **Overall Score: 0.8407**: The overall performance improves from 0.8357 to 0.8407, indicating a more robust model after optimization.

The optimized Random Forest model provides a slight boost in ROC AUC scores for both target variables (H1N1 and Seasonal vaccines), improving by 0.0060 and 0.0040, respectively.
The overall ROC AUC score increases by 0.0050, showing a slight but noticeable improvement in model performance.

**Conclusion**: 

While the improvements are modest, the optimized Random Forest model outperforms the non-optimized version, making it the more reliable model in terms of predictive accuracy for both target variables.


### 2. XGBoost Classifier

**Modelling**

In [30]:
# Initialize XGBoost models with default parameters
h1n1_xgb = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
seasonal_xgb = xgb.XGBClassifier(random_state=42, eval_metric='logloss')

# Rename columns to remove special characters
X_train.columns = X_train.columns.str.replace(r'[^A-Za-z0-9_]', '_', regex=True)
X_val.columns = X_val.columns.str.replace(r'[^A-Za-z0-9_]', '_', regex=True)

# Train the models
h1n1_xgb.fit(X_train, y_h1n1_train)
seasonal_xgb.fit(X_train, y_seasonal_train)

# Predict probabilities on the validation set
h1n1_val_preds = h1n1_xgb.predict_proba(X_val)[:, 1]
seasonal_val_preds = seasonal_xgb.predict_proba(X_val)[:, 1]

**Evaluation**

In [31]:
# Calculate ROC AUC scores
h1n1_roc_auc = roc_auc_score(y_h1n1_val, h1n1_val_preds)
seasonal_roc_auc = roc_auc_score(y_seasonal_val, seasonal_val_preds)

# Calculate the mean ROC AUC score
overall_roc_auc = (h1n1_roc_auc + seasonal_roc_auc) / 2

# Print ROC AUC scores
print(f"H1N1 Vaccine ROC AUC (XGBoost): {h1n1_roc_auc:.5f}")
print(f"Seasonal Vaccine ROC AUC (XGBoost): {seasonal_roc_auc:.5f}")
print(f"Overall ROC AUC Score (XGBoost): {overall_roc_auc:.5f}")

H1N1 Vaccine ROC AUC (XGBoost): 0.81237
Seasonal Vaccine ROC AUC (XGBoost): 0.84504
Overall ROC AUC Score (XGBoost): 0.82870


- **H1N1 Vaccine ROC AUC: 0.8124**: This indicates decent predictive performance for the H1N1 vaccine. While values above 0.8 suggest a strong model, this score is slightly lower than the optimized Random Forest model, which had a higher AUC of 0.82998, demonstrating that the Random Forest model performed a bit better in predicting the H1N1 vaccine.

- **Seasonal Vaccine ROC AUC: 0.8450**: This shows a comparable performance for predicting the seasonal flu vaccine, with the XGBoost model showing an AUC close to that of the optimized Random Forest model (0.85148). Though the scores are nearly identical, the Random Forest model still slightly outperforms the XGBoost model here.

✅ **Overall Score: 0.8287**: While the XGBoost model performs well, the overall score is slightly lower than the optimized Random Forest model's 0.84073. This suggests that the Random Forest model has more consistent performance across both target variables, whereas the XGBoost model is performing slightly weaker in comparison.

**Conclusion**:

The **optimized Random Forest model** outperforms the **non-optimized XGBoost model** in both target variables (H1N1 and Seasonal vaccine) and the overall ROC AUC score. This suggests that, in this case, the Random Forest model is better at making accurate predictions than the non-optimized XGBoost model. 

However, further optimization of the XGBoost model could potentially improve its performance and allow it to compete more closely with the Random Forest model.

#### Hyperparameter Tuning

**Modelling**

In [32]:
# Define parameter grid for XGBoost
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Perform Grid Search for H1N1
grid_search_h1n1 = GridSearchCV(h1n1_xgb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search_h1n1.fit(X_train, y_h1n1_train)

# Perform Grid Search for Seasonal Flu
grid_search_seasonal = GridSearchCV(seasonal_xgb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search_seasonal.fit(X_train, y_seasonal_train)

# Get best parameters
best_h1n1_params = grid_search_h1n1.best_params_
best_seasonal_params = grid_search_seasonal.best_params_

# Train optimized models
h1n1_xgb_best = xgb.XGBClassifier(**best_h1n1_params, random_state=42, eval_metric='logloss')
seasonal_xgb_best = xgb.XGBClassifier(**best_seasonal_params, random_state=42, eval_metric='logloss')

h1n1_xgb_best.fit(X_train, y_h1n1_train)
seasonal_xgb_best.fit(X_train, y_seasonal_train)

# Predict probabilities on the validation set
h1n1_val_preds_best = h1n1_xgb_best.predict_proba(X_val)[:, 1]
seasonal_val_preds_best = seasonal_xgb_best.predict_proba(X_val)[:, 1]

**Evaluation**

In [33]:
# Calculate ROC AUC scores
h1n1_roc_auc_best = roc_auc_score(y_h1n1_val, h1n1_val_preds_best)
seasonal_roc_auc_best = roc_auc_score(y_seasonal_val, seasonal_val_preds_best)

# Calculate the mean ROC AUC score
overall_roc_auc_best = (h1n1_roc_auc_best + seasonal_roc_auc_best) / 2

# Print ROC AUC scores
print(f"H1N1 Vaccine ROC AUC (Optimized XGBoost): {h1n1_roc_auc_best:.5f}")
print(f"Seasonal Vaccine ROC AUC (Optimized XGBoost): {seasonal_roc_auc_best:.5f}")
print(f"Overall ROC AUC Score (Optimized XGBoost): {overall_roc_auc_best:.5f}")

H1N1 Vaccine ROC AUC (Optimized XGBoost): 0.83440
Seasonal Vaccine ROC AUC (Optimized XGBoost): 0.85823
Overall ROC AUC Score (Optimized XGBoost): 0.84632


- **H1N1 Vaccine ROC AUC: 0.8344**: This shows a solid performance for predicting the H1N1 vaccine, with a slight improvement over the non-optimized XGBoost score of 0.81237.

- **Seasonal Vaccine ROC AUC: 0.8582**: The performance for predicting the seasonal flu vaccine is significantly better, surpassing both the non-optimized XGBoost score (0.84504) and optimized Random Forest (0.85148).

✅ **Overall ROC AUC Score: 0.8463**: This score is an improvement over both the non-optimized XGBoost (0.8287) and optimized Random Forest (0.84073). It indicates that Optimized XGBoost is performing better overall.

**Conclusion**:

**Optimized XGBoost** appears to outperform **Optimized Random Forest** overall. It has stronger predictive power, especially for the _Seasonal vaccine_, but the _H1N1 vaccine_ performance is quite similar across both models.

### 3. CatBoost Classifier

**Modelling**

In [34]:
# Initialize CatBoost models
h1n1_catboost = CatBoostClassifier(random_state=42, class_weights=[1, 1], verbose=0)
seasonal_catboost = CatBoostClassifier(random_state=42, verbose=0)

# Train the models
h1n1_catboost.fit(X_train, y_h1n1_train)
seasonal_catboost.fit(X_train, y_seasonal_train)

# Predict probabilities on the validation set
h1n1_val_preds = h1n1_catboost.predict_proba(X_val)[:, 1]
seasonal_val_preds = seasonal_catboost.predict_proba(X_val)[:, 1]

**Evaluation**

In [35]:
# Calculate ROC AUC scores
h1n1_roc_auc = roc_auc_score(y_h1n1_val, h1n1_val_preds)
seasonal_roc_auc = roc_auc_score(y_seasonal_val, seasonal_val_preds)

# Calculate the mean ROC AUC score
overall_roc_auc = (h1n1_roc_auc + seasonal_roc_auc) / 2

# Print ROC AUC scores
print(f"H1N1 Vaccine ROC AUC (CatBoost): {h1n1_roc_auc:.5f}")
print(f"Seasonal Vaccine ROC AUC (CatBoost): {seasonal_roc_auc:.5f}")
print(f"Overall ROC AUC Score (CatBoost): {overall_roc_auc:.5f}")

H1N1 Vaccine ROC AUC (CatBoost): 0.82793
Seasonal Vaccine ROC AUC (CatBoost): 0.85794
Overall ROC AUC Score (CatBoost): 0.84294


- **H1N1 Vaccine ROC AUC Score: 0.8279**: The CatBoost model shows solid performance for predicting the H1N1 vaccine, though it is slightly lower than the optimized XGBoost score of 0.8344. This indicates that the CatBoost model could benefit from further tuning to match XGBoost's performance in this area.
- **Seasonal Vaccine ROC AUC Score: 0.8579**: The CatBoost model performs very well in predicting the seasonal vaccine, with a score of 0.8579, which is very close to the optimized XGBoost score of 0.8582. This indicates that the CatBoost model is highly effective at this task, showing minimal performance difference compared to XGBoost.

✅ **Overall ROC AUC Score: 0.8429**: Performance is slightly lower than XGBoost's 0.8463. While there is a small performance gap, CatBoost still provides strong predictive power, especially in handling categorical features and dataset characteristics that may benefit from its tree-based structure.

**Conclusion**:

CatBoost shows **strong performance overall**, with minimal difference from Optimized XGBoost. Fine-tuning the hyperparameters could further improve its performance, especially for tasks like predicting the H1N1 vaccine.

#### Hyperparameter Tuning

**Modelling**

In [36]:
# Define the parameter grid for tuning
param_grid = {
    'iterations': [500],             # Number of boosting iterations
    'learning_rate': [0.01, 0.05],   # Learning rate
    'depth': [6, 8],                 # Depth of trees
    'l2_leaf_reg': [1, 3],           # L2 regularization term
    'bagging_temperature': [0.5, 1], # Controls randomness in data sampling
    'border_count': [32, 64]         # Number of bins for categorical features
}

# Perform GridSearchCV for hyperparameter tuning
grid_search_h1n1 = GridSearchCV(estimator=h1n1_catboost, param_grid=param_grid, scoring='roc_auc', cv=3, n_jobs=-1, verbose=1)
grid_search_seasonal = GridSearchCV(estimator=seasonal_catboost, param_grid=param_grid, scoring='roc_auc', cv=3, n_jobs=-1, verbose=1)

# Train the GridSearchCV with the training data
grid_search_h1n1.fit(X_train, y_h1n1_train)
grid_search_seasonal.fit(X_train, y_seasonal_train)

# Get the best models after tuning
h1n1_catboost_best = grid_search_h1n1.best_estimator_
seasonal_catboost_best = grid_search_seasonal.best_estimator_

# Train the models on the training data
h1n1_catboost_best.fit(X_train, y_h1n1_train)
seasonal_catboost_best.fit(X_train, y_seasonal_train)

# Predict probabilities on the validation set
h1n1_val_preds = h1n1_catboost_best.predict_proba(X_val)[:, 1]
seasonal_val_preds = seasonal_catboost_best.predict_proba(X_val)[:, 1]

Fitting 3 folds for each of 32 candidates, totalling 96 fits
Fitting 3 folds for each of 32 candidates, totalling 96 fits


**Evaluation**

In [37]:
# Calculate ROC AUC scores
h1n1_roc_auc = roc_auc_score(y_h1n1_val, h1n1_val_preds)
seasonal_roc_auc = roc_auc_score(y_seasonal_val, seasonal_val_preds)

# Calculate the overall ROC AUC score
overall_roc_auc = (h1n1_roc_auc + seasonal_roc_auc) / 2

# Print ROC AUC scores
print(f"H1N1 Vaccine ROC AUC (Optimized CatBoost): {h1n1_roc_auc:.5f}")
print(f"Seasonal Vaccine ROC AUC (Optimized CatBoost): {seasonal_roc_auc:.5f}")
print(f"Overall ROC AUC Score (Optimized CatBoost): {overall_roc_auc:.5f}")

H1N1 Vaccine ROC AUC (Optimized CatBoost): 0.83508
Seasonal Vaccine ROC AUC (Optimized CatBoost): 0.85704
Overall ROC AUC Score (Optimized CatBoost): 0.84606


- **H1N1 Vaccine ROC AUC: 0.8351**: Slightly outperforms Optimized XGBoost (0.8344), showing a marginal improvement in predicting H1N1 vaccination.
- **Seasonal Vaccine ROC AUC: 0.8570**: Performs slightly lower than Optimized XGBoost (0.8582), but still highly competitive.

✅ **Overall ROC AUC Score: 0.8461**: Almost identical to Optimized XGBoost (0.8463), indicating both models are performing at a similar level.

**Conclusion**:

Optimized CatBoost and XGBoost are closely matched, with CatBoost performing slightly better for H1N1 but XGBoost excelling for Seasonal vaccine. Both models are strong choices, with minimal differences in overall performance. 

**NOTE**:

To leverage the strengths of each model and further improve predictive performance, I will apply an ensemble method using a Voting Classifier. This ensemble method will combine the predictions of the three models—Optimized Random Forest, Optimized XGBoost, and Optimized CatBoost—allowing us to make the final prediction based on the majority vote. By using this approach, we aim to enhance accuracy, reduce overfitting, and create a more robust model that benefits from the diverse strengths of each individual algorithm.

### 4. Voting Classifier

**Modelling**

In [38]:
# Voting Classifier for H1N1
voting_clf_h1n1 = VotingClassifier(
    estimators=[
        ('random_forest', h1n1_rf_best),
        ('xgboost', h1n1_xgb_best),
        ('catboost', h1n1_catboost_best)
    ],
    voting='soft'  # Probability-based predictions
)

# Voting Classifier for Seasonal Vaccine
voting_clf_seasonal = VotingClassifier(
    estimators=[
        ('random_forest', seasonal_rf_best),
        ('xgboost', seasonal_xgb_best),
        ('catboost', seasonal_catboost_best)
    ],
    voting='soft'
)

# Train the models on the training data
voting_clf_h1n1.fit(X_train, y_h1n1_train)
voting_clf_seasonal.fit(X_train, y_seasonal_train)

# Predict probabilities
h1n1_val_preds = voting_clf_h1n1.predict_proba(X_val)[:, 1]
seasonal_val_preds = voting_clf_seasonal.predict_proba(X_val)[:, 1]

**Evaluation**

In [39]:
# Compute ROC AUC scores
h1n1_roc_auc = roc_auc_score(y_h1n1_val, h1n1_val_preds)
seasonal_roc_auc = roc_auc_score(y_seasonal_val, seasonal_val_preds)

# Calculate the overall ROC AUC score
overall_roc_auc = (h1n1_roc_auc + seasonal_roc_auc) / 2

# Print ROC AUC scores
print(f"Voting Classifier ROC AUC (H1N1): {h1n1_roc_auc:.5f}")
print(f"Voting Classifier ROC AUC (Seasonal): {seasonal_roc_auc:.5f}")
print(f"Overall Voting Classifier ROC AUC: {overall_roc_auc:.5f}")

Voting Classifier ROC AUC (H1N1): 0.83585
Voting Classifier ROC AUC (Seasonal): 0.85785
Overall Voting Classifier ROC AUC: 0.84685


The Voting Classifier has shown a slight improvement in performance over the individual models:
- **H1N1 Vaccine ROC AUC: 0.83585**: This is a small but noticeable improvement over the Optimized XGBoost score of 0.83440 and CatBoost score of 0.82793 for H1N1 prediction, indicating better model performance due to the ensemble approach.
- **Seasonal Vaccine ROC AUC: 0.85785**: The Voting Classifier also performs better for the Seasonal vaccine, with a slight boost compared to Optimized XGBoost's 0.85823, showing that combining the models helps to balance out their strengths.

✅ **Overall ROC AUC Score: 0.84685**: The Voting Classifier achieves a marginal improvement over both Optimized XGBoost (0.84632) and CatBoost (0.84294), demonstrating that combining the models has led to better overall performance.

**Conclusion**: 

The **Voting Classifier** effectively combines the strengths of **Optimized Random Forest**, **Optimized XGBoost**, and **Optimized CatBoost**, leading to an overall improved performance in predicting both H1N1 and Seasonal vaccines. This ensemble method enhances predictive power, making it a solid choice for the final model.

**Submission**

In [41]:
# Predict probabilities for the test set
y_pred_proba_h1n1 = voting_clf_h1n1.predict_proba(X_test_encoded)[:, 1]
y_pred_proba_seasonal = voting_clf_seasonal.predict_proba(X_test_encoded)[:, 1]

# Format prediction probabilities for submission
submission = pd.DataFrame({
    'h1n1_vaccine': y_pred_proba_h1n1.round(1),
    'seasonal_vaccine': y_pred_proba_seasonal.round(1)
})

# Set and name the index
submission.index.name = 'respondent_id'

# Save to CSV
submission.to_csv('Data/submission.csv')