# Predicting Vaccine Uptake: A Data-Driven Analysis of H1N1 and Seasonal Flu Vaccinations

### Project Objective

The objective of this project is to predict whether people are likely to be vaccinated against **H1N1 flu** and the **seasonal flu** based on the features from the 2009 National H1N1 Flu Survey. The following aims will be achieved through an analysis of demographic, behavioral, and opinion-based features:

1. Construct models that would help estimate vaccine uptake.
2. Describe influencing factors of the choices regarding vaccination.
3. Provide insights to guide public health strategies and improve vaccine outreach efforts.

This project aims to contribute to the reduction in vaccine hesitancy and towards better data-driven decision-making that helps enhance health outcomes.

### **Problem Statement**

This is a multilabel classification problem where each row in the data corresponds to a respondent who took part in the **2009 National H1N1 Flu Survey**. The task is to predict two binary target variables for each respondent:

- **h1n1_vaccine**: Did the respondent get the H1N1 flu vaccine (1 = Yes, 0 = No).
- **seasonal_vaccine**: The respondent received the seasonal flu vaccine (1 = Yes, 0 = No).
-
- Each target is independent of the others, so a respondent might receive:
 1. Neither vaccine,  2. Only the H1N1 vaccine,  3. Only the seasonal flu vaccine,  4. Both vaccines.


### **Data Exploration and Preparation**


In [5]:
# Import essential libraries for data manipulation, visualization, and machine learning tasks

# Pandas for data manipulation and analysis
import pandas as pd

# NumPy for numerical operations and handling arrays
import numpy as np

# Matplotlib for creating static, interactive, and animated visualizations
import matplotlib.pyplot as plt

# Seaborn for advanced statistical data visualization
import seaborn as sns

# ydata-profiling for generating detailed data profiling reports
from ydata_profiling import ProfileReport

# Enable experimental features in scikit-learn, specifically for IterativeImputer
from sklearn.experimental import enable_iterative_imputer  # noqa

# IterativeImputer and SimpleImputer for handling missing data
from sklearn.impute import IterativeImputer, SimpleImputer


# MinMaxScaler for scaling features to a specific range (e.g., 0-1)
# OneHotEncoder for converting categorical variables into numerical representations
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# SMOTE (Synthetic Minority Oversampling Technique) for handling imbalanced datasets
from imblearn.over_sampling import SMOTE

# train_test_split for dividing the dataset into training and testing subsets
from sklearn.model_selection import train_test_split

# Logistic Regression for a linear classification model
from sklearn.linear_model import LogisticRegression

# RandomForestClassifier for ensemble-based classification using decision trees
from sklearn.ensemble import RandomForestClassifier

# GradientBoostingClassifier for ensemble-based classification using boosting techniques
from sklearn.ensemble import GradientBoostingClassifier

# XGBClassifier from XGBoost, an optimized gradient boosting framework
from xgboost import XGBClassifier

# Metrics to evaluate model performance:
# - accuracy_score: For overall accuracy
# - f1_score: For the balance between precision and recall
# - confusion_matrix: For a detailed breakdown of prediction results
# - classification_report: For precision, recall, F1-score, and support metrics
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix,classification_report

# Suppress warnings for cleaner output during execution
import warnings
warnings.filterwarnings('ignore')

In [6]:
# Importing the dataset

# Step 1: Load the dataset containing feature variables
# The "features.csv" file is expected to hold the independent variables (features) of the dataset.
df_features = pd.read_csv("./Data/features.csv")

# Step 2: Load the dataset containing target labels or additional information
# The "labels.csv" file is expected to hold the dependent variable (target labels).
df_labels = pd.read_csv("./Data/labels.csv")

# Step 3: Merge the two datasets
# Combine the features and labels DataFrames using their indices as the key.
# The "left_index=True" and "right_index=True" arguments ensure the merge is based on the row indices.
df = pd.merge(df_features, df_labels, right_index=True, left_index=True)

# Step 4: Display the first few rows of the merged DataFrame to verify the merge operation
df

Unnamed: 0,respondent_id_x,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,respondent_id_y,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,1,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,2,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,3,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,4,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,,,26702,0,0
26703,26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea,26703,0,0
26704,26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,,lzgpxyit,"MSA, Not Principle City",0.0,0.0,,,26704,0,1
26705,26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,...,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg,26705,0,0


In [7]:
# Display the first 5 rows of the DataFrame to inspect its structure and initial data
df.head(5)

Unnamed: 0,respondent_id_x,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,respondent_id_y,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,1,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,2,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,3,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,4,0,0


In [8]:
# Show the last 5 rows of the DataFrame to check for trailing data or missing values
df.tail(5)

Unnamed: 0,respondent_id_x,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,respondent_id_y,h1n1_vaccine,seasonal_vaccine
26702,26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,,,26702,0,0
26703,26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea,26703,0,0
26704,26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,,lzgpxyit,"MSA, Not Principle City",0.0,0.0,,,26704,0,1
26705,26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,...,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg,26705,0,0
26706,26706,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Not in Labor Force,mlyzmhmf,"MSA, Principle City",1.0,0.0,,,26706,0,0


In [9]:
# Generate summary statistics for numerical columns (e.g., mean, std, min, max) to understand data distribution
df.describe()

Unnamed: 0,respondent_id_x,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children,respondent_id_y,h1n1_vaccine,seasonal_vaccine
count,26707.0,26615.0,26591.0,26636.0,26499.0,26688.0,26665.0,26620.0,26625.0,26579.0,...,26319.0,26312.0,26245.0,26193.0,26170.0,26458.0,26458.0,26707.0,26707.0,26707.0
mean,13353.0,1.618486,1.262532,0.048844,0.725612,0.068982,0.825614,0.35864,0.337315,0.677264,...,2.342566,2.35767,4.025986,2.719162,2.118112,0.886499,0.534583,13353.0,0.212454,0.465608
std,7709.791156,0.910311,0.618149,0.215545,0.446214,0.253429,0.379448,0.47961,0.472802,0.467531,...,1.285539,1.362766,1.086565,1.385055,1.33295,0.753422,0.928173,7709.791156,0.409052,0.498825
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,6676.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,1.0,4.0,2.0,1.0,0.0,0.0,6676.5,0.0,0.0
50%,13353.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,2.0,2.0,4.0,2.0,2.0,1.0,0.0,13353.0,0.0,0.0
75%,20029.5,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,4.0,4.0,5.0,4.0,4.0,1.0,1.0,20029.5,0.0,1.0
max,26706.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,5.0,5.0,5.0,5.0,5.0,3.0,3.0,26706.0,1.0,1.0


In [10]:
# Display a concise summary of the DataFrame, including column data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 39 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id_x              26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [11]:
# Return the number of rows and columns in the DataFrame (shape: (rows, columns))
df.shape

(26707, 39)

In [12]:
# Display the column names of the DataFrame to understand its structure
df.columns

Index(['respondent_id_x', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation', 'respondent_id_y', 'h1n1_vaccine',
       'seasonal_vaccine'],
      dtype='object')

In [13]:
# Generate summary statistics for categorical columns (e.g., count, unique, top, freq)
df.describe(include="O")

Unnamed: 0,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,employment_industry,employment_occupation
count,26707,25300,26707,26707,22284,25299,24665,25244,26707,26707,13377,13237
unique,5,4,4,2,3,2,2,3,10,3,21,23
top,65+ Years,College Graduate,White,Female,"<= $75,000, Above Poverty",Married,Own,Employed,lzgpxyit,"MSA, Not Principle City",fcxhlnwr,xtkaffoo
freq,6843,10097,21222,15858,12777,13555,18736,13560,4297,11645,2468,1778


In [14]:
# Drop the "respondent_id_y" column from the DataFrame as it is no longer needed
# 'axis=1' indicates column-wise operation, and 'inplace=True' modifies the DataFrame directly
df.drop("respondent_id_y",axis=1,inplace=True)

In [15]:
# Display the updated column names after modifications (e.g., column drops or additions)
df.columns

Index(['respondent_id_x', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation', 'h1n1_vaccine', 'seasonal_vaccine'],
      dtype='object')

In [16]:
# Rename the column "respondent_id_x" to "id" for clarity
# 'axis=1' indicates the operation is on columns, and 'inplace=True' modifies the DataFrame directly
df.rename({"respondent_id_x":"id"},axis=1,inplace=True)

In [17]:
# Display the updated column names after renaming columns
df.columns

Index(['id', 'h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds',
       'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation', 'h1n1_vaccine', 'seasonal_vaccine'],
      dtype='object')

In [18]:
# Check for missing values in each column by summing the null entries
# This helps identify columns with missing data
df.isnull().sum()

id                                 0
h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

In [19]:
# Identify duplicate rows in the DataFrame
# Returns a boolean series indicating True for duplicate rows
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
26702    False
26703    False
26704    False
26705    False
26706    False
Length: 26707, dtype: bool

In [20]:
# Return the number of rows and columns in the DataFrame (shape: (rows, columns))
df.shape

(26707, 38)

### Exploratory Data Analysis

In [22]:
# Generate an EDA report using ydata-profiling (previously known as pandas-profiling)
# This creates an interactive, detailed report that includes statistics, visualizations, and insights.
# The 'explorative=True' flag enables more in-depth analysis, including additional features like correlations.
# from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="EDA", explorative=True)

# Display the generated profile report
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [23]:
# Define the feature matrix (X) by dropping unnecessary columns 'id', 'h1n1_vaccine', and 'seasonal_vaccine'
# The 'id' column is likely a unique identifier and not relevant for model training
# 'h1n1_vaccine' and 'seasonal_vaccine' are target variables, so we exclude them from the features (X)
X = df.drop(columns=['id', 'h1n1_vaccine', 'seasonal_vaccine'], axis=1)

# Define the target variable (y) as 'h1n1_vaccine' since it's the outcome we want to predict
y = df['h1n1_vaccine']

In [24]:
# Display the feature matrix (X) to check the columns being used as inputs for the model
X

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,,
26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Rent,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea
26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,,Not Married,Own,,lzgpxyit,"MSA, Not Principle City",0.0,0.0,,
26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,"<= $75,000, Above Poverty",Married,Rent,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg


In [25]:
# Display the target variable (y) to check the column being used for prediction
y

0        0
1        0
2        0
3        0
4        0
        ..
26702    0
26703    0
26704    0
26705    0
26706    0
Name: h1n1_vaccine, Length: 26707, dtype: int64

In [26]:
# Split the data into training and testing sets

# Import train_test_split from sklearn to divide the data into training and testing subsets
from sklearn.model_selection import train_test_split

# Define the feature matrix X and the target variable y
# Split the data into a training set (80% of the data) and a testing set (20% of the data)
# 'random_state=42' ensures that the random split is reproducible across different runs
# 'stratify=y' maintains the class distribution in both the training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [27]:
# Set up lists for each column's data types

# Initialize empty lists to store column names based on their data types
num_cols = []  # For numerical columns (e.g., continuous data)
ohe_cols = []  # For categorical columns with fewer unique values (suitable for One-Hot Encoding)
freq_cols = [] # For categorical columns with more than 10 unique values (typically treated as 'frequent' categorical)

# Iterate through each column in the feature matrix (X)
for c in X.columns:
    # Check if the column's data type is numeric (either float64 or int64)
    if X[c].dtype in ['float64', 'int64']:
        num_cols.append(c)  # Add numeric columns to num_cols list
    # Check if the column has fewer than 10 unique values (categorical with fewer categories)
    elif X[c].nunique() < 10:
        ohe_cols.append(c)  # Add columns with fewer unique values to ohe_cols list
    else:
        freq_cols.append(c)  # Add columns with more than 10 unique values to freq_cols list

In [28]:
# Display the categorized columns to check the column groupings based on their data types

# Print the list of numerical columns
print(f'Numerical Columns:', num_cols)

# Add a new line for better readability
print('\n')

# Print the list of categorical columns with less than 10 unique values (suitable for One-Hot Encoding)
print(f'Object Columns (with less than 10 unique values):', ohe_cols)

# Add a new line for better readability
print('\n')

# Print the list of categorical columns with more than 10 unique values (frequent categories)
print(f'Object Columns (with more than 10 unique values):', freq_cols)

Numerical Columns: ['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'household_adults', 'household_children']


Object Columns (with less than 10 unique values): ['age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status', 'census_msa']


Object Columns (with more than 10 unique values): ['hhs_geo_region', 'employment_industry', 'employment_occupation']


In [29]:
# Display the feature matrix (X) to check the current state of the input features
X

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,,
26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Rent,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea
26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,,Not Married,Own,,lzgpxyit,"MSA, Not Principle City",0.0,0.0,,
26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,"<= $75,000, Above Poverty",Married,Rent,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg


### Feature Engineering

In [31]:
# Handle numeric columns using iterative imputation and scaling

# Create an instance of IterativeImputer to handle missing numeric data by predicting missing values iteratively
numeric_imputer = IterativeImputer(max_iter=100, random_state=42)

# Apply the imputer to the numerical columns in the training set (x_train)
# 'fit_transform' fits the imputer and applies the transformation to fill missing values
x_train_numeric = numeric_imputer.fit_transform(x_train[num_cols])

# Apply the imputer to the numerical columns in the test set (x_test)
# 'transform' only applies the transformation learned from the training set (to avoid data leakage)
x_test_numeric = numeric_imputer.transform(x_test[num_cols])

# Create an instance of MinMaxScaler to scale the numerical features to a range between 0 and 1
scaler = MinMaxScaler()

# Apply MinMaxScaler to the training set's numerical columns
# 'fit_transform' fits the scaler and applies the transformation to scale the values
x_train_numeric = scaler.fit_transform(x_train_numeric)

# Apply MinMaxScaler to the test set's numerical columns
# 'transform' applies the transformation learned from the training set without re-fitting
x_test_numeric = scaler.transform(x_test_numeric)


In [32]:
# Handle categorical columns using imputation and One-Hot Encoding

# Create an instance of SimpleImputer to handle missing categorical data
# 'strategy="constant"' fills missing values with a constant value ('Unknown' in this case)
categorical_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')

# Apply the imputer to the categorical columns in the training set (x_train)
# 'fit_transform' fits the imputer and applies the transformation to fill missing values
x_train_categorical = categorical_imputer.fit_transform(x_train[ohe_cols])

# Apply the imputer to the categorical columns in the test set (x_test)
# 'transform' applies the learned transformation from the training set to avoid data leakage
x_test_categorical = categorical_imputer.transform(x_test[ohe_cols])

# Create an instance of OneHotEncoder to perform One-Hot Encoding on categorical data
# 'handle_unknown="ignore"' ensures that if the test set contains unknown categories, they are ignored
# 'sparse=False' ensures the result is returned as a dense array (rather than sparse matrix)
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Apply One-Hot Encoding to the categorical columns in the training set
# 'fit_transform' fits the encoder and applies the transformation to encode categorical variables
x_train_categorical = encoder.fit_transform(x_train_categorical)

# Apply One-Hot Encoding to the categorical columns in the test set
# 'transform' applies the transformation learned from the training set to the test set
x_test_categorical = encoder.transform(x_test_categorical)

In [33]:
# Convert the processed results (numeric and categorical) back to DataFrames

# Convert the transformed numeric training data back to a DataFrame
# Use 'num_cols' as the column names since they correspond to the original numeric features
x_train_numeric = pd.DataFrame(x_train_numeric, columns=num_cols)

# Convert the transformed numeric test data back to a DataFrame
x_test_numeric = pd.DataFrame(x_test_numeric, columns=num_cols)

# Convert the transformed categorical training data back to a DataFrame
# Use 'encoder.get_feature_names_out(ohe_cols)' to get the names of the new binary columns created by One-Hot Encoding
x_train_categorical = pd.DataFrame(x_train_categorical, columns=encoder.get_feature_names_out(ohe_cols))

# Convert the transformed categorical test data back to a DataFrame
x_test_categorical = pd.DataFrame(x_test_categorical, columns=encoder.get_feature_names_out(ohe_cols))

In [34]:
# Concatenate the processed numeric and categorical features back into a single DataFrame

# Concatenate the processed numeric and categorical features for the training set
# 'axis=1' ensures the concatenation is done horizontally (i.e., adding columns)
x_train_processed = pd.concat([x_train_numeric, x_train_categorical], axis=1)

# Concatenate the processed numeric and categorical features for the test set
x_test_processed = pd.concat([x_test_numeric, x_test_categorical], axis=1)

In [35]:
# Display the data types of each column in the processed training dataset
x_train_processed.dtypes

h1n1_concern                                float64
h1n1_knowledge                              float64
behavioral_antiviral_meds                   float64
behavioral_avoidance                        float64
behavioral_face_mask                        float64
behavioral_wash_hands                       float64
behavioral_large_gatherings                 float64
behavioral_outside_home                     float64
behavioral_touch_face                       float64
doctor_recc_h1n1                            float64
doctor_recc_seasonal                        float64
chronic_med_condition                       float64
child_under_6_months                        float64
health_worker                               float64
health_insurance                            float64
opinion_h1n1_vacc_effective                 float64
opinion_h1n1_risk                           float64
opinion_h1n1_sick_from_vacc                 float64
opinion_seas_vacc_effective                 float64
opinion_seas

In [36]:
# Display the data types of each column in the processed test dataset
x_test_processed.dtypes

h1n1_concern                                float64
h1n1_knowledge                              float64
behavioral_antiviral_meds                   float64
behavioral_avoidance                        float64
behavioral_face_mask                        float64
behavioral_wash_hands                       float64
behavioral_large_gatherings                 float64
behavioral_outside_home                     float64
behavioral_touch_face                       float64
doctor_recc_h1n1                            float64
doctor_recc_seasonal                        float64
chronic_med_condition                       float64
child_under_6_months                        float64
health_worker                               float64
health_insurance                            float64
opinion_h1n1_vacc_effective                 float64
opinion_h1n1_risk                           float64
opinion_h1n1_sick_from_vacc                 float64
opinion_seas_vacc_effective                 float64
opinion_seas

In [37]:
# Display the shape (number of rows and columns) of the processed training and test datasets

# Print the shape of the processed training data (x_train_processed)
print(x_train_processed.shape)

# Print the shape of the processed test data (x_test_processed)
print(x_test_processed.shape)

(21365, 56)
(5342, 56)


In [38]:
import pickle  # Import the pickle module to save and load Python objects in binary format

# Create a dictionary to store the fitted preprocessors
preprocessors = {
    'numeric_imputer': numeric_imputer,  # Add the fitted 'numeric_imputer' (IterativeImputer) to the dictionary under the key 'numeric_imputer'
    'scaler': scaler,                    # Add the fitted 'scaler' (MinMaxScaler) to the dictionary under the key 'scaler'
    'categorical_imputer': categorical_imputer,  # Add the fitted 'categorical_imputer' (SimpleImputer) to the dictionary under the key 'categorical_imputer'
    'encoder': encoder                   # Add the fitted 'encoder' (OneHotEncoder) to the dictionary under the key 'encoder'
}

# Save the preprocessors to a pickle file
with open('preprocessors.pkl', 'wb') as f:  # Open the file 'preprocessors.pkl' in write-binary mode ('wb')
    pickle.dump(preprocessors, f)  # Serialize the 'preprocessors' dictionary and save it to the file 'preprocessors.pkl'

# Confirm that the preprocessors were successfully saved
print("Preprocessors saved to preprocessors.pkl")  # Output a message indicating that the preprocessors have been saved to the file

Preprocessors saved to preprocessors.pkl


### Model Development

In [40]:
# Import the LogisticRegression model from scikit-learn
from sklearn.linear_model import LogisticRegression  

# Import evaluation metrics to assess model performance
from sklearn.metrics import accuracy_score, f1_score, classification_report  

# Create an instance of LogisticRegression model with a fixed random state for reproducibility
model = LogisticRegression(random_state=42) 

# Train the LogisticRegression model using the processed training data (x_train_processed) and labels (y_train)
model.fit(x_train_processed, y_train)  


In [41]:
# Make predictions

# Use the trained LogisticRegression model to predict the target labels for the test data (x_test_processed)
y_pred = model.predict(x_test_processed)  


In [42]:
# Display the predicted target labels generated by the model
y_pred

array([0, 1, 0, ..., 1, 0, 0], dtype=int64)

In [43]:
# Calculate and display the F1 score as a percentage
print('f1_scoring', f1_score(y_test, y_pred)*100, '%')

f1_scoring 51.45945945945945 %


In [44]:
# Evaluate the model

# Calculate the accuracy of the model by comparing the true labels (y_test) with the predicted labels (y_pred)
# accuracy_score() function from sklearn computes the proportion of correct predictions
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy of the model, formatted to two decimal places for clarity
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.83


In [45]:
# Print a detailed classification report including precision, recall, F1 score, and support for each class
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.94      0.90      4207
           1       0.67      0.42      0.51      1135

    accuracy                           0.83      5342
   macro avg       0.76      0.68      0.71      5342
weighted avg       0.82      0.83      0.82      5342



In [46]:
# Display the confusion matrix

# Create a crosstab (confusion matrix) to compare the true labels (y_test) with the predicted labels (y_pred)
# This helps to understand how well the model performs by showing the count of true positives, true negatives,
# false positives, and false negatives
pd.crosstab(y_test, y_pred)

col_0,0,1
h1n1_vaccine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3968,239
1,659,476


### Explanation:

- **Rows (h1n1_vaccine)**: Represents the **actual values** from `y_test` (i.e., whether a person actually received the H1N1 vaccine).
  - `0`: The person did **not** receive the vaccine.
  - `1`: The person **did** receive the vaccine.
  
- **Columns (col_0)**: Represents the **predicted values** from `y_pred` (i.e., whether the model predicted that a person received the H1N1 vaccine).
  - `0`: The model predicted the person did **not** receive the vaccine.
  - `1`: The model predicted the person **did** receive the vaccine.

### Breakdown of the Values:

- **True Negatives (TN)**: 3970
  - The model correctly predicted 3970 people who **did not** receive the vaccine (Actual = 0, Predicted = 0).

- **False Positives (FP)**: 237
  - The model incorrectly predicted 237 people who **did not** receive the vaccine, but the model said they did (Actual = 0, Predicted = 1).

- **False Negatives (FN)**: 659
  - The model incorrectly predicted 659 people who **did** receive the vaccine, but the model said they did not (Actual = 1, Predicted = 0).

- **True Positives (TP)**: 476
  - The model correctly predicted 476 people who **did** receive the vaccine (Actual = 1, Predicted = 1).

### Key Metrics (Based on the Confusion Matrix):

- **Accuracy** = (TP + TN) / Total = (476 + 3970) / (476 + 3970 + 237 + 659) ≈ 0.876 (87.6%)
  - The overall accuracy of the model is 87.6%, meaning the model correctly predicted the vaccine status in 87.6% of the cases.
  
- **Precision** = TP / (TP + FP) = 476 / (476 + 237) ≈ 0.668 (66.8%)
  - Precision indicates that 66.8% of the people predicted to have received the vaccine actually received it.

- **Recall** = TP / (TP + FN) = 476 / (476 + 659) ≈ 0.419 (41.9%)
  - Recall indicates that 41.9% of the people who actually received the vaccine were correctly identified by the model.

- **F1-Score** = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.514 (51.4%)
  - The F1-Score balances precision and recall, and in this case, it is 51.4%, indicating a moderate balance between precision and recall.

### Conclusion:

The confusion matrix clearly indicates that although the model performs reasonably well on high accuracy with a percentage of 87.6, the recall percentage is lower, standing at 41.9%, pointing out that it misses a considerable number of people who indeed got vaccinated. Increasing recall would enable the model to detect more actual positive cases.

In [48]:
# Check the distribution of target labels (y_train)
# This gives the count of instances for each class in the training data.
y_train.value_counts()

h1n1_vaccine
0    16826
1     4539
Name: count, dtype: int64

### Imbalanced Dataset and Model Evaluation

Since the data used for training the logistic regression model is imbalanced.

- The training dataset has more instances of the '0' class (not vaccinated) than the '1' class (vaccinated).
- This class imbalance means the model is likely to predict the majority class ('0') more often, leading to a higher overall accuracy.
  
For example:
- The model might predict '0' for most of the instances, and even if it gets the minority class ('1') wrong, it could still achieve a high accuracy score because the majority class ('0') dominates the data.

Even though the model shows an accuracy of **83%**, this doesn't necessarily indicate that it is predicting correctly across both classes.
- A high accuracy might be misleading, as the model could be favoring the majority class and neglecting the minority class.

In [50]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to balance the data

# SMOTE generates synthetic samples for the minority class to balance the class distribution.
# This helps improve model performance on imbalanced datasets by avoiding the bias towards the majority class.

smote = SMOTE(random_state=42)  # Initialize SMOTE with a fixed random seed for reproducibility
x_train_smote, y_train_smote = smote.fit_resample(x_train_processed, y_train)  # Apply SMOTE to the training data


In [51]:
# Train the Logistic Regression model

# Initialize the model with specified parameters:
# - max_iter=500: Set the maximum number of iterations for convergence
# - C=1: Regularization strength (higher values = less regularization)
# - solver='lbfgs': Optimization algorithm used for model fitting
# - random_state=42: Ensures reproducibility of results
model = LogisticRegression(max_iter=500, C=1, solver='lbfgs', random_state=42)

# Fit the model on the SMOTE-resampled training data
model.fit(x_train_smote, y_train_smote)


In [52]:
# Make predictions using the trained model

# Use the fitted Logistic Regression model to predict the labels for the test data (x_test_processed)
y_pred = model.predict(x_test_processed)

In [53]:
# Display the predicted labels for the test data
y_pred  # This shows the predicted values (0 or 1) for the test set, based on the trained model

array([0, 1, 0, ..., 1, 0, 0], dtype=int64)

In [54]:
# Evaluate the model's performance

# Calculate the accuracy score by comparing the true labels (y_test) with the predicted labels (y_pred)
accuracy_lr = accuracy_score(y_test, y_pred)

# Print the accuracy as a percentage with two decimal places
print(f'Accuracy of Logistic Regression: {accuracy_lr:.2f}')

Accuracy of Logistic Regression: 0.77


In [55]:
pd.crosstab(y_test, y_pred)

col_0,0,1
h1n1_vaccine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3293,914
1,316,819



**Interpretation:**
- **True Negatives (3290):** The model correctly predicted '0' (not vaccinated) for 3290 instances.
- **False Positives (917):** The model incorrectly predicted '1' (vaccinated) for 917 instances that were actually '0' (not vaccinated).
- **False Negatives (315):** The model incorrectly predicted '0' (not vaccinated) for 315 instances that were actually '1' (vaccinated).
- **True Positives (820):** The model correctly predicted '1' (vaccinated) for 820 instances.

This confusion matrix helps evaluate the model's classification performance and understand its errors.


In [57]:
# # Hyperparameter tuning using GridSearchCV

# # Import GridSearchCV from scikit-learn to search for the best hyperparameters
# from sklearn.model_selection import GridSearchCV

# # Create a Logistic Regression model instance for hyperparameter tuning
# model = LogisticRegression()

# # Define the hyperparameters to search over
# params = {
#     'fit_intercept': [True, False],  # Whether to include an intercept term in the model
#     'penalty': ['l1', 'l2', 'elasticnet'],  # Regularization techniques to apply
#     'random_state': [i for i in range(1,43)],  # Different values for random state to ensure reproducibility
#     'solver': ['lbfgs', 'liblinear', 'newton-cg','newton-cholesky', 'sag', 'saga']  # Solvers to optimize the model
# }

# # Initialize GridSearchCV with the model, hyperparameters, cross-validation, and F1 score as the scoring metric
# grid_search = GridSearchCV(model, params, cv=3, verbose=3, scoring='f1')

# # Fit the GridSearchCV to the resampled training data (x_train_smote, y_train_smote)
# grid_search.fit(x_train_smote, y_train_smote)


In [58]:
# # Get the best hyperparameters from the grid search
# grid_search.best_params_

In [59]:
# Initialize the Logistic Regression model with tuned hyperparameters
model = LogisticRegression(fit_intercept=False,  # Do not include the intercept term in the model
                           penalty='l2',  # Use L2 regularization (Ridge)
                           random_state=32,  # Set random state for reproducibility
                           solver='sag',  # Use the 'sag' solver for optimization
                           max_iter=200)  # Set the maximum number of iterations for solver convergence

# Train the model using the SMOTE-balanced training data (x_train_smote, y_train_smote)
model.fit(x_train_smote, y_train_smote)

# Make predictions on the test set (x_test_processed)
y_pred = model.predict(x_test_processed)

In [60]:
# Import evaluation metrics (F1 score and classification report)
from sklearn.metrics import f1_score, classification_report

# Calculate and display the F1 score as a percentage for the model's predictions on the test set
print('f1_scoring', f1_score(y_test, y_pred) * 100, '%')

f1_scoring 57.12295367467781 %


In [61]:
# Print the classification report, showing precision, recall, F1 score, and support for each class
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.78      0.84      4207
           1       0.47      0.72      0.57      1135

    accuracy                           0.77      5342
   macro avg       0.69      0.75      0.71      5342
weighted avg       0.82      0.77      0.78      5342



In [62]:
# Create a confusion matrix to compare the true labels (y_test) with the predicted labels (y_pred)
pd.crosstab(y_test, y_pred)

col_0,0,1
h1n1_vaccine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3291,916
1,315,820


**Interpretation:**
- **True Negatives (TN)**: 3291 (Correctly predicted as 0)
- **False Positives (FP)**: 916 (Incorrectly predicted as 1, when it should have been 0)
- **False Negatives (FN)**: 315 (Incorrectly predicted as 0, when it should have been 1)
- **True Positives (TP)**: 820 (Correctly predicted as 1)



In [64]:
# Evaluate the model
accuracy_lr_ht = accuracy_score(y_test, y_pred)  # Calculate the accuracy by comparing the true labels with the predicted labels

# Print the accuracy as a percentage with two decimal places
print(f'Accuracy of Logistic Regression after Hyperparameter Tuning: {accuracy_lr_ht:.2f}')

Accuracy of Logistic Regression after Hyperparameter Tuning: 0.77


### Conclusion:
- When the dataset is **balanced**, the model's performance significantly improves, leading to more accurate predictions.
- Handling **imbalanced data** through techniques like SMOTE (Synthetic Minority Over-sampling Technique) helps in achieving better model performance by ensuring that the model doesn't favor the majority class, thereby improving its ability to predict the minority class accurately.
- This observation suggests that balancing the data before training can be a key factor in enhancing the effectiveness of classification models.


In [66]:
# Import RandomForestClassifier from sklearn.ensemble for model training
from sklearn.ensemble import RandomForestClassifier
# Import accuracy_score from sklearn.metrics to evaluate model performance
from sklearn.metrics import accuracy_score

# Train the model
model = RandomForestClassifier(random_state=42)  # Initialize the RandomForestClassifier with a fixed random_state for reproducibility
model.fit(x_train_smote, y_train_smote)  # Fit the model to the resampled training data (x_train_smote and y_train_smote)

In [67]:
# Make predictions using the trained model
y_pred = model.predict(x_test_processed)  # Use the trained RandomForest model to predict labels for the test dataset (x_test_processed)
y_pred  # Display the predicted labels for the test set

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

In [68]:
# Evaluate the model
accuracy_rfc = accuracy_score(y_test, y_pred)
print(f'Accuracy of Random Forest Classifier: {accuracy_rfc:.2f}')

Accuracy of Random Forest Classifier: 0.84


In [69]:
# # Define the parameter grid
# param_grid = {  
#     'max_depth': [3, 5, 7, 10],  # Depth of the tree; higher values allow the model to capture more complex patterns
#     'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
#     'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
#     'n_estimators': [100, 200, 300],  # Number of trees in the forest
#     'class_weight': ['balanced', 'balanced_subsample']  # Handles imbalanced classes by adjusting weights
# }

# # Initialize the RandomForestClassifier
# rfc = RandomForestClassifier(random_state=42)  # Create a RandomForestClassifier with a fixed random state for reproducibility

# # Initialize GridSearchCV
# grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5, n_jobs=-1, verbose=3, scoring='accuracy') 
# # GridSearchCV will perform cross-validation (cv=5) and search for the best combination of hyperparameters
# # n_jobs=-1 uses all available CPU cores for faster computation
# # verbose=3 provides detailed output during the search
# # scoring='accuracy' means the search will use accuracy as the performance metric

# # Fit GridSearchCV
# grid_search.fit(x_train_processed, y_train)  # Fit the grid search to the training data, which will search for the best hyperparameters based on accuracy


In [70]:
# # Get the best parameters from the GridSearchCV result
# best_params = grid_search.best_params_  # Retrieve the hyperparameter combination that yielded the best performance

# # Get the best score from the GridSearchCV result
# best_score = grid_search.best_score_  # Retrieve the best cross-validation score achieved by the best parameters


In [71]:
# # Print the best hyperparameters found by GridSearchCV
# print(f'Best Parameters: {best_params}')  # Display the best combination of hyperparameters

# # Print the best score achieved by the best hyperparameters
# print(f'Best Score: {best_score:.2f}')  # Display the best cross-validation score, rounded to 2 decimal places


In [72]:
# Train the model using the best hyperparameters found from GridSearchCV

# Initialize the RandomForestClassifier with the best parameters:
# - max_depth=10: Maximum depth of the tree, controls overfitting
# - min_samples_leaf=1: Minimum number of samples required to be at a leaf node
# - min_samples_split=2: Minimum number of samples required to split an internal node
# - n_estimators=300: The number of trees in the forest
# - random_state=42: Ensures reproducibility of results
best_rfc = RandomForestClassifier(max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=42)

# Fit the model on the resampled training data (x_train_smote and y_train_smote)
# This will train the RandomForest model to learn from the training data
best_rfc.fit(x_train_smote, y_train_smote)


In [73]:
# Make predictions using the trained RandomForest model
y_pred = best_rfc.predict(x_test_processed)  # Predict labels for the test set using the best RandomForest model
y_pred  # Output the predictions to view the results

array([0, 1, 0, ..., 1, 0, 0], dtype=int64)

In [74]:
# Evaluate the model
accuracy_rfc_ht = accuracy_score(y_test, y_pred)
print(f'Accuracy of Random Forest Classifier after Hyperparameter Tuning: {accuracy_rfc_ht:.2f}')

Accuracy of Random Forest Classifier after Hyperparameter Tuning: 0.83


In [75]:
# Evaluate the model using F1 score
from sklearn.metrics import f1_score, classification_report  # Import metrics for model evaluation

# Calculate the F1 score and display it as a percentage
print('f1_scoring', f1_score(y_test, y_pred) * 100, '%')


f1_scoring 58.375401560348784 %


In [76]:
# Print the classification report, showing precision, recall, F1 score, and support for each class
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.90      0.89      4207
           1       0.61      0.56      0.58      1135

    accuracy                           0.83      5342
   macro avg       0.75      0.73      0.74      5342
weighted avg       0.83      0.83      0.83      5342



In [77]:
# Create a confusion matrix of actual vs. predicted labels
pd.crosstab(y_test, y_pred)


col_0,0,1
h1n1_vaccine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3799,408
1,499,636


**Interpretation:**
- **True Negatives (TN):** 3798 - Correctly predicted as not vaccinated.
- **False Positives (FP):** 409 - Incorrectly predicted as vaccinated.
- **False Negatives (FN):** 497 - Missed vaccinated individuals.
- **True Positives (TP):** 638 - Correctly predicted as vaccinated.


In [79]:
# Display the value counts for each class in the SMOTE-balanced training data
y_train_smote.value_counts()


h1n1_vaccine
0    16826
1    16826
Name: count, dtype: int64

In [80]:
# Clean column names in x_train_smote by removing special characters and replacing spaces with underscores for compatibility with Gradient Boosting
x_train_smote.columns = [str(col).replace('[', '').replace(']', '').replace('<', '').replace('>', '').replace(' ', '_') 
                         for col in x_train_processed.columns]

# Apply the same cleaning process to column names in x_test_processed to ensure consistency
x_test_processed.columns = [str(col).replace('[', '').replace(']', '').replace('<', '').replace('>', '').replace(' ', '_') 
                            for col in x_test_processed.columns]


In [81]:
# Import GradientBoostingClassifier from scikit-learn ensemble module
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the GradientBoostingClassifier model with a fixed random state for reproducibility
model = GradientBoostingClassifier(random_state=42)

# Fit the model on the resampled training data (x_train_smote, y_train_smote)
model.fit(x_train_smote, y_train_smote)


In [82]:
# Make predictions using the trained GradientBoostingClassifier model on the processed test data
y_pred = model.predict(x_test_processed)

# Display the predicted labels for the test set
y_pred


array([0, 1, 0, ..., 1, 0, 0], dtype=int64)

In [83]:
# Evaluate the model
accuracy_gbc = accuracy_score(y_test, y_pred)
print(f'Accuracy of GradientBoostingClassifier: {accuracy_gbc: .2f}')

Accuracy of GradientBoostingClassifier:  0.83


In [84]:
# # Import GridSearchCV to perform hyperparameter tuning
# from sklearn.model_selection import GridSearchCV  

# # Define the parameter grid for hyperparameter tuning in GradientBoostingClassifier
# param_grid = {
#     'learning_rate': [0.1, 0.3, 0.6],  # Set of learning rates to try
#     'max_depth': [5, 6, 7, 8, 9],      # Max depth of trees to test
#     'n_estimators': [50, 65, 80],      # Number of estimators (trees) to test
#     'random_state': [i for i in range(1, 43)]  # Range of random states for reproducibility
# }

# # Initialize the GradientBoostingClassifier model
# GBH_HT = GradientBoostingClassifier()  

# # Initialize GridSearchCV to perform hyperparameter tuning using 5-fold cross-validation
# GBHT = GridSearchCV(estimator=GBH_HT,             # Estimator model to tune
#                     scoring='f1',                 # Use F1 score for evaluation
#                     refit=True,                   # Refit the model with the best parameters
#                     param_grid=param_grid,       # Parameter grid to search through
#                     cv=5,                         # Number of folds for cross-validation
#                     verbose=3,                    # Show detailed progress during fitting
#                     n_jobs=-1)                    # Use all available CPU cores for parallel processing

# # Perform hyperparameter tuning on the training data (x_train_smote, y_train_smote)
# GBHT.fit(x_train_smote, y_train_smote)


In [85]:
# # Retrieve the best hyperparameters found by GridSearchCV during the search process
# cv_best_params = GBHT.best_params_  


In [86]:
# # Print the best hyperparameters found during GridSearchCV
# print(f"Best parameters :  {cv_best_params})")  


In [87]:
# Train the model using the best hyperparameters found from GridSearchCV

GBHT = GradientBoostingClassifier(learning_rate=0.1,  # Initialize GradientBoostingClassifier with a learning rate of 0.1
                                 n_estimators=50,    # Set the number of boosting iterations (trees) to 50
                                 max_depth=9,        # Set the maximum depth of individual trees to 9
                                 random_state=40)    # Set the random state for reproducibility

GBHT.fit(x_train_smote, y_train_smote)  # Fit the GradientBoostingClassifier model to the resampled training data (x_train_smote, y_train_smote)


In [88]:
# Use the trained GradientBoostingClassifier (GBHT) to make predictions on the test data (x_test_processed)
gbmh_pred = GBHT.predict(x_test_processed)  


In [89]:
# Calculate and print the accuracy score by comparing the true labels (y_test) with the predicted labels (gbmh_pred)
accuracy_gbc_ht = accuracy_score(y_test, gbmh_pred)
print(f'Accuracy of XGBClassifier after Hyperparameter Tuning: {accuracy_gbc_ht: .2f}')


Accuracy of XGBClassifier after Hyperparameter Tuning:  0.84


In [90]:
# Calculate the F1 score by comparing the true labels (y_test) with the predicted labels (gbmh_pred)
f1gbh = f1_score(y_test, gbmh_pred)  


In [91]:
# Generate and print a detailed classification report with precision, recall, F1 score, and support for each class
print(classification_report(y_test, gbmh_pred))  


              precision    recall  f1-score   support

           0       0.88      0.93      0.90      4207
           1       0.67      0.53      0.59      1135

    accuracy                           0.84      5342
   macro avg       0.78      0.73      0.75      5342
weighted avg       0.84      0.84      0.84      5342



In [92]:
# Create and display a confusion matrix to compare the true labels (y_test) and predicted labels (gbmh_pred)
pd.crosstab(y_test, gbmh_pred)  


col_0,0,1
h1n1_vaccine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3909,298
1,531,604


**Interpretation:**
- **True Negatives (TN)**: 3915 – Correctly predicted `0` (no vaccine) when the actual value is `0`.
- **False Positives (FP)**: 292 – Incorrectly predicted `1` (vaccine) when the actual value is `0`.
- **False Negatives (FN)**: 538 – Incorrectly predicted `0` (no vaccine) when the actual value is `1`.
- **True Positives (TP)**: 597 – Correctly predicted `1` (vaccine) when the actual value is `1`.


In [94]:
 # !pip install xgboost

In [95]:
from xgboost import XGBClassifier

In [96]:
 # Initialize the XGBoost classifier model with a fixed random state for reproducibility
model = XGBClassifier(random_state=42) 

# Train the XGBoost model using the resampled training data (x_train_smote) and their corresponding labels (y_train_smote)
model.fit(x_train_smote, y_train_smote)  

In [97]:
# Use the trained XGBoost model to make predictions on the processed test data (x_test_processed)
y_pred = model.predict(x_test_processed)  

# Return the predictions for review or further analysis
y_pred  


array([0, 1, 0, ..., 0, 0, 0])

In [98]:
# Evaluate the model
accuracy_xgbc = accuracy_score(y_test, y_pred)
print(f'Accuracy of XGBClassifier: {accuracy_xgbc: .2f}')

Accuracy of XGBClassifier:  0.84


In [99]:
# from sklearn.model_selection import GridSearchCV  # Import GridSearchCV to perform hyperparameter tuning

# # Define a dictionary with hyperparameters and their values to search over
# param_grid = {
#     'gamma': [0, 0.1, 0.2, 0.4],  # Values for gamma parameter which controls regularization
#     'learning_rate': [0.01, 0.03, 0.06, 0.1],  # Values for learning rate to control the step size at each iteration
#     'max_depth': [5, 6, 7, 8, 9],  # Values for maximum tree depth to prevent overfitting
#     'n_estimators': [50, 65, 80],  # Number of boosting rounds (trees)
#     'reg_alpha': [0, 0.1, 0.2, 0.4],  # Values for L1 regularization (Lasso)
#     'reg_lambda': [0, 0.1, 0.2]  # Values for L2 regularization (Ridge)
# }

# # Initialize the XGBClassifier model with fixed random state, verbosity level and silent parameter for controlling logging
# model = XGBClassifier(random_state=42, verbosity=3, silent=0)

# # Initialize GridSearchCV with the model and parameter grid, using 3-fold cross-validation and f1 score for evaluation
# xgb_ht = GridSearchCV(estimator=model, scoring='f1', refit=True, param_grid=param_grid, cv=3, verbose=3, n_jobs=-1)

# # Fit the GridSearchCV to the training data (x_train_smote, y_train_smote) to find the best parameters
# xgb_ht.fit(x_train_smote, y_train_smote)


In [100]:
# # Extract the best hyperparameters found by GridSearchCV
# best_params = xgb_ht.best_params_  

# # Extract the best score achieved by GridSearchCV with the best hyperparameters
# best_score = xgb_ht.best_score_ 

# # Print the best hyperparameters
# print(f'Best Parameters: {best_params}')  

# # Print the best score, formatted to two decimal places
# print(f'Best Score: {best_score:.2f}')    


In [101]:
# Train the model using the best hyperparameters found from GridSearchCV

XGB_HT = XGBClassifier(reg_lambda = 0,                     # Regularization term for L2 regularization, set to 0 for no L2 regularization
                      reg_alpha= 0,                       # Regularization term for L1 regularization, set to 0 for no L1 regularization
                      n_estimators=80,                    # Number of boosting rounds (trees) to train, set to 80
                      max_depth=5,                        # Maximum depth of each decision tree, set to 5 for limited complexity
                      learning_rate=0.03,                 # Step size to update weights, set to 0.03 for more gradual learning
                      gamma=0.2,                          # Minimum loss reduction required to make a further partition, set to 0.2 to control tree growth
                      random_state = 16)                  # Random seed for reproducibility, set to 16 for fixed results


In [102]:
# Train the XGBoost model using the SMOTE-sampled training data
XGB_HT.fit(x_train_smote, y_train_smote)  


In [103]:
# Use the trained model to predict labels on the test data
y_pred = XGB_HT.predict(x_test_processed)  


In [104]:
# Calculate the accuracy score by comparing the actual vs predicted labels
accuracy_xgbc_ht = accuracy_score(y_test, y_pred)  # Store accuracy score in the variable 'accuracy_xgbc_ht'

# Print the accuracy of the XGBClassifier after hyperparameter tuning
print(f'Accuracy of XGBClassifier after Hyperparameter Tuning: {accuracy_xgbc_ht:.2f}')  # Display the accuracy to two decimal places


Accuracy of XGBClassifier after Hyperparameter Tuning: 0.83


### Model Evaluation

In [106]:
# Print the accuracy of the Logistic Regression model (accuracy_lr) formatted to two decimal places
print(f'Accuracy of Logistic Regression: {accuracy_lr:.2f}')

# Print the accuracy of the Random Forest Classifier model (accuracy_rfc) formatted to two decimal places
print(f'Accuracy of Random Forest Classifier: {accuracy_rfc:.2f}')

# Print the accuracy of the Gradient Boosting Classifier model (accuracy_gbc) formatted to two decimal places
print(f'Accuracy of Gradient Boosting Classifier: {accuracy_gbc:.2f}')

# Print the accuracy of the XGBoost Classifier model (accuracy_xgbc) formatted to two decimal places
print(f'Accuracy of XGBoost Classifier: {accuracy_xgbc:.2f}')


Accuracy of Logistic Regression: 0.77
Accuracy of Random Forest Classifier: 0.84
Accuracy of Gradient Boosting Classifier: 0.83
Accuracy of XGBoost Classifier: 0.84


In [107]:
# Print the accuracy of the Logistic Regression model after Hyperparameter Tuning (accuracy_lr_ht) formatted to two decimal places
print(f'Accuracy of Logistic Regression (Hyperparameter Tuning): {accuracy_lr_ht:.2f}')

# Print the accuracy of the Random Forest Classifier model after Hyperparameter Tuning (accuracy_rfc_ht) formatted to two decimal places
print(f'Accuracy of Random Forest Classifier (Hyperparameter Tuning): {accuracy_rfc_ht:.2f}')

# Print the accuracy of the Gradient Boosting Classifier model after Hyperparameter Tuning (accuracy_gbc_ht) formatted to two decimal places
print(f'Accuracy of Gradient Boosting Classifier (Hyperparameter Tuning): {accuracy_gbc_ht:.2f}')

# Print the accuracy of the XGBoost Classifier model after Hyperparameter Tuning (accuracy_xgbc_ht) formatted to two decimal places
print(f'Accuracy of XGBoost Classifier (Hyperparameter Tuning): {accuracy_xgbc_ht:.2f}')


Accuracy of Logistic Regression (Hyperparameter Tuning): 0.77
Accuracy of Random Forest Classifier (Hyperparameter Tuning): 0.83
Accuracy of Gradient Boosting Classifier (Hyperparameter Tuning): 0.84
Accuracy of XGBoost Classifier (Hyperparameter Tuning): 0.83


### Overcoming Challenges: A Journey Through Hyperparameter Tuning

This project was not without its challenges. From addressing data imbalances to optimizing machine learning models, each difficulty pushed us to explore innovative solutions and refine our approach. Here's how we navigated these obstacles:

---

### 1. **Imbalanced Dataset**
- **Challenge:** The original dataset was imbalanced, with unequal representation of classes. This could have caused the models to favor the majority class, leading to biased predictions.
- **Solution:** We used **SMOTE (Synthetic Minority Oversampling Technique)** to balance the dataset, ensuring the models learned equally from both classes. This step was crucial in creating fair and accurate predictions.

---

### 2. **Selecting the Right Model**
- **Challenge:** Choosing the best algorithm for predicting vaccination status was not straightforward. Each model had its unique strengths and weaknesses.
- **Solution:** We evaluated multiple algorithms:
  - **Logistic Regression**: A simple baseline, but it struggled with complex patterns.
  - **Random Forest**: Better accuracy but required fine-tuning to avoid overfitting.
  - **Gradient Boosting**: Demonstrated strong performance with its iterative learning approach.
  - **XGBoost**: Matched Random Forest in accuracy but excelled in efficiency.

---

### 3. **Hyperparameter Tuning**
- **Challenge:** Models performed poorly with default parameters, showing the need for fine-tuning to reach optimal performance.
- **Solution:** We conducted **Grid Search** to systematically test various hyperparameters. This process involved:
  - Adjusting tree depth, learning rate, and the number of estimators.
  - Optimizing regularization terms to prevent overfitting.
  - Balancing computational efficiency with predictive accuracy.

---

### 4. **Interpreting Model Performance**
- **Challenge:** Understanding why certain models performed better and validating the results across different metrics.
- **Solution:** We used detailed evaluation techniques:
  - **Accuracy**: A straightforward measure of correct predictions.
  - **F1 Score**: Balanced precision and recall to address class imbalance.
  - **Confusion Matrix**: Gave insights into false positives and false negatives.

---

### **Key Learning and Success**
- **Gradient Boosting Classifier** emerged as the best-performing model, with an accuracy of **84%** after hyperparameter tuning. 
- This success demonstrated the power of iterative improvement and careful parameter adjustment.

---

### Conclusion
Through this journey, we overcame significant challenges by leveraging advanced techniques like SMOTE, grid search, and detailed evaluations. Each step reinforced the importance of a structured and iterative approach to machine learning. By addressing difficulties head-on, we not only improved the model's performance but also deepened our understanding of predictive analytics, paving the way for future successes.


In [109]:
# Instantiate the Gradient Boosting Classifier with specific hyperparameters
model = GradientBoostingClassifier(
    learning_rate=0.1,  # Controls the contribution of each tree to the final model
    n_estimators=50,    # Specifies the number of boosting iterations (trees) to use
    max_depth=9,        # Sets the maximum depth of individual trees to prevent underfitting or overfitting
    random_state=40     # Ensures reproducibility of the results by fixing the random seed
)


In [110]:
# Train the Gradient Boosting Classifier model using the balanced training dataset
model.fit(x_train_smote, y_train_smote)


In [111]:
# Generate predictions on the processed test dataset using the trained Gradient Boosting Classifier
gbmh_pred = model.predict(x_test_processed)


In [112]:
# Calculate the accuracy score for the Gradient Boosting Classifier after hyperparameter tuning
accuracy_gbc_ht = accuracy_score(y_test, gbmh_pred)

# Print the accuracy score formatted to two decimal places
print(f'Accuracy of Gradient Boosting Classifier after Hyperparameter Tuning: {accuracy_gbc_ht: .2f}')


Accuracy of Gradient Boosting Classifier after Hyperparameter Tuning:  0.84


In [113]:
# Calculate the F1 score for the Gradient Boosting Classifier model after hyperparameter tuning
f1gbh = f1_score(y_test, gbmh_pred)


In [114]:
# Print a detailed classification report for the Gradient Boosting Classifier
print(classification_report(y_test, gbmh_pred))


              precision    recall  f1-score   support

           0       0.88      0.93      0.90      4207
           1       0.67      0.53      0.59      1135

    accuracy                           0.84      5342
   macro avg       0.78      0.73      0.75      5342
weighted avg       0.84      0.84      0.84      5342



In [115]:
# Create a cross-tabulation of actual and predicted values to evaluate model performance
pd.crosstab(y_test, gbmh_pred)


col_0,0,1
h1n1_vaccine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3909,298
1,531,604


### Deployment

In [117]:
import pickle
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Gradient Boosting Classifier with specified hyperparameters
final_model = GradientBoostingClassifier(
    learning_rate=0.1,  # Learning rate for boosting
    n_estimators=50,    # Number of boosting stages (trees)
    max_depth=9,        # Maximum depth of each tree
    random_state=28      # Random seed for reproducibility
)

# Train the Gradient Boosting model on the balanced dataset
final_model.fit(x_train_smote, y_train_smote)

# Save the trained model as a pickle file for future use
with open('h1n1_gradient_boosting_model.pkl', 'wb') as f:
    pickle.dump(final_model, f)  # Serialize and save the model object

print("Model saved to h1n1_gradient_boosting_model.pkl")  # Confirmation message


Model saved to h1n1_gradient_boosting_model.pkl


### Final Model - **Gradient Boosting Classifier**

The project demonstrates the successful application of **Gradient Boosting Classifier** (GBC) for predicting outcomes, following an in-depth process of model training, hyperparameter tuning, and evaluation. The steps followed in the project, from data preparation to final model deployment, can be summarized as follows:

1. **Model Initialization and Training**:
   - The model was initialized with key hyperparameters such as a learning rate of 0.1, 50 boosting iterations (trees), a maximum tree depth of 9, and a fixed random seed for reproducibility.
   - The model was trained on a **balanced dataset** using the **SMOTE technique**, ensuring that the class imbalance issue was addressed, which could otherwise impact model performance.

2. **Performance Evaluation**:
   - After training, predictions were made on the processed test dataset, and performance metrics such as **accuracy**, **F1 score**, and a **detailed classification report** were used to assess the model.
   - The **accuracy score** of the model, post-hyperparameter tuning, was reported, reflecting a well-tuned model that performs better with respect to both precision and recall.
   - A **cross-tabulation** (confusion matrix) was generated to evaluate the true positives, true negatives, false positives, and false negatives, providing further insights into the model's classification ability.

3. **Model Saving**:
   - The final model was saved using **pickle**, ensuring that the trained model could be easily deployed for future predictions without the need to retrain.
   - The model was serialized into a file named `h1n1_gradient_boosting_model.pkl`, providing a reusable artifact for further analysis or integration into production systems.

4. **Key Insights**:
   - The hyperparameter tuning significantly improved the model's performance, demonstrating that proper fine-tuning can enhance predictive accuracy and mitigate underfitting or overfitting.
   - The **Gradient Boosting Classifier** proved to be a robust choice for this classification task, yielding solid performance in terms of **accuracy** and **F1 score**.

In conclusion, by leveraging **Gradient Boosting**, hyperparameter tuning, and model evaluation techniques, the project successfully built and saved a high-performing model that can be used for accurate predictions in future applications.
