## 1. Business Understanding
The goal of this project is to predict whether an individual received the H1N1 vaccine using survey data from the National 2009 H1N1 Flu Survey. This analysis will help public health agencies identify patterns in demographics, health behaviors, and opinions to better target vaccination campaigns and allocate resources effectively.

To solve this problem, I will use data science tools such as data cleaning and preprocessing, exploratory data analysis, and machine learning models (logistic regression and decision trees) to build and evaluate predictive models for vaccination behavior.
- Why classification is appropriate

## 2. Data Understanding
The dataset comes from the National 2009 H1N1 Flu Survey and contains information on respondents’ demographics, health status, opinions about vaccines, and health-related behaviors. Key features include age, sex, race, education, chronic health conditions, perceived risk of H1N1, and previous vaccination history.

The target variable is h1n1_vaccine (1 = vaccinated, 0 = not vaccinated). Initial steps include examining feature distributions, handling missing values, encoding categorical variables, and exploring relationships between features and the target to guide model building.

## Exploratory Data Analysis

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler

df_features = pd.read_csv('./data/training_set_features.csv')
df_labels = pd.read_csv('./data/training_set_labels.csv')

# merging features and labels dataframes on 'respondent_id'
df = pd.merge(df_features, df_labels, on='respondent_id')
df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0


In [3]:
df.isna().sum().sort_values(ascending=False)    

employment_occupation          13470
employment_industry            13330
health_insurance               12274
income_poverty                  4423
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
rent_or_own                     2042
employment_status               1463
marital_status                  1408
education                       1407
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
opinion_seas_sick_from_vacc      537
opinion_seas_risk                514
opinion_seas_vacc_effective      462
opinion_h1n1_sick_from_vacc      395
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
household_adults                 249
household_children               249
behavioral_avoidance             208
behavioral_touch_face            128
h1n1_knowledge                   116
h1n1_concern                      92
behavioral_large_gatherings       87
behavioral_outside_home           82
b

### Dropping Columns
 - Employment Occupation, employment_industry and health_insurance have a high number of missing values(about 50%)so we will drop those.

In [4]:
#dropping columns with high missing values
df = df.drop(["employment_occupation","employment_industry","health_insurance"], axis=1)
df.isna().sum().sort_values(ascending=False)    

income_poverty                 4423
doctor_recc_h1n1               2160
doctor_recc_seasonal           2160
rent_or_own                    2042
employment_status              1463
marital_status                 1408
education                      1407
chronic_med_condition           971
child_under_6_months            820
health_worker                   804
opinion_seas_sick_from_vacc     537
opinion_seas_risk               514
opinion_seas_vacc_effective     462
opinion_h1n1_sick_from_vacc     395
opinion_h1n1_vacc_effective     391
opinion_h1n1_risk               388
household_adults                249
household_children              249
behavioral_avoidance            208
behavioral_touch_face           128
h1n1_knowledge                  116
h1n1_concern                     92
behavioral_large_gatherings      87
behavioral_outside_home          82
behavioral_antiviral_meds        71
behavioral_wash_hands            42
behavioral_face_mask             19
seasonal_vaccine            

## Train-Test Split
As we have decided our target is h1n1 vaccine, we're going to set out X and y variables and perform a train-test split before doing anymore data cleaning to prevent data leakage.

In [None]:
# dropping seasonal_vaccine column as we're focused on h1

In [2]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [4]:
df.isna().sum().sort_values(ascending=False)

employment_occupation          13470
employment_industry            13330
health_insurance               12274
income_poverty                  4423
doctor_recc_seasonal            2160
doctor_recc_h1n1                2160
rent_or_own                     2042
employment_status               1463
marital_status                  1408
education                       1407
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
opinion_seas_sick_from_vacc      537
opinion_seas_risk                514
opinion_seas_vacc_effective      462
opinion_h1n1_sick_from_vacc      395
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
household_adults                 249
household_children               249
behavioral_avoidance             208
behavioral_touch_face            128
h1n1_knowledge                   116
h1n1_concern                      92
behavioral_large_gatherings       87
behavioral_outside_home           82
b

## Data Cleaning


- Filling in Categorical columns(with missing values) with the most frequently occuring value
- Filling in Numerical columns(with missing values) with the median. 

In [7]:
df.isna().sum().sort_values(ascending=False)

income_poverty                 4423
doctor_recc_h1n1               2160
doctor_recc_seasonal           2160
rent_or_own                    2042
employment_status              1463
marital_status                 1408
education                      1407
chronic_med_condition           971
child_under_6_months            820
health_worker                   804
opinion_seas_sick_from_vacc     537
opinion_seas_risk               514
opinion_seas_vacc_effective     462
opinion_h1n1_sick_from_vacc     395
opinion_h1n1_vacc_effective     391
opinion_h1n1_risk               388
household_children              249
household_adults                249
behavioral_avoidance            208
behavioral_touch_face           128
h1n1_knowledge                  116
h1n1_concern                     92
behavioral_large_gatherings      87
behavioral_outside_home          82
behavioral_antiviral_meds        71
behavioral_wash_hands            42
behavioral_face_mask             19
age_group                   

In [10]:
# filling in missing categorical values with the most frequent value
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')

categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
df[categorical_cols] = imputer.fit_transform(df[categorical_cols])



In [None]:
# filling in missing numerical values with the median
num_imputer = SimpleImputer(strategy='median')
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
df[numerical_cols] = num_imputer.fit_transform(df[numerical_cols])


In [13]:
df.isna().sum().sort_values(ascending=False)

household_children             0
opinion_h1n1_vacc_effective    0
h1n1_concern                   0
h1n1_knowledge                 0
behavioral_antiviral_meds      0
behavioral_avoidance           0
behavioral_face_mask           0
behavioral_wash_hands          0
behavioral_large_gatherings    0
behavioral_outside_home        0
behavioral_touch_face          0
doctor_recc_h1n1               0
doctor_recc_seasonal           0
chronic_med_condition          0
child_under_6_months           0
health_worker                  0
opinion_h1n1_risk              0
household_adults               0
opinion_h1n1_sick_from_vacc    0
opinion_seas_vacc_effective    0
opinion_seas_risk              0
opinion_seas_sick_from_vacc    0
age_group                      0
education                      0
race                           0
sex                            0
income_poverty                 0
marital_status                 0
rent_or_own                    0
employment_status              0
hhs_geo_re

In [None]:
df['h']

## Data Preparation
 -