# Predicting H1N1 Vaccination

![img](./images/cdc.jpg)

## Overview

There is a need to increase vaccination rates in the US. Having high vaccination rates will allow for another layer of safeguard against a certain disease by lowering the spread of it with the hopes of achieving herd immunity. With this added layer of protection for vaccine, people will not get severely sick from the disease or possibly die from it. As such, the CDC wants to create a model that can predict whether someone has gotten a vaccine. I will present to you model that specifically predicts whether or not someone got a H1N1 vaccine during the 2009-2010 H1N1 pandemic as there is survey data available for this vaccine. The model performed with an accuracy of 84% on unseen data with an F1 score of about 0.53. Using this model as the basis, a new model can then be created in order to locate which region many of these unvaccinated people are and strategies to get these individuals vaccinated such as free vaccine clinics, can then be further discussed after examination of the landscape.

## Problem

The CDC wants to create a model that can accurately predict who has been vaccinated and who hasn't, in order to input into another model to locate these spots of where there is low vaccination rates.

## Data

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, \
ExtraTreesClassifier, VotingClassifier, StackingRegressor
from sklearn.metrics import plot_confusion_matrix, recall_score,\
    accuracy_score, precision_score, f1_score, roc_auc_score, plot_roc_curve
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline

from model import *
from get_features import *
import pickle
import warnings

The model library was taken from Flatiron's Workflow with pipelines lecture. Adjustments and addition of a method was added into the model.py. The get_feature file was taken from Haupt J. github.

In [None]:
df = pd.read_csv('./data/training_set_features.csv')
df2 = pd.read_csv('./data/training_set_labels.csv')

A test set_feature was provided as well but unfortunately the set_labels that contain the target is with-held for the DataDriven competition. As such for the purposes of model testing, a train-test split will be performed on the training data provided. 

# Data Inspection

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df

In [None]:
df2

In [None]:
df.info()

In [None]:
df.education.value_counts()

In [None]:
df.employment_status.value_counts()

In [None]:
df.employment_industry.value_counts()

In [None]:
df.employment_occupation.value_counts()

This makes sense as to why there would be so many missing values for the columns `employment_industry` and `employment_occupation`.

In [None]:
df.health_insurance.value_counts()

In [None]:
df.race.value_counts()

We do not want to introduce racial bias into our model as such we'll be dropping this feature. 

In [None]:
df.child_under_6_months.value_counts()

In [None]:
df.hhs_geo_region.value_counts()

In [None]:
df.isna().sum().sum()
#too many nulls in general to drop all of them

In [None]:
df2.isna().sum().sum()

## Data Cleaning before Train-test split

In [None]:
df3 = pd.concat([df,df2], axis = 1)

In [None]:
df3

In [None]:
df3.employment_status.value_counts()

In [None]:
df3.employment_status.isna().sum()

In [None]:
#unemployed + not in labor force
10231 + 1453 

In [None]:
df3.employment_industry.isna().sum()

There's is more null in than the total amount of people not actively working.

In [None]:
#how much nulls should remain if we replace those that had N/A to not applicable for the employment industry
13330 - 11684 

It make sense that those that are not active in the workforce would not have not have an employment occupation and industry. As such, we will replace the occupation and industry to `not applicable` where there is a `not in Labor Force` and `unemployment` for `employment_status`.

In [None]:
#replacement of some nan bases of whether they were employed or not as it makes sense that they would not have an industy
#or occupation
df3.loc[df3['employment_status'] == "Not in Labor Force", 'employment_industry'] = "not_applicable"
df3.loc[df3['employment_status'] == "Not in Labor Force", 'employment_occupation'] = "not_applicable"

In [None]:
df3.employment_industry.isna().sum()

In [None]:
df3.employment_occupation.isna().sum()

In [None]:
df3.loc[df3['employment_status'] == "Unemployed", 'employment_industry'] = "not_applicable"
df3.loc[df3['employment_status'] == "Unemployed", 'employment_occupation'] = "not_applicable"

In [None]:
df3.employment_industry.isna().sum()

In [None]:
df3.employment_occupation.value_counts()

In [None]:
df3.describe()

We see that there is of binary or multilabel numericals

In [None]:
df3.h1n1_vaccine.value_counts()

In [None]:
df3.seasonal_vaccine.value_counts()

In [None]:
X = df3.drop(['respondent_id','race', 'h1n1_vaccine', 'seasonal_vaccine'],
            axis = 1)
y = df3['h1n1_vaccine']

###  Train-test-split 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 1)

In [None]:
X_train

In [None]:
train_df = pd.concat([X_train,y_train], axis =1)

In [None]:
sns.pairplot(train_df, y_vars= 'h1n1_vaccine') #graph makes sense looking at the dictionary for the columns

##### Preprocessing and Transformation

In [None]:
y_train.value_counts(normalize = True)

Slight Class imbalance where we can SMOTE if we want to. We will SMOTE to .35 to ensure that our data is closer to even split in terms of our target classes.

In [None]:
numeric = ['h1n1_concern', 'h1n1_knowledge', 'opinion_h1n1_vacc_effective',
                  'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc',
                  'opinion_seas_vacc_effective', 'opinion_seas_risk',
                  'opinion_seas_sick_from_vacc']

cat_cols = ['behavioral_antiviral_meds', 'behavioral_avoidance',
           'behavioral_face_mask','behavioral_wash_hands',
           'behavioral_large_gatherings', 'behavioral_outside_home',
           'behavioral_touch_face', 'doctor_recc_h1n1',
           'doctor_recc_seasonal', 'chronic_med_condition',
           'child_under_6_months', 'health_worker',
           'health_insurance', 'sex', 'income_poverty',
           'marital_status', 'rent_or_own', 'employment_status',
           'hhs_geo_region', 'census_msa', 'household_adults',
           'household_children', 'employment_industry', 'employment_occupation', 'age_group', 'education']

cat_pipe = Pipeline(steps=[('cat_impute', SimpleImputer(strategy='most_frequent')),
                              ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))])
scale_pipe = Pipeline(steps=[('scale_impute', SimpleImputer(strategy='most_frequent')),
                              ('scale', StandardScaler())])

In [None]:
ct = ColumnTransformer(transformers=[
    ('cat', cat_pipe, cat_cols),
    ('scale', scale_pipe, numeric)
])

In [None]:
#for our simple model
ct_no_cat = ColumnTransformer(transformers=[
    ('scale', scale_pipe, numeric)
])

### Simple models and It's corresponding dummy classifier

In [None]:
dummy_simple = ImPipeline(steps=[
    ('ct', ct_no_cat),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('dummy', DummyClassifier(strategy='most_frequent'))
]).fit(X_train, y_train)

In [None]:
dummy_results = ModelWithCV(dummy_simple, 'dummy', X_train, y_train)

In [None]:
dummy_results.print_summary()

Using all our numerical columns as our baseline

In [None]:
X_simple = X_train[numeric]
y_simple = y_train

In [None]:
X_simple

In [None]:
X_simple.columns == numeric

In [None]:
X_simple_pipe = ImPipeline(steps = [
    ('ct', ct_no_cat),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('logreg', LogisticRegression(random_state=1))
]).fit(X_simple,y_simple)

In [None]:
simple_log = ModelWithCV(X_simple_pipe, 'logreg', X_simple, y_simple, cv_now = True)

In [None]:
log_score = simple_log.cv_mean

In [None]:
fig, ax = plt.subplots()

ax = simple_log.plot_cv(ax)

In [None]:
#plot_confusion_matrix(X_simple_pipe, X_simple, y_simple);
simple_log.print_summary()

The f1, auc score and the accuracy is better than dummy.

#### simple svc

In [None]:
#using kernel = linear as it needs to be this to get features
svc2_pipe = ImPipeline(steps = [
    ('ct', ct_no_cat),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)), 
    ('svc2', SVC(random_state=1, kernel='linear'))]).fit(X_simple,y_simple)

In [None]:
sv2_results = ModelWithCV(svc2_pipe, 'svc2', X_simple, y_simple, cv_now = True)

In [None]:
sv2_score = sv2_results.cv_mean

In [None]:
fig, ax = plt.subplots()

ax = sv2_results.plot_cv(ax)

In [None]:
sv2_results.print_summary()

This has a better auc score but the accuracy and f1 is the same as dummy, so the default kernel for SVC is not good for this.

#### simple KNN 

In [None]:
knn_p = ImPipeline(steps = [
    ('ct', ct_no_cat),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('knn', KNeighborsClassifier())]).fit(X_simple,y_simple)

In [None]:
#knn = KNeighborsClassifier().fit(X_simple_trans, y_simple)

In [None]:
knn_result = ModelWithCV(knn_p, 'knn', X_simple, y_simple, cv_now = True)

In [None]:
knn_score = knn_result.cv_mean

In [None]:
knn_result.print_summary()

lower accuracy but better f1 and auc score

#### simple dtree

In [None]:
dtree_pipe = ImPipeline(steps = [
    ('ct', ct_no_cat),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('knn', DecisionTreeClassifier(random_state=1))]).fit(X_simple,y_simple)

In [None]:
dtree_result = ModelWithCV(dtree_pipe, 'dt', X_simple, y_simple, cv_now = True)

In [None]:
dtree_score = dtree_result.cv_mean

In [None]:
dtree_result.print_summary()

The decision tree has a lower accuracy but higher auc, and f1 score compared to the dummy.

##### let's put multiple default models in voting!

In [None]:
voting = VotingClassifier(estimators= [
    ('lr', X_simple_pipe),
    ('knn', knn_p),
    ('dt', dtree_pipe)],
).fit(X_simple,y_simple)

In [None]:
voting_result = ModelWithCV(voting, 'voting', X_simple, y_simple, cv_now = True)

In [None]:
voting_score = voting_result.cv_mean

In [None]:
voting_score

In [None]:
plot_confusion_matrix(voting, X_simple, y_simple);

In [None]:
preds = voting.predict(X_simple)

f1_score(y_simple, preds)

we will use all our features this time around and scale/transofrm it out to see if it does any better in predicting.

### Addition of more features

we will use all our features this time around and scale/transofrm it out to see if it does any better in predicting.

In [None]:
dummy_pipe = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('dummy', DummyClassifier(strategy='most_frequent'))
]).fit(X_train, y_train)

In [None]:
dummy_results = ModelWithCV(dummy_pipe, 'dummy', X_train, y_train)

In [None]:
dummy_results.print_summary()

###### logistic regression

In [None]:
#no grid search performed
logreg_pipe = ImPipeline(steps = [
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('logreg', LogisticRegression(random_state=1))
]).fit(X_train,y_train)

In [None]:
logreg_result = ModelWithCV(logreg_pipe,'log_reg',X_train,y_train)

In [None]:
logreg_result.print_summary()

When including all our features the acurracy increases from 78-79% to about 83%. IT is also better than our dummy in all aspects.

###### knn model

In [None]:
knn_pipe = ImPipeline(steps = [
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('knn', KNeighborsClassifier())
]).fit(X_train,y_train)

In [None]:
# knn_pickle = 'knn_pipe.sav'
# pickle.dump(knn_pipe, open(knn_pickle, 'wb'))

In [None]:
# knn_pipe = pickle.load(open('knn_pipe.sav', 'rb'))

In [None]:
knn1_results = ModelWithCV(knn_pipe,'knn',X_train,y_train)

In [None]:
knn1_results.print_summary()

The F1 score is higher than our logistic, the roc-auc is also higher but the accuracy is worse compared to logistic and our dummy.

###### Decision Tree model

In [None]:
dt_pipe = ImPipeline(steps = [
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('dt', DecisionTreeClassifier(random_state=1))
]).fit(X_train,y_train)

In [None]:
# dt_pickle = 'dt_pipe.sav'
# pickle.dump(dt_pipe, open(dt_pickle, 'wb'))

In [None]:
# dt_pipe = pickle.load(open('dt_pipe.sav', 'rb'))

In [None]:
dt_results = ModelWithCV(dt_pipe,'dt',X_train,y_train) #0.7581627558662006

In [None]:
dt_results.print_summary()

Getting an accuracy score of 75% while the ROC, F1 and confusion matrix were of perfect scores of 1 and classification, indicates that this decision tree overfitted to our data.

###### SVC model

In [None]:
svm_pipe = ImPipeline(steps = [
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('svm', SVC(random_state=1))
]).fit(X_train,y_train)

In [None]:
# svm_pickle = 'svm_pipe.sav'
# pickle.dump(svm_pipe, open(svm_pickle, 'wb'))

In [None]:
# svm_pipe = pickle.load(open('svm_pipe.sav', 'rb'))

In [None]:
#takes a long time to load
cv_svm = ModelWithCV(svm_pipe,'svm',X_train,y_train)

In [None]:
cv_svm.print_summary()

The F1 score is slightly worse than our knn model by about .04, while the accuracy is higher with the confusion matrix confirming it visually. The ROC-AUC is the same as the KNN with it being 0.9.

##### voting model

In [None]:
found the parameters for the logistic regression so we'll input this in our voting classifer
logreg2_pipe = ImPipeline(steps = [
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('logreg', LogisticRegression(random_state=1,C=1, max_iter=50, penalty = 'l1', solver='saga' ))
]).fit(X_train,y_train)

In [None]:
voting_complex = VotingClassifier(estimators= [
     ('lr', logreg_pipe),
     ('knn', knn_pipe),
     ('dt', dt_pipe)
 ], n_jobs=-1).fit(X_train,y_train)

In [None]:
# voting_pickle = 'voting.sav'
# pickle.dump(voting_complex, open(voting_pickle, 'wb'))

In [None]:
# voting_complex = pickle.load(open('voting.sav', 'rb'))

In [None]:
voting_complex_results = ModelWithCV(voting_complex, 'voting', X_train, y_train) # 0.834648027958063

In [None]:
#voting_complex_results.print_summary()

In [None]:
plot_confusion_matrix(voting_complex, X_train, y_train)

In [None]:
preds = voting_complex.predict(X_train)

f1_score(y_train, preds)

In [None]:
voting_complex_results.cv_mean

##### let's try out bagging instead 

In [None]:
#bagging- no need to pickle as it doesn't take too long to run
bagdt_pipe = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('bag', BaggingClassifier(random_state= 1))
]).fit(X_train, y_train)

In [None]:
bagdt_result = ModelWithCV(bagdt_pipe, 'bagging', X_train, y_train)

In [None]:
bagdt_result.print_summary()

Better than the decision tree by itself but it appears to also be super overfit with the f1 score being close to 1, and ROC-AUC being 1, while the accuracy is at 82%.

##### random forest

In [None]:
rf_pipe = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('rf', RandomForestClassifier(n_jobs=-1))
]).fit(X_train, y_train)

In [None]:
rf_result = ModelWithCV(rf_pipe, 'bagging', X_train, y_train)

In [None]:
rf_result.print_summary()

### Model tuning

We will perform hypertuning using grid search on both the bagging and random forest model to drop the overfitting of these models and then seeing how well it does.

##### gridsearch bagging 

In [None]:
bagdt_pipe = ImPipeline(steps=[
     ('ct', ct),
     ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
     ('bag', BaggingClassifier(random_state= 1))
 ]).fit(X_train, y_train)

In [None]:
 bag_params = {
     'bag__n_estimators' : [10,100,1000],
     'bag__max_features' : [5,10,15,20],
 }

In [None]:
bag_grid = GridSearchCV(estimator=bagdt_pipe, param_grid=bag_params, n_jobs=-1).fit(X_train,y_train)

In [None]:
# bag_pickle = 'bag_gridsearch.sav'
# pickle.dump(bag_grid, open(bag_pickle, 'wb'))

In [None]:
# bag_grid = pickle.load(open('bag_gridsearch.sav', 'rb'))

In [None]:
bag_grid.best_params_

In [None]:
bag_tune = ModelWithCV(bag_grid.best_estimator_,'bag_tune', X_train, y_train)

In [None]:
bag_tune.print_summary()

The two features results came out to be on the upper boundary of what was set. While it is ideal to run another grid search, we just need to move on for now as we also have to grid search our random forest first.

##### grid search random forest

In [None]:
 rf_pipe = ImPipeline(steps=[
     ('ct', ct),
     ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
     ('rf', RandomForestClassifier(n_jobs=-1))
 ]).fit(X_train, y_train)

In [None]:
 rf_params = {
     'rf__n_estimators' : [10 ,100,200],
     'rf__criterion' : ['gini', 'entropy'],
     'rf__max_depth' : [5,10,20,25],
     'rf__min_samples_split' : [100,500,1000]
 }

In [None]:
 rf_grid = GridSearchCV(estimator=rf_pipe, param_grid=rf_params, n_jobs=-1,verbose=3).fit(X_train,y_train)

In [None]:
rf_pickle = 'rf_gridsearch.sav'
pickle.dump(rf_grid, open(rf_pickle, 'wb'))

In [None]:
rf_grid = pickle.load(open('rf_gridsearch.sav', 'rb'))

In [None]:
rf_grid.best_score_

In [None]:
rf_grid.best_params_

In [None]:
rf_tune = ModelWithCV(rf_grid.best_estimator_,'bag_tune', X_train, y_train)

In [None]:
rf_tune.print_summary()

The accuracy is about the same as our SVC model at about 83% but the F1 and ROC-AUC is just slightly lower than the SVC. This does put this model in contention of for being the best model. The random forest model will be able to give us information on the feature importance of our model while the SVC gives a slightly better F1 of 0.6

**A SVC gridsearch was not perform due to the big O issue that would be encountered as the fitting time will increase quadratically for the amount of rows with this type of algorithim**

#### rfe with 60

In [None]:
rf_pipe2 = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('rfe', RFE(RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100),
               n_features_to_select= 60)),
    ('rf', RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100))
]).fit(X_train, y_train)

In [None]:
rf_pipe2.score(X_train, y_train)

In [None]:
ModelWithCV(rf_pipe2, 'rf_pipe2', X_train, y_train).print_summary()

In [None]:
features = pd.DataFrame(rf_pipe2.named_steps.rfe.support_.flatten(), index=get_feature_names(ct))

In [None]:
features

In [None]:
features[features[0]== True]

#### RFE of 100 selection

In [None]:
rf_pipe3 = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('rfe', RFE(RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100),
               n_features_to_select= 100)),
    ('rf', RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100))
], verbose=True).fit(X_train, y_train)

In [None]:
ModelWithCV(rf_pipe3, 'rf_pipe3', X_train, y_train).print_summary()

##### RFE with 25

In [None]:
rf_pipe4 = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('rfe', RFE(RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100),
               n_features_to_select= 25)),
    ('rf', RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100))
], verbose=True).fit(X_train, y_train)

In [None]:
ModelWithCV(rf_pipe4, 'rf_pipe3', X_train, y_train).print_summary()

##### RFE with all features 

In [None]:
rf_pipe5 = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('rfe', RFE(RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100),
               n_features_to_select= 122)),
    ('rf', RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100))
], verbose=True).fit(X_train, y_train)

In [None]:
ModelWithCV(rf_pipe5, 'rf_pipe3', X_train, y_train).print_summary()

In [None]:
feature1 = pd.DataFrame(rf_pipe5.named_steps.rf.feature_importances_, index=get_feature_names(ct))

In [None]:
feature1.sort_values(0, ascending=False)[:10]

Mapping out the OHE by manually looking at the columns.

In [None]:
cat_cols = ['behavioral_antiviral_meds', 'behavioral_avoidance',
           'behavioral_face_mask','behavioral_wash_hands',
           'behavioral_large_gatherings', 'behavioral_outside_home',
           'behavioral_touch_face', 'doctor_recc_h1n1',
           'doctor_recc_seasonal', 'chronic_med_condition',
           'child_under_6_months', 'health_worker',
           'health_insurance', 'sex', 'income_poverty',
           'marital_status', 'rent_or_own', 'employment_status',
           'hhs_geo_region', 'census_msa', 'household_adults',
           'household_children', 'employment_industry', 'employment_occupation', 'age_group', 'education']

In [None]:
X_train

Feature Importance according to our data:

- behavioral_touch_face yes/no
- opinion of the risk of h1n1
- h1n1 vaccine effectiveness
- season flu vaccine risk
- season flu vaccine effectiveness
- doc recommendation for the h1n1 vaccine
- child under 6 months

### Final Model Evaluation

We will perform two model evaluation as one is a slightly predictor but is a black box due to the kernel being used.The other can be used to gather feature importance of the model.

In [None]:
rf_pipe2 = ImPipeline(steps=[
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('rfe', RFE(RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100),
               n_features_to_select= 60)),
    ('rf', RandomForestClassifier(criterion='gini', max_depth=25, min_samples_split=100, n_estimators = 100))
]).fit(X_train, y_train)

In [None]:
final_model = rf_pipe2

In [None]:
ModelWithCV(final_model, 'final_model', X_test, y_test).print_summary()

In [None]:
svm_pipe = ImPipeline(steps = [
    ('ct', ct),
    ('sm', SMOTE(sampling_strategy= 0.35, random_state=1)),
    ('svm', SVC(random_state=1))
]).fit(X_train,y_train)

In [None]:
ModelWithCV(svm_pipe, 'final_model', X_test, y_test).print_summary()

They both performed about the same with the SVC being slightly better as expected base on the difference in the training model. But overall, both performed only slightly worse on unseen data.

## Next Steps

Further tuning of model is needed to lower the misclassification of those that did or did not get the H1N1 vaccine. Not only that but refinement can be made to the data if access to the original dataset as the regions in which each individual was located is masked. As such, we are unable to do initial analysis on where those who were unvaccinated are. We would also want to test the model against the covid-19 vaccine confidence survey, as they have some of the same questions that was asked in the H1N1 survey, to see how well it’s able to predict if someone had the covid-19 vaccine to examine if it’s generalizable for other vaccines.