<a href="https://colab.research.google.com/github/ricardo-arl/flushotcomp/blob/master/Flu_Shot_Comp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Colab Notebook for Flu Shot Competition

Source: https://www.drivendata.org/competitions/66/flu-shot-learning/page/210/

In [None]:
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
Building wheels for collected packages: pandas-profiling
  Building wheel for pandas-profiling (setup.py) ... [?25l[?25hdone
  Created wheel for pandas-profiling: filename=pandas_profiling-2.9.0rc1-py2.py3-none-any.whl size=258106 sha256=aaaa6fe503e70d14b85977a3ddf77ff568a7362308c52ba24580a5824596482b
  Stored in directory: /tmp/pip-ephem-wheel-cache-n8wpqhyy/wheels/56/c2/dd/8d945b0443c35df7d5f62fa9e9ae105a2d8b286302b92e0109
Successfully built pandas-profiling


In [None]:
#Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#EDA
from pandas_profiling import ProfileReport

#Processing
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

#Metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,roc_auc_score
from sklearn.model_selection import cross_val_score

#Modeling
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

In [None]:
data_features = pd.read_csv('/content/drive/My Drive/Flu Shot Comp/training_set_features.csv')

In [None]:
data_labels = pd.read_csv('/content/drive/My Drive/Flu Shot Comp/training_set_labels.csv')

In [None]:
test_features = pd.read_csv('/content/drive/My Drive/Flu Shot Comp/test_set_features.csv')

In [None]:
print(data_features.shape)
print(test_features.shape)

(26707, 36)
(26708, 36)


Despues de importar la data, vamos a visualizar los dataframes

In [None]:
data_features.head(5)

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,doctor_recc_seasonal,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0,2.0,2.0,1.0,2.0,55 - 64 Years,< 12 Years,White,Female,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,4.0,4.0,4.0,2.0,4.0,35 - 44 Years,12 Years,White,Male,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,0.0,0.0,,3.0,1.0,1.0,4.0,1.0,2.0,18 - 34 Years,College Graduate,White,Male,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,,3.0,3.0,5.0,5.0,4.0,1.0,65+ Years,12 Years,White,Female,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,3.0,3.0,2.0,3.0,1.0,4.0,45 - 54 Years,Some College,White,Female,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


Definición de Variables según la página de la competencia:

For all binary variables: 0 = No; 1 = Yes.

h1n1_concern - Level of concern about the H1N1 flu.
0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned.

h1n1_knowledge - Level of knowledge about H1N1 flu.
0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.

behavioral_antiviral_meds - Has taken antiviral medications. (binary)

behavioral_avoidance - Has avoided close contact with others with flu-like symptoms. (binary)

behavioral_face_mask - Has bought a face mask. (binary)

behavioral_wash_hands - Has frequently washed hands or used hand sanitizer. (binary)

behavioral_large_gatherings - Has reduced time at large gatherings. (binary)

behavioral_outside_home - Has reduced contact with people outside of own household. (binary)

behavioral_touch_face - Has avoided touching eyes, nose, or mouth. (binary)

doctor_recc_h1n1 - H1N1 flu vaccine was recommended by doctor. (binary)

doctor_recc_seasonal - Seasonal flu vaccine was recommended by doctor. (binary)

chronic_med_condition - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)

child_under_6_months - Has regular close contact with a child under the age of six months. (binary)

health_worker - Is a healthcare worker. (binary)

health_insurance - Has health insurance. (binary)

opinion_h1n1_vacc_effective - Respondent's opinion about H1N1 vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.

opinion_h1n1_risk - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.

opinion_h1n1_sick_from_vacc - Respondent's worry of getting sick from taking H1N1 vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.

opinion_seas_vacc_effective - Respondent's opinion about seasonal flu vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.

opinion_seas_risk - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.

opinion_seas_sick_from_vacc - Respondent's worry of getting sick from taking seasonal flu vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.

age_group - Age group of respondent.

education - Self-reported education level.

race - Race of respondent.

sex - Sex of respondent.

income_poverty - Household annual income of respondent with respect to 2008 Census poverty thresholds.

marital_status - Marital status of respondent.

rent_or_own - Housing situation of respondent.

employment_status - Employment status of respondent.

hhs_geo_region - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings.

census_msa - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.

household_adults - Number of other adults in household, top-coded to 3.

household_children - Number of children in household, top-coded to 3.

employment_industry - Type of industry respondent is employed in. Values are represented as short random character strings.

employment_occupation - Type of occupation of respondent. Values are represented as short random character strings.

Usualmente, si es un dataframe pequeño, trato de usar la libreria de Pandas Profile Report para explorar un poco el dataset. Pero siento que a veces nos volvemos dependientes de las herramientas y quiero dedicarle mas tiempo a la interpretacion de los datos y relación de variables.

In [None]:
data_labels

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0
...,...,...,...
26702,26702,0,0
26703,26703,0,0
26704,26704,0,1
26705,26705,0,0


In [None]:
df = data_features.merge(data_labels,how='inner',on='respondent_id')

In [None]:
percent_missing = (df.isnull().sum() / len(df))*100
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df

Unnamed: 0,percent_missing
respondent_id,0.0
h1n1_concern,0.344479
h1n1_knowledge,0.434343
behavioral_antiviral_meds,0.265848
behavioral_avoidance,0.778822
behavioral_face_mask,0.071142
behavioral_wash_hands,0.157262
behavioral_large_gatherings,0.325757
behavioral_outside_home,0.307036
behavioral_touch_face,0.479275


In [None]:
missing_value_df.shape

(38, 1)

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report")

In [None]:
profile

Despues de tener una idea general del dataset, vamos a imputar los valores nulos y eliminar las columnas con mas de un 20% de nulos. Quizas podria servir imputarlas como no data para tratar de mantener la mayor cantidad de información en columnas que pueden ser de utilidad. 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [None]:
df.columns

Index(['respondent_id', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation', 'h1n1_vaccine', 'seasonal_vaccine'],
      dtype='object')

In [None]:
y1 = df['h1n1_vaccine']
y2 = df['seasonal_vaccine']

In [None]:
df = df.drop(['respondent_id','employment_industry','employment_occupation','health_insurance','h1n1_vaccine','seasonal_vaccine'], axis = 1)

In [None]:
cat_col = df.select_dtypes(include=['object']).columns
bool_col = ['behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker']
ord_col = ['opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc','household_adults',
       'household_children']

In [None]:
cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
ohe = OneHotEncoder(sparse = False, handle_unknown='ignore')
ord = OrdinalEncoder()

In [None]:
categorical_transformer = Pipeline(steps=[('imputer', cat),
                                          ('onehot', ohe)])

boolean_transformer = Pipeline(steps=[('imputer', cat),
                                      ('onehot', ohe)])

ordinal_transformer = Pipeline(steps=[('imputer', cat),
                                      ('ordinal', ord)])

preprocessor = ColumnTransformer(transformers=[('categorical', categorical_transformer, cat_col),
                                               ('boolean', boolean_transformer, bool_col),
                                               ('ordinal', ordinal_transformer, ord_col)])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, y1, test_size=0.3)
X_train2, X_test2, y_train2, y_test2 = train_test_split(df, y2, test_size=0.3)

Modeling Basics

In [None]:
rf = RandomForestClassifier()
lg = lgb.LGBMClassifier()
xg = XGBClassifier()

In [None]:
pipe_rf = make_pipeline(preprocessor, rf)
pipe_lg = make_pipeline(preprocessor, lg)
pipe_xg = make_pipeline(preprocessor, xg)

Random Forest

In [None]:
rf_model = pipe_rf.fit(X_train,y_train)
rf_pred = rf_model.predict_proba(X_test)
rf_score = roc_auc_score(y_test,rf_pred[:, 1])
rf_score

0.8240508815166582

In [None]:
rf_model2 = pipe_rf.fit(X_train2,y_train2)
rf_pred2 = rf_model2.predict_proba(X_test2)
rf_score2 = roc_auc_score(y_test2,rf_pred2[:, 1])
rf_score2

0.8430265691186964

LightGBM

In [None]:
lg_model = pipe_lg.fit(X_train,y_train)
lg_pred = lg_model.predict_proba(X_test)
lg_score = roc_auc_score(y_test,lg_pred[:, 1])
lg_score

0.83548705815057

In [None]:
lg_model2 = pipe_lg.fit(X_train2,y_train2)
lg_pred2 = lg_model2.predict_proba(X_test2)
lg_score2 = roc_auc_score(y_test2,lg_pred2[:, 1])
lg_score2

0.8561206116208565

XGBoost

In [None]:
xg_model = pipe_xg.fit(X_train,y_train)
xg_pred = xg_model.predict_proba(X_test)
xg_score = roc_auc_score(y_test,xg_pred[:, 1])
xg_score

0.8380513309573644

In [None]:
xg_model2 = pipe_xg.fit(X_train2,y_train2)
xg_pred2 = xg_model2.predict_proba(X_test2)
xg_score2 = roc_auc_score(y_test2,xg_pred2[:, 1])
xg_score2

0.8568123994572826

En ambas predicciones el modelo XGBoost con sus parametros por default fue mejor. Ahora haremos tuneo o modificación de los hiperparametros.

Predict

In [None]:
test = test_features.drop(['respondent_id','employment_industry','employment_occupation','health_insurance'], axis = 1)

In [None]:
submission = pd.read_csv('/content/submission_format.csv')

In [None]:
h1n1 = xg_model.predict_proba(test)

In [None]:
h1n1[:, 1]

array([0.12156226, 0.04580515, 0.43874004, ..., 0.13092348, 0.06804745,
       0.5391636 ], dtype=float32)

In [None]:
submission['h1n1_vaccine'] = h1n1[:, 1]

In [None]:
seasonal = xg_model2.predict_proba(test)

In [None]:
seasonal[:, 1]

array([0.23135146, 0.0540988 , 0.7575805 , ..., 0.18539461, 0.35388353,
       0.61423296], dtype=float32)

In [None]:
submission['seasonal_vaccine'] = seasonal[:, 1]

In [None]:
submission

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,26707,0.121562,0.231351
1,26708,0.045805,0.054099
2,26709,0.438740,0.757581
3,26710,0.526917,0.819088
4,26711,0.233618,0.515052
...,...,...,...
26703,53410,0.377419,0.551616
26704,53411,0.129752,0.353451
26705,53412,0.130923,0.185395
26706,53413,0.068047,0.353884


In [None]:
submission.to_csv('submission1.csv',index = False)