# DataHack , IIT Guwahati

## A Data Analytics Hackathon,  Vaccine Uptake Prediction Model
This notebook outlines the process of predicting vaccine uptake using logistic regression. The dataset includes demographic and health information from individuals to predict their likelihood of receiving vaccines.


In [99]:
import pandas as pd
import numpy as np

 ## Loading the data

In [103]:
#training data 
train_data = pd.read_csv('/Users/barupatisaivarun/Downloads/dataset and all/training_set_features.csv')
train_labels = pd.read_csv('/Users/barupatisaivarun/Downloads/dataset and all/training_set_labels.csv')
# testing data set
test_data = pd.read_csv('/Users/barupatisaivarun/Downloads/dataset and all/test_set_features.csv')

# Print column names to ensure 'respondent_id' is present
print("Train Data Columns:", train_data.columns.tolist())
print("Train Labels Columns:", train_labels.columns.tolist())
print("Test Data Columns:", test_data.columns.tolist())

Train Data Columns: ['respondent_id', 'xyz_concern', 'xyz_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_xyz', 'doctor_recc_seasonal', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_xyz_vacc_effective', 'opinion_xyz_risk', 'opinion_xyz_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa', 'household_adults', 'household_children', 'employment_industry', 'employment_occupation']
Train Labels Columns: ['respondent_id', 'xyz_vaccine', 'seasonal_vaccine']
Test Data Columns: ['respondent_id', 'xyz_concern', 'xyz_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask

## DATA PREPROCESSING
This section covers the filling of missing values

In [107]:
# Convert respondent_id to int if stored as float and check for non-matching ids
train_data['respondent_id'] = train_data['respondent_id'].astype(int)
train_labels['respondent_id'] = train_labels['respondent_id'].astype(int)
test_data['respondent_id'] = test_data['respondent_id'].astype(int)

# Display the first few rows of the  training dataset and training labels to understand its structure
print(train_data.head())
print(train_labels.head())

# Check for missing values in each column
print(train_data.isnull().sum())
print(train_labels.isnull().sum())


   respondent_id  xyz_concern  xyz_knowledge  behavioral_antiviral_meds  \
0              0          1.0            0.0                        0.0   
1              1          3.0            2.0                        0.0   
2              2          1.0            1.0                        0.0   
3              3          1.0            1.0                        0.0   
4              4          2.0            1.0                        0.0   

   behavioral_avoidance  behavioral_face_mask  behavioral_wash_hands  \
0                   0.0                   0.0                    0.0   
1                   1.0                   0.0                    1.0   
2                   1.0                   0.0                    0.0   
3                   1.0                   0.0                    1.0   
4                   1.0                   0.0                    1.0   

   behavioral_large_gatherings  behavioral_outside_home  \
0                          0.0                      1.0  

## Merging the data
Combining feature set and label set to a single dataframe

In [109]:
# Merge training data with labels on 'respondent_id'
train_merged = pd.merge(train_data, train_labels, on='respondent_id', how='inner')
print("Columns after merge:", train_merged.columns.tolist())


Columns after merge: ['respondent_id', 'xyz_concern', 'xyz_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_xyz', 'doctor_recc_seasonal', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_xyz_vacc_effective', 'opinion_xyz_risk', 'opinion_xyz_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa', 'household_adults', 'household_children', 'employment_industry', 'employment_occupation', 'xyz_vaccine', 'seasonal_vaccine']


In [49]:
from sklearn.preprocessing import StandardScaler

## Encoding categorical variables
Conversion of categorical variables into a suitable numerical format. We utilize one-hot encoding to transform categorical variables

In [111]:
# Handling missing values
train_merged.fillna(train_merged.median(numeric_only=True), inplace=True)
test_data.fillna(train_merged.median(numeric_only=True), inplace=True)


# Encode categorical variables excluding 'respondent_id'
categorical_vars = train_merged.select_dtypes(include=['object']).columns.tolist()
if 'respondent_id' in categorical_vars:
    categorical_vars.remove('respondent_id')
train_merged = pd.get_dummies(train_merged, columns=categorical_vars, drop_first=True)
test_data = pd.get_dummies(test_data, columns=categorical_vars, drop_first=True)

# Align features
train_merged, test_data = train_merged.align(test_data, join='outer', axis=1, fill_value=0)
test_data = test_data.drop(columns=['xyz_vaccine', 'seasonal_vaccine'], errors='ignore')

if train_merged.select_dtypes(include=['object']).any().any():
    raise ValueError("There are still object type data in the training features after encoding.")

# Ensure 'respondent_id' remains
print("Test Data Columns after alignment:", test_data.columns.tolist())


Test Data Columns after alignment: ['age_group_35 - 44 Years', 'age_group_45 - 54 Years', 'age_group_55 - 64 Years', 'age_group_65+ Years', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'behavioral_wash_hands', 'census_msa_MSA, Principle City', 'census_msa_Non-MSA', 'child_under_6_months', 'chronic_med_condition', 'doctor_recc_seasonal', 'doctor_recc_xyz', 'education_< 12 Years', 'education_College Graduate', 'education_Some College', 'employment_industry_atmlpfrs', 'employment_industry_cfqqtusy', 'employment_industry_dotnnunm', 'employment_industry_fcxhlnwr', 'employment_industry_haxffmxo', 'employment_industry_ldnlellj', 'employment_industry_mcubkhph', 'employment_industry_mfikgejo', 'employment_industry_msuufmds', 'employment_industry_nduyfdeo', 'employment_industry_phxvnwax', 'employment_industry_pxcmvdjn', 'employment_industry_qnlwzans', 'employment_industry_rucpziij',

In [59]:
from sklearn.linear_model import LogisticRegression

## Model training
 we train a logistic regression model with our data

In [113]:

# Ensure the labels are properly named and present
if 'xyz_vaccine' not in train_merged.columns or 'seasonal_vaccine' not in train_merged.columns:
    raise ValueError("One or more target columns are missing from the training data.")

# Prepare the features and targets for training
X_train = train_merged.drop(['respondent_id', 'xyz_vaccine', 'seasonal_vaccine'], axis=1, errors='ignore')
y_train_xyz = train_merged['xyz_vaccine']
y_train_seasonal = train_merged['seasonal_vaccine']

# Initialize and train logistic regression models
model_xyz = LogisticRegression(max_iter=1000)
model_xyz.fit(X_train, y_train_xyz)

model_seasonal = LogisticRegression(max_iter=1000)
model_seasonal.fit(X_train, y_train_seasonal)


In [65]:
from sklearn.metrics import roc_auc_score

## Model evaluation and predictions

In [115]:
# Prepare test features
X_test = test_data.drop('respondent_id', axis=1, errors='ignore')

# Predict probabilities for the test data
final_probabilities_xyz = model_xyz.predict_proba(X_test)[:, 1]
final_probabilities_seasonal = model_seasonal.predict_proba(X_test)[:, 1]

# Prepare the submission DataFrame
submission = pd.DataFrame({
    'respondent_id': test_data['respondent_id'],
    'xyz_vaccine_label': final_probabilities_xyz,
    'seasonal_vaccine_label': final_probabilities_seasonal
})


## Vaccine_predictions


In [117]:
# Save the submission file
submission.to_csv('vaccine_predictions.csv', index=False)


* The final predictions for the vaccine uptake have been saved in the file "vaccine_predictions.csv". This file contains the probability estimates for individuals in the test dataset receiving both the xyz vaccine and the seasonal vaccine.

** Github link - https://github.com/saivarun2108/DataHack