# Predict Immunization Dropouts

In this notebook, I give a summary of my analysis and thought process. My analysis can be divided into 5 main sections:

1. [Problem Outline](#1)
2. [Exploratory Data Analysis](#2)
3. [Model + Training](#3)
4. [Analysing Results](#4)
5. [Conclusion](#5)

In [1]:
# SETUP

#general imports
import numpy as np
import pandas as pd

#model building imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

#visualization imports
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# 1. Define Problem<a id='1'></a>

- Goals: Maximize the amount of patients that complete their full vaccinations. 
- Type of problem: Supervised, classification, prediction
- Specific ML task: Classify the patients as (1) 'need intervention' and (0) 'don't need intervention'
- Evaluation Metric: ROC score
- False Negative vs False Positive: FN more costly than FP

# 2. Exploratory Data Analysis<a id='2'></a>

Refer to 1-exploratory-data-analysis.ipynb for data cleaning and analysis

First the data was cleaned by:
- removing duplicate values
- imputing missing data

Second, since the model had to predict which class the patients belong to, I annotated the data by assigning a class to each patients. This was done by determining which patients would NOT receive their full vaccination records by 6 months of age and classifying them as (1) 'need intervention', the remaining were classified as (0) 'don't need intervention'.

Finally a created a new data table "training_data" that contains data on patients only up to 4 months of age, with their class labels and categorical values transformed to one-hot-encoding. 

# 3. Model + Training<a id='3'></a>

#### Baseline Model

In [12]:
#load the processed data

data = pd.read_csv("proc_data/training_baseline.csv")
data.head()

Unnamed: 0,pat_id,fac_id,gender,region,district,Y,n-opv,n-dtp
0,1,51.0,f,Ghanzi,Ghanzi,False,3.0,2.0
1,2,89.0,f,Chobe,Chobe,True,1.0,0.0
2,3,161.0,m,Central,Tutume,True,2.0,1.0
3,4,168.0,f,Central,Lethlakane,True,1.0,1.0
4,5,183.0,m,Central,Tuli,True,3.0,2.0


In [13]:
# as a baseline score lets predict to highed occuring class as prediction for all

# how many patients need intervention vs dont?
data.Y.value_counts()

True     31673
False    17102
Name: Y, dtype: int64

In [14]:
# baseline accuracy - if the model predicted everyone needs intervention:
31031/(31031+17102) * 100

64.46928302827581

In [15]:
data.describe(include = "all")

Unnamed: 0,pat_id,fac_id,gender,region,district,Y,n-opv,n-dtp
count,48775.0,46329.0,47803,47174,47174,48775,48775.0,48775.0
unique,,,2,15,24,2,,
top,,,m,North-West,Ngamiland East,True,,
freq,,,23931,16184,14523,31673,,
mean,25021.350118,173.139567,,,,,2.635428,1.732117
std,14435.084986,99.793493,,,,,0.968813,0.943496
min,1.0,1.0,,,,,0.0,0.0
25%,12522.5,87.0,,,,,2.0,1.0
50%,25036.0,173.0,,,,,3.0,2.0
75%,37522.5,260.0,,,,,3.0,2.0


In [17]:
# convert categorical variables to one-hot-encoded

# Categorical boolean mask
categorical_feature_mask = data.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = data.columns[categorical_feature_mask].tolist()

categorical_cols

['gender', 'region', 'district']

In [None]:
categorical_feature_mask

In [None]:
# generate binary values using get_dummies
training_data = pd.get_dummies(training_data, columns=categorical_cols)
training_data.head()

In [None]:
# split into train/test set
X_train, X_test, y_train, y_test = train_test_split(training_data.drop(columns = ['Y', 'pat_id']), 
                                                    training_data['Y'],
                                                    random_state=RND, 
                                                    stratify = training_data['Y'])

In [None]:
model = Sequential()
model.add(Dense(392, input_dim=392, activation='relu',kernel_regularizer=L1L2(l1=0.0, l2=0.1)))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

In [None]:
# final model accuracy on test set - 0.8622

In [None]:
eval = model.evaluate(x=X_test, y=y_test)
print('Accuracy on test set: {:.2f}'.format(eval[1]))

# 4. Analysing Results<a id='4'></a>

# 5. Conclusion<a id='5'></a>