# Predict Immunization Dropouts

In this notebook, I give a summary of my analysis and thought process. My analysis can be divided into 5 main sections:

1. [Problem Outline](#1)
2. [Exploratory Data Analysis](#2)
3. [Model + Training](#3)
4. [Analysing Results](#4)
5. [Conclusion](#5)

In [None]:
# SETUP

#general imports
import numpy as np
import pandas as pd

#model building imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

#visualization imports
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

RND = 2020

# 1. Define Problem<a id='1'></a>

- Goals: Maximize the amount of patients that complete their full vaccinations. 
- Type of problem: Supervised, classification, prediction
- Specific ML task: Classify the patients as (1) 'need intervention' and (0) 'don't need intervention'
- Evaluation Metric: ROC score
- False Negative vs False Positive: FN more costly than FP

# 2. Exploratory Data Analysis<a id='2'></a>

(Refer to exploratory-data-analyis.ipynb for initial data analysis and cleaning)

In [1]:
#load the clean data

#### Available Raw Data:

In [3]:
# patient data

patients_df = pd.read_csv("raw_data/patients_db_v2.csv")
patients_df.head()

Unnamed: 0.1,Unnamed: 0,pat_id,fac_id,dob,gender,long,lat,region,district
0,0,1,51.0,2019-01-22,f,,,Ghanzi,Ghanzi
1,1,2,89.0,2019-11-12,f,24.877556,-18.370709,Chobe,Chobe
2,2,3,161.0,2019-11-03,m,25.249672,-20.490189,Central,Tutume
3,3,4,168.0,2019-04-17,f,25.579269,-21.412151,Central,Lethlakane
4,4,5,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli


In [4]:
# vaccination record data

immun_df = pd.read_csv("raw_data/immunization_db_v2.csv")
immun_df.head()

Unnamed: 0.1,Unnamed: 0,pat_id,vaccine,im_date,successful,reason_unsuccesful
0,0,1,OPV,2019-01-31,True,
1,1,2,OPV,2019-11-12,True,
2,2,3,OPV,2019-11-03,True,
3,3,4,OPV,2019-06-01,True,
4,4,5,OPV,2018-12-24,True,


#### Processing raw data into a form useful for training:

First the data was cleaned by:
- removing duplicate values
- imputing missing data

Second, since the model had to predict which class the patients belong to, I annotated the data by assigning a class to each patients. This was done by determining which patients would NOT receive their full vaccination records by 6 months of age and classifying them as (1) 'need intervention', the remaining were classified as (0) 'don't need intervention'.

Finally a created a new data table "training_data" that contains data on patients only up to 4 months of age, with their class labels and categorical values transformed to one-hot-encoding. 

In [5]:
# Training data


- Will you have to do any featuring engineering or feature extraction?

- Does it need normalization?

- What to do with missing data?

- If there's class imbalance in the data, how do you plan on handling it?

- How to evaluate whether your train set and test set come from the same distribution, and what to do if they don't?

- If you have data of different types, say both texts, numbers, and images, how are you planning on combining them?

- What biases might represent in the data? How would you correct the biases?

# 3. Model + Training<a id='3'></a>

Start with simple model and build complexity


Start simple like logistic regression, then get more complex like random forest or xgboost

# 4. Analysing Results<a id='4'></a>

# 5. Conclusion<a id='5'></a>