# Codeclub Docker tutorial

This document is a brief case study on how a data analyis can be run on a docker container. We will use data on diabetes readmissions from the UCI repository: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

The specification of the Docker environment, installed modules and data can be found in the Dockerfile provided with this tutorial.

In [71]:
# Import all necessary modules. All modules are already pre-installed in the Docker image
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

We copied the data into the container when we created the image. Now we only have to set the correct path and read the data in with pandas. 

In [2]:
data_path = "/home/data/diabetic_data.csv"
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


For simplicities sake, we specify only a small number of predictor variables. Our outcome is readmission to hospital within 30 days. 

In [72]:
X = df[['age', 'gender', 'race']]

y = df['readmitted']
y = y.replace('>30', 'NO')
y = y.replace('<30', 'YES')
y.unique()

array(['NO', 'YES'], dtype=object)

Scikit-learn requires us to encode our categorical variables as numbers. We simply apply scikit-learns OneHotEncoder, dropping the first category of each and adding a constant vector (=intercept) to obtain a full rank matrix. 

In [83]:
enc = OneHotEncoder(drop='first')
X_enc = np.hstack((
    np.ones((X.shape[0], 1)), 
    enc.fit_transform(X).toarray()
))

We then feed the encoded matrix and the outcome vector to fit the logistic regression.

In [84]:
clf = LogisticRegression(random_state=42, solver='newton-cg')
clf.fit(X_enc, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=42, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

Unsurprisingly, this very simple prediction based on only age, sex and ethnicity does poorly. However, you now have a Docker image in which you can further explore this dataset to your heart's desire :)

In [86]:
probs = clf.predict_proba(X_enc)
roc_auc_score(y, probs[:,1])

0.5283278747843857