# Docker Tutorial

Before you get started.  If you feel like you still need some practice getting a feel for docker try the tutrial for beginners in the Docker tutorial, before starting this tutorial.  Docker is intuitive so going through a few examples will be all that it takes to get comfortable.

* [Docker Tutorials](https://github.com/docker/labs/blob/master/beginner/readme.md)

This tutorial is loosly based on the second tutorial, `Webapps with Docker`, so going through both of those tutorials along with this one will provide a lot of context for how to use Docker in a number of different ways.

> You will need to run through this tutorial with access to a termminal.  Jupyter lab or an open terminal will work.  We will create some of the files you need from within this notebook, but Docker is a command line tool.

In [4]:
import os
import sys
import joblib
import requests
import numpy as np
import pandas as pd

from collections import Counter
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import ensemble

In [17]:
## preprocessing pipeline
numeric_features = ['age', 'num_streams']
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),
                                      ('scaler', StandardScaler())])

categorical_features = ['country', 'subscriber_type']
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                               ('cat', categorical_transformer, categorical_features)])


def load_aavail_data():
    data_dir = os.path.join(".")
    df = pd.read_csv(os.path.join(data_dir,r"aavail-target.csv"))

    ## pull out the target and remove uneeded columns
    _y = df.pop('is_subscriber')
    y = np.zeros(_y.size)
    y[_y==0] = 1 
    df.drop(columns=['customer_id','customer_name'],inplace=True)
    return(df,y)

## Docker command reference

Here is a quick reference to keep your Docker commands accessable.

| command | description |
|:--|:--|
|`docker container ls`| # List all running containers|
|`docker ps` | # List all running containers|
|`docker container ls -a` |  # List all containers, even those not running|
|`docker container stop CONTAINER_ID_OR_NAME` | # Gracefully stop the specified container|
|`docker container kill CONTAINER_ID_OR_NAME` | # Force shutdown of the specified container|
|`docker container rm CONTAINER_ID_OR_NAME`  |   # Remove specified container from this machine|
|`docker container rm $(docker container ls -a -q)` | # Remove all containers|
|`docker image ls -a`  | # List all images on this machine|
|`docker image rm IMAGE_ID_OR_NAME` | # Remove specified image from this machine|
|`docker image rm $(docker image ls -a -q)`   |# Remove all images from this machine|
|`docker login` |# Log in this CLI session using your Docker credentials|

In [10]:
## make a directory for the tutorial
if not os.path.isdir("docker-tutorial"):
    os.mkdir("docker-tutorial")
    
if os.path.split(os.getcwd())[-1] != 'docker-tutorial':  
    os.chdir("docker-tutorial")
print(os.getcwd())

/root/data/docker-tutorial


## Persist a machine learning model

Vist the docs to learn more about [model persistence in scikit-learn](https://scikit-learn.org/stable/modules/model_persistence.html).  Be careful with sensitive data and pickle files since the data can easily be extracted.

In [11]:
## load data (you may need to adjust the location of the data to match your system)
X,y = load_aavail_data()

## train test split check model performance (assumes you have already grid-searched to tune model)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
params = {'n_estimators': 100,'max_depth':2}   
clf = ensemble.RandomForestClassifier(**params)
pipe = Pipeline(steps=[('pre', preprocessor),
                       ('clf',clf)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test,y_pred))

## retrain using all of the data
pipe.fit(X, y)
saved_model = 'aavail-rf.joblib'
joblib.dump(pipe, saved_model)

              precision    recall  f1-score   support

         0.0       0.83      0.91      0.87       142
         1.0       0.71      0.55      0.62        58

    accuracy                           0.81       200
   macro avg       0.77      0.73      0.75       200
weighted avg       0.80      0.81      0.80       200



['aavail-rf.joblib']

## Create a simple flask app

In [12]:
%%writefile app.py

from flask import Flask, jsonify, request
import joblib
import socket
import json
import pandas as pd
import os

app = Flask(__name__)

@app.route("/")
def hello():
    html = "<h3>Hello {name}!</h3>" \
           "<b>Hostname:</b> {hostname}<br/>"
    return html.format(name=os.getenv("NAME", "world"), hostname=socket.gethostname())

@app.route('/predict', methods=['GET','POST'])
def predict():
    
    ## input checking
    if not request.json:
        print("ERROR: API (predict): did not receive request data")
        return jsonify([])

    query = request.json
    query = pd.DataFrame(query)
    
    if len(query.shape) == 1:
         query = query.reshape(1, -1)

    y_pred = model.predict(query)
    
    return(jsonify(y_pred.tolist()))        
            
if __name__ == '__main__':
    saved_model = 'aavail-rf.joblib'
    model = joblib.load(saved_model)
    app.run(host='0.0.0.0', port=8080,debug=True)

Writing app.py


## Test the flask app

Move into your `docker-tutorial` directory

```bash
$ cd docker-tutorial
```

Start the app

```bash
$ python app.py
```

Then go to [http://0.0.0.0:8080/](http://0.0.0.0:8080/)

Stop the server.  We will relaunch it in a few moments from within Docker.

## Create the DockerFile

Before we build the DockerFile first we need to create a requirement.txt

In [13]:
%%writefile requirements.txt

cython
numpy
flask
pandas
scikit-learn

Writing requirements.txt


In [14]:
%%writefile Dockerfile

# Use an official Python runtime as a parent image
FROM python:3.7.5-stretch

RUN apt-get update && apt-get install -y \
python3-dev \
build-essential    
        
# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "app.py"]

Writing Dockerfile


## Build the Docker image and run it

Step one: build the image (from the directory that was created with this notebook)
 
```bash
    ~$ cd docker-tutorial
    ~$ docker build -t example-ml-app .
```

Check that the image is there.

```bash
    ~$ docker image ls
```

You may notice images that you no longer use.  You may delete them with

```bash
    ~$ docker image rm IMAGE_ID_OR_NAME
```

Run the container

```bash
docker run -p 4000:8080 example-ml-app
```

## Test the running app

First go to [http://0.0.0.0:4000/](http://0.0.0.0:4000/) to ensure the app is running and accessible.

In [15]:
## create some new data
X_new_data = {}
X_new_data['country'] = ['united_states','united_states','singapore','united_states','singapore']
X_new_data['age'] = [28,30,33,24,39]
X_new_data['subscriber_type'] = ['aavail_premium','aavail_basic','aavail_basic','aavail_basic','aavail_unlimited']
X_new_data['num_streams'] = [9,19,14,33,20]
X_new = pd.DataFrame(X_new_data)
X_new.head()

Unnamed: 0,country,age,subscriber_type,num_streams
0,united_states,28,aavail_premium,9
1,united_states,30,aavail_basic,19
2,singapore,33,aavail_basic,14
3,united_states,24,aavail_basic,33
4,singapore,39,aavail_unlimited,20


In [22]:
import json
json.dumps(X_new.to_dict())

'{"country": {"0": "united_states", "1": "united_states", "2": "singapore", "3": "united_states", "4": "singapore"}, "age": {"0": 28, "1": 30, "2": 33, "3": 24, "4": 39}, "subscriber_type": {"0": "aavail_premium", "1": "aavail_basic", "2": "aavail_basic", "3": "aavail_basic", "4": "aavail_unlimited"}, "num_streams": {"0": 9, "1": 19, "2": 14, "3": 33, "4": 20}}'

In [18]:
import requests
from ast import literal_eval


## data needs to be in dict format for JSON
query = X_new.to_dict()

## test the Flask API
#port = 8080
#r = requests.post('http://0.0.0.0:{}/predict'.format(port),json=query)

## test the Docker API
port = 4000
r = requests.post('http://0.0.0.0:{}/predict'.format(port),json=query)

response = literal_eval(r.text)
print(response)

ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=4000): Max retries exceeded with url: /predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f047cb7c690>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Continued learning

In this tutorial we showed how to add a `predict` endpoint to the flask app.  Go back and edit the flask app to add a training endpoint that accepts new data as input.