# Data Engineering

What follows is a quick scan of the technologies and problems that come up when putting a machine learning model to use.

I by no means expect everyone to follow along with everything. So I'll be making this notebook available to anyone who wants it for reference in the future.

For now, just think of it as way to start getting familiar with various technologies and how they fit together.

Machine Learning
====

In just a few lines of code...

In [1]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

We will be using sample data included in the scikit-learn library.

In [2]:
# Read in Data
bunch = load_iris()
data = bunch["data"]
targets = bunch["target"]
feature_names = bunch["feature_names"]
target_names = bunch["target_names"]

print(feature_names)
print(data[:10])
print()
print(target_names)
print(targets)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Before we train, let's simplify the problem a bit by converting the data into a boolean classifier instead of a multiclass classifier. This means we will only be interested in one type of flower instead of three. All this classifier will do is tell us if it is a "virginica" specimen or not.

In [3]:
targets = np.array([1 if i == 2 else 0 for i in targets])
targets

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [4]:
# Train the Model
data_train, data_test, labels_train, labels_test = train_test_split(data, targets)
model = RandomForestClassifier()
model.fit(data_train, labels_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [5]:
# Make predictions on testing dataset
results = model.predict(data_test)

# Calculate score of model
score = accuracy_score(labels_test, results)
print("Model Accuracy:", score)

Model Accuracy: 0.9473684210526315


Is that everything? Are we done?
====

A lot of data science and machine learning tutorials and talks end there. Kaggle (a regular machine learning competition) notoriously only cares about that last number. Higher is better. Highest wins the prize. It's not uncommon for ML notebooks to contain hundreds of lines of code, and still end with that number.

Job done, right?

As a data engineer the job has only begun. The first thing your boss is going ask when he sees that above number is "How can I sell it?". So let's talk about what it takes to actually use this for something useful.


Persistance
===

First thing we need to do is save our model to disk.

Joblib is a fairly basic utility library. It doesn't do any one particular thing, instead it has helpers for a lot of different little things. Scikit-learn didn't bother making their own serializer and instead recomends we use the one built into joblib because it handles large amounts of numeric data exceptionally well (and models are at their core just a giant bundle of numbers).

In [6]:
import joblib

In [7]:
joblib.dump(model, "model.pkl")

['model.pkl']

Preparing for a basic api.
===

The first step is to create a model management object. This object will manage our model and make sure it is loaded when our server goes up. We don't want our server doing an expensive hard drive read every time someone makes a call to our api.

In [8]:
class ModelManager:
    def __init__(self, path_to_model):
        self.model = joblib.load(path_to_model)

    def score(self, data):
        return self.model.predict(data)

As well, we cannot send numpy arrays (scikit-learn's native format) over the internet. So instead we are going to send and recieve everything through json. 

We will need a function to translate the json we recieve to the array of numbers our model requires.

We will also need a function for the reverse transformation, so that we can send the prediction back to our user.

In [9]:
feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [10]:
def dict_to_numpy(data):
    # Note how are keys list is hard coded. This will be important later.
    keys = [
        'sepal length (cm)',
        'sepal width (cm)',
        'petal length (cm)',
        'petal width (cm)'
    ]

    return np.array([data[key] for key in keys]).reshape(1, -1) # scikit learn is very particular, it wants a matrix not an array. So we have to resize it.

def numpy_to_dict(array):
    return {
        "prediction": "virginica" if int(array[0]) == 1 else "not virginica",
    }

Awesome, now let's put all of this together and build our endpoint function.

In [11]:
# We need our manager to get instantiated when the server goes up, not when we call the endpoint.
model_manager = ModelManager("model.pkl")

def api_endpoint(data):
    array = dict_to_numpy(data)
    result = model_manager.score(array)

    return numpy_to_dict(result)

In [13]:
sample_request = {
    "sepal length (cm)": 5,
    "sepal width (cm)": 2,
    "petal length (cm)": 2,
    "petal width (cm)": 1
}

api_endpoint(sample_request)

{'prediction': 'not virginica'}

Setting up an actual API
===

There are numerous web application frameworks in python. The most famous of which is django. Django is super powerful and allows us to build extreamly complicated web based applications in python. However, it is also super complicated and way to much power for our purposes. Instead we are going to be using flask.

Flask is a lightweight web application framework that is designed to be super easy to use. It will allow us to stand up a relativly simple application in only a few lines of code.

Sadly, flask won't work inside of jupyter notebook, so let's switch to an executable file.

Waitress is another web library. It is the actual web server that will be hosting our flask app.

(see api.py)

Calling the api
===

To make calls to our api we will make use of another library called requests. Requests is a very large library, but we only need it to make http requests.

In [18]:
import requests

In [26]:
response = requests.post("http://0.0.0.0:8080/score", json=sample_request)

In [27]:
response.json()

{'prediction': 'not virginica'}

An issue with versions
===

So everything is all fine and well, until someone does something unspeakable.

They update the version of some of our libraries....

PKL don't actually save the model. It only saves the data stored on that model. Usually this is enough to reproduce the trained model exactly, but only if the sklearn version is the same as when it was trained. 

As an example, let's break some stuff.

The call to our server still worked. But sklearn gave us some fairly serious warnings. We shouldn't ignore them. We got lucky this time, but sometimes things won't allways work.

(Note: I upgraded from scikit-learn 0.22.2 to latest. Had I upgraded from 0.20.0 to latest everything would have broken. This was a very serious actual issue that needed solving).

So we can't update scikit-learn. But you will probobly notice that on a more complicated system we won't be able to upgrade anything. Scikit learn was kind enough to give us a warning, but other technologies aren't as kind.

Sadly, this means two things are true. 

1) We can never update our software.

2) Two models trained on different computers cannot simultanously run on the same system.

Containers to the rescue!!
===

As my final trick I am going to wrap my entire python environment into a docker container. 

But first we need to save our python environment using the python default package manager pip.

In [21]:
import os

In [22]:
os.system("pip3 freeze > requirements.txt")

0

(See Dockerfile for final result)

In [29]:
# The following are command line functions that likely won't work in jupyter.
# I am recording them here anyway for reference.

# build the server.
os.system("docker build -t ml_server .")
os.system("docker run -itp 8080:8080 ml_server") # Note, we have to explicity open ports to docker containers.

256

And now we test our new docker server.

In [15]:
import requests

In [19]:
response = requests.post("http://0.0.0.0:8080/score", json=sample_request)
response.json()

{'prediction': 'not virginica'}

All is well.

At least until we start trying to sync code at train time and inference time.

But that is a nightmare for another day.