In [1]:
import pickle
import json
import pandas as pd
# this category_encoders package is not part of the
# sklearn core packages, it is a 3rd party but makes
# our lives a lot easier because it can deal with encoding
# strings whereas sklearn's OneHotEncoder cannot
import category_encoders
from sklearn.preprocessing import Imputer, FunctionTransformer, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

# Preserving your model

In this learning unit you will learn how to preserve your model so that
the value it generates can be used in a separate process or program than the
one in which it was fitted. There are a few different ways to do this
but for this hackathon we will be using a few tools:

1. pickle from python core
1. pipelines from scikit

## The journey

### Train
We are going to first train a model on the classic titanic dataset. We will use
this one because it has categorical, numeric, and missings in both types.

### Serialize
Once the model has been trained as part of a pipeline, we will [serialize](https://en.wikipedia.org/wiki/Serialization)
it using the [pickle](https://docs.python.org/3/library/pickle.html) package
that is found in python's core.

### Predict on new data
After we are confident we can retrieve the pickled model from disk, we will
show how to prepare a brand new observation for prediction with the model.

Let's get started! We're not going to spend much time preparing the dataset or working on model performance
because it's not the focus of this learning unit. So let's power through the first few steps!

In [2]:
# read original dataset from disk and take a look at it
df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# let's get rid of a few features that don't hold anything
# particularly useful and take another peak
df = df.drop(['Ticket', 'Name', 'PassengerId'], axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [4]:
# now let's split it into X_train and y_train
X_train, y_train = df.drop('Survived', axis=1), df.Survived

## Let's build the pipeline

Okay, the next bit of necessary code isn't very much at all but
is very dense. So let's take things one at a time to understand
the motivation.

We'll begin with just the model itself - a logistic regression
and see what happens.

In [5]:
clf = LogisticRegression()
clf.fit(X_train, y_train)



ValueError: could not convert string to float: 'Q'

We know this game - scikit classifiers don't know how to deal
with non-numerical data. Since we already know about pipelines,
let's try to put together a pipeline that has a OneHotEncoder
in an attempt to deal with the non-numeric data.

In [6]:
pipeline = make_pipeline(
    OneHotEncoder(
        categorical_features=list(
            X_train.select_dtypes(include=['object']).columns)),
    LogisticRegression()
)
pipeline.fit(X_train, y_train)



IndexError: arrays used as indices must be of integer (or boolean) type

Well this is incredibly stupid - the [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
from scikit can't deal with strings which means that we would have
to map all the strings to codes before anything goes through
the pipeline and we definitely don't want this kind of headache.

Fortunately, there's a package out there from Scikit Contrib called
[categorical_encoding](http://contrib.scikit-learn.org/categorical-encoding/index.html)
and is available to be installed via `pip install category_encoders`. So let's
give it a spin and see if it magically fixes all of our problems!

This one is a bit smarter as the [default behavior](http://contrib.scikit-learn.org/categorical-encoding/onehot.html)
is for all string columns to be dummified. One last thing to note is the
usage of the `ignore` keyword is set to `ignore`. This is because later on
if the encoder runs into a value that it hasn't seen before, it won't throw
an error. It's not a perfect solution to be silently ignoring stuff that
you haven't seen before but hey, it's better than crashing!

In [7]:
pipeline = make_pipeline(
    category_encoders.OneHotEncoder(handle_unknown='ignore'),
    LogisticRegression(),
)
pipeline.fit(X_train, y_train)



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Hooray! A new error! We're making some kind of progress!

What's happening now is that some of the numeric columns have nans
in them so we need to use an [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html)
for those bad boys.

In [8]:
pipeline = make_pipeline(
    category_encoders.OneHotEncoder(handle_unknown='impute'),
    Imputer(strategy='mean'),
    LogisticRegression(),
)
pipeline.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(cols=['Sex', 'Cabin', 'Embarked'],
                               drop_invariant=False, handle_missing='value',
                               handle_unknown='impute', return_df=True,
                               use_cat_names=False, verbose=0)),
                ('imputer',
                 Imputer(axis=0, copy=True, missing_values='NaN',
                         strategy='mean', verbose=0)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='warn', tol=0.0001, verbose=0,
                                    warm_start=F

Hooray! Now we are using a pipeline and training a classifier without doing
any explicit preprocessing of the dataset!

With the pipeline we should now be able to move on with our lives and start
processing new observations.

## Predicting on new observations

Let's construct a new observation using a protocol that is technology agnostic: json.

We will assume that a new observation has come over the wire using a transportation
layer such as HTTP. What this means is that it can arrive to us as a string:

In [9]:
new_obs_str = '{"Age": 22.0, "Cabin": null, "Embarked": "S", "Fare": 7.25, "Parch": 0, "Pclass": 3, "Sex": "male", "SibSp": 1}'

Great, now we've got a new observation as a json string. This is desireable because no matter what
programming language or environment we are in, we know that there will be support for deserialization
into a native type. In ruby these are hashes, in javascript they are objects, and in python they are
dictionaries.

So let's turn our string into a dictionary - it's a great starting point to do anything we may need.

In [10]:
new_obs_dict = json.loads(new_obs_str)
print('type {}'.format(type(new_obs_dict)))


type <class 'dict'>


In [11]:
new_obs_dict

{'Age': 22.0,
 'Cabin': None,
 'Embarked': 'S',
 'Fare': 7.25,
 'Parch': 0,
 'Pclass': 3,
 'Sex': 'male',
 'SibSp': 1}

Not so fast... scikit models don't know how to deal with dictionaries!

Well we know that when we trained the model, the pipeline took a pandas
dataframe so that's what we should be passing into `predict_proba` as well.

With that in mind, let's take a few lines of code to transform the dictionary
into a pandas dataframe. Note that a series isn't good enough, it must be
a full dataframe even if it's just for a single observation.

Although it's only a few lines of code, it's pretty dense so be sure to
read the comments in order to fully understand what's going on here.

In [12]:
# First step is to create a dataframe with the columns in the correct
# order. You can get the correct order by getting the columns from
# the X_train dataframe with which the model was trained. Doing this
# will preserve the correct order.

# Also note that that you must pass the dictionary as an entry
# in an array, even if there is only a single one... scikit models
# always assume things are being processed in batches.
obs = pd.DataFrame([new_obs_dict], columns=X_train.columns.tolist())

# Now you need to make sure that the types are correct so that the
# pipeline steps will have things as expected.
obs = obs.astype(X_train.dtypes)

In [13]:
obs

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.25,,S


In [14]:
# Finally you can call predict_proba and the pipeline will function
# as expected.
pipeline.predict_proba(obs)

array([[0.90499549, 0.09500451]])

Alright, we are feeling pretty cool right about now. We have a trained
model, we can process new observations! Now that we have this under control
let's take a look at preserving our model so that we can keep
this sweetness for posterity.

## Serialization of the necessary components

Okay let's take a moment to imagine that we will need to use
this in a totally different environment than the one that
the model was trained in. We are in a different notebook, a
different python process, maybe even a flask server (hint hint).

We won't want to have to carry around the training set in order
to re-train the model so what we will want to do is to save it to
the disk so that it can be transfered to somewhere else and used
later on.

Remember that serialiation is just the process of storing something
so that it can be deserialized and used later on. So let's think
about what it was that we needed in order to be able to call
`predict_proba` on a new observation. It is:

1. The column names in the correct order
1. The fitted pipeline
1. The dtypes of the columns of the training set

One at a time let's look at serializing these:

### Serializing the columns in the correct order

Probably the most well-known serialization format for data is [json](https://en.wikipedia.org/wiki/JSON)

This is great because it's robust and technology agnostic. So let's serialize the
columns of the training set in the correct order:

In [15]:
with open('columns.json', 'w') as fh:
    json.dump(X_train.columns.tolist(), fh)

We have a similar situation with the dtypes. This is because
when you call X_train.dtypes, you will get a list of python
objects so we must use pickle for this as well:

In [16]:
with open('dtypes.pickle', 'wb') as fh:
    pickle.dump(X_train.dtypes, fh)

Finally we need to serialize the fitted pipeline. Unfortunately we
don't have something as clean as json to do this with since 
the pipeline is a python object and not just raw data like the 
column names are.

So in order to preserve the pipeline, we will need to use a library to export a python object (the fitted pipeline) into our hard drive in a way that python can reload it again. The most common library for serialization of python objects is [pickle](https://docs.python.org/3/library/pickle.html), which is part of the core python language (the standard library). However scikit-learn comes with a version of pickle designed to save scikit-learn estimators, [joblib](http://scikit-learn.org/stable/modules/model_persistence.html)

In [17]:
from sklearn.externals import joblib
joblib.dump(pipeline, 'pipeline.pickle') 



['pipeline.pickle']

**NOTE**. One thing to consider when serializing/deserializing scikit-learn estimators is that joblib/pickle dont store the definitions of the estimators (the code). This means 2 things:

- All of the libraries you use to build the pipeline on your laptop need to be available (installed) in the machine where you deploy it. 

- All of the custom code you use (for example, any custom transformer you build) in the pipeline needs to be defined as well when you load the pipeline back.

Alrighty then, we have now serialized all necessary components
of our model in order to be able to process new observations!

Move on to the next notebook to see how we can de-serialize and use
all of this work in a completely different process.