# Learning notebook 1 - Train and serialize

In [5]:
import json
import pandas as pd
import pickle
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In this learning unit you will learn how to preserve your model so that the value it generates can be used in a separate process or program than the one in which it was fitted. There are a few different ways to do this but for this specialization we will be using the following tools:

1. [pickle](https://docs.python.org/3/library/pickle.html) from the Python core
1. [pipelines](https://scikit-learn.org/stable/modules/compose.html#pipelines-and-composite-estimators) from scikit

This will be our journey:

1. **Train**: We are going to first train a model on the classic titanic dataset. We will use this one because it has categorical and numeric features and missing values in both types.

2. **Serialize**: Once the model has been trained as part of a pipeline, we will [serialize](https://en.wikipedia.org/wiki/Serialization) it, i.e. preserve it for later, using the [pickle](https://docs.python.org/3/library/pickle.html) package that is found in Python's core.

3. **Predict on new data**: After we are confident that we can retrieve the pickled model from the disk, we will show how to prepare a brand new observation for prediction with the model.

# 1. Train your model

Let's get started! We're not going to spend much time preparing the data set or working on the model performance because it's not the focus of this learning unit. So let's power through the first few steps!

First we read the data set and take a look at it:

In [6]:
df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's get rid of a few features that don't hold anything particularly useful and take another peek:

In [7]:
df = df.drop(['Ticket', 'Name', 'PassengerId'], axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


Now let's separate the features and the target:

In [8]:
X_train, y_train = df.drop('Survived', axis=1), df.Survived

## 1.1 Build the pipeline

Okay, the next bit of necessary code isn't very much at all but is very dense. So let's take things one at a time to understand
the motivation.

We'll begin by confidently throwing a logistic regression at the data and see what happens.

In [9]:
# NOTE: We're using a try/except block here because (SPOILER ALERT)
# the fit function is going to fail and we want the notebook to look clean.
# Go ahead and remove the try/except block to see the error's stack trace

try:
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
except ValueError as e:
     print(e)

could not convert string to float: 'male'


We know this game - scikit classifiers don't know how to deal with non-numerical data. Since we already know about pipelines,
let's try to put together a pipeline that has a OneHotEncoder in an attempt to deal with the non-numeric data.

We'll use the [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). The default behavior is for all string columns to be dummified. `SimpleImputer` fills in missing values in the data. By default, it assumes that the missing values are NaNs, but we could specify other types of missing values with the `missing_values` keyword. With the `strategy` keyword we define how to fill in the missing values: in this case, we're replacing them with the mean value of the feature where the missing value occurs.

In [10]:
categorical_features = X_train.select_dtypes(include=['object']).columns
numeric_features = list(set(X_train.columns).difference(categorical_features))

preprocessor = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy='mean'), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

In [11]:
pipeline = make_pipeline(
    preprocessor,
    LogisticRegression(max_iter=1000)
)
pipeline.fit(X_train, y_train)

Hooray! Now we are using a pipeline and training a classifier without doing any explicit preprocessing of the dataset!

With the pipeline we should now be able to move on and start processing new observations.

## 1.2 Predict on new observations

Let's construct a new observation using a protocol that is technology agnostic: [json](https://en.wikipedia.org/wiki/JSON).

We will assume that a new observation has come over the wire using a transportation
layer such as HTTP, which means that it will arrive to us as a json string (a string whose content follows the json protocol).

In [12]:
new_obs_str = '{"Age": 22.0, "Cabin": NaN, "Embarked": "S", "Fare": 7.25, "Parch": 0, "Pclass": 3, "Sex": "male", "SibSp": 1}'

Great, now we've got a new observation as a json string. This is desirable because no matter what
programming language or environment we are in, we know that there will be support for deserialization
into a native type. In Ruby these are hashes, in javascript they are objects, and in Python they are
dictionaries.

So let's turn our json string into a dictionary - it's a great starting point to do anything we may need.

In [13]:
new_obs_dict = json.loads(new_obs_str)
print('type {}'.format(type(new_obs_dict)))

type <class 'dict'>


In [14]:
new_obs_dict

{'Age': 22.0,
 'Cabin': nan,
 'Embarked': 'S',
 'Fare': 7.25,
 'Parch': 0,
 'Pclass': 3,
 'Sex': 'male',
 'SibSp': 1}

Not so fast... scikit models don't know how to deal with dictionaries! Well, we know that when we trained the model, the pipeline took a pandas dataframe so that's what we should be passing into the pipeline's `predict_proba` as well.

With that in mind, let's take a few lines of code to transform the dictionary
into a pandas dataframe. Note that a series isn't good enough, it must be
a full dataframe, even if it's just for a single observation.

The first step is to create a dataframe with the columns in the correct order. You can get the correct order by getting the columns from the `X_train` dataframe with which the model was trained. We're passing the dictionary in a list to ensure row indexing (see discussion [here](https://stackoverflow.com/questions/17839973/constructing-dataframe-from-values-in-variables-yields-valueerror-if-using-all)).

In [15]:
obs = pd.DataFrame([new_obs_dict], columns=X_train.columns.tolist())

Now you need to make sure that the column types are correct, as the pipeline expects them.

In [16]:
obs = obs.astype(X_train.dtypes)

In [17]:
obs

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.25,,S


Finally you can call `predict_proba` and the pipeline will output the probabilities of the negative and positive class (dead or alive in this case).

In [18]:
pipeline.predict_proba(obs)

array([[0.90828634, 0.09171366]])

Alright, we are feeling pretty cool right now. We have a trained model and we can process new observations! Now that we have this under control, let's learn how to preserve our model so that we can keep this sweetness for posterity.

## 2. Serialization of the necessary components

Okay let's take a moment to imagine that we will need to use this model in a totally different environment than the one it was trained in.

For instance, if we want to share the model with a friend: we wouldn't want them to retrain it (some models take  hours to train), and we wouldn't want to have to send them the training data (it might be very large and it might be confidential data that shouldn't be shared).

Even within the same computer, we may want to use the model in a different notebook than the one it was trained in. Or in a different Python process. Or in a flask server (hint hint).

What we  want to do is to save the model to the disk so that it can be transfered somewhere else and used later on.

Remember that serialization is just the process of storing the state of an object so that it can be used later on. Let's think about what we need in order to be able to call `predict_proba` on a new observation. It is:

1. The column names in the correct order
1. The fitted pipeline
1. The dtypes of the columns of the training set

One at a time, let's look at serializing these.

### 2.1 Serializing the columns in the correct order

Probably the most well-known serialization format for data is json. This is great because it's robust and technology agnostic. Let's serialize the columns of the training set in the correct order:

In [19]:
with open('columns.json', 'w') as fh:
    json.dump(X_train.columns.tolist(), fh)

The column names are just strings, so this simply writes the column names into a text file in the json format.

### 2.2 Serializing the fitted pipeline

Now we need to serialize the fitted pipeline. Unfortunately we don't have something as clean as json for this since the pipeline is a Python object and not just text like the column names.

In order to preserve the pipeline, we will need to use a library to export a Python object (the fitted pipeline) onto the disk in such a way that Python can reload it again. The most common library for serialization of Python objects is [pickle](https://docs.python.org/3/library/pickle.html) which is part of the Python standard library. However, scikit-learn comes with a version of pickle designed to save scikit-learn estimators, [joblib](http://scikit-learn.org/stable/modules/model_persistence.html), so we'll it.

In [20]:
import joblib
joblib.dump(pipeline, 'pipeline.pickle') 

['pipeline.pickle']

**NOTE**. One thing to consider when serializing/deserializing scikit-learn estimators is that joblib/pickle don't store the definitions of the estimators (the code). This means 2 things:

- All of the libraries you use to build the pipeline on your laptop need to be available (installed) in the machine where you deploy it. 

- All of the custom code you use in the pipeline (for example any custom transformers) needs to be defined as well when you unpickle the pipeline.

### 2.3 Serializing the dtypes of the columns

We have a similar situation with the dtypes. When you call `X_train.dtypes`, you will get a list of Python
objects, so we have to use pickle to serialize them as well.

In [21]:
with open('dtypes.pickle', 'wb') as fh:
    pickle.dump(X_train.dtypes, fh)

Alrighty then, we have now serialized all the necessary components
of our model!

Move on to the next notebook to see how we can deserialize and use
all of this work in a different process (notebook).