### Persistence of Vectorizer and Classifier

Right now, we are doing first training and then using the vectorizer, feature selector and classifier immediately after training. However, in practice this scenario is one of the least likely occurring one. You generally train your classifier once, and then you want to use that as much as you'd like. In order to do so, we need to persist the vectorizer, feature selector and classifier. 


#### Pipeline Serialization

In last notebook, I showed that `Pipeline` structure gives a nice way to combine and put together a single component for all vectorizer, feature selector and also classifier(for the sequential pipeline). Instead of serializing two and possibly three structures, we would use `pipeline` to handle the serialization and deserialization. Without loss of generality, the things that I will show in this notebook is applicable to independent and separate components of the system(namely vectorizer, fature selector and classifier) as well. However, `pipeline` is preffered way to persist your whole __machine learning pipeline__.

In [1]:
%matplotlib inline
import csv
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn import cross_validation
from sklearn import ensemble
from sklearn.feature_extraction import text
from sklearn import feature_extraction
from sklearn import feature_selection
from sklearn import linear_model
from sklearn import metrics
from sklearn import naive_bayes
from sklearn import pipeline
from sklearn import svm
from sklearn import tree
from sklearn import externals

_DATA_DIR = 'data'
_NYT_DATA_PATH = os.path.join(_DATA_DIR, 'nyt_title_data.csv')
_SERIALIZATION_DIR = 'serializations'
_SERIALIZED_PIPELINE_NAME = 'pipe.pickle'
_SERIALIZATION_PATH = os.path.join(_SERIALIZATION_DIR, _SERIALIZED_PIPELINE_NAME)

In [2]:
with open(_NYT_DATA_PATH) as nyt:
    nyt_data = []
    nyt_labels = []
    csv_reader = csv.reader(nyt)
    for line in csv_reader:
      nyt_labels.append(int(line[0]))
      nyt_data.append(line[1])

In [3]:
X = np.array([''.join(el) for el in nyt_data])
y = np.array([el for el in nyt_labels])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)

vectorizer = text.TfidfVectorizer(min_df=2, 
 ngram_range=(1, 2), 
 stop_words='english', 
 strip_accents='unicode', 
 norm='l2')

In [4]:
pipe = pipeline.Pipeline([("vectorizer", vectorizer), ("svm", linear_model.RidgeClassifier())])

In [5]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf...copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, solver='auto', tol=0.001))])

Up to here, everything is same with previous notebook, there should be no surprises.

#### Create a Serialization Directory if it does not exist already

In [6]:
if not os.path.exists(_SERIALIZATION_DIR):
    os.makedirs(_SERIALIZATION_DIR)

In [7]:
externals.joblib.dump(pipe, _SERIALIZATION_PATH)

['serializations/pipe.pickle',
 'serializations/pipe.pickle_01.npy',
 'serializations/pipe.pickle_02.npy',
 'serializations/pipe.pickle_03.npy',
 'serializations/pipe.pickle_04.npy',
 'serializations/pipe.pickle_05.npy']

>  joblib.dump returns a list of filenames. Each individual numpy array contained in the clf object is serialized as a separate file on the filesystem. All files are required in the same folder when reloading the model with joblib.load.

By this point, the serialization is complete and ready for deployment. If we were not using `Pipeline`, we would need at least two serialization for vectorizer and classifier(feature seelector would be third if one uses that). In order to deploy the model, let's deserialize the `pipeline` in  a very similar manner.

In [8]:
pipe = externals.joblib.load(_SERIALIZATION_PATH)

In [9]:
pipe

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf...copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, solver='auto', tol=0.001))])

We successfully persisted our pipeline and loaded into the namespace again, ready to apply our test set.

> Never, ever unpickle untrusted data! `Pickle` (the serialization method) in Python which `joblib` uses under the hood has a lot of issues in terms security and vulnerability. 

Let's try the pipeline on the test dataset if it works. 

In [10]:
y_test = pipe.predict(X_test)

In [11]:
y_test

array([12, 19, 16, 19, 16, 16, 19, 19, 19, 19, 20, 16, 19, 16, 16, 16, 19,
       19, 16, 19, 19, 19, 19,  3, 19, 15, 19, 12, 19, 15, 15, 12, 15,  3,
       20, 16,  3, 19, 20, 16, 19, 16, 19, 19, 20, 16, 29, 19, 20, 16, 19,
       16, 20, 19, 19, 20, 19,  3, 29, 16, 16, 19, 20, 20, 15,  3, 12, 16,
       19, 19, 16, 19, 19, 16, 16, 16, 29, 19, 20, 19, 19, 20, 19, 19, 20,
       16, 20, 19, 19, 19, 29, 16, 20, 19, 19, 16, 29, 19, 12, 19, 16, 16,
       19, 19, 16, 19, 29, 20, 16, 19, 19, 19, 15, 20, 19, 20, 20, 16, 19,
        3, 16, 19, 20, 19, 16, 19, 19, 29, 19, 19, 19, 20, 15, 19, 16, 20,
       16, 20, 19, 20, 19, 20, 29, 16, 19, 16, 16, 19, 12, 20, 19, 19, 19,
       20, 16, 19, 12, 20, 19, 15, 20, 16, 16, 19, 19, 15, 29, 20, 12, 16,
       20, 15, 19, 12, 20, 19, 19, 20, 12, 20, 16, 20, 20, 16, 20, 19, 19,
       19, 19, 16, 19, 19,  3, 16, 20, 19, 16, 20, 12, 16, 16,  3, 16, 19,
       20, 20, 19, 19, 19, 20, 19, 16, 16, 12, 19, 19, 12,  3, 16, 29, 16,
       12, 16, 19,  3,  3

### Takeaways
- `Pipeline` is not just useful for abstraction but also easier to maintain, persist and deploy.
- Do not use a loss-compression technique to compress serialization.(Tarballs would work fine)
- If you want to create exact environment of your training set, use always `virtualenv` and note the version numbers of the libraries that you are using.(See the Notebook 0)
- Some of the algorithms support `partial_fit` function for online learning. ([SGD Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), [Perceptron](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron), [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)). If you have incremental data that you want to improve your classifier over time, you may want to persist your models and then use `partial_fit` to improve them when you have new data. Works like a charm!