# Pipeline and GridSearch: two great tools of Sci-Kit Learn

The `Pipeline` object allows the construction of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

In [4]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups

In [5]:
# We can now load the list of files matching those categories as follows:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

## 1. Building a pipeline

In [7]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

The names **vect**, **tfidf** and **clf** (classifier) are arbitrary. We shall see their use in the section on grid search, below. We can now train the model with a single command:

In [8]:
# Everything in one step!
text_clf = text_clf.fit(train.data, train.target)

In [9]:
# Prediction is also done in a single step
predicted = text_clf.predict(test.data)

In [10]:
np.mean(predicted == test.target)

0.83488681757656458

** With Pipelines we can build new models very quickly **

We achieved 83.5% accuracy. Let’s see if we can do better with a linear support vector machine (SVM), which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by just plugging a different classifier object into our pipeline:

In [11]:
from sklearn.linear_model import SGDClassifier

text_clf2 = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42))])

text_clf2.fit(train.data, train.target)
predicted2 = text_clf2.predict(test.data)
np.mean(predicted2 == test.target)

0.9127829560585885

## 2. GridSeach: a nice way to tune parameters

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

It is possible and recommended to search the hyper-parameter space for the best Cross-validation: evaluating estimator performance score.

In [12]:
from sklearn.grid_search import GridSearchCV

In [13]:
# Use the __ notation to indicate what parameter we are refering:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3)}

In [14]:
gs_clf = GridSearchCV(text_clf, param_grid=parameters, n_jobs=-1)

In [15]:
gs_clf.fit(train.data, train.target)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...False,
         use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (0.01, 0.001)},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [18]:
gs_clf.best_estimator_

Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        st...se,
         use_idf=False)), ('clf', MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True))])


In [19]:
gs_clf.best_params_

{'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}

In [20]:
gs_clf.best_score_

0.97873283119184762