# Pipelines and Feature Extraction with Sklearn

This notebook shows how to create a pipeline and perform feature extraction with sklearn.

* CountVectorizer: convert a collection of text documents to a matrix of token counts
* TfidfTransformer: transform a count matrix to a normalized tf or tf-idf representation
* SGDClassifier: linear classifiers (SVM, logistic regression, a.o.) with SGD training
* GridSearchCV: exhaustive search over specified parameter values for an estimator(model)
* Pipeline: Pipeline of transforms with a final estimator

## Imports

In [None]:
from pprint import pprint
from time import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

## Import Data

In [None]:
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]

In [None]:
print("Loading 20 newsgroups dataset for categories:")
print(categories)

In [None]:
data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))

## Define a Pipeline

In [None]:
# Define a pipeline combining a text feature extractor with a simple classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# Select Parameters to Use

In [None]:
# Uncommenting more parameters will give better exploring power but will increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

## Perform the Grid Search

In [None]:
# Find the best parameters for both the feature extraction and the classifier
# This will fit 3 folds for each of 24 candidates, totalling 72 fits
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

print("Performing grid search...")
print("Pipeline:", [name for name, _ in pipeline.steps])
print("Parameters:")
pprint(parameters)

# Start time
t0 = time()

# Perform the search
grid_search.fit(data.data, data.target)

# Print out the metrics
print("Completed in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))