# Session 11 - Preprocessing

## Session agenda

1. Types of features. Examples of feature extraction from texts, images and other objects
2. Basics of sklearn.preprocessing module.
3. Data standardization, normalization and binarization.
4. Working with missing values.
5. Polynomial features and custom transformations.
6. Pipelining data analysis tasks with sklearn.pipeline

## Types of features. Examples of feature extraction from texts, images and other objects
sklearn library has a special module sklearn.feature_extraction, which deals with feature extraction tasks for typical cases of objects (e.g. texts and images).

sklearn.feature_extraction module provides vectorizer and transformer classes, which adhere to the same basic API and allow to extract and transform features from different objects. These classes use NumPy arrays and SciPy sparse arrays to represent underlying data.

This functionality is somewhat complimentary to the functionality of pandas. So you have to decide, which library to use for your data preprocessing and then if necessary perform the convertion from one data structure to another.

In [1]:
from sklearn.feature_extraction import DictVectorizer

measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Fransisco', 'temperature': 18.},
]

vector = DictVectorizer()

print(vector.fit_transform(measurements).toarray())
print(vector.get_feature_names())

[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 12.]
 [ 0.  0.  1. 18.]]
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']


### Extracting features from texts.
We will now see a couple of examples of how simple it can be to extract features from texts using sklearn.feature_extraction. We will use the typical models (e.g. bag-of-words)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
vectorizer = CountVectorizer()
#Vectorizer has a very versitale constructor and have a lot of parameters, which can be configured
#CountVectorizer?

In [4]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

In [5]:
X = vectorizer.fit_transform(corpus)
print(X)

  (0, 1)	1
  (0, 2)	1
  (0, 6)	1
  (0, 3)	1
  (0, 8)	1
  (1, 5)	2
  (1, 1)	1
  (1, 6)	1
  (1, 3)	1
  (1, 8)	1
  (2, 4)	1
  (2, 7)	1
  (2, 0)	1
  (2, 6)	1
  (3, 1)	1
  (3, 2)	1
  (3, 6)	1
  (3, 3)	1
  (3, 8)	1


In [6]:
analyze = vectorizer.build_analyzer()
print(analyze("This is a text document to analyze."))
print(vectorizer.get_feature_names())
print(X.toarray())

['this', 'is', 'text', 'document', 'to', 'analyze']
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


In [7]:
print(vectorizer.vocabulary_)
print(vectorizer.vocabulary_.get('document'))

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
1


In [8]:
print(vectorizer.transform(['Something completely new.']).toarray())

[[0 0 0 0 0 0 0 0 0]]


In [9]:
#Count vectorizers also support creation of N-gram models.
#Here is an example of combined 1-gram and 2-gram model.
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
print(analyze('Bi-grams are cool!'))

['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']


In [10]:
X_2 = bigram_vectorizer.fit_transform(corpus)
print(X_2.toarray())

feature_index = bigram_vectorizer.vocabulary_.get('is this')
print(X_2[:, feature_index])

[[0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0]
 [0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 0 1 1 0]
 [1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1]]
  (3, 0)	1


In [11]:
# Character or symbol N-gram models can be created by specifing 'char_wb' analyzer.
# Basically, you can create your own model by creating a special analyzer.
digram_char_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
counts = digram_char_vectorizer.fit_transform(['words', 'wprds'])
print(digram_char_vectorizer.get_feature_names())
print(counts.toarray().astype(int))

[' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp']
[[1 1 1 0 1 1 1 0]
 [1 1 0 1 1 1 0 1]]


### Tf–idf term weighting
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 
$$\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}.$$
Using the TfidfTransformer‘s default settings, ```TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)``` the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1,$$
where $n_d$ is the total number of documents, and $\text{df}(d,t)$ is the number of documents that contain term $t$. The resulting tf-idf vectors are then normalized by the Euclidean norm:
$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}.$$
This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.
The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly and how the tf-idfs computed in scikit-learn’s ```TfidfTransformer``` and ```TfidfVectorizer``` differ slightly from the standard textbook notation that defines the idf as
$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.$$
In the ```TfidfTransformer``` and ```TfidfVectorizer``` with ```smooth_idf=False```, the “1” count is added to the idf instead of the idf’s denominator:
$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)

counts = [[3, 0, 1],
          [2, 0, 0],
          [3, 0, 0],
          [4, 0, 0],
          [3, 2, 0],
          [3, 0, 2]]

tfidf = transformer.fit_transform(counts)
print(tfidf.toarray())
print(transformer.idf_)

[[0.81940995 0.         0.57320793]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [0.47330339 0.88089948 0.        ]
 [0.58149261 0.         0.81355169]]
[1.         2.79175947 2.09861229]


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_3 = vectorizer.fit_transform(corpus)
print(X_3.toarray())

[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]


### The hashing trick
The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

1. The larger the corpus, the larger the vocabulary will grow and hence the memory use too, fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
2. Building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
3. Pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
4. It is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.

It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by the ```sklearn.feature_extraction.FeatureHasher``` class and the text preprocessing and tokenization features of the ```CountVectorizer```.
This combination is implementing in ```HashingVectorizer```, a transformer class that is mostly API compatible with ```CountVectorizer```. ```HashingVectorizer``` is stateless, meaning that you don’t have to call fit on it:

In [14]:
from sklearn.feature_extraction.text import HashingVectorizer
print('Hashing vectorizer example:')
hv = HashingVectorizer()
print(hv.transform(corpus))

print('Reducing feature space:')
hv = HashingVectorizer(n_features=10)
print(hv.transform(corpus))


Hashing vectorizer example:
  (0, 144749)	0.4472135954999579
  (0, 170062)	0.4472135954999579
  (0, 286878)	-0.4472135954999579
  (0, 351664)	-0.4472135954999579
  (0, 989160)	-0.4472135954999579
  (1, 144749)	0.35355339059327373
  (1, 170062)	0.35355339059327373
  (1, 286878)	-0.35355339059327373
  (1, 351664)	-0.35355339059327373
  (1, 544379)	0.7071067811865475
  (2, 178949)	0.5
  (2, 180525)	-0.5
  (2, 286878)	-0.5
  (2, 948532)	-0.5
  (3, 144749)	0.4472135954999579
  (3, 170062)	0.4472135954999579
  (3, 286878)	-0.4472135954999579
  (3, 351664)	-0.4472135954999579
  (3, 989160)	-0.4472135954999579
Reducing feature space:
  (0, 2)	0.0
  (0, 6)	-0.5773502691896258
  (0, 7)	0.5773502691896258
  (0, 8)	-0.5773502691896258
  (1, 2)	0.0
  (1, 5)	0.8164965809277261
  (1, 7)	0.4082482904638631
  (1, 8)	-0.4082482904638631
  (2, 1)	0.5
  (2, 4)	-0.5
  (2, 5)	-0.5
  (2, 8)	-0.5
  (3, 2)	0.0
  (3, 6)	-0.5773502691896258
  (3, 7)	0.5773502691896258
  (3, 8)	-0.5773502691896258


In [15]:
# Author: Lars Buitinck
# License: BSD 3 clause

from __future__ import print_function
from collections import defaultdict
import re
import sys
from time import time

import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction import DictVectorizer, FeatureHasher


def n_nonzero_columns(X):
    """Returns the number of non-zero columns in a CSR matrix X."""
    return len(np.unique(X.nonzero()[1]))


def tokens(doc):
    """Extract tokens from doc.

    This uses a simple regex to break strings into tokens. For a more
    principled approach, see CountVectorizer or TfidfVectorizer.
    """
    return (tok.lower() for tok in re.findall(r"\w+", doc))


def token_freqs(doc):
    """Extract a dict mapping tokens from doc to their frequencies."""
    freq = defaultdict(int)
    for tok in tokens(doc):
        freq[tok] += 1
    return freq


categories = [
    'alt.atheism',
    'comp.graphics',
    'comp.sys.ibm.pc.hardware',
    'misc.forsale',
    'rec.autos',
    'sci.space',
    'talk.religion.misc',
]
# Uncomment the following line to use a larger set (11k+ documents)
#categories = None

print(__doc__)
print("    The default number of features is 2**18.")

try:
    n_features = int(input())
except:
    n_features = 2 ** 18

print("Loading 20 newsgroups training data")
raw_data = fetch_20newsgroups(subset='train', categories=categories).data
data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6
print("%d documents - %0.3fMB" % (len(raw_data), data_size_mb))
print()

print("DictVectorizer")
t0 = time()
vectorizer = DictVectorizer()
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % len(vectorizer.get_feature_names()))
print()

print("FeatureHasher on frequency dicts")
t0 = time()
hasher = FeatureHasher(n_features=n_features)
X = hasher.transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % n_nonzero_columns(X))
print()

print("FeatureHasher on raw tokens")
t0 = time()
hasher = FeatureHasher(n_features=n_features, input_type="string")
X = hasher.transform(tokens(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % n_nonzero_columns(X))

Automatically created module for IPython interactive environment
    The default number of features is 2**18.
100


Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Loading 20 newsgroups training data
3803 documents - 6.245MB

DictVectorizer
done in 1.562829s at 3.996MB/s
Found 47928 unique terms

FeatureHasher on frequency dicts
done in 1.117980s at 5.586MB/s
Found 100 unique terms

FeatureHasher on raw tokens
done in 1.093077s at 5.713MB/s
Found 100 unique terms


### Extracting features from images

In [16]:
import numpy as np
from sklearn.feature_extraction import image

one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
print('R channel of imaginary picture')
print(one_image[:, :, 0])

patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2, random_state=0)
print('Shape of created patches:')
print(patches.shape)
print('R channels of created patches:')
print(patches[:, :, :, 0])

patches = image.extract_patches_2d(one_image, (2, 2))
print('Shape from another extraction method:')
print(patches.shape)
print('Patch 4 (R channel):')
print(patches[4, :, :, 0])

R channel of imaginary picture
[[ 0  3  6  9]
 [12 15 18 21]
 [24 27 30 33]
 [36 39 42 45]]
Shape of created patches:
(2, 2, 2, 3)
R channels of created patches:
[[[ 0  3]
  [12 15]]

 [[15 18]
  [27 30]]]
Shape from another extraction method:
(9, 2, 2, 3)
Patch 4 (R channel):
[[15 18]
 [27 30]]


In [17]:
five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
patches = image.PatchExtractor((2, 2)).transform(five_images)
print(patches.shape)


(45, 2, 2, 3)


## Basics of sklearn.preprocessing module
The sklearn.preprocessing module has a number of typical transformers, which can be applied on the extracted features. All transformers adhere to the ```Transformer``` API.

Let us review a couplle of examples.

### Data standartization

In [18]:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X)
print(X_scaled)
print('Mean: ',X_scaled.mean(axis=0))
print('Std: ', X_scaled.std(axis=0))

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
Mean:  [0. 0. 0.]
Std:  [1. 1. 1.]


In [19]:
scaler = preprocessing.StandardScaler().fit(X)
print('Scaler mean: ', scaler.mean_)
print('Scaler std: ', scaler.scale_)

print(scaler.transform(X))
print(scaler.transform([[-1.,  1., 0.]]))

Scaler mean:  [1.         0.         0.33333333]
Scaler std:  [0.81649658 0.81649658 1.24721913]
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[[-2.44948974  1.22474487 -0.26726124]]


In [20]:
#There is a number of different scalers, which can be used.
#Check out MinMaxScaler, MaxAbsScaler, RobustScaler

### Data normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.
The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:

In [21]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
print(X_normalized)

[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]


In [22]:
normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
print(normalizer.transform(X))
print(normalizer.transform([[-1.,  1., 0.]]))

[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]
[[-0.70710678  0.70710678  0.        ]]


### Data binarization
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.
It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice.

In [23]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
print(binarizer.transform(X))

[[1. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


In [24]:
binarizer = preprocessing.Binarizer(threshold=1.1)
print(binarizer.transform(X))

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 0.]]


### Working with missing values in sklearn

In [25]:
#Imputer transformer
import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X)) 

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]




In [26]:
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(X)
X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(imp.transform(X_test))  

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]




### Polynomial features
Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms.

In [27]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
print(X)
poly = PolynomialFeatures(2)
results = poly.fit_transform(X)                             
print(results)

[[0 1]
 [2 3]
 [4 5]]
[[ 1.  0.  1.  0.  0.  1.]
 [ 1.  2.  3.  4.  6.  9.]
 [ 1.  4.  5. 16. 20. 25.]]


### Custom transformers
Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing. You can implement a transformer from an arbitrary function with FunctionTransformer.

In [28]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
print(transformer.transform(X))

[[0.         0.69314718]
 [1.09861229 1.38629436]]




## Pipelining data analysis tasks with sklearn.pipeline
sklearn.pipeline allows you to chain your data analysis task and steps. Basically, it implements a chain of responsibility pattern. 

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two purposes here:

* Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
* Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

imputer_transform = ('fill_NaNs', Imputer(missing_values='NaN', strategy='mean', axis=0))
scaler_transform = ('scale', StandardScaler())
polynomial_features_transform = ('polynomial_features', PolynomialFeatures())

pipeline = Pipeline(steps = [imputer_transform,
                             scaler_transform,
                             polynomial_features_transform]
                   )
X_train = [[1, 2], [np.nan, 3], [7, 6]]

pipeline.fit(X_train)
print(pipeline.transform(X_train))

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(pipeline.transform(X))

[[ 1.         -1.22474487 -0.98058068  1.5         1.20096115  0.96153846]
 [ 1.          0.         -0.39223227  0.         -0.          0.15384615]
 [ 1.          1.22474487  1.37281295  1.5         1.68134561  1.88461538]]
[[ 1.          0.         -0.98058068  0.         -0.          0.96153846]
 [ 1.          0.81649658  0.          0.66666667  0.          0.        ]
 [ 1.          1.22474487  1.37281295  1.5         1.68134561  1.88461538]]




In [30]:
%%file pipeline_test.py
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


###############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Overwriting pipeline_test.py


In [31]:
%run pipeline_test.py

Module created for script run in IPython
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__penalty': ('l2', 'elasticnet'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    9.2s
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:   19.1s finished


done in 20.140s

Best score: 0.938
Best parameters set:
	clf__alpha: 1e-06
	clf__penalty: 'elasticnet'
	vect__max_df: 0.5
	vect__ngram_range: (1, 2)


