## Feature extraction 4. Using Multiple Classifiers

In [None]:
# Author: Guillaume Lussier <lussier.guillaume@gmail.com>
# base of work http://scikit-learn.org/stable/modules/feature_extraction.html
# Date: Feb2017
# ipython file, kernel 2.7, required modules: sklearn, numpy, pprint, time, logging 

### Section8 : Pipeline of Mutiple Classifiers

Using multiple classifiers can be done several ways. One way is to chain them in what is called a pipeline.  
An estimator pipeline will act as the addition of the estimators on the data.

In the previous works we have chained a vectorizer with a classifier. The vectorizers we used were a simple count vectorizer and a tf-idf vectorizer. The classifiers were Logistic Regression, multinomial Naive Bayes.

One of the difficulties of using several estimators is to chose the different parameters to be used with each of them, especially as one can impact the result of the next one. The sklearn GridSearchCV library can help with identifying the best parameters.

An example with CountVectorizer, TfidfTransformer, SGDClassifier and a pipeline can be found here:
http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py  
Below we continue to work on the basis and results obtained in our previous work (fextraction1, 2 and 3). We will use TfidfVectorizer and LogisticRegression.

Introduction of the libraries, description of the parameters

In [1]:
# sklearn data set
from sklearn.datasets import fetch_20newsgroups

# sklearn text feature extraction pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# basic libraries
from sklearn import metrics
import numpy as np
from pprint import pprint
from time import time

# this is to configure python logging to handle warning messages 
import logging
logging.basicConfig()


# TfidfVectorizer
# tf / term frequency
# idf / inverse documentfrequency
# max_df: terms with a frequency higher than this value are ignored
# min_df: cut-off, terms wih an obsolute count lower than this value are ignored
# analyzer='word': default value, feature will be made of words n-grams
# ngram_range=tuple (min_n, max_n): default 1, n-grams used such as min_n <= n <= max_n
# vocabulary: default None, if not given, a vocabulary is determined from the input documents.
# max_features: default None, if not None, build a vocabulary with only top max_features ordered by term frequency across the corpus.
# stop_words: None, english
# example : TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
# note: the effect of the different parameters on the tf-idf vectorizer & fit have been discussed in fextraction2 

# LogisticRegression
# penalty: str, ‘l1’ or ‘l2’, default: ‘l2’
# C: float, default: 1.0, positive float, smaller values specify stronger regularization
# fit_intercept : bool, default: True, constant (a.k.a. bias or intercept) added to the decision function
# random_state : int seed, RandomState instance, default: None, seed of the pseudo random number generator to use when shuffling the data
# solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}, default: ‘liblinear’
# n_jobs : int, default: 1, Number of CPU cores used during cross-validation loop. If -1, all cores are used
# example : LogisticRegression(random_state=0)


Definition of the search parameters.

In [2]:
categories = [
    'comp.graphics',
    'sci.crypt',
    'sci.space',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

# fetching 20newsgroups data set with filtering of the header/footers/quotes to be more realistic
data = fetch_20newsgroups(subset='train', shuffle=True, random_state=1, 
                          remove=('headers', 'footers', 'quotes'),
                          categories=categories)

print("Data loaded from 20 newsgroups:")
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

Loading 20 newsgroups dataset for categories:
['comp.graphics', 'sci.crypt', 'sci.space', 'talk.religion.misc']
Data loaded from 20 newsgroups:
2149 documents
4 categories
()


Definition of the piepline and parameters

In [3]:
pipeline = Pipeline([
#    ('vect', CountVectorizer()), # we use a TfidfVectorizer so the CountVectorizer is not needed
    ('tfidf', TfidfVectorizer()),
    ('logreg', LogisticRegression()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    #'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    #'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__max_df': (0.5, 0.75, 0.95), # max frequence of word for it to be kept as a feature
    'tfidf__min_df': (2, 10, 50), # minimum number of word occurrences for it to be kept as a feature
    'tfidf__max_features': (2000, 20000, 50000), # max number of features in the model
    # stop_words = 'english' will not be used here as it render max_df impact less visible
    'logreg__C': (0.1, 0.5, 1.0), # smaller values specify stronger regularization
    'logreg__fit_intercept': (True, False), # bias constant should be added to the decision function
    'logreg__solver': ('liblinear', 'sag'),
}

Execution of the pipeline and results

In [4]:
if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['tfidf', 'logreg'])
parameters:
{'logreg__C': (0.1, 0.5, 1.0),
 'logreg__fit_intercept': (True, False),
 'logreg__solver': ('liblinear', 'sag'),
 'tfidf__max_df': (0.5, 0.75, 0.95),
 'tfidf__max_features': (2000, 20000, 50000),
 'tfidf__min_df': (2, 10, 50)}
Fitting 3 folds for each of 324 candidates, totalling 972 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  9.5min finished


done in 573.207s
()
Best score: 0.863
Best parameters set:
	logreg__C: 1.0
	logreg__fit_intercept: False
	logreg__solver: 'liblinear'
	tfidf__max_df: 0.5
	tfidf__max_features: 20000
	tfidf__min_df: 2


Performing grid search...
('pipeline:', ['tfidf', 'logreg'])
parameters:
{'logreg__C': (0.1, 0.5, 1.0),
 'logreg__fit_intercept': (True, False),
 'logreg__solver': ('liblinear', 'sag'),
 'tfidf__max_df': (0.5, 0.75, 0.95),
 'tfidf__max_features': (2000, 20000, 50000),
 'tfidf__min_df': (2, 10, 50)}
Fitting 3 folds for each of 324 candidates, totalling 972 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  9.5min finished
done in 573.207s
()
Best score: 0.863
Best parameters set:
	logreg__C: 1.0
	logreg__fit_intercept: False
	logreg__solver: 'liblinear'
	tfidf__max_df: 0.5
	tfidf__max_features: 20000
	tfidf__min_df: 2

In [5]:
Performing grid search...
('pipeline:', ['tfidf', 'logreg'])
parameters:
{'logreg__C': (0.1, 0.5, 1.0),
 'logreg__fit_intercept': (True, False),
 'logreg__solver': ('liblinear', 'sag'),
 'tfidf__max_df': (0.5, 0.75, 0.95),
 'tfidf__max_features': (2000, 20000, 50000),
 'tfidf__min_df': (2, 10, 50)}
Fitting 3 folds for each of 324 candidates, totalling 972 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 972 out of 972 | elapsed:  9.5min finished
done in 573.207s
()
Best score: 0.863
Best parameters set:
	logreg__C: 1.0
	logreg__fit_intercept: False
	logreg__solver: 'liblinear'
	tfidf__max_df: 0.5
	tfidf__max_features: 20000
	tfidf__min_df: 2

SyntaxError: invalid syntax (<ipython-input-5-6f92b8f2cdbb>, line 1)

What the result tell us ?

TF-IDF Vectorizer  
1. min_df = 2: this is aligned with removing as little as possible of the features
2. max_df = 50%: this is the strongest filtering of the most common words, focusing a category on its specific words
3. max_features = 20000: as we have about 30K features, this is aligned with the results we had in the previous trials were we have seen that limiting the features to 2000 was too strong (and did not improve runtime as much as could be expected)

Log Reg Classifier  
1. C = 1.0: no regularization works best here, this can depend on the data provided
2. fit_intercept = False: no bias 
3. solver = liblinear: the dataset is small enough for the liblinear to work better than sag, this would change with a much larger dataset

In the end, for our dataset the pipeline indicates us that we need to use as many features as possible (not remove rare words too strongly) and ignore the most common features (ideal here is to use the stop_words for english, or a very strong max_df.  
This is very close to the results we observed when changing paremeters of the tf-idf and logistic regression in the previous works.

### Section9 : Pipeline Refining

Let us refine the results of the first pipeline obtained in Section8.

In [14]:
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'tfidf__max_df': (0.3, 0.4, 0.5), # max frequence of word for it to be kept as a feature
    'tfidf__max_features': (20000, 30000, 40000), # max number of features in the model 
}

In [13]:
if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

ValueError: Parameter values for parameter (logreg__C) need to be a sequence(but not a string) or np.ndarray.

Performing grid search...
('pipeline:', ['tfidf', 'logreg'])
parameters:
{'tfidf__max_df': (0.3, 0.4, 0.5),
 'tfidf__max_features': (20000, 30000, 40000)}
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:   21.5s finished
done in 23.177s
()
Best score: 0.858
Best parameters set:
	tfidf__max_df: 0.3
	tfidf__max_features: 20000

In [None]:
Performing grid search...
('pipeline:', ['tfidf', 'logreg'])
parameters:
{'tfidf__max_df': (0.3, 0.4, 0.5),
 'tfidf__max_features': (20000, 30000, 40000)}
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:   21.5s finished
done in 23.177s
()
Best score: 0.858
Best parameters set:
	tfidf__max_df: 0.3
	tfidf__max_features: 20000

Interestingly these refinement results show that filtering even more commong words by removing any word present in more than 30% of the data brings the best score of the pipeline.