## Feature extraction 4. Using Multiple Classifiers

In [None]:
# Author: Guillaume Lussier <lussier.guillaume@gmail.com>
# base of work http://scikit-learn.org/stable/modules/feature_extraction.html
# Date: Feb2017
# ipython file, kernel 2.7, required modules: sklearn, numpy, pprint, time, logging 

### Section8 : Mutiple Classifiers

Using multiple classifiers can be done several ways. One way is to chain them in what is called a pipeline.  
An estimator pipeline will act as the addition of the estimators on the data.

In the previous works we have chained a vectorizer with a classifier. The vectorizers we used were a simple count vectorizer and a tf-idf vectorizer. The classifiers were Logistic Regression, multinomial Naive Bayes.

One of the difficulties of using several estimators is to chose the different parameters to be used with each of them, especially as one can impact the result of the next one. The sklearn GridSearchCV library can help with identifying the best parameters.

##### 8.1 Introduction of the libraries, description of the parameters

In [None]:
# sklearn data set
from sklearn.datasets import fetch_20newsgroups

# sklearn text feature extraction pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# basic libraries
from sklearn import metrics
import numpy as np
from pprint import pprint
from time import time

# this is to configure python logging to handle warning messages 
import logging
logging.basicConfig()


# TfidfVectorizer
# tf / term frequency
# idf / inverse documentfrequency
# max_df: terms with a frequency higher than this value are ignored
# min_df: cut-off, terms wih an obsolute count lower than this value are ignored
# analyzer='word': default value, feature will be made of words n-grams
# ngram_range=tuple (min_n, max_n): default 1, n-grams used such as min_n <= n <= max_n
# vocabulary: default None, if not given, a vocabulary is determined from the input documents.
# max_features: default None, if not None, build a vocabulary with only top max_features ordered by term frequency across the corpus.
# stop_words: None, english
# example : TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
# note: the effect of the different parameters on the tf-idf vectorizer & fit have been discussed in fextraction2 

# LogisticRegression
# penalty: str, ‘l1’ or ‘l2’, default: ‘l2’
# C: float, default: 1.0, smaller values specify stronger regularization
# fit_intercept : bool, default: True, constant (a.k.a. bias or intercept) added to the decision function
# random_state : int seed, RandomState instance, default: None, seed of the pseudo random number generator to use when shuffling the data
# solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}, default: ‘liblinear’
# n_jobs : int, default: 1, Number of CPU cores used during cross-validation loop. If -1, all cores are used
# example : LogisticRegression(random_state=0)
