# Hyper-parameter tuning part 2:
## Classify two of "20 newsgroups"
### Tuning the vectorizer and classifier with a pipeline


#### Created by John Burt

Based on example code [from Scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py) 

In this example, I implement a classifier to estimate whether a comment is from one of two selected newsgroups, "alt.atheism" or "talk.religion.misc". Unlike in the first example, however, I also tune the transformation of words to vectors via TfidfVectorizer. 

#### General classifier procedure:
- vectorize the text content into features using TfidfVectorizer
- train a stochastic gradient descent classifier (SGDClassifier) to classify text as one of the given newsgroup categories.

#### Hyperparameter tuning procedure:
- Create a pipeline object that contains the two stages of the training procedure: TfidfVectorizer and SGDClassifier
- Use a grid search to find the TfidfVectorizer and SGDClassifier parameters that give best accuracy.


### About the 20 Newsgroups text dataset:
The [20 newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups) is easily accessible from the scikit-learn module. It comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

The 20 newsgroup names are:
- alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc

In [1]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

# set matplotlib environment and import some basics
%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 100 # set to -1 to see entire text

# import the scikit modules to be used
from sklearn.datasets import fetch_20newsgroups

# specify some categories from the training set
# categories = [
#     'alt.atheism',
#     'talk.religion.misc',
# ]
# categories = [
#     'rec.sport.baseball',
#     'rec.sport.hockey',
# ]
# specify some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

# load the training dataset
data = fetch_20newsgroups(subset='train', categories=categories)

print("\n%d total samples"%(len(data.target)))
for catname,index in zip(data.target_names, range(len(data.target_names))):
    print("  %d samples of %s"%(sum(data.target==index), catname))


Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']

857 total samples
  480 samples of alt.atheism
  377 samples of talk.religion.misc


In [9]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Define TfidfVectorizer defaults: 
# Doing it this way allows us to define our own default params based on prior gridsearch runs.
vect_defaults = {
    "analyzer":'word', 
    "max_features" : None,
    "max_df" : 0.5, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "min_df" : 2, # Filters out terms that occur in only one document (min_df=2).
    "ngram_range":(1, 3), # unigrams
    "stop_words" : "english", # None, # "english", # Strips out “stop words”
    "use_idf" : True
    }

# Define SGDClassifier defaults: 
# define defaults: doing it this way allows us to define our own default params
clf_defaults = {
    "alpha": 1e-03, 
    "penalty" : 'l2',
    "n_iter" : 10
    }

# Create a pipeline combining a text vectorizer with SGDClassifier classifier
pipeline = Pipeline([    
    ('vect', TfidfVectorizer(**vect_defaults)),
    ('clf', SGDClassifier(**clf_defaults)),
])

# Define the parameters and values we want to test.
# Uncommenting more parameters will give better exploring power but will
#   increase processing time in a combinatorial way. I suggest tuning <= 3
#   parameters at a time.
# Note the naming format: pipelineobjectname__paramname
parameters = {
    #'vect__stop_words': ('english', None),
    #'vect__analyzer': ('word', 'char_wb'),
    #'vect__analyzer': ('word', 'char', 'char_wb'),
    'vect__max_df': (0.1, 0.25, 0.5, 0.75),
    'vect__min_df': (1,2,5),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2), (2, 2)),  # unigrams or bigrams
    'vect__use_idf': (True, False),
    
    #'clf__alpha': (1e-02, 1e-03, 1e-04, 1e-05, 1e-06),
    'clf__penalty': ( 'none', 'l2', 'l1', 'elasticnet'),
    #'clf__n_iter': (5, 10, 20, 50, 80),
}
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

print("Performing grid search...")
print("parameters:")
print(parameters)
t0 = time()
grid_search.fit(data.data, data.target)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))




Performing grid search...
parameters:
{'vect__max_df': (0.1, 0.25, 0.5, 0.75), 'vect__ngram_range': ((1, 1), (1, 2), (2, 2)), 'vect__use_idf': (True, False), 'vect__min_df': (1, 2, 5), 'clf__penalty': ('none', 'l2', 'l1', 'elasticnet')}
Fitting 3 folds for each of 288 candidates, totalling 864 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   29.4s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed:  2.5min finished


done in 153.702s

Best score: 0.953
Best parameters set:
	clf__penalty: 'l2'
	vect__max_df: 0.25
	vect__min_df: 1
	vect__ngram_range: (2, 2)
	vect__use_idf: False
