## Feature extraction 2. Text Extended Examples

In [58]:
# Author: Guillaume Lussier <lussier.guillaume@gmail.com>
# base of work http://scikit-learn.org/stable/modules/feature_extraction.html
# Date: Jan2017
# ipython file, kernel 2.7, required modules: sklearn, numpy, pprint, time, logging 

Because of the number of vectorizers and classifiers in this document running the whole document can take a minute or two.

### Section4 :  Fitting continued

In the previous analysis we used a filtered training set but did not filter the test set, let us compare the results of the classifier on a filtered test set.
We use the sklearn fetch_20newsgroups with parameters to remove the headers, footers and quotes from the dataset.

In [59]:
# redefine the 20newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from pprint import pprint
from time import time

# this is to configure python logging to handle warning messages 
import logging
logging.basicConfig()

# Categories and Corpus - filtered corpus
print("Loading and filering sklearn.20newsgroup dataset, remove headers, footers, quotes")
corpus_train_20ng = fetch_20newsgroups(subset='train', shuffle=True, random_state=1, 
                                    remove=('headers', 'footers', 'quotes'))
list_train = list(corpus_train_20ng.target_names)
pprint(list_train)

# Test Set
#filtered test set
corpus_test_20ng = fetch_20newsgroups(subset='test', shuffle=True, random_state=1, 
                                    remove=('headers', 'footers', 'quotes'))
# unfiltered test set
#corpus_test_20ng = fetch_20newsgroups(subset='test', shuffle=True, random_state=1)
list_test = list(corpus_test_20ng.target_names)
pprint(list_test)

# TF-IDF Vectorizer, filtered corpus
print("Vectorizer on filtered data train group")
vectorizer_20ng = TfidfVectorizer()
vectors = vectorizer_20ng.fit_transform(corpus_train_20ng.data)
pprint(vectors.shape)
#(11314, 101631)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng = MultinomialNB(alpha=.01)
classifier_20ng.fit(vectors, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test = vectorizer_20ng.transform(corpus_test_20ng.data)
pprint(vectors_test.shape)
#(7532, 101631)

print("F1 Score on sampled/filtered set")
predictor = classifier_20ng.predict(vectors_test)
metrics.f1_score(corpus_test_20ng.target, predictor, average='macro')
#0.68286112952505695 (filtered test set, were headers, footers and quotes have been removed)
#0.77414478112872853 (unfiltered test set, same result as in fextraction1 exercise with only train set filtered)

Loading and filering sklearn.20newsgroup dataset, remove headers, footers, quotes
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
Vectorizer on filtered data train group
(11314, 101631)
Creating Naive Bayes classifier on f

0.68286112952505695

F1 score is worst on the filtered test set (no header, footer or quotes) than on the full unfiltered test set. Despite the fact we are using a filtered training data set.

In [60]:
# function to provide the top 10 features (words) of a category for the provided classifier 
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

In [61]:
show_top10(classifier_20ng, vectorizer_20ng, corpus_train_20ng.target_names)

alt.atheism: not in and it you is that of to the
comp.graphics: you in graphics it is for of and to the
comp.os.ms-windows.misc: file of you for and is it to windows the
comp.sys.ibm.pc.hardware: with scsi for of drive is it and to the
comp.sys.mac.hardware: that apple for of mac it and is to the
comp.windows.x: for this it in of is and window to the
misc.forsale: or in shipping offer 00 to and sale the for
rec.autos: is that in it of you and to car the
rec.motorcycles: for that in of you it and bike to the
rec.sport.baseball: year was is that of in and to he the
rec.sport.hockey: hockey team that game of he and in to the
sci.crypt: in be it is that key and of to the
sci.electronics: that for in it you is and of to the
sci.med: this you that in it and is to of the
sci.space: for that it is in and space of to the
soc.religion.christian: you it in god and is that to of the
talk.politics.guns: it gun is you in and that of to the
talk.politics.mideast: it is israel that you in and to of th

As we have seen earlier the top 10 features even with a classifier are not necessarily very meaningful. In fextraction1 we have seen how to use a larger top feature set to remove commonalities and improve results. We are going to see below ho to use parameter of the tf-idf vectorization to improve filtering before the classifier.

### Section5 :  Comparing TF-IDF, TF and post-treatment filtering as seen on Sections 3 & 4

The post-treatment we have done is centered on removing from the classifier top features the ones that are too common across categories. This is similar to term weighting and inverse document frequency filtering. Let us compare the results.

Let us compare to a basic CountVectorizer first.

In [62]:
from sklearn.feature_extraction.text import CountVectorizer

# TF Vectorizer, filtered corpus
print("Vectorizer on filtered data train group")
vectorizer_20ng2 = CountVectorizer()
vectors2 = vectorizer_20ng2.fit_transform(corpus_train_20ng.data)
pprint(vectors2.shape)
#(11314, 101631)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng2 = MultinomialNB(alpha=.01)
classifier_20ng2.fit(vectors2, corpus_train_20ng.target)

# TF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test2 = vectorizer_20ng2.transform(corpus_test_20ng.data)
pprint(vectors_test2.shape)
#(7532, 101631)

print("F1 Score on sampled/filtered set")
predictor2 = classifier_20ng2.predict(vectors_test2)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor2, average='macro'))
#0.6203806145034193 (with count vectorizer)

Vectorizer on filtered data train group
(11314, 101631)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 101631)
F1 Score on sampled/filtered set
0.6203806145034193


The vector sizes are the same with and without tf-idf, if no parameters are given.

In [63]:
show_top10(classifier_20ng, vectorizer_20ng2, corpus_train_20ng.target_names)

alt.atheism: not in and it you is that of to the
comp.graphics: you in graphics it is for of and to the
comp.os.ms-windows.misc: file of you for and is it to windows the
comp.sys.ibm.pc.hardware: with scsi for of drive is it and to the
comp.sys.mac.hardware: that apple for of mac it and is to the
comp.windows.x: for this it in of is and window to the
misc.forsale: or in shipping offer 00 to and sale the for
rec.autos: is that in it of you and to car the
rec.motorcycles: for that in of you it and bike to the
rec.sport.baseball: year was is that of in and to he the
rec.sport.hockey: hockey team that game of he and in to the
sci.crypt: in be it is that key and of to the
sci.electronics: that for in it you is and of to the
sci.med: this you that in it and is to of the
sci.space: for that it is in and space of to the
soc.religion.christian: you it in god and is that to of the
talk.politics.guns: it gun is you in and that of to the
talk.politics.mideast: it is israel that you in and to of th

compared to tf-idf vectorizer (without parameters)  
alt.atheism: not in and it you is that of to the  
comp.graphics: you in graphics it is for of and to the  
comp.os.ms-windows.misc: file of you for and is it to windows the  
comp.sys.ibm.pc.hardware: with scsi for of drive is it and to the  
comp.sys.mac.hardware: that apple for of mac it and is to the  
comp.windows.x: for this it in of is and window to the  
misc.forsale: or in shipping offer 00 to and sale the for  
rec.autos: is that in it of you and to car the  
rec.motorcycles: for that in of you it and bike to the  
rec.sport.baseball: year was is that of in and to he the  
rec.sport.hockey: hockey team that game of he and in to the  
sci.crypt: in be it is that key and of to the  
sci.electronics: that for in it you is and of to the  
sci.med: this you that in it and is to of the  
sci.space: for that it is in and space of to the  
soc.religion.christian: you it in god and is that to of the  
talk.politics.guns: it gun is you in and that of to the  
talk.politics.mideast: it is israel that you in and to of the  
talk.politics.misc: are it is you in and that of to the  
talk.religion.misc: not it in you is and that to of the  

In [64]:
# TF-IDF Vectorizer with parameter, filtered corpus
# here we use the tf-idf vectrizer from sklearn
# TERM WEIGHTING
# tf / term frequency
# idf / inverse documentfrequency
# max_df: terms with a frequency higher than this value are ignored
# min_df: cut-off, terms wih an obsolute count lower than this value are ignored
# analyzer='word': default value, feature will be made of words n-grams
# ngram_range=tuple (min_n, max_n) : default 1, n-grams used such as min_n <= n <= max_n
# vocabulary: default None, if not given, a vocabulary is determined from the input documents.
# max_features: default None, if not None, build a vocabulary with only top max_features ordered by term frequency across the corpus.
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3 = TfidfVectorizer(max_df=0.95, min_df=2, 
                                   stop_words='english')
vectors3 = vectorizer_20ng3.fit_transform(corpus_train_20ng.data)
pprint(vectors3.shape)
#(11314, 39116)
# by default 39116 features are extracted with tf-idf and english stop words plus max_df at 95%

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3 = MultinomialNB(alpha=.01)
classifier_20ng3.fit(vectors3, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3 = vectorizer_20ng3.transform(corpus_test_20ng.data)
pprint(vectors_test3.shape)
#(7532, 39116)

print("F1 Score on sampled/filtered set")
predictor3 = classifier_20ng3.predict(vectors_test3)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3, average='macro'))
#0.68007259851926749 (with tf-idf & parameters)

print("done in %0.3fs." % (time() - t0))
#time 5.3s

Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 5.007s.


F1 score is slightly improved and wih much smaller vector size (hence memory imprint).

In [65]:
show_top10(classifier_20ng3, vectorizer_20ng3, corpus_train_20ng.target_names)

alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: format looking 3d know program file files thanks image graphics
comp.os.ms-windows.misc: program problem thanks drivers use driver files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor quadra does simms problem thanks drive apple mac
comp.windows.x: windows xterm x11r5 use application thanks widget motif server window
misc.forsale: asking email price sell new condition shipping 00 offer sale
rec.autos: don ford new good dealer just engine like cars car
rec.motorcycles: don helmet just riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: good voltage thanks used does know

Using a tf-idf vectorizer with english stop words (removing most common english words) is clearly much effective, F1 score is slightly worse (0.68) than on unfiltered tf-idf (0.77) but the top10 lists show much less overfit with more meaningful words.  
The matrix sizes are also much smaller, most common english words have been removed, which brings us from 101631 to 39116 features.

#### 5.1 max_features impact on tf-idf

Let us compare the impact of the number of features on the F1 score and fitting (extracted top 10) when using the term frequency and inverse document frequency vectorization.

In [66]:
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3b = TfidfVectorizer(max_df=0.95, min_df=2, 
                                   max_features=1000,
                                   stop_words='english')
vectors3b = vectorizer_20ng3b.fit_transform(corpus_train_20ng.data)
pprint(vectors3b.shape)
#(11314, 1000)
# features extracted are limited to 1000

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3b = MultinomialNB(alpha=.01)
classifier_20ng3b.fit(vectors3b, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3b = vectorizer_20ng3b.transform(corpus_test_20ng.data)
pprint(vectors_test3b.shape)
#(7532, 1000)

print("F1 Score on sampled/filtered set")
predictor3b = classifier_20ng3b.predict(vectors_test3b)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3b, average='macro'))
#0.50290117598718154 (with tf-idf and features limited to 1000)

print("done in %0.3fs." % (time() - t0))
#time 5.2s

Vectorizer on filtered data train group
(11314, 1000)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 1000)
F1 Score on sampled/filtered set
0.50290117598718154
done in 4.879s.


F1 score is down from 0.68 to 0.5 when we limit the features of the tf-idf vectorization to 1000.

In [67]:
show_top10(classifier_20ng3b, vectorizer_20ng3b, corpus_train_20ng.target_names)

alt.atheism: said does say just atheism religion think don people god
comp.graphics: does format 3d know program file files thanks image graphics
comp.os.ms-windows.misc: using problem driver drivers thanks use files dos file windows
comp.sys.ibm.pc.hardware: monitor disk pc thanks ide controller bus scsi card drive
comp.sys.mac.hardware: scsi monitor know use does problem thanks drive apple mac
comp.windows.x: code program application using thanks use widget motif server window
misc.forsale: price email asking sell new condition 00 shipping offer sale
rec.autos: think know don new good just engine like cars car
rec.motorcycles: think ve good right know don like just dod bike
rec.sport.baseball: just good players think hit runs games game team year
rec.sport.hockey: think year games nhl players play season hockey team game
sci.crypt: escrow people use nsa keys government chip clipper encryption key
sci.electronics: ve don good current does know used power like use
sci.med: time think l

If we compare this top10 to the previous one we can see more basic words. Because features are limited to 1000, the words that only appeared in one category (less often in the whole documents) have been removed from the features.  
We have lost in each document category meaningfulness of our extraction, but we cut from more than 32K features to 1K. The processing time on the other hand has not been reduced significantly.

#### 5.2 max frequency impact on tf-idf

In the previous examples we have used a max frequency of 95%. Any word present in more than 95% of the documents was removed from the extracted features. Let us see if it impacts the size and or the fitting.

In [68]:
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3c = TfidfVectorizer(min_df=2, 
                                   stop_words='english')
vectors3c = vectorizer_20ng3c.fit_transform(corpus_train_20ng.data)
pprint(vectors3c.shape)
#(11314, 39116)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3c = MultinomialNB(alpha=.01)
classifier_20ng3c.fit(vectors3c, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3c = vectorizer_20ng3c.transform(corpus_test_20ng.data)
pprint(vectors_test3c.shape)
#(7532, 39116)

print("F1 Score on sampled/filtered set")
predictor3c = classifier_20ng3c.predict(vectors_test3c)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3c, average='macro'))
#0.68007259851926749

print("done in %0.3fs." % (time() - t0))
#time 4.8s

Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.782s.


Result comparison for TF-IDF vectorization

with max_df = 95%
Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.731s.

without maw_df = 95%
Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.779s.


In [69]:
show_top10(classifier_20ng3c, vectorizer_20ng3c, corpus_train_20ng.target_names)

alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: format looking 3d know program file files thanks image graphics
comp.os.ms-windows.misc: program problem thanks drivers use driver files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor quadra does simms problem thanks drive apple mac
comp.windows.x: windows xterm x11r5 use application thanks widget motif server window
misc.forsale: asking email price sell new condition shipping 00 offer sale
rec.autos: don ford new good dealer just engine like cars car
rec.motorcycles: don helmet just riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: good voltage thanks used does know

The features extracted are the same as the tf-idf with 95% max document frequency.  
There is no impact of the max document frequency set at 95%, with the common english words already filtered no word has been present in 95% of the documents.

#### 5.3 max frequency impact on tf-idf - try stronger filtering

If we change the max frequency filtering from 95% to 10% we should see an impact to verify our previous results.

In [70]:
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3d = TfidfVectorizer(max_df=0.10, min_df=2, 
                                   stop_words='english')
vectors3d = vectorizer_20ng3d.fit_transform(corpus_train_20ng.data)
pprint(vectors3d.shape)
#(11314, 39116)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3d = MultinomialNB(alpha=.01)
classifier_20ng3d.fit(vectors3d, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3d = vectorizer_20ng3d.transform(corpus_test_20ng.data)
pprint(vectors_test3d.shape)
#(7532, 39116)

print("F1 Score on sampled/filtered set")
predictor3d = classifier_20ng3d.predict(vectors_test3d)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3d, average='macro'))
#0.68007259851926749

print("done in %0.3fs." % (time() - t0))
#time 4.8s

Vectorizer on filtered data train group
(11314, 39096)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39096)
F1 Score on sampled/filtered set
0.67754321082070557
done in 4.894s.


Result comparison of the max_df parameter impact

with max_df = 95%
Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.731s.

with max_df = 10%
Vectorizer on filtered data train group
(11314, 39096)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39096)
F1 Score on sampled/filtered set
0.67754321082070557
done in 4.847s.

Only 20 words are common in 10% of all the documents. Removing them doesn't change the results a lot, but we could investigate the efficiency of the max_df parameter if we have not first filtered the english language with the "english" parameter.


In [71]:
show_top10(classifier_20ng3d, vectorizer_20ng3d, corpus_train_20ng.target_names)

alt.atheism: moral said objective morality bible islam atheists religion atheism god
comp.graphics: hi software format looking 3d program file files image graphics
comp.os.ms-windows.misc: using card program problem drivers driver files dos file windows
comp.sys.ibm.pc.hardware: drives monitor disk pc ide controller bus card scsi drive
comp.sys.mac.hardware: centris scsi lc monitor quadra simms problem drive apple mac
comp.windows.x: program using windows xterm x11r5 application widget server motif window
misc.forsale: interested asking email price sell condition shipping 00 offer sale
rec.autos: toyota drive auto price oil ford dealer engine cars car
rec.motorcycles: honda dog bmw helmet riding motorcycle ride bikes dod bike
rec.sport.baseball: braves players hit pitching runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: secure security escrow nsa keys government chip clipper encryption key
sci.electronics: work 

Let us compare this word extraction to the original from the tf-idf:  
alt.atheism: islam atheists say just religion atheism think don people god  
comp.graphics: looking format 3d know program file files thanks image graphics  
comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows  
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive  
comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac  
comp.windows.x: using windows x11r5 use application thanks widget server motif window  
misc.forsale: asking email sell price condition new shipping offer 00 sale  
rec.autos: don ford new good dealer just engine like cars car  
rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike  
rec.sport.baseball: braves players pitching hit runs games game baseball team year  
rec.sport.hockey: league year nhl games season players play hockey team game  
sci.crypt: people use escrow nsa keys government chip clipper encryption key  
sci.electronics: don thanks voltage used know does like circuit power use  
sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg  
sci.space: just lunar earth shuttle like moon launch orbit nasa space  
soc.religion.christian: believe faith christian christ bible people christians church jesus god  
talk.politics.guns: just law firearms government fbi don weapons people guns gun  
talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel  
talk.politics.misc: know state clinton president just think tax don government people  
talk.religion.misc: think don koresh objective christians bible people christian jesus god  

Let us compare one category:  
sci.space: data spacecraft lunar earth shuttle moon launch orbit nasa space  
sci.space: just lunar earth shuttle like moon launch orbit nasa space  
The differences are:   
sci.space: data spacecraft  
sci.space: just like  

With the extra filtering of document frequency we removed two additional common words: just and like. These two words can have very strong meaning though but for a category like sci.space they can probably be removed without losing information. They were replaced by data and spacecraft, which makes more sense for the sci.space category.

#### 5.4 max frequency impact replacing the english words filtering from TFIDFVectorizer

Can the max frequency be used to replace the english words filtering (stop_words='english')?  
Let us compare fitting results from a max_df of 80% to a max_df of 30%.  

In [72]:
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3e = TfidfVectorizer(max_df=0.8, min_df=2)
vectors3e = vectorizer_20ng3e.fit_transform(corpus_train_20ng.data)
pprint(vectors3e.shape)
#(11314, 39422)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3e = MultinomialNB(alpha=.01)
classifier_20ng3e.fit(vectors3e, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3e = vectorizer_20ng3e.transform(corpus_test_20ng.data)
pprint(vectors_test3e.shape)
#(7532, 39422)

print("F1 Score on sampled/filtered set")
predictor3e = classifier_20ng3e.predict(vectors_test3e)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3e, average='macro'))
#0.68092009218862148

print("done in %0.3fs." % (time() - t0))
#time 4.8s

Vectorizer on filtered data train group
(11314, 39422)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39422)
F1 Score on sampled/filtered set
0.68092009218862148
done in 4.814s.


Result comparison for TF-IDF vectorization

with english stop words filtering  
Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.731s.

with maw_df = 80% and no stop words  
Vectorizer on filtered data train group
(11314, 39422)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39422)
F1 Score on sampled/filtered set
0.68092009218862148
done in 4.815s.

Size of the resulting vectors are close.

In [73]:
show_top10(classifier_20ng3e, vectorizer_20ng3e, corpus_train_20ng.target_names)

alt.atheism: are not in and it you is that to of
comp.graphics: that you in graphics it is for and of to
comp.os.ms-windows.misc: in file of you for and is it to windows
comp.sys.ibm.pc.hardware: have with scsi for drive of is it and to
comp.sys.mac.hardware: with that apple for of mac it and is to
comp.windows.x: server for this it in of is and window to
misc.forsale: of or in shipping offer 00 to and sale for
rec.autos: on is that in it of you and to car
rec.motorcycles: my for that in of you it and bike to
rec.sport.baseball: they year was is that of in and to he
rec.sport.hockey: was hockey team that game of he and in to
sci.crypt: this in be it is that key and of to
sci.electronics: on that for it in you is and of to
sci.med: are this you that in it and is to of
sci.space: be for that it is in and space of to
soc.religion.christian: not you it in god and is that to of
talk.politics.guns: they it gun is you in and that of to
talk.politics.mideast: not it is israel you that in and t

The features extracted without stop words and a 80% document frequency filtering do not fit the data very well. There are still too many common english words and little of the specificities of each category has been extracted in the top 10 features.

In [74]:
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3e1 = TfidfVectorizer(max_df=0.3, min_df=2)
vectors3e1 = vectorizer_20ng3e1.fit_transform(corpus_train_20ng.data)
pprint(vectors3e1.shape)
#(11314, 39422)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3e1 = MultinomialNB(alpha=.01)
classifier_20ng3e1.fit(vectors3e1, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3e1 = vectorizer_20ng3e1.transform(corpus_test_20ng.data)
pprint(vectors_test3e1.shape)
#(7532, 39422)

print("F1 Score on sampled/filtered set")
predictor3e1 = classifier_20ng3e1.predict(vectors_test3e1)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3e1, average='macro'))
#0.68092009218862148

print("done in %0.3fs." % (time() - t0))
#time 4.8s

Vectorizer on filtered data train group
(11314, 39398)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39398)
F1 Score on sampled/filtered set
0.68088798109027371
done in 5.509s.


Result comparison for TF-IDF vectorization

with english stop words filtering  
Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.731s.

with maw_df = 80% and no stop words  
Vectorizer on filtered data train group
(11314, 39422)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39422)
F1 Score on sampled/filtered set
0.68092009218862148
done in 4.815s.

with maw_df = 30% and no stop words  
Vectorizer on filtered data train group
(11314, 39398)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39398)
F1 Score on sampled/filtered set
0.68088798109027371
done in 5.109s.

Size of the resulting vectors are close, the filtering of the max frequency does not remove many extracted features and the resulting size is close to the stop words filtering.

In [75]:
show_top10(classifier_20ng3e1, vectorizer_20ng3e1, corpus_train_20ng.target_names)

alt.atheism: one people was so we your they do what god
comp.graphics: would me program file files there thanks image any graphics
comp.os.ms-windows.misc: thanks use drivers driver there my files dos file windows
comp.sys.ibm.pc.hardware: thanks pc any ide controller bus my card scsi drive
comp.sys.mac.hardware: there any one problem thanks what my drive apple mac
comp.windows.x: use my application do thanks widget any motif server window
misc.forsale: price sell please me new condition shipping offer 00 sale
rec.autos: me any your out about was cars they my car
rec.motorcycles: motorcycle one ride bikes me your dod was my bike
rec.sport.baseball: runs games game baseball team his they year was he
rec.sport.hockey: nhl season players play they was hockey team game he
sci.crypt: nsa would will government keys chip they clipper encryption key
sci.electronics: what power circuit anyone use any one they would there
sci.med: n3jxp chastity there pitt about gordon geb was msg my
sci.space: 

Extracted features for each category are now closer to the results obtained with the english stop words. The max frequency parameter can be used to filter data the same way as the english stop words, but results will not be as accurate and it could be needed to use a very strong max_df filtering.

#### 5.5 min frequency impact on tf-idf

We first studied the impact of the max document frequency, let us see how the min document frequency impacts the results of the size of the models.

In [76]:
t0 = time()

print("Vectorizer on filtered data train group")
vectorizer_20ng3f = TfidfVectorizer(max_df=0.95, 
                                   stop_words='english')
vectors3f = vectorizer_20ng3f.fit_transform(corpus_train_20ng.data)
pprint(vectors3f.shape)
#(11314, 101323)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng3f = MultinomialNB(alpha=.01)
classifier_20ng3f.fit(vectors3f, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer on filtered data test group")
vectors_test3f = vectorizer_20ng3f.transform(corpus_test_20ng.data)
pprint(vectors_test3f.shape)
#(7532, 101323)

print("F1 Score on sampled/filtered set")
predictor3f = classifier_20ng3f.predict(vectors_test3f)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor3f, average='macro'))
#0.68443899192121638

print("done in %0.3fs." % (time() - t0))
#time 4.9s

Vectorizer on filtered data train group
(11314, 101323)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 101323)
F1 Score on sampled/filtered set
0.68443899192121638
done in 5.820s.


Result comparison for TF-IDF vectorization

basic results with max_df, min_df and stop words   
Vectorizer on filtered data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 39116)
F1 Score on sampled/filtered set
0.68007259851926749
done in 4.731s.

without min_df = 2  
Vectorizer on filtered data train group
(11314, 101323)
Creating Naive Bayes classifier on filtered data group
Vectorizer on filtered data test group
(7532, 101323)
F1 Score on sampled/filtered set
0.68443899192121638
done in 4.884s.

F1 scores are close, but the size of the matrixes is much smaller when the rarest words are removed.

In [77]:
show_top10(classifier_20ng3f, vectorizer_20ng3f, corpus_train_20ng.target_names)

alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: looking format 3d know program file files thanks image graphics
comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac
comp.windows.x: using windows x11r5 use application thanks widget server motif window
misc.forsale: asking email sell price condition new shipping offer 00 sale
rec.autos: don ford new good dealer just engine like cars car
rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: don thanks voltage used know does lik

The features extracted fit the categories well without removing the rarest words.

What do we  learn from these different results on TF-IDF vectorization and filtering?  
1. dictionary filtering using stop_words
    This is the most effective to remove less meaningful words and fit the results. It requires having a dictionary for the data filtered though (only English is present in sklearn).
    
2. max document frequency filtering
    This is similar to dictionary filtering aiming at removing the most common (and potentially less meaningful) features. It can require a strong filtering parameter, the F1 scores in our examples were not strongly affected by a max_df as high as 30%.
    
3. min document frequency filtering
    This filtering effect is first to reduce the size of the matrixes. It removes the least frequent words, we will not see an effect on the top 10 extracted features, but it can produce more (false negatives).


### Section6 : Using other models

In all the previous examples we used a multinomial Naive Bayes (MultinomialNB) classifier. Let us compare results with other models.

In [78]:
# new top words that works with models that are not classifiers but matrix decomposition
def print_top_n_words(model, vectorizer, corpus, n_top_words):
    feature_names = np.asarray(vectorizer.get_feature_names())
    categories = corpus.target_names
    for i, topic in enumerate(model.components_):
        print("Topic %d: %s" % (i, " ".join([feature_names[j]
                        for j in topic.argsort()[:-n_top_words - 1:-1]])))

In [79]:
from sklearn.decomposition import NMF

# we use vectorizer_20ng3 and vectors3 from previous tf-idf examples
print("Vectorizer used is tf-idf")
# vectors3
#(11314, 39116)
# vectors_test3
#(7532, 39116)

t0 = time()
# Fit the Non Negative Matrix Factorization matrix decomposition
print("Creating Non Negative Matrix Factorization matrix decomposition")
model_nmf = NMF(n_components=20,
                random_state=1,
                alpha=.1,
                l1_ratio=.5)
model_nmf.fit(vectors3, corpus_train_20ng.target)

print("done in %0.3fs." % (time() - t0))
#time 2.738s for 10 components
#time 11.662 for 20 components

print("Topics in Non Negative Matrix Factorization matrix decomposition model")
print_top_n_words(model_nmf, vectorizer_20ng3, corpus_train_20ng, 10)

Vectorizer used is tf-idf
Creating Non Negative Matrix Factorization matrix decomposition
done in 11.719s.
Topics in Non Negative Matrix Factorization matrix decomposition model
Topic 0: don people just like think know time good ve right
Topic 1: god believe bible faith truth existence belief hell heaven atheism
Topic 2: thanks mail does know advance hi info looking anybody address
Topic 3: drive scsi ide disk drives hard controller floppy hd cd
Topic 4: 00 sale 50 shipping 20 10 price 15 new 25
Topic 5: geb dsl n3jxp chastity cadre shameful pitt intellect skepticism surrender
Topic 6: windows dos ms running version os microsoft nt using drivers
Topic 7: window manager application motif display server screen xterm widget program
Topic 8: game espn games baseball hockey detroit leafs wings night blues
Topic 9: car cars dealer engine miles owner buy speed ford tires
Topic 10: bobbe ico beauchaine tek queens bronx sank manhattan com blew
Topic 11: key keys bit des bits public escrow 80 pg

Note: time complexity is polynomial for NMF with number of components.  
Even with a number of topics equal to the original categories into the 20 newsgroups, the topics extracted are not the same as the categories. It is possible to find similarities like for the sci.space category.  
td-idf + multinomial Naive Bayes  
sci.space: just lunar earth shuttle like moon launch orbit nasa space  
tf-idf + Non Negative Matrix Factorization  
Topc 14: space nasa shuttle launch station sci gov orbit moon lunar  

In [80]:
from sklearn.decomposition import LatentDirichletAllocation

# we use vectorizer_20ng3 and vectors3 from previous tf-idf examples
print("Vectorizer used is tf-idf")
# vectors3
#(11314, 39116)
# vectors_test3
#(7532, 39116)

t0 = time()
# Fit the Latent Dirichlet Allocation matrix decomposition
print("Creating Latent Dirichlet Allocation model")
model_lda = LatentDirichletAllocation(n_topics=10,
                                      max_iter=5,
                                      learning_method='online',
                                      learning_offset=50., # tau_0
                                      random_state=0)
model_lda.fit(vectors3, corpus_train_20ng.target)

print("done in %0.3fs." % (time() - t0))
#time 21.025s

print("Topics in Latent Dirichlet Allocation model")
print_top_n_words(model_lda, vectorizer_20ng3, corpus_train_20ng, 10)

Vectorizer used is tf-idf
Creating Latent Dirichlet Allocation model
done in 22.159s.
Topics in Latent Dirichlet Allocation model
Topic 0: like just know don people think does use thanks good
Topic 1: ax ik kr rb oy w1 vv whirrr ky w7
Topic 2: intellect geb shameful cadre pitt chastity n3jxp dsl skepticism gordon
Topic 3: keller ivy quakers kkeller upenn sas champs xarchie usl ites
Topic 4: transoft humanist spreads basalts bensen wingo regolith 3he ppb crust
Topic 5: cache ram rear hd automatics miles card anyways simm odometer
Topic 6: trc zmed16 amoco sandiego graig nettles 8330 704 phys bickering
Topic 7: rectum siberian 712 1261 269 676 70k sputter whooooooooshhhhhh game
Topic 8: team game armenian players armenians season turkish games turkey play
Topic 9: decnet trademark wimp xdmcp retentive anal sparky helmeted storming survivalist


Note: time complexity is proportional to (data_samples * iterations) for LDA
The results from the NMF and LDA are pretty different, 

The extracted feature words tend to support the use of a 'multinomial Naive Bayes' classifier over the matrix reduction techniques such as NMF or LDA.