## Feature extraction 3. More Classifiers

In [130]:
# Author: Guillaume Lussier <lussier.guillaume@gmail.com>
# base of work http://scikit-learn.org/stable/modules/feature_extraction.html
# Date: Feb2017
# ipython file, kernel 2.7, required modules: sklearn, numpy, pprint, time, logging 

### Section7 : Using other classifiers from sklearn

Let us now use other classifiers from sklearn library.

In [92]:
#redefining the basic sklearn environment for text classification
# redefine the 20newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from pprint import pprint
from time import time

# this is to configure python logging to handle warning messages 
import logging
logging.basicConfig()

# Categories and Corpus - filtered corpus
print("Loading and filering sklearn.20newsgroup dataset, remove headers, footers, quotes")
corpus_train_20ng = fetch_20newsgroups(subset='train', shuffle=True, random_state=1, 
                                    remove=('headers', 'footers', 'quotes'))
list_train = list(corpus_train_20ng.target_names)
#pprint(list_train)

# Test Set
#filtered test set
corpus_test_20ng = fetch_20newsgroups(subset='test', shuffle=True, random_state=1, 
                                    remove=('headers', 'footers', 'quotes'))
list_test = list(corpus_test_20ng.target_names)
#pprint(list_test)

# TF-IDF Vectorizer, filtered corpus
print("Vectorizer tf-idf filtering on data train group")
vectorizer_20ng = TfidfVectorizer(max_df=0.95, min_df=2, 
                                   stop_words='english')
vectors_train = vectorizer_20ng.fit_transform(corpus_train_20ng.data)
pprint(vectors_train.shape)
#(11314, 101631)

# Classifier 'multinomial Naive Bayes' on trimed filtered corpus 20newsgroups set
print("Creating Naive Bayes classifier on filtered data group")
classifier_20ng_mNB = MultinomialNB(alpha=.01)
classifier_20ng_mNB.fit(vectors_train, corpus_train_20ng.target)

# TF-IDF Vectorizer, test corpus
print("Vectorizer tf-idf filtering on data test group")
vectors_test = vectorizer_20ng.transform(corpus_test_20ng.data)
pprint(vectors_test.shape)
#(7532, 101631)

print("F1 Score on sampled/filtered set:")
predictor_mNB = classifier_20ng_mNB.predict(vectors_test)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor_mNB, average='macro'))
#0.68286112952505695 (filtered test set, were headers, footers and quotes have been removed)

print("Extracted Top10 Features per Category/Topic:")
show_top10(classifier_20ng_mNB, vectorizer_20ng, corpus_train_20ng.target_names)

Loading and filering sklearn.20newsgroup dataset, remove headers, footers, quotes
Vectorizer tf-idf filtering on data train group
(11314, 39116)
Creating Naive Bayes classifier on filtered data group
Vectorizer tf-idf filtering on data test group
(7532, 39116)
F1 Score on sampled/filtered set:
0.68007259851926749
Extracted Top10 Features per Category/Topic:
alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: format looking 3d know program file files thanks image graphics
comp.os.ms-windows.misc: program problem thanks drivers use driver files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor quadra does simms problem thanks drive apple mac
comp.windows.x: windows xterm x11r5 use application thanks widget motif server window
misc.forsale: asking email price sell new condition shipping 00 offer sale
rec.autos: don ford new good dealer just engine like cars car
rec.

In [93]:
# function to provide the top 10 features (words) of a category for the provided classifier
# this works for classifiers with a coef_ attribute
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

#### 7.1 Linear Model - Logictic Regression

In [94]:
from sklearn.linear_model import LogisticRegression

# we use vectorizer_20ng3 and vectors3 from previous tf-idf examples
print("Vectorizer used is tf-idf")
# vectors3
#(11314, 39116)
# vectors_test3
#(7532, 39116)

t0 = time()
# Fit the Non Negative Matrix Factorization matrix decomposition
print("Fiting Random Forest Classifier")
logreg = LogisticRegression(random_state=0)
logreg.fit(vectors_train, corpus_train_20ng.target)

print("F1 Score - Logistic Regression")
predictor_lr = logreg.predict(vectors_test)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor_lr, average='macro'))
#0.67388146101858992

print("done in %0.3fs." % (time() - t0))
#time 5.173s

print("Topics with a Logistic Regression Classifier")
show_top10(logreg, vectorizer_20ng, corpus_train_20ng.target_names)

Vectorizer used is tf-idf
Fiting Random Forest Classifier
F1 Score - Logistic Regression
0.67388146101858992
done in 4.485s.
Topics with a Logistic Regression Classifier
alt.atheism: punishment atheist motto deletion islamic bobby atheists islam religion atheism
comp.graphics: polygon pov cview files tiff format images 3d image graphics
comp.os.ms-windows.misc: win3 risc fonts files drivers driver cica ax file windows
comp.sys.ibm.pc.hardware: bios 486 monitor drive card ide controller bus pc scsi
comp.sys.mac.hardware: nubus powerbook duo simms lc centris se quadra apple mac
comp.windows.x: widgets sun application mit x11r5 xterm widget motif server window
misc.forsale: new asking email interested 00 condition sell shipping offer sale
rec.autos: vw gt auto toyota oil dealer engine ford cars car
rec.motorcycles: harley dog bmw riding helmet motorcycle ride bikes dod bike
rec.sport.baseball: phillies ball pitching stadium hit cubs runs braves year baseball
rec.sport.hockey: puck playoff

Let us compare to the multinomial Naive Bayes classifier results:
alt.atheism: islam atheists say just religion atheism think don people god
comp.graphics: format looking 3d know program file files thanks image graphics
comp.os.ms-windows.misc: program problem thanks drivers use driver files dos file windows
comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive
comp.sys.mac.hardware: know monitor quadra does simms problem thanks drive apple mac
comp.windows.x: windows xterm x11r5 use application thanks widget motif server window
misc.forsale: asking email price sell new condition shipping 00 offer sale
rec.autos: don ford new good dealer just engine like cars car
rec.motorcycles: don helmet just riding like motorcycle ride bikes dod bike
rec.sport.baseball: braves players pitching hit runs games game baseball team year
rec.sport.hockey: league year nhl games season players play hockey team game
sci.crypt: people use escrow nsa keys government chip clipper encryption key
sci.electronics: good voltage thanks used does know like power circuit use
sci.med: skepticism cadre dsl banks n3jxp chastity pitt gordon geb msg
sci.space: just lunar earth shuttle like moon launch orbit nasa space
soc.religion.christian: believe faith christian christ bible people christians church jesus god
talk.politics.guns: just law firearms government fbi don weapons people guns gun
talk.politics.mideast: jewish arabs arab turkish people armenians armenian jews israeli israel
talk.politics.misc: know state president clinton just think tax don government people
talk.religion.misc: think don koresh objective christians bible people christian jesus god

Let us use the triage functions defined in fextraction1 to see what are the different words extracted.

In [95]:
# different words in 2 lists, with list comprehension (faster for big lists)
def dwl(list1, list2):
    return([x for x in (set(list1)^set(list2))])

def unique1(list1, list2):
    return([x for x in (set(list1)-set(list2))])

def unique2(list1, list2):
    return([x for x in (set(list2)-set(list1))])

# extract unique words from 2 classifier tops10 lists, they have to use the same categories/topics and same vetorizer
def unique_top(n, classifier1, classifier2, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        features_top_1 = feature_names[np.argsort(classifier1.coef_[i])[-n:]]
        features_top_2 = feature_names[np.argsort(classifier2.coef_[i])[-n:]]
        print("%s: %s" % (category, " ".join(dwl(features_top_1, features_top_2))))

def unique1_top(n, classifier1, classifier2, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        features_top_1 = feature_names[np.argsort(classifier1.coef_[i])[-n:]]
        features_top_2 = feature_names[np.argsort(classifier2.coef_[i])[-n:]]
        print("%s: %s" % (category, " ".join(unique1(features_top_1, features_top_2))))
        
def unique2_top(n, classifier1, classifier2, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        features_top_1 = feature_names[np.argsort(classifier1.coef_[i])[-n:]]
        features_top_2 = feature_names[np.argsort(classifier2.coef_[i])[-n:]]
        print("%s: %s" % (category, " ".join(unique2(features_top_1, features_top_2))))
        
# same words in 2 lists, with list comprehension (faster for big lists)
def cwl(list1, list2):
    return([x for x in list1 if x in set(list2)])

# extract same words from 2 classifier tops10 lists, they have to use the same categories/topics and same vetorizer
def same_top(n, classifier1, classifier2, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        features_top_1 = feature_names[np.argsort(classifier1.coef_[i])[-n:]]
        features_top_2 = feature_names[np.argsort(classifier2.coef_[i])[-n:]]
        print("%s: %s" % (category, " ".join(cwl(features_top_1, features_top_2))))

In [96]:
# words unique to Logistic Regression or multinomial Naive Bayes
unique_top(10, classifier_20ng_mNB, logreg, vectorizer_20ng, corpus_train_20ng.target_names)

alt.atheism: don just bobby atheist people punishment islamic say deletion god motto think
comp.graphics: polygon pov tiff know looking program thanks file images cview
comp.os.ms-windows.misc: use win3 fonts risc cica program thanks dos ax problem
comp.sys.ibm.pc.hardware: bios thanks 486 disk
comp.sys.mac.hardware: lc know duo thanks powerbook nubus drive centris does problem se monitor
comp.windows.x: windows sun use thanks widgets mit
misc.forsale: price interested
rec.autos: gt don like just auto vw toyota good oil new
rec.motorcycles: don like just dog harley bmw
rec.sport.baseball: ball stadium phillies players game cubs team games
rec.sport.hockey: league playoffs puck games leafs year
sci.crypt: use people privacy crypto escrow security
sci.electronics: 8051 good like thanks use know current does used radio amp output electronics ground
sci.med: pain cancer doctor food medical cadre disease diet pitt geb patients gordon treatment dsl banks n3jxp skepticism chastity
sci.space: 

In [97]:
# words common to Logistic Regression or multinomial Naive Bayes
same_top(10, classifier_20ng_mNB, logreg, vectorizer_20ng, corpus_train_20ng.target_names)

alt.atheism: islam atheists religion atheism
comp.graphics: format 3d files image graphics
comp.os.ms-windows.misc: drivers driver files file windows
comp.sys.ibm.pc.hardware: monitor pc ide controller bus card scsi drive
comp.sys.mac.hardware: quadra simms apple mac
comp.windows.x: xterm x11r5 application widget motif server window
misc.forsale: asking email sell new condition shipping 00 offer sale
rec.autos: ford dealer engine cars car
rec.motorcycles: helmet riding motorcycle ride bikes dod bike
rec.sport.baseball: braves pitching hit runs baseball year
rec.sport.hockey: nhl season players play hockey team game
sci.crypt: nsa keys government chip clipper encryption key
sci.electronics: voltage power circuit
sci.med: msg
sci.space: lunar earth shuttle moon launch orbit nasa space
soc.religion.christian: faith christian christ christians church jesus god
talk.politics.guns: firearms fbi weapons guns gun
talk.politics.mideast: jewish arabs arab turkish armenians armenian jews israeli 

We can see the results of the top 10 features extracted by Logistic Regression and multinomial Naive Bayes are pretty different.  
The common words are clearly most central features to be extracted from each category.  
Let us compare unique results from both classifiers.

In [98]:
# words unique to multinomial Naive Bayes
print("Unique Topics with a multinomial Naive Bayes:")
unique1_top(10, classifier_20ng_mNB, logreg, vectorizer_20ng, corpus_train_20ng.target_names)
print("")
# words unique to Logistic Regression
print("Unique Topics with a Logistic Regression Classifier:")
unique2_top(10, classifier_20ng_mNB, logreg, vectorizer_20ng, corpus_train_20ng.target_names)

Unique Topics with a multinomial Naive Bayes:
alt.atheism: don just people god say think
comp.graphics: looking program thanks file know
comp.os.ms-windows.misc: dos use program thanks problem
comp.sys.ibm.pc.hardware: disk thanks
comp.sys.mac.hardware: monitor drive does thanks problem know
comp.windows.x: windows use thanks
misc.forsale: price
rec.autos: new good don like just
rec.motorcycles: don like just
rec.sport.baseball: players game games team
rec.sport.hockey: league games year
sci.crypt: use escrow people
sci.electronics: use good like used does thanks know
sci.med: n3jxp pitt geb dsl chastity banks cadre skepticism gordon
sci.space: like just
soc.religion.christian: believe bible people
talk.politics.guns: law people don just government
talk.politics.mideast: people
talk.politics.misc: state don know just think
talk.religion.misc: bible don think people

Unique Topics with a Logistic Regression Classifier:
alt.atheism: bobby atheist islamic deletion motto punishment
comp.gr

If we look at unique words from each classifier the Logistic Regression results look more meaningful for each category than the multinomial Naive Bayes.  
Considering the F1 scores are very close this would push towards using Logistic Regression for this feature extraction problem.


#### 7.2 Ensemble Method - Random Forest Classifier

In [99]:
from sklearn.ensemble import RandomForestClassifier

# we use vectorizer_20ng3 and vectors3 from previous tf-idf examples
print("Vectorizer used is tf-idf")
# vectors3
#(11314, 39116)
# vectors_test3
#(7532, 39116)

t0 = time()
# Fit the Random Forest classifier
print("Fiting Random Forest Classifier")
random_forest = RandomForestClassifier(n_estimators=100, # number of trees in the forest
                                       max_features="auto", # auto, sqrt, log2
                                       max_depth=None, # integer or None
                                       min_samples_split=20, # minimum number of samples required to split an internal node
                                       n_jobs=-1, # number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores
                                       random_state=0) # seed used by the random number generator
random_forest.fit(vectors_train, corpus_train_20ng.target)

print("F1 Score - Random Forest Classifier")
predictor_rf = random_forest.predict(vectors_test)
pprint(metrics.f1_score(corpus_test_20ng.target, predictor_rf, average='macro'))
#
print("done in %0.3fs." % (time() - t0))
#

print("Topics with a Random Forest Classifier")
#show_top10(random_forest, vectorizer_20ng, corpus_train_20ng.target_names)
#print_top_n_words(random_forest, vectorizer_20ng, corpus_train_20ng, 10)

Vectorizer used is tf-idf
Fiting Random Forest Classifier
F1 Score - Random Forest Classifier
0.60335367250287308
done in 11.575s.
Topics with a Random Forest Classifier


Impact of number of estimators (trees):  
1. n_estimators=10, min_samples_split=2  
F1 Score : 0.50706915419130261  
done in 3.562s.
2. n_estimators=100, min_samples_split=2  
F1 Score : 0.59718046613774967
done in 29.829s.
3. n_estimators=1, min_samples_split=2  
F1 Score : 0.31805565386545032  
done in 0.911s.  

The runtime returned for calculation of the classifier is linear with the number of estimators.

Impact of minimum number of splits :  
1. n_estimators=10, min_samples_split=2  
F1 Score : 0.50706915419130261  
done in 3.562s.
2. n_estimators=10, min_samples_split=20  
F1 Score : 0.54781375149316192  
done in 1.586s.
3. n_estimators=10, min_samples_split=5  
F1 Score : 0.53908788926598206   
done in 2.411s.  

Increasing the minimum splits decreases calculation time and improves F1 Score.  

If we combine a higher number of estimators with a larger minimum splits:  
n_estimators=100, min_samples_split=20  
F1 Score - Random Forest: 0.60335367250287308     
done in 10.589s.  
The results are better but still inferior to Logictic Regression below:  
F1 Score - Logistic Regression : 0.67388146101858992  
done in 4.939s.


As for the matrix models in fextraction2, the random forest classifier in sklearn library sklearn.ensemble.RandomForestClassifier does not have a coef_ attribute. We cannot use the show_top10 function to display the top extracted features.

In [127]:
# function to provide the top 10 features from a random forest
def randomtrees_top_n(classifier, vectorizer, categories, n):
    feature_names = np.asarray(vectorizer.get_feature_names())
    top_n = np.argsort(random_forest.feature_importances_)[-n:]
    print("Top %d: %s" % (n, " ".join(feature_names[top_n])))

In [129]:
pprint(random_forest.n_classes_)
pprint(random_forest.classes_)
pprint(random_forest.n_features_)
np.argsort(random_forest.feature_importances_)[-5:]
# [34358,  9259, 16901, 22034, 32570, 30791, 16466,  6974,  8308, 37783]
# [30791, 16466,  6974,  8308, 37783]

20L
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
39116


array([30791, 16466,  6974,  8308, 37783], dtype=int64)

In [123]:
randomtrees_top_n(random_forest, vectorizer_20ng, corpus_train_20ng.target_names, 10)

Top 10: team clipper gun mac space sale god bike car windows


The radom forest top 10 are all present in the common words extracted by the Logistic Regression and multinomial Naive Bayes:  
alt.atheism: islam atheists religion atheism  
comp.graphics: format 3d files image graphics  
comp.os.ms-windows.misc: drivers driver files file **windows**  
comp.sys.ibm.pc.hardware: monitor pc ide controller bus card scsi drive  
comp.sys.mac.hardware: quadra simms apple **mac**  
comp.windows.x: xterm x11r5 application widget motif server window  
misc.forsale: asking email sell new condition shipping 00 offer **sale**  
rec.autos: ford dealer engine cars **car**  
rec.motorcycles: helmet riding motorcycle ride bikes dod **bike**  
rec.sport.baseball: braves pitching hit runs baseball year  
rec.sport.hockey: nhl season players play hockey **team** game  
sci.crypt: nsa keys government chip **clipper** encryption key  
sci.electronics: voltage power circuit  
sci.med: msg  
sci.space: lunar earth shuttle moon launch orbit nasa **space**  
soc.religion.christian: faith christian christ christians church jesus god  
talk.politics.guns: firearms fbi weapons guns **gun**  
talk.politics.mideast: jewish arabs arab turkish armenians armenian jews israeli israel  
talk.politics.misc: president clinton tax government people  
talk.religion.misc: koresh objective christians christian jesus **god**  

The random forest was effective in extracting the top words over the whole 20newsgroups data. The extracted data are not matched to the original categories as they are for the Logistic Regression or multinomial Naive Bayes.  
Depending of which result is needed (overall data or per category) one type of classifier will be easier to use directly.

To Do: in fextraction4 we will discuss the use of multiple base estimators/classifiers to improve generalizability / robustness over a single estimator. (to be noted that random forest is already an ensemble method that achieves this result)