# Classification of Newsgroups by Text

## Newsgroups Data

We will take a look at some of the twenty newsgroups dataset, another common dataset for classification.

More specifically, it is a list of emails within 20 different newsgroups. We will just focus on a few categories in particular. 

There will most likely be some difficulties or misclassification between emails in `atheism` and `religion` as well as between `comp.graphics` and `space`. Let's see...

In [1]:
from sklearn.datasets import fetch_20newsgroups

# We will use four of the twenty newsgroups
categories = ['alt.atheism',
              'talk.religion.misc',
              'comp.graphics',
              'sci.space']

twenty_train_subset = fetch_20newsgroups(subset='train', categories=categories)
twenty_test_subset = fetch_20newsgroups(subset='test', categories=categories)



Now we have lists of messages (as strings) in the `.data` members.

In [2]:
twenty_train_subset.data[0:2]

 u"Subject: Re: Biblical Backing of Koresh's 3-02 Tape (Cites enclosed)\nFrom: kmcvay@oneb.almanac.bc.ca (Ken Mcvay)\nOrganization: The Old Frog's Almanac\nLines: 20\n\nIn article <20APR199301460499@utarlg.uta.edu> b645zaw@utarlg.uta.edu (stephen) writes:\n\n>Seems to me Koresh is yet another messenger that got killed\n>for the message he carried. (Which says nothing about the \n\nSeems to be, barring evidence to the contrary, that Koresh was simply\nanother deranged fanatic who thought it neccessary to take a whole bunch of\nfolks with him, children and all, to satisfy his delusional mania. Jim\nJones, circa 1993.\n\n>In the mean time, we sure learned a lot about evil and corruption.\n>Are you surprised things have gotten that rotten?\n\nNope - fruitcakes like Koresh have been demonstrating such evil corruption\nfor centuries.\n-- \nThe Old Frog's Almanac - A Salute to That Old Frog Hisse'f, Ryugen Fisher \n     (604) 245-3205 (v32) (604) 245-4366 (2400x4) SCO XENIX 2.3.2 GT \n  Ladys

In [3]:
print twenty_train_subset.target[0:2]
print twenty_train_subset.target_names

[1 3]
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


## Count Vectorization

Recall how we can generate features from text using the CountVectorizer. We can "fit" a body of text using a count vectorizer, which will give the CountVectorizer a total vocabulary. When we call transform again using the fitted vectorizer, it will return a matrix (sparse matrix) of size `document x vocabulary`, where each element is the frequency of that word.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Turn the text documents into vectors of word frequencies
vectorizer = CountVectorizer()
vectorizer.fit(twenty_train_subset.data)

X_train = vectorizer.transform(twenty_train_subset.data)

Again, this makes a matrix of word counts, where each row is a document and each column is the word, the cell matrix[document, word] contains the count of word in document.

We can expand on this by setting the ngram_range. This parameter allows us to set each column not only as one word, but possibly sequences of words.

We can also provide an argument for `stopwords` that will strip out any english style stop words such as "and", "or", "other", etc for words that don't provide significance.

In [5]:
# Include every 1-gram, 2-gram, and 3-gram
vectorizer = CountVectorizer(ngram_range=(1,3),stop_words='english')
vectorizer.fit(twenty_train_subset.data)
X_train = vectorizer.transform(twenty_train_subset.data)

In [7]:
import pandas as pd
pd.DataFrame(X_train.toarray(), columns = vectorizer.get_feature_names())

Unnamed: 0,00,00 00,00 00 gmt,00 00 pm,00 01,00 01 gmt,00 01 mainly,00 04,00 04 gmt,00 06,...,zyxel 14 epimntl,zyxel v32bis,zyxel v32bis v32,ªl,ªl r0506048,ªl r0506048 csie,º_________________________________________________º_____________________º,ºnd,ºnd sun,ºnd sun eclipsed
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Additionally, we could use a tf-idf representation, which stands for Term Frequency - Inverse Document Frequency.

This value is the product of two intermediate values, the Term Frequency and the Inverse Document Frequency.

The Term Frequency is equivalent to the CountVectorizer features, the number of times or count that a word appear in the document. This is our most basic representation of text.

To establish Inverse Document Frequency, first let's define Document Frequency. This is the percentage of documents that a particular word appears in. For example, the word `the` might appear in 100% of documents, while words like `Syria` would likely have low document frequency. Inverse Document Frequency is simply 1 / Document Frequency (although often the log is also taken).

So tf-idf is Term Frequency * Inverse Document Frequency, or similar to Term Frequency / Document Frequency. The intuition is that words that have high weight are those that appear a lot in this document and/or appear in very few other documents (somehow unique to this document).

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer()

# This is short hand for calling fit, and then transform
X_train = vectorizer.fit_transform(twenty_train_subset.data)

We can put this together with our other tricks as well.

In [9]:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
X_train = vectorizer.fit_transform(twenty_train_subset.data)

For computational ease, let's just look at single word patterns

In [10]:
vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(twenty_train_subset.data)

# Exercise - Naive Bayes vs. Decision Trees

We originally reviewed the CountVectorizer concept during our exploration of Naive Bayes. This time, we'll use a tf-idf vectorizer to create our X_train matrix, and compare the performance of Naive Bayes and Decision Trees.


1. Using Naive Bayes, what is your cross validation score on the training data (5 fold using StratifiedShuffleSplit) with a tf-idf matrix?
2. What is your score on the test set when you fit on the entire training set using Naive Bayes?
3. With a similar setup, how well do decision trees perform on training and test? Try comparing the performance of the trees with various max_depth.


**Notes:**
* Using MultinomialNB may provide the best results when creating a Naive Bayes model.
* Note, you will have to define y_train, X_test, and y_test variables.
* You will also have to `transform` the data into a tf-idf matrix for the test set (**Don't call fit again! Remember that `fit` is run on the training set to get a total vocabulary. Once we have this vocabulary, we run transform on new data to transform the known words**)
* Remember not to contaminate your training and test sets. Ideally, the test set should be looked at once at the end. Otherwise, you could end up fitting to the test set.


In [27]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import cross_val_score

vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(twenty_train_subset.data)

#only fit train, not test!

X_train = vectorizer.transform(twenty_train_subset.data)
y_train = twenty_train_subset.target

X_test = vectorizer.transform(twenty_test_subset.data)
y_test = twenty_test_subset.target




In [28]:
from sklearn.cross_validation import *
from sklearn.tree import DecisionTreeClassifier
import numpy as np


mm = MultinomialNB()
mm.fit(X_train,y_train)
print "Training:",np.mean(cross_val_score(mm, X_train,y_train,
                                         cv = StratifiedShuffleSplit(y_train,5)))

print "Testing:", mm.score(X_test,y_test)

Training: 0.932352941176
Testing: 0.875092387288


In [29]:
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
print "Training:",np.mean(cross_val_score(dt, X_train,y_train,
                                         cv = StratifiedShuffleSplit(y_train,5)))

print "Testing:", mm.score(X_test,y_test)

Training: 0.816666666667
Testing: 0.875092387288


# Bagging (Bootstrap Aggregation)

Let's explore how bootstrap aggregation can give us stronger results.

Bagging tends to work well when the base classifiers (weak learners) have a low bias but high variance. By averaging the predictions over these base classifiers, we can reduce the overall variance of the final predictions.

In [31]:
from sklearn.ensemble import BaggingClassifier

# Let's view the arguments in the model
# By default, DecisionTreeClassifier used, 
#  but can also use any base classifier
model_bag = BaggingClassifier()

# look at parameters of Bagging Classifier, base estimator/# estimator

In [32]:
model_bag.fit(X_train,y_train)

BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
         verbose=0)

In [35]:
model_bag.score(X_train,y_train)
model_bag.score(X_test,y_test)

0.73392461197339243

# Exercise - Bagging

* How does bagging compare to a base decision tree classifier with 1, 5, 10, 20, and 30 estimators on the training set? 
* How about on the testing set?
* Plot the improvements on both training and testing sets

If you finish early, feel free to build a bagging ensemble using a different base classifier, such as NaiveBayes or SVM. You may see that SVM does poorly with a default rbf kernel.

In [36]:
from sklearn.ensemble import BaggingClassifier

train_res, test_res = [],[]

for i in [1,5,10,20,30]:
    model_bag = BaggingClassifier(n_estimators = i)
    model_bag.fit(X_train,y_train)
    train_res.append(model_bag.score(X_train,y_train))
    test_res.append(model_bag.score(X_test,y_test))

In [55]:
import matplotlib as plt
%matplotlib inline

plt.plot(n_est,train_res, )
plt.plot(n_est,test_res, )

AttributeError: 'module' object has no attribute 'plot'

In [38]:
model_bag.score(X_train,y_train)
model_bag.score(X_test,y_test)

0.47302291204730229

Bagging works very well when the base classifier has a low bias but a high variance.

------
# Random Forests

Random Forests are very popular ensemble classifiers. They are relatively simple to use (very few parameters to set and easy to avoid overfitting). 

The only parameter we are really worried about is the number of trees we want to create - n_estimators in sklearn.

In [39]:
from sklearn.ensemble import RandomForestClassifier

# Default # of vars is sqrt(features)
model_rf = RandomForestClassifier()

# play with number of estimators

We can use predict using our 20-newsgroup dataset above

In [40]:
model_rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [41]:
model_rf.score(X_train,y_train)

0.99606686332350047

We can access the results and parameters of each individual estimator in the random forest through `estimators_`

In [46]:
model_rf.estimators_[0:2]
#model_rf.estimators_[0].feature_importances_

[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             random_state=1487170897, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             random_state=1526732950, splitter='best')]

Random Forests can quantify the importance of features. Unlike with logistic regression, we don't have coeffcients that tell us relative impact. But we can keep track of what features give us the best splits.

In [47]:
# This prints the top 10 most important features
print sorted(zip(model_rf.feature_importances_, 
                 vectorizer.get_feature_names()), 
             reverse=True)[:20]

[(0.033403920713236876, u'space'), (0.015530845362056961, u'graphics'), (0.015217538475433496, u'god'), (0.011202707017213485, u'people'), (0.010414370872908332, u'allan'), (0.010136770990195833, u'religion'), (0.0089300192367453325, u'atheists'), (0.0086567996133415093, u'christian'), (0.0086032054738690127, u'moon'), (0.0063799087451154399, u'access'), (0.0057260991062291393, u'solntze'), (0.0053645054506590763, u'islam'), (0.0052561053112685436, u'windows'), (0.0052000400044451911, u'cco'), (0.0050943711341091165, u'help'), (0.00489501423569705, u'orbit'), (0.0047142302995528754, u'christ'), (0.0043697983630181843, u'article'), (0.0043346079384999868, u'spacecraft'), (0.0042279566880664681, u'3d')]


# Exercise - Random Forest Estimator Size

* Compare training vs testing performance as you increase the number of estimators (i.e. from 1 - 100 for instance). Plotting will help visualize this.
* In addition, anecdotally compare the performance to that of bagged trees.

Extra: If you finish early, look at the top feature importances for a random forest. Is this similar to the feature importances coming out of a bagged model? Since the BaggingClassifier is generic and does not have a feature\_importances\_ attribute, you may need to roll your own baggingclassifier (i.e. generate your own function to bootstrap samples as well as create decision trees through a for loop, while accessing the feature\_importances\_ of each individual tree)

In [60]:
import matplotlib as plt
%matplotlib inline

train, test = [], []

for i in [1,5,10,20,50,100]:
    model_rf = RandomForestClassifier(n_estimators =i)
    model_rf.fit(X_train,y_train)
    train.append(model_rf.score(X_train,y_train))
    test.append(model_rf.score(X_test,y_test))

train
test

plt.plot([1,5,10,20,50,100],train, label= 'train')
plt.plot([1,5,10,20,50,100],test, label= 'test')

plt.legend()

AttributeError: 'module' object has no attribute 'plot'

For ensemble methods, we would like our base classifiers to be *better than random*, as well as *uncorrelated*.

Bagging reduces variance by averaging over many trees. Random forests purposefully add "randomness" through the random features and decreases correlations of these trees. Think of random forest as an improved bagging method.

-----
# Boosting

Adaboost is a boosting implementation by Freund and Schapire that fit a sequence of weak learners. Each step improves upon the last one by overweighting misclassified samples and underweighting correct samples. The weak learners are also given weights based on their error rates.

There are many components in AdaBoost--Let's see how they function.

In [61]:
from sklearn.ensemble import AdaBoostClassifier

In [62]:
# Default decision tree classifier
#  Notice the learning rate ~ how much to shrink each stage of tree
model_boost = AdaBoostClassifier()

SAMME works off of discrete misclassification errors, while SAMME.R uses the predicted probabilities to reweight the samples and base classifiers.

In [63]:
model_boost.fit(X_train,y_train)
#large number of parameters, tune learning rate through grid search

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [65]:
model_boost.score(X_train,y_train)
model_boost.score(X_test,y_test)

0.7464892830746489

Boosting tends to perform better when the base classifier has a high bias but low variance.

I.e. Random forest works well with **OVERFIT** models, while Boosting works well with **UNDERFIT** models.

# Exercise - Stumps vs. Dense Trees

Remember that in order for ensemble learners to work well, they require that the base classifiers are better than random guessing, and tend to get 'different questions wrong'. 

Boosting often does surprisingly well when using decision tree stumps (i.e. Decision Tree Classifiers with depth=1).

Create a plot showing in sample and out of sample performance for an adaboost ensemble classifier that uses base_estimator as Decision Tree Classifiers with max depth = 1 and 10, as you increase the number of estimators.

The x-axis should be n_estimators while y-axis is the score (or error = 1 - score). There should be two lines, one for stump and 1 for 10 node trees.

Warning: Make sure your code works correctly--this can take time! I encourage you to try printing out diagnostics too to help you see that it is working.

Also, just using stumps, observe the performance in and out of sample as you increase the n_estimators from 1 to 100. You may want to test various ranges, rather than every number between 1 and 100.

If you finish early, feel free to play around with the GBDT (Gradient Boosted Decision Tree), aka GBT (Gradient Boosted Tree), aka GBM (Gradient Boosted Model) within sklearn. [GBDT](http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting)

Gradient boosting can be thought of as a generic form of boosting that can take in any differentiable loss function.