## Lab: Text Classification

### Stop Words

Text documents often contain many occurrences of the same word. For example, in a document written in _English_, words such as _a_, _the_, _of_, and _it_ likely occur very frequently. When classifying a document based on the number of times specific words occur in the text document, these words can lead to biases, especially since they are generally common in **all** text documents you might want to classify. As a result, the concept of [_stop words_](https://en.wikipedia.org/wiki/Stop_words) was invented. Basically these words are the most commonly occurring words that should be removed during the tokenization process in order to improve subsequent classification efforts. 

We can easily specify that the __English__ stop words should be excluded during tokenization by using the `stop_words`. Note, _stop word_ dictionaries for other languages, or even specific domains, exist and can be used instead. We demonstrate the removal of stop words by using a `CountVectorizer` in the following simple example.

-----

In [None]:
# Define our vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', lowercase=True)

# Sample sentance to tokenize
my_text = 'This module introduced many concepts in text analysis.'

cv1 = CountVectorizer(lowercase=True)
cv2 = CountVectorizer(stop_words = 'english', lowercase=True)

tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

import pprint
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
pp.pprint(tk_func1(my_text))

print()

print('Tokenization (with Stop words):')
pp.pprint(tk_func2(my_text))

## Stemming

So far, we have looked at several techniques to remove redundant or unimportant features. For example, we changed the case of all text to lowercase and we have applied stop words. However, there still is the issue of different forms of the same word, for example compute, computer, computed, and computing. The process of changing words back to their root, or basic form (by removing prefixes and suffixes) so that token frequencies match the use of the root token rather than being spread across multiple similar tokens is known as [stemming](https://en.wikipedia.org/wiki/Stemming). 

The most widely used stemmer, or program/method that performs stemming, is the _Porter Stemmer_, which was originally published in 1980 by Martin Porter. An improved version was released in 2000, which fixed a number of errors. NLTK includes the Porter Stemmer, which can be used with scikit learn by creating a special function that tokenizes text documents and passing this function as an argument to the `CountVectorizer` via the `tokenizer` attribute. By performing stemming inside this tokenize method, we can return a set of tokens for a document that have been stemmed. In the following code cell, we use a custom `tokenize` method that first builds a list of tokens by using nltk, and then maps the Porter stemmer to the list of tokens to generate a stemmed list.

-----


In [None]:
import string
import nltk
from nltk.stem.porter import PorterStemmer

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
stemmer = PorterStemmer()

for w in example_words:
    print(stemmer.stem(w))

In [None]:
new_text = "It is important to be very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."

tokens = nltk.word_tokenize(new_text)
tokens = [token for token in tokens if token not in string.punctuation]

for w in tokens:
    print(stemmer.stem(w))

-----

## Classification

We identified the features (or tokens in the training documents) that we can use to classify the documents. Before we introduce a  classification technique on the newsgroups data, be aware that many issues might affect a classification process. In the context of this notebook, the data we have is similar to emails. Exclude email address information (like com, edu, etc.), proper names, and information such as dates, monetary information etc. The content in some categories will clearly overlap, such as _alt.atheism_ and _soc.religion.christian_. 

Issues like this demonstrate the **need** for manual intervention and introspection during the machine learning process. You would want to continually analyze classification results to ensure you understand what is occurring and why it is occurring.

-----

-----

### Naive Bayes Classifier

One of the simplest techniques for perfomring text classification is the [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier). Fundamentally this method applies Bayes theorem by (naively) assuming independence between the features. In scikit learn, we will use a [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) model, where we treat each feature independently. Thus we calculate the likelihood of a feature corresponding to each training label, and the accumulation of these likelihoods provides our overall classification. By working with log-likelihoods, this accumulation becomes a simple sum.

-----

In [None]:
# Split into training and testing
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(data_home='../../../../datasets/DSA-8630/newsgroups/', subset='train', shuffle=True, random_state=23)
test = fetch_20newsgroups(data_home='../../../../datasets/DSA-8630/newsgroups/', subset='test', shuffle=True, random_state=23)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()

train_counts = cv.fit_transform(train['data'])
test_data = cv.transform(test['data'])

nb = MultinomialNB()

clf = nb.fit(train_counts, train['target'])
predicted = clf.predict(test_data)


print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test_data, test['target'])))

The below code does the same thing as the above code but is implemented using the pipeline function in sklearn. [Pipelines](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html) allows you to chain transformers and estimators together in such a way that you can use them as a single unit. Here vectorizer => classifier is made easier to work with using the Pipeline class. The fit() method of CountVectorizer() below will learn the vocabulary dictionary of all tokens in the input data train['data'].

In [None]:
from sklearn.pipeline import Pipeline

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

clf = clf.fit(train['data'], train['target'])
predicted = clf.predict(test['data'])

print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['data'], test['target'])))

## TF IFD

Previously, we have simply used the number of times a token (i.e., word, or more generally an n-gram) occurs in a document to classify the document. Even with the removal of stop words, however, this can still overemphasize tokens that might generally occur across many documents (e.g., names or general concepts). An alternative technique that often provides robust improvements in classification accuracy is to employ the frequency of token occurrence, normalized over the frequency with which the token occurs in all documents. In this manner, we give higher weight in the classification process to tokens that are more strongly tied to a particular label. 

Formally this concept is known as [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf) (or tf-idf), and scikit-learn provides this functionality via the [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) that can either follow a tokenizer, such as `CountVectorizer` or can be combined together into a single transformer via the [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

-----

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tools = [('tf', TfidfVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

# set_params() of TfidfVectorizer below, sets the parameters of the estimator. The method works on simple estimators as 
# well as on nested objects (such as pipelines). The pipelines have parameters of the form <component>__<parameter> 
# so that it’s possible to update each component of a nested object.
clf.set_params(tf__stop_words = 'english')

clf = clf.fit(train['data'], train['target'])
predicted = clf.predict(test['data'])

print("NB (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['data'], test['target'])))

----

### Logistic Regression

[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) is typically employed on categorical variables, such as yes/no decision, or win/loss likelihoods. In the case of many labels, we can use the trick that logistic regression can quantify the likelihood a vector is in or out of a particular category. Thus, by computing this over all categories we can determine the best label for each test vector. [scikit_learn](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) provides an implementation that can be easily used for our classification problem.

-----

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer

clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')),
                ('tfidf', TfidfTransformer()),
                ('lr', LogisticRegression())])


clf = clf.fit(train['data'], train['target'])
predicted = clf.predict(test['data'])

print("LR (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['data'], test['target'])))
