## Text Mining

Text mining, an application of natural language processing (NLP), is the process of analyzing unstructured text data to extract structured information and high-value knowledge.  The usefulness of text mining spans many industries such as media/publishing, banking/insurance, and biotech/pharmaceutical.  Applications of text mining include:

- named entity recognition
- identifying facts in arcane text
- document classification
- identifying similar content
- building taxonomies and ontologies
- sentiment analysis

In this demo, we will talk through a documentation classification problem using newsgroup posts.  It is derived from an example in the scikit-learn documentation:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

### Load file contents and categories

We load sample newsgroup data here.  The full dataset contains posts from 20 categories, but to keep the problem bounded, we'll start by training a classification model on four categories.  Later, we can extend our methodology and train our model on all 20 categories.

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
categories = ['alt.atheism',
              'soc.religion.christian',
              'comp.graphics', 
              'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

The data is stored in a scikit-learn bunch.  These are the four categories:

In [3]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Let's look at a sample post in the comp.graphics newsgroup.

In [4]:
for line in twenty_train.data[0].split('\n'):
    print(line)

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [5]:
twenty_train.target_names[twenty_train.target[0]]

'comp.graphics'

### Extract feature vectors

To train a classification model, we need to boil the documents down into feature vectors.  We use the "bag of words" approach: each distinct word that appears in a corpus is a feature.  For a particular document, the value of each feature could simply be a 0 or 1 (meaning does the word appear or not) or the word count.  For example, using the word count approach on a 1-sentence document:

The boy chased the red ball.

Features:
the = 2, boy = 1, chased = 1, red = 1, ball = 1

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [7]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

Scikit-learn contains functions that automatically create the feature vectors for the 2,257 documents in our training corpus. This text preprocessing step is known as **tokenization**.  More advanced possibilities for preprocessing include document truncation, lemmatization, filtering of stop words, and feature normalization.  The value of these techniques depends heavily on the corpus.

In [8]:
X_train_counts

<2257x35788 sparse matrix of type '<type 'numpy.int64'>'
	with 365886 stored elements in Compressed Sparse Row format>

Alternatively, we can weight the bags of words using a [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), or term frequency-inverse document frequency algorithm.  This favors words that are common in a particular document or rare in the overall training corpus.  These tend to be high-value words that aid in classification.

In [9]:
tfidf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

### Train a linear categorization model

Now that we have our features, we can train a classification model -- we'll start with a naïve Bayes classifier.  

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

from sklearn.pipeline import Pipeline

In [11]:
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Now that we have a basic model, we'll need to test it using some unseen data.

In [12]:
docs_new = ['I haven\'t been to church since 2002', 'This new visualization tool is awesome']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [13]:
predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

"I haven't been to church since 2002" => soc.religion.christian
'This new visualization tool is awesome' => comp.graphics


Scikit-learn offers a Pipeline class that makes the vectorizer => transformer => classifier process work with a single command.  This is useful when running Monte Carlo simulations to tune model parameters.

In this pipeline, we plug in a support vector machine (SVM), which works well for a broad spectrum on text mining applications. 

In [14]:
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42))])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

### Evaluate the categorization model

We'll grab a larger set of newsgroup posts to evaluate our model.  A larger test set allows us to spot broader trends and areas for improvement. 

In [15]:
import numpy as np 
from sklearn import metrics

In [16]:
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.9127829560585885

Lets look at the results in the form of [precision, recall, and F-score](https://en.wikipedia.org/wiki/Precision_and_recall).

A more detailed error analysis is possible by looking at the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

We can see that "alt.atheism" posts are more-frequently confused as "soc.religion.christian" posts:

In [17]:
print(metrics.classification_report(twenty_test.target, predicted,target_names=twenty_test.target_names))
print(metrics.confusion_matrix(twenty_test.target, predicted))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502

[[258  11  15  35]
 [  4 379   3   3]
 [  5  33 355   3]
 [  5  10   4 379]]
