# Text Analysis with Scikit-learn

### Info

- Scott Bailey - Stanford Libraries

With great thanks to Javier de la Rosa and Peter Broadwell, who wrote chunks of this workshop during prevous iterations. 

### What are we covering today?

- What do we mean by text analysis?
- What is scikit-learn and why would we use it?
- Where does scikit-learn fit into a full text analysis pipeline?
- Turning texts into numbers
- The scikit-learn API
- Text classification
- Building a pipeline to test multiple algorithms

### Requirements

If running this notebook locally, please install the next packages before continuing.

- `nltk`
- `jupyter`
- `scikit-learn`
- `lime`

## Goals

By the end of the workshop, we hope you'll have a solid sense of how and why we turn text into numbers, a sense of the scikit-learn standard API, and knowledge of building a pipeline in scikit-learn for text classification. 

## A typical ML classification workflow
Generally, the workflow for any classification task is usually as follows: 
1. **Collect** or create **labeled data**
2. **Transform** that data into a numeric representation
  - Each numeric value representing a characteristic of the data is called a **feature**
  - The set of all features representing a single pair of input data and labels is called the **feature vector**
  - The whole labeled data set is split into two parts (at least) to train, evaluate and refine the model: a **training set** and a **test set**
3. **Train** (learn/fit) a model on a part of the transformed labeled data (the training set)
4. **Test** the model predictions on the test set to evaluate its performance
5. **Assess** your model and revisit each of the previous steps, if necessary

## `scikit-learn`

In Python, one solid choice for machine learning is the library [`scikit-learn`](http://scikit-learn.org/stable/):
- Simple and efficient tools for data mining and data analysis
- Open source, commercially usable (BSD license), and reusable in various contexts
- Built on other popular Python libraries:
  - NumPy (core numerical processing tools and data structures)
  - SciPy (scientific computing functions, including clustering algorithms)
  - Matplotlib (plotting and data visualization, intended to resemble MATLAB's viz features, but free)

<figure>
  <img  src="http://scikit-learn.org/stable/_static/ml_map.png" width="75%" />  <figcaption><div align="left" style="padding-top: 4px;">Source: <a href="http://scikit-learn.org/stable/tutorial/machine_learning_map/">`scikit-learn`: Choosing the right estimator</a></div></figcaption>
</figure>

### scikit-learn's simple and unified API
`scikit-learn` provides classes for most of the central machine learning tasks and methods, including those in the classification workflow above. These classes share many of the same interface points, with the goal of making it easier to swap or chain algorithms.

For example, all `Estimators` (classes containing the actual learning algorithms)  provide the following interfaces:
- `fit()`: load (labeled) input data and compute/"learn" various qualities (parameters) of the data
- `predict()`: make predictions/inferences about other data after training

Some `Estimators` and all feature extractors/`Vectorizers` also provide a `transform()` interface, which modifies and outputs data based on the parameters learned by running `fit()` on the data. The interface `fit_transform()` runs both of these steps on the same input data.

## Text classification


Given the specific task of assigning a category to a new text based on a set of labeled input texts:
1. **Collect and label.** In the case of Twitter data, for example, you'd download a bunch of tweets using the Twitter API, extract their texts from the JSON format of the API response, and then assign one or more labels to each tweet (the easiest way is just to use the tweet's hashtags as labels).
2. **Transform.** There are many strategies for turning your textual data into numbers, and `scikit-learn` has built-in libraries for most of them. Usually you'll begin by making a list of the words that apear in a document (a "bag of words"), but probably you'll also want to count the *frequencies* of the words in each document (BoW counts). In `scikit-learn`, this is done by a type of transformer called a `CountVectorizer`. We also must  choose whether to exclude uncommon words (i.e., words that only appear in a few documents) or very common words ("stopwords"). These high-level settings as a whole are called **hyperparameters** (different from **parameters**, which are the values learned by the model from the training data that enable it to make predictions).
3. **Train.** You'll need to choose which learning model/algorithm to use, either by reading the documentation or talking to your friendly neighborhood data scientist. After choosing a model, we train it on part of the labeled input data (the training set).
4. **Test.** We then use the trained model to predict/infer the labels of the remainder of the labeled input data (the test/validation set).
5. **Assess.** Apply one or more metrics (scores) to evaluate how well the predicted labels match the actual labels of the test/validation set. If the performance is unsatisfactory, we'll need to backtrack, possibly all the way to step #1, getting more labeled data and applying different transformations/hyper-parameters as needed, and/or trying a different model.

We'll begin by installing any Python libraries we need that aren't already available in Google Colab. 

In [1]:
!pip install lime

Collecting lime
[?25l  Downloading https://files.pythonhosted.org/packages/e5/72/4be533df5151fcb48942515e95e88281ec439396c48d67d3ae41f27586f0/lime-0.1.1.36.tar.gz (275kB)
[K     |█▏                              | 10kB 18.0MB/s eta 0:00:01[K     |██▍                             | 20kB 3.2MB/s eta 0:00:01[K     |███▋                            | 30kB 4.7MB/s eta 0:00:01[K     |████▊                           | 40kB 3.0MB/s eta 0:00:01[K     |██████                          | 51kB 3.7MB/s eta 0:00:01[K     |███████▏                        | 61kB 4.4MB/s eta 0:00:01[K     |████████▎                       | 71kB 5.1MB/s eta 0:00:01[K     |█████████▌                      | 81kB 5.7MB/s eta 0:00:01[K     |██████████▊                     | 92kB 6.4MB/s eta 0:00:01[K     |███████████▉                    | 102kB 5.0MB/s eta 0:00:01[K     |█████████████                   | 112kB 5.0MB/s eta 0:00:01[K     |██████████████▎                 | 122kB 5.0MB/s eta 0:00:01[K    

In [0]:
# For the basic documentation on text feature extraction,
# see https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
documents = [
    'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]

In [0]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(documents)
print("Vocabulary size:", len(count_vectorizer.vocabulary_))
count_vectorizer.vocabulary_

Vocabulary size: 9


{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

In [0]:
counts = count_vectorizer.transform(documents)
print(counts)
print("   ^  ^         ^\n   |  |         |\n  doc word_id count")

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 6)	1
  (0, 8)	1
  (1, 1)	2
  (1, 3)	1
  (1, 5)	1
  (1, 6)	1
  (1, 8)	1
  (2, 0)	1
  (2, 3)	1
  (2, 4)	1
  (2, 6)	1
  (2, 7)	1
  (2, 8)	1
  (3, 1)	1
  (3, 2)	1
  (3, 3)	1
  (3, 6)	1
  (3, 8)	1
   ^  ^         ^
   |  |         |
  doc word_id count


In [0]:
# This will go through the entire vocabulary, but only show counts from the first doc
doc = 0
for word, word_id in count_vectorizer.vocabulary_.items():
    print(word, ":", counts[doc, word_id])

this : 1
is : 1
the : 1
first : 1
document : 1
second : 0
and : 0
third : 0
one : 0


In [0]:
# We previously used both the fit and transform methods.
# Vectorizers typically have a single fit_transform method we can use to do both
# in one step.
counts = count_vectorizer.fit_transform(documents)
print(counts)

  (0, 8)	1
  (0, 3)	1
  (0, 6)	1
  (0, 2)	1
  (0, 1)	1
  (1, 8)	1
  (1, 3)	1
  (1, 6)	1
  (1, 1)	2
  (1, 5)	1
  (2, 8)	1
  (2, 3)	1
  (2, 6)	1
  (2, 0)	1
  (2, 7)	1
  (2, 4)	1
  (3, 8)	1
  (3, 3)	1
  (3, 6)	1
  (3, 2)	1
  (3, 1)	1


`CountVectorizer` also has some options to disregard stopwords, count ngrams (multiple adjacent words) instead of single words, cap the maximum number of words in each bag, normalize spelling, or count terms within a frequency range. It is worth exploring the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

### Activity

Fill in the text in the code cell below so after transforming it using a `CountVectorizer`, the counts are as shown (order of words is not important).

```
flowers : 1
garden : 1
up : 1
some : 1
the : 1
morning : 1
place : 1
every : 1
pick : 1
from : 1
my : 1
by : 1
```

**Hint**: No special parameters are needed to get this output.

**Stretch goal**: What word is missing from the token counts? How would you figure out why it's missing, and how to get it back (if desired)?

In [0]:
documents="my by from pick every morning by place garden flowers the "
counts = count_vectorizer.fit_transform(documents) 


ValueError: ignored

## Activity

Take a look at the documentation for the `CountVectorizer`. Create a new vectorizer instance that modifies some of the parameters, such as `ngram_range` or `lowercase`, and run some text of your choice through that vectorizer. 

## Training and testing

`scikit-learn` provides  functions to split a labeled dataset into training and testing sets.

**Note**: Many machine learning approaches call for splitting the labeled data into three sets:
- **training** data (usually the largest set) for the initial model training
- **validation** data, which is then used to evaluate the initial performance of the model and subsequently fine-tune the model settings and **hyperparameters** in the hopes of getting better results
- **testing** data is "held out" until all model tuning is completed and then is used to give a final evaluation score or *benchmark* of the model's performance.

![train test validation split](https://cdn-images-1.medium.com/max/800/1*Nv2NNALuokZEcV6hYEHdGA.png)

Train, test, and validation splits<br>Source: Tarang Shah, [About Train, Validation and Test Sets in Machine Learning](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7)

To keep things simple for this tutorial, we'll mostly just use training and test sets.

Additionally, in real-world applications, it is highly recommended to split the data set randomly in several different ways (*folds*) and then to compare the performance of the model on the validation/test data across all of these. This approach is called **cross-validation.** The result from a single split can be a fluke or outlier, leading to an unrealistic evaluation of the model. Cross-validation gives us a much clearer picture of the likely performance of the model given arbitrary data, and also can be a way to "stretch" the training data when the available training set is small.

![4-fold cross validation](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)

4-fold cross validation<br>Source: Wikimedia Commons

## The text corpus

For our text classification example, we will be using the [Brown corpus](https://www.nltk.org/book/ch02.html) included in the Natural Language Toolkit (`nltk`), which contains more than a million words of English from 500 texts, where each text is categorized into one of 15 genres. We will consider two of these genre categories: `news` (e.g., the Chicago *Tribune*'s society reportage), and `adventure` (e.g., Peter Field, *Rattlesnake Ridge* (1961)). The goal will be to create a classifier able to assign a text to `news` or `adventure` solely based on its textual contents.

In [0]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

In [0]:
for category in brown.categories():
    print(category,len(brown.fileids(category)))

In [0]:
import random

dataset = []
categories = ['lore', 'news']
new_categories = ['fiction', 'romance']
for category in categories:
    for fileid in brown.fileids(category):
        text = " ".join(brown.words(fileids=fileid))
        dataset.append((text, category))

random.shuffle(dataset)
print(len(dataset), "documents:", ", ".join(" ".join((str(len(brown.fileids(c))), c)) for c in categories))

In [0]:
#Let's take a quick look at some texts to see what we have.
dataset[0]

From the dataset, we can now separate the labels and the texts into two different variables. Usually, the variable containing the labels is named `y`, and the one containing the input features (in our case, the texts) is named `X`, as in you can obtain the output `y` as a function of the inputs `X`, which is the core abstraction in `scikit-learn`. But using arbitrary letters is confusing when you're trying to learn a new concept, so we'll add some explanatory info to the variable names after `X_` and `y_`.

In [0]:
import numpy as np  # scikit-learn works internally with NumPy arrays

texts = []
labels = []
for text, label in dataset:
    texts.append(text)
    labels.append(label)
    
X_texts = np.array(texts)
y_labels = np.array(labels)

In [0]:
from sklearn.model_selection import train_test_split

(X_texts_train, X_texts_test,
 y_labels_train, y_labels_test) = train_test_split(X_texts, y_labels, test_size=0.25, random_state=42)

print("{} training documents".format(*X_texts_train.shape))
print("{} testing documents".format(*X_texts_test.shape))

In [0]:
# Don't forget to transform the text!
#vectorizer = CountVectorizer(stop_words='english')
vectorizer = CountVectorizer()
X_features_train = vectorizer.fit_transform(X_texts_train)
X_features_test = vectorizer.transform(X_texts_test)

print("{} training documents with {} features per document".format(*X_features_train.shape))
print("{} testing documents with {} features per document".format(*X_features_test.shape))

## Classification (prediction)

Let's start with one of the Naïve Bayes classifiers.

**Naïve Bayes** is a family of classifiers based on Bayes' Theorem of probability, which describes the probability of an event based on prior knowledge of possibly relevant conditions. Although its formulation can get confusing, all the math boils down to counting, multiplication and division, making Naïve Bayes (NB) classifiers very fast. On the other hand, NB makes the assumption that all of the features in the data set are equally important and independent, which is obviously not true for words. Despite this, Naïve Bayes classifiers are generally very accurate as text classifiers.

<div align="left"><b>"All models are wrong but some are useful" - George Box (1978)</b></div>

There are three Naïve Bayes algorimths in `scikit-learn`: 
- Gaussian: assumes that features follow a normal distribution.
- Multinomial: good for discrete counts, like in text classification problems using counts of words.
- Bernoulli: useful for feature vectors that are binary (i.e. zeros and ones), like classic bag of words.

Given that our feature vectors are counts of the words in each document with some additional vocabulary constraints, we will use `MultinomialNB`.

In [0]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

classifier.fit(X_features_train, y_labels_train)

In [0]:
np.shape(classifier.feature_log_prob_)

Now we can predict the categories of previously unseen texts and assess how good our classifier is at classifying them.

In [0]:
samples = [
    "This issue raises new and troubling questions.",
    "Suddenly the cave entrance collapsed, trapping them inside."
]
transformed_samples = vectorizer.transform(samples)
classifier.predict(transformed_samples)

If we wanted to know which words are being used to decide whether a text is either `news` or `lore`, we need to take into account the the word vocabulary built by the vectorizer we used to transform our data, and the feature (word) probabilities calculated for each label by the model at training time.

In [0]:
def most_informative_features(classifier, vectorizer=None, n=10):
    class_labels = classifier.classes_
    if vectorizer is None:
        feature_names = classifier.steps[0].get_feature_names()
    else:
        feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_log_prob_[0], feature_names))[-n:]
    topn_class2 = sorted(zip(classifier.feature_log_prob_[1], feature_names))[-n:]
    for prob, feat in reversed(topn_class2):
        print(class_labels[1], prob, feat)
    print()
    for prob, feat in reversed(topn_class1):
        print(class_labels[0], prob, feat)

most_informative_features(classifier, vectorizer)

In [0]:
%%capture --no-display
# Don't worry about this particular chunk of code

%matplotlib inline
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline

def explain(entry, clf, vectorizer=None, n=10):
    if vectorizer is None:
        class_names = clf.steps[1].classes_.tolist()
        pipeline = clf
    else:
        class_names = clf.classes_.tolist()
        pipeline = make_pipeline(vectorizer, clf)
    explainer = LimeTextExplainer(class_names=class_names)
    exp = explainer.explain_instance(entry, pipeline.predict_proba, num_features=n)
    exp.show_in_notebook()

explain("Reports indicated that the cave entrance collapsed, trapping them inside.", classifier, vectorizer)
# explain("They went to a beautiful restaurant, and drank wine together.", classifier, vectorizer)

### Activity

The list of "most informative features" seems to include a lot of really common words, aka "stopwords." The model might perform better if we ignore them. Which code block above would we modify to exclude stopwords from the feature set? (Hint: it's pretty far back). Consulting the documentation might also be helpful. 

Make this modification and then see how it changes the results.

## Model evaluation and selection

Changing model hyper-parameters without really having a way to assess its performance can get problematic. If for example you find better informative features but the classifier is failing 50% of the time, it doesn't really matter much how good you think those features are. In some cases, very performant classifiers make use of unexpected or counterintuitive features.

The only real way to assess its performance is by making the classifier predict labels for unseen data, and then comparing the predicted labels with the real labels. This is where the importance of separating training and testing data lies in.

In [0]:
y_pred = classifier.predict(X_features_test)
y_pred

In [0]:
print("Label     Predicted    Result\n-----     ---------    ------")
for i, real_label in enumerate(y_labels_test):
    predicted_label = y_pred[i]
    if real_label == predicted_label:
        result =  "hit"
    else:
        result = "miss"
    print(real_label, "    ", predicted_label, "       ", result)

Here we've ended up with mostly hits, perhaps because of the size of the corpus and the clear differences in the types of texts. We can rerun this with other categories to make sure we get some fails. From there, we can look at how we can understand our hits and misses. Since they're related genres or types, let's pull the `romance` and `fiction` chunks of the corpus. 

While hit vs. miss gives an idea of what's going on, it doesn't tell us much about the nature of the fails. Were the `romance` texts being classified as `fiction`, or the other way around? This might not seem very important, but if instead of texts you were trying to test whether a person has a deadly desease or not (here labels are positive, has the desease, or negative, does not have the desease), you might be more interested in classifiers with lower false negative ratio.

Taking this into account, we could count how many `romance` texts have been correctly classified as `romance` (we will call this `TR`, as in true romance), and how many have been incorrectly classified as `fiction` (`FF`, as in false fiction); and the same for `fiction`: how many `fiction` texts have been classified as `fiction` (`TF`, as in true fiction), and how many as `romance` (`FR`, as in false romance). If we put this into a table, we get a confusion matrix. 

![Confusion Matrix c/o Towards Data Science](https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png)

In [0]:
from sklearn.metrics import confusion_matrix

print(" TF FR")
print(confusion_matrix(y_labels_test, y_pred))
print(" FF TR")

Let's look at a few other measures of a classifier.

**Accuracy** is one of the most important metrics for evaluating classifiers. It's defined as the ratio of correct predictions ("hits") to the total number of predictions. We could code it up in a line of Python, but why not just use `scikit-learn`'s [built-in function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)?

In [0]:
from sklearn.metrics import accuracy_score

accuracy_score(y_labels_test, y_pred)

Two other evaluation metrics that often prove useful are *precision* and *recall*. Like accuracy, these can be calculated (averaged) for the entire model, or considered separately for each category. We will use the latter approach here. In this context, the metrics can be defined as follows:

- **Precision**: out of the test texts the model classified as fiction, what fraction of them were actually fiction?
- **Recall**: out of the total number of texts in the test set, what fraction of them did the model find (i.e., correctly classify as fiction)?

### Activity for post-workshop

As an optional activity, use the precision and recall functions provided by `scikit-learn` to evaluate this classifier. 

We're going to move on to think about how we can automate our classifier build and evaluation to test different hyperparameters for making the model more accurate. 

In [0]:
def get_accuracy(X, y, vectorizer):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    clf = MultinomialNB()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    return accuracy_score(y_test, y_pred)

### Activity

`get_accuracy` is a function that takes input data as `X`, its labels as `y`, and a vectorizer. 

We know that both vocabulary size and stopwords can affect a classifier. Write some code that takes advantage of `get_accuracy` to test out different vocabulary sizes (10000, 1000, 500, 300, 100) and keeping or removing stopwords. Which combination performs best?

We've been looking at the vectorizer or model parameters affecting the accuracy of our classifier, but it could also be the splits that are being used to create training and test data. One common strategy to prevent the split from affecting our model, is to split the dataset randomly several times in different **folds**, calculate whatever measure you want, and then averaging the scores. This process is called **cross-validation**, and prevents us from making the model learn too well the training data (**over-fitting**).

In order to cross-validate our model, we need to create a pipeline in `scikit-learn` by combining both the vectorizer and the classifier, then calling the `cross_val_score` function.

In [0]:
from sklearn.model_selection import cross_val_score

for stopwords in [None, 'english']:
    for vocabulary_size in [10000, 1000, 500, 300, 150, 100]:
        print("stopwords:", stopwords)
        print("vocabulary_size:", vocabulary_size)
        vectorizer = CountVectorizer(
            max_features=vocabulary_size, stop_words=stopwords)
        
        pipe = make_pipeline(vectorizer, MultinomialNB())
        accuracies = cross_val_score(pipe, X_texts_train, y_labels_train, cv=10)
        
        print("accuracy:", accuracies.mean())
        print()

By looking at this data, which is more reliable, it looks like we could get a classifier with up to ~70% of accuracy. However, if we want a model more intuitive and with a better explanatory power, we could go with the model that removes stop words, although its accuracy is lower.

In [0]:
vectorizer = CountVectorizer(max_features=500, stop_words="english")
X_train = vectorizer.fit_transform(X_texts_train)
X_test = vectorizer.transform(X_texts_test)

clf = MultinomialNB()
clf.fit(X_train, y_labels_train)
y_pred = clf.predict(X_test)

most_informative_features(clf, vectorizer, n=10)
entry = 0
print()

explain("They told their friend that they found love high atop a mountain in the Sierras.", clf, vectorizer)

What if we want to test whether a different classification algorithm might be more effective? Using the pipeline and a simple loop, we can at least start to check this out. You should always be thinking about which algorithms are actually appropriate to your dataset, but when you're prototyping, sometimes you just throw a bunch of algorithms at the problem and see what comes out. 

In [0]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC

for clf in (RandomForestClassifier(), LinearSVC(), SVC()):
    for stopwords in [None, 'english']:
        for vocabulary_size in [10000, 1000, 500, 300, 150, 100]:
            print("model:", clf)
            print("stopwords:", stopwords)
            print("vocabulary_size:", vocabulary_size)
            vectorizer = CountVectorizer(
                max_features=vocabulary_size, stop_words=stopwords)

            pipe = make_pipeline(vectorizer, clf)
            accuracies = cross_val_score(pipe, X_texts_train, y_labels_train, cv=10)

            print("accuracy:", accuracies.mean())
            print()

## Other libraries of interest:
- SpaCy - NLP - https://spacy.io/
- Textacy - NLP + text analysis - https://github.com/chartbeat-labs/textacy
- Gensim - semantic modeling - https://radimrehurek.com/gensim/
- AllenNLP - https://github.com/allenai/allennlp
