# Supervised learning: the standard pipeline

This notebook introduces **supervised learning** in Python. As described more fully in lecture, "supervised" learning represents a set of algorithms that classify items (e.g., a document, paragraph, or sentence) into predefined labels (or classes) based on human-annotated "training" data.

We will outline a general **pipeline** that you can follow when working with supervised text classification. The pipeline includes:

1. Collecting training data
2. Text preprocessing and normalization
3. Feature extraction
4. Model estimation and selection (and parameter tuning)
5. Final model estimation and reporting performance

This notebook covers items 2-5 in Python and `sklearn`. Before getting started, let's load the necessary libraries:

In [1]:
import os
os.chdir('/Users/tcoan/git_repos/ncrm-spring-school')

import pandas as pd # Use to read data
import numpy as np  # Numpy is used in various places

# Tokenization, stopword removal, and lemmatization
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Vectorization
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Load classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# Load functions for model selection and performance
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV

## Collecting data

There are a number of different ways to collect training data. In the social sciences, we tend to use the standard operating procedure of quantitative content analysis:

1. Construct **coding guidelines** to code into documents into substantively meaningful classes based on relevant theory.
2. Train coders on a small, random sample of documents, **calculate reliability**, and discuss disagreements. **Repeat** this process until reliability is sufficiently high}.
3. Code the remaining documents in the training set. Ideally, you would have multiple coders per document; however, it is common to have a single coder after the training phase.

See <a href="https://www.amazon.co.uk/Content-Analysis-Introduction-Its-Methodology/dp/1412983150">Krippendorff's (2013) classic book</a> for a detailed introduction. More recently, researchers have been using crowd-sourcing to annotate documents. For a detailed introduction to this approach, see <a href="http://kenbenoit.net/pdfs/Crowd_sourced_data_coding_APSR.pdf">Benoit et al. (2016)</a>.

### Movie reviews

We will use the well-known "move review" data from [Pang and Lee's (2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/) article on using supervised methods for sentiment analysis. This is the same data that we used for lexicon-based sentiment analysis several weeks ago. To refresh, the data has the following fields:

* **id**: A unique ID for each movie review.
* **class**: An integer equal to 1 if the review is positive and 0 if negative.
* **text**: The text content of each review (already converted to lowercase).

We can load the data in the usual way:

In [2]:
reviews_df = pd.read_csv('data/movie_reviews.csv')

# Convert to a list of dicts!
reviews = reviews_df.to_dict('records')

And we can examine a single "review" as follows:

In [3]:
# Print the last review
print(reviews[-1])

{'id': 'cv999_14636', 'positive': 0, 'text': 'two party guys bob their heads to haddaway\'s dance hit " what is love ? " \nwhile getting themselves into trouble in nightclub after nightclub . \nit\'s barely enough to sustain a three-minute _saturday_night_live_ skit , but _snl_ producer lorne michaels , _clueless_ creator amy heckerling , and paramount pictures saw something in the late night television institution\'s recurring " roxbury guys " sketch that would presumably make a good feature . \nemphasis on the word " presumably . " \n_a_night_at_the_roxbury_ takes an already-thin concept and tediously stretches it far beyond the breaking point--and that of viewers\' patience levels . \nthe first five minutes or so of _roxbury_ play very much like one of the original " roxbury guys " skits . \nwith " what is love ? " \nblaring on the soundtrack , the brotherly duo of doug and steve butabi ( chris kattan and will ferrell ) bob their heads , scope out " hotties " at clubs , and then bum

# Text preprocessing and normalization

As with dictionary and unsupervised approaches, the first step is **preprocessing**. We will do the following preprocessing:

1. **Tokenize** into unigrams (i.e., words)
2. **Remove stopwords and punctuation**

Let's start with **tokenization**:

In [4]:
# Tokenize entire corpus of reviews
tokens = [word_tokenize(row['text']) for row in reviews]

In [5]:
print(tokens[0])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'s", 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '``', 'sorta', '``', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind-fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways'

Next, we remove common **stopwords** and **punctuation**:

In [6]:
# Load common English stopwords
stops = stopwords.words('english')
extended_stops = stops + ['``', "'s", "'ve'"]

# Define a function that removes stopwords AND punctuation
def remove_stops(text, stops):
    return [token for token in text if (token.lower() not in set(stops)) and (len(token) > 1)]

tokens = [remove_stops(doc, extended_stops) for doc in tokens]

In [7]:
print(tokens[0])

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'deal', 'watch', 'movie', 'sorta', 'find', 'critique', 'mind-fuck', 'movie', 'teen', 'generation', 'touches', 'cool', 'idea', 'presents', 'bad', 'package', 'makes', 'review', 'even', 'harder', 'one', 'write', 'since', 'generally', 'applaud', 'films', 'attempt', 'break', 'mold', 'mess', 'head', 'lost', 'highway', 'memento', 'good', 'bad', 'ways', 'making', 'types', 'films', 'folks', "n't", 'snag', 'one', 'correctly', 'seem', 'taken', 'pretty', 'neat', 'concept', 'executed', 'terribly', 'problems', 'movie', 'well', 'main', 'problem', 'simply', 'jumbled', 'starts', 'normal', 'downshifts', 'fantasy', 'world', 'audience', 'member', 'idea', 'going', 'dreams', 'characters', 'coming', 'back', 'dead', 'others', 'look', 'like', 'dead', 'strange', 'apparitions', 'disappearances', 'looooot', 'chase', 'scenes', 'tons', 'weird

In [8]:
# Prepare data for sklearn. First, get the text
texts = [' '.join(doc) for doc in tokens]

# And then get the "class" variable
positive = [trav['positive'] for trav in reviews]

In [10]:
positive[0]

0

While all of our 2,000 movie reviews are labeled, in real-world analysis, you will have (labeled) training data, as well as unlabeled data. We will simulate this more realistic scenario by "pretending" that the last 500 reviews are unlabeled:

In [11]:
# Split texts
texts_labeled = texts[0:1500]
texts_unlabeled = texts[1500:]

# Split class
positive_labeled = positive[0:1500]

In [12]:
positive_labeled[0]

0

# Feature extraction

**Feature extraction** (also called **vectorization**) is a process whereby we extract meaningful features or attributes from raw textual data that will be used in a classification algorithm. We have already extracted **bag-of-words** features using the `CountVectorizer`:

In [13]:
# Initialize vectorizer
vectorizer = CountVectorizer()

# Transform the text data using the chosen vectorizer
X = vectorizer.fit_transform(texts_labeled)

What does `CountVectorizer()` actually do? It generates what's called a **document-term matrix** in which our documents are on the rows, our **vocabulary** (all of the unique terms in our corpus) is in the columns, and the cells represent the number of times (the count) a term shows up in each document.

In [16]:
print(X.shape)

(1500, 35267)


So we have 1500 documents (which we already knew) and 35,162 unique words (or tokens) in our corpus. We can look at these 35k tokens using:

In [17]:
vectorizer.get_feature_names_out()

array(['00', '000', '0009f', ..., 'zwigoff', 'zycie', 'zzzzzzz'],
      dtype=object)

We can lookup the number of times the token '0009f' is used in the first movie review as follows:

In [19]:
print(X[0,2])

0


It doesn't show up! This isn't really surprising: only a handful of the 35k works will show up in any one movie review. Most of the cells in the **document-term matrix** will be zeros. This is what is called a **sparse matrix**.

### TF-IDF

Another commonly used vectorization method is applying **term frequency-inverse document frequency** weights.

The TF-IDF weights are defined as follows:

\begin{equation}
tf * idf
\end{equation}

Where $tf$ represents the relative term frequency of word in a document and the $idf$ is defined as:

\begin{equation}
idf(t) = 1 + log\frac{N}{1+df(t)}
\end{equation}

Where $N$ represents the total number of documents in the corpus and $df(t)$ represents the number of documents in which term $t$ is present. Often the $tfidf$ weigthts are normalized to range between 0 and 1 by using the L2 norm:

\begin{equation}
tfidf_{norm} = \frac{tfidf}{||tfidf||}
\end{equation}

TF-IDF vectorization in `sklearn` is easy. We simply load the the `TfidfVectorizer` and perform the same process as above:

In [20]:
# Initialize the vectorizer
vectorizer = TfidfVectorizer()

# Generate TF-IDF weights
X_tfidf = vectorizer.fit_transform(texts_labeled)

We can compare our counts and TF-IDF weights:

In [21]:
print('Counts =')
print(X[0,np.nonzero(X_tfidf[0,:])[1]].todense())

print('TF-IDF weights =')
print(X_tfidf[0,np.nonzero(X_tfidf[0,:])[1]].todense())

Counts =
[[ 1  1  1  2  1  1 10  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
   1  1  1  1  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
   1  1  1  1  1  1  2  1  1  1  1  1  1  1  1  1  1  1  1  2  1  1  2  1
   2  1  1  1  1  1  1  1  1  1  1  1  1  2  2  2  1  2  1  2  1  3  1  1
   2  1  1  1  2  1  2  1  1  1  2  5  2  1  1  1  1  1  2  2  1  1  1  2
   1  1  1  1  1  6  1  1  1  1  1  2  1  1  2  1  1  1  1  1  3  2  2  2
   1  2  1  1  2  1  2  2  1  1  1  1  1  2  2  1  1  1  1  1  1  1  5  1
   1  1  1  1  1  1  1  2  2  2  2  1  1  1  1  1  2  1  1  2  1  1  3  1
   1  1  2  1  2  2  1  1  2  3  1  1  1  6  1  1  1  1  2  1  1  1  1  3
   1  3  1  1  1  1  2  1  4  2  1]]
TF-IDF weights =
[[0.06591754 0.06871883 0.06871883 0.12599666 0.05595759 0.05668382
  0.39422847 0.04510487 0.06871883 0.05315631 0.06177449 0.05872692
  0.04319636 0.02592255 0.06677836 0.04790615 0.03994304 0.02

The `TfidfVectorizer` smooths the IDF (i.e., adds a 1 to avoid division by 0) and normalizes using the **L2 norm** by default.

## Other feature extraction decisions

There are a number of other items to consider when vectorizing. One thing that can really change classification performance is switching from **unigrams** to **ngrams**. This is easy to implement in `sklearn`:

In [22]:
# Use unigrams and bigrams
vectorizer_bigram = TfidfVectorizer(ngram_range=(1,2))

# Generate TF-IDF weights
X_bigram = vectorizer_bigram.fit_transform(texts_labeled)

I also sometimes restrict the maximum and minimum documents for which a token appears:

In [23]:
# Use unigrams and bigrams
vectorizer_bigram = TfidfVectorizer(max_df = .95, min_df = 10, ngram_range=(1,2))

# Generate TF-IDF weights
X_bigram = vectorizer_bigram.fit_transform(texts_labeled)

# Fitting a naive Bayes model in `sklearn`

Fitting our model in `sklearn` is really, really easy. Let's get a feel for how this works using a simple **naive Bayes** model:

In [24]:
# Initialize vectorizer
vectorizer = CountVectorizer()

# Transform the text data using the chosen vectorizer
X = vectorizer.fit_transform(texts_labeled)

# Prepare our "class" variable
y = np.array(positive_labeled)

# Initialize the classifer
clf = MultinomialNB()

# Fit the model
clf_fit = clf.fit(X, y)

We can generate predictions for reviews (e.g., our `y` variable) using the `predict` method:

In [25]:
y_predict = clf_fit.predict(X)

In [26]:
print(y_predict[0:10])

[0 1 1 0 1 0 1 0 1 0]


And we can quickly look at how accurate our model is at predicting the movie reviews based on our text (`X`):

In [27]:
clf_fit.score(X, y)

0.9806666666666667

Wowzers, 98% accurate! We are done, right? foo <font color='red'>WRONG</font>: we really care about **in-sample** fit.

# Model performance

How well does our model perform? Let's get a sense of how one would assess **out-of-sample** performance using `sklearn`. First, we need to split our data into a training and testing set:

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1234)

print("Training set size = %s" % y_train.shape)
print("Validation set size = %s" % y_test.shape)

In [None]:
X_train.shape

Next, we train our model using the training set and predict the classes for our test set:

In [None]:
# Initialize the classifer
clf = MultinomialNB()

# Fit the model
clf_fit = clf.fit(X_train, y_train)

# Generate predictions
y_predict = clf_fit.predict(X_test)

### Performance metrics

With our prediction in hand, we can now calculate a range of **performance metrics**. All of these metrics start with what's referred to as a **confusion matrix**: 

<img src="https://miro.medium.com/v2/resize:fit:712/format:webp/1*Z54JgbS4DUwWSknhDCvNTQ.png">

From this matrix, we can define the following:

\begin{equation}
    Acurracy = \frac{TP + TN}{TP + FP + FN + TN}
  \end{equation}
  
  \begin{equation}
    Precision = \frac{TP}{TP + FP}
  \end{equation}
  
  \begin{equation}
    Recall = \frac{TP}{TP + FN}
  \end{equation}
  
  \begin{equation}
    Specificity = \frac{TN}{TN + FP}
  \end{equation}

The most commonly employed metric is **accuracy**. However, we need to be very careful when using accuracy alone. For example, consider Dallas Raines, the most accurate meteorologist in history:

<img src="http://farm6.static.flickr.com/5260/5516412091_06fea7fdb8.jpg" style="width=400;height=300">

If accuracy falls apart for "imbalanced classes," then what are our alternatives. The most common alternatives (and what I tend to use in my own work) is some combination of **precision** and **recall**:

<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" style="width=300;height=400">

These measures are formally combined in the **F1-score**:

\begin{equation}
  F1 = \frac{2\:x\:Precision\:x\:Recall}{Precision\:+\:Recall}
\end{equation}

The F1-score takes into account the tradeoff between precision and accuracy. All of these metrics (and others as well) are easy to calculate in `sklearn`:

In [None]:
print(f'Accuracy = {clf_fit.score(X_test, y_test)}')
print(f'Precision = {precision_score(y_test, y_predict)}')
print(f'Recall = {recall_score(y_test, y_predict)}')
print(f'F1 score = {f1_score(y_test, y_predict)}')

## Cross-validation

In the last section, we randomly split our data into 70% training and 30% testing. The realized sample, however, is just one of many samples that we could have pulled. Here, we will look at **k-fold cross-validation**. What do we mean be cross-validation? Take a look at the following:

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg">

Once again, cross-validation is easy in `sklearn`. Here is the relevant code for 10-fold CV using our movies data the **long way**:

In [None]:
# Setup classifier
clf = MultinomialNB()

# Get the k folds
kf = KFold(n_splits=10, shuffle = True, random_state=50)

# Loop over folds and calculate performance measure
results = []
for k, (train_idx, test_idx) in enumerate(kf.split(X)):
    # Fit model
    cfit = clf.fit(X[train_idx], y[train_idx])
    
    # Get predictions
    y_pred = cfit.predict(X[test_idx])
    
    # Write results
    result = {'fold': k,
              'precision': precision_score(y[test_idx], y_pred),
              'recall': recall_score(y[test_idx], y_pred),
              'f1': f1_score(y[test_idx], y_pred)}
              
    results.append(result)

In [None]:
print(results)

And we can then pull out any information from our `results` dictionary that we want:

In [None]:
# Get the average F1 score
mean_f1 = np.mean(np.array([row['f1'] for row in results]))

print("Average, cross-validated F1 score = %s" % mean_f1)

While this code clearly outlines the steps taken for cross-validation, it is a bit cumbersome. Luckily, `sklearn` has a utilty function to streamline this process:

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
print(cross_val_score(clf, X, y, cv=10, scoring="precision"))

We can average over the each fold using `np.mean()`:

In [None]:
np.mean(cross_val_score(clf, X, y, cv=10, scoring="precision"))

# Model selection

We've seen how to estimate a Naive Bayes classifier. However, this is only one of many models available in `sklearn` (for a list of available classifiers, see http://scikit-learn.org/stable/supervised_learning.html). When selecting a model, the typical process is to try a bunch of different models and select the "best" based on out of sample performance.

Let's take a look at one model that has been shown to be extremely accurate for text classification problems: the **support vector machine**.

### Support vector machines

There are two versions of **support vector classifiers** available in `sklearn`, the `LinearSVC()` (which is restricted to the linear case, but is super fast!) and the `SVC()` implementation (which is highly flexible, but a bit slow). For an extremely clear introduction to what these models are actually doing, see <a href="https://www.youtube.com/watch?v=N1vOgolbjSc">this video</a>.

We can fit these models using the exact same procedure as specified above for Naive Bayes:

In [None]:
# Use the TF-IDF weights instead of counts
X = X_tfidf

# Split data into training and testing sets. Same splits as Naive Bayes.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1234)

# Initialize the classifer
clf = LinearSVC()

# Fit the model using the training data
# generated above.
clf_fit = clf.fit(X_train, y_train)

# Generate predictions
y_predict = clf_fit.predict(X_test)

# Output performance metrics
print('Precision = %s' % precision_score(y_test, y_predict))
print('Recall = %s' % recall_score(y_test, y_predict))
print('F1 score = %s' % f1_score(y_test, y_predict))
print("Accuracy score = %s" % accuracy_score(y_test, y_predict))

We fit the `SVC()` version of the model in the exact same way. However, `SVC` is more flexible and can handle nonlinear hyperplanes. Let's use the popular [radial basis function kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel) to fit a nonlinear classifier:

In [None]:
# Initialize the classifer using the rbf kernel
clf = SVC(kernel = 'rbf')

# Fit the model using the training data
# generated above.
clf_fit = clf.fit(X_train, y_train)

# Generate predictions
y_predict = clf_fit.predict(X_test)

# Output performance metrics
print('Precision = %s' % precision_score(y_test, y_predict))
print('Recall = %s' % recall_score(y_test, y_predict))
print('F1 score = %s' % f1_score(y_test, y_predict))

Ouch -- not as good of a model! We can look more at where the model is getting confused by viewing the confusion matrix (note: truth is on the rows and prediction is on the columns):

In [None]:
from sklearn.metrics import confusion_matrix

# Print the confusion matrix
print(confusion_matrix(y_test, y_predict))

So complexity does not always help for **out-of-sample** classification. We can also use `SVC()` with a linear kernel to get very similar results as `LinearSVC()`:

In [None]:
# Initialize the classifer
clf = SVC(kernel = 'linear')

# Fit the model using the training data
# generated above.
clf_fit = clf.fit(X_train, y_train)

# Generate predictions
y_predict = clf_fit.predict(X_test)

# Output performance metrics
print('Precision = %s' % precision_score(y_test, y_predict))
print('Recall = %s' % recall_score(y_test, y_predict))
print('F1 score = %s' % f1_score(y_test, y_predict))

### Logistic regression classifer

My desert island classifer is logistic regression. We would use the same sytax to estmate a logistic regression in sklearn:

In [None]:
# Load the model
from sklearn.linear_model import LogisticRegression

In [None]:
# Initialize the classifer
clf = LogisticRegression()

# Fit the model using the training data
# generated above.
clf_fit = clf.fit(X_train, y_train)

# Generate predictions
y_predict = clf_fit.predict(X_test)

# Output performance metrics
print('Precision = %s' % precision_score(y_test, y_predict))
print('Recall = %s' % recall_score(y_test, y_predict))
print('F1 score = %s' % f1_score(y_test, y_predict))

## Parameter tuning

As we discussed in class, one of the most important parameters in an SVC is the regularization parameter `C`. Let's see how one would "tune" this parameter in `sklearn` using the **grid search** method. Here, we define a (coarse) grid of parameter values, estimate a classifier, cross-validate to get out-of-sample performance, and save the results. `sklearn` has a `GridSearchCV` class that makes this very easy.

Next, we set of the parameter grid:

In [None]:
"""
param_grid = [
  {'C': [.5, 1, 1.5], 'loss': ['hinge']},
  {'C': [.5, 1, 1.5], 'loss': ['squared_hinge']},
 ]
"""

param_grid = [
  {'C': [.95, .98, 1, 1.02, 1.03]}
 ]

In [None]:
# Initialize model
lsvc = LinearSVC()

# Intialize grid search
grid = GridSearchCV(lsvc, param_grid=param_grid, cv=5, scoring = 'f1')

# Run grid search
grid.fit(X, y)

After running the grid search algorithm, we can print out the optimal results:

In [None]:
print("The best model based on CV =")
print(grid.best_estimator_)

print("\nThe score of the best model = ")
print(grid.best_score_)

For more information on parameter tuning in `sklearn` (including a number of great worked examples), see http://scikit-learn.org/stable/modules/grid_search.html.

In [None]:
grid.cv_results_

## Estimating a final model

We will often spend quite a bit of time in the model selection phase, trying different models, tuning parameters, and comparing out-of-sample performance. However, once we've decided on a model, then we need to use the model to classify unannotated data.

This is quite easy in `sklearn` and not much changes in terms of implementation. First, we re-estimate our model based on **all available data** and then project out to the remaining sample.

In [None]:
# Re-estimate the best fitting model on all
# of the data
clf = LinearSVC(C = 1.03)

# Fit the model
clf_fit = clf.fit(X, y)

Next, we need to prepare our unannotated data using the same vectorization process as used for the model:

In [None]:
X_unlabeled = vectorizer.transform(texts_unlabeled)

And then we predict the new classes as usual:

In [None]:
y_predict = clf_fit.predict(X_unlabeled)
print(y_predict[0:10])

While we "pretended" that our the remaining 500 reviews were unlabeled, we can actually check the performance of our classifer for the remaining 500:

In [None]:
# Pull the class values for the final 500
positive_unlabeled = positive[1500:]

# Save as a numpy array
y_unlabeled = np.array(positive_unlabeled)

# Compare via the F1 score
print('F1 score = %s' % f1_score(y_unlabeled, y_predict))

# Multiclass classification

Up until now, we have assumed that we have two classes: positive and negative reviews. What if we have more classes? It turns out that extending our binary classifiers to handle multiclass data is quite easy in `sklearn`. We will, however, need a new dataset. Let's use some data from my own research on classifying skeptical claims related to climate change:

In [None]:
import json

# Read "CARDS" JSON file
with open('cards.json', 'r') as jfile:
    cards = json.load(jfile)

print('Loaded %s annotated paragraphs.' % len(cards))

The `cards` data object is a list of dictionaries with the following fields:

* **id**: A unique ID for each paragraph.
* **claim**: The skeptical "claim" associated with the paragraph
* **processed_text**: The pre-processed text content of each paragraph (lowercase and simple stopword removal).

The **claim** variable consists of the following "classes":

* 0 = No claim
* 1 = It's not happening
* 2 = It's not us
* 3 = It won't be bad and could be beneficial
* 4 = Climate solutions won't work
* 5 = Climate science is unreliable

Here's an example of the what the data looks like:

In [None]:
print(cards[0])

Next, we get our data ready for `sklearn`:

In [None]:
# Parse text and classes
texts = [row['processed_text'] for row in cards]
claims = [row['claim'] for row in cards]

# Vectorize. Let's use TF-IDF weighting:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# And get the label (or class) data in a format sklearn likes:
y = np.array(claims)

Split into testing and training sets for evaluation:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1234)

## One-vs-rest classification

There are two primary ways to do multi-class (and multi-label) classification: **one-vs-rest** and **one-vs-one** classification. Let's start with the former. In a nutshell, one-vs-rest classification fits a **binary classifier** to each class (for us, class = 0, class = 1 ... class = 5), treating the class of interest as a 1 and everything else as a zero. The prediction is carried out by simply choosing the class with the highest predicted probability.

We can carryout this process for **any classifer** in `sklearn` by using the `OneVsRestClassifier()` class:

In [None]:
from sklearn.multiclass import OneVsRestClassifier

# Initialize 1 v rest classifer using a
# linear SVC
clf=OneVsRestClassifier(LinearSVC())

# Fit the training data
clf_fit = clf.fit(X_train, y_train)

# Predict the new classes
y_predict_1vrest = clf_fit.predict(X_test)

print('Here are the first 10 predictions:')
print(y_predict_1vrest[0:10])

## One-vs-one classification

The second option is to train a **one-vs-one** classifer. Here, we train binary classifer for each "pair" of classes (e.g., class 0 vs. class 5) and choose the "predicted" class by majority voting (i.e., the class that is predicted the most in all of the pairwise comparisons).

In `sklearn`, we use the `OneVsOneClassifer` class to do this sort of classification:

In [None]:
from sklearn.multiclass import OneVsOneClassifier

# Initialize 1 v rest classifer using a
# linear SVC
clf=OneVsOneClassifier(LinearSVC())

# Fit the training data
clf_fit = clf.fit(X_train, y_train)

# Predict the new classes
y_predict_1v1 = clf_fit.predict(X_test)

print('Here are the first 10 predictions:')
print(y_predict_1v1[0:10])

## Model performance for multiclass problems

We check performance in a very similar way as well. Let's start by looking at the **F1 score** for each class:

In [None]:
print("OneVRest: F1 score for each class = ")
print(f1_score(y_test, y_predict_1vrest, average = None))

print("\nOneVOne: F1 score for each class = ")
print(f1_score(y_test, y_predict_1v1, average = None))

The one major difference between binary and multi-class performance evaluation centers on the `average` parameter -- i.e., to get overall model performance we need to average in some way. How should we take this average? There are two options: **micro-averaging** and **macro-averaging**. In `sklearn`:

In [None]:
print("OneVRest: macro-averaged F1 score for each class = ")
print(f1_score(y_test, y_predict_1vrest, average = 'macro'))

print("\nOneVRest: micro-averaged F1 score for each class = ")
print(f1_score(y_test, y_predict_1vrest, average = 'micro'))

Wow, that is quite a bit of difference between the two averages! What's going on here? Answer: **class imbalance**

## Class imbalance

Why is the performance so bad for the above model? Why are the macro and micro averages so different? One reason is class imbalance. Let's look at the frequency of the various classes in our dataset:

In [None]:
def get_freq(obj):
    unique, counts = np.unique(obj, return_counts=True)
    print(np.asarray((unique, counts)).T)
    return unique, counts

print(get_freq(y))

The "no claim" (0) class is far more frequent then the other classes. This is an example of imbalanced classes, which is a major headache for using supervised learning algorithms in practice. What can we do?

One potential solution that could help is **weighting**. This solution is easy to implement in sklearn:

In [None]:
# Force a more balanced class weighting
clf=OneVsRestClassifier(LinearSVC(class_weight='balanced'))

# Fit the training data
clf_fit = clf.fit(X_train, y_train)

# Predict the new classes
y_predict_1vrest = clf_fit.predict(X_test)

print("Per-class F1 score = ")
print(f1_score(y_test, y_predict_1vrest, average = None))

print("\nOneVRest: macro-averaged F1 score for each class = ")
print(f1_score(y_test, y_predict_1vrest, average = 'macro'))

print("\nOneVRest: micro-averaged F1 score for each class = ")
print(f1_score(y_test, y_predict_1vrest, average = 'micro'))

There are many other ways (aside from weighting) that you can approach the class imbalance problem. While we do not have time to look at all the available solutions, many options are available in the imbalanced-learn library in Python (see http://contrib.scikit-learn.org/imbalanced-learn/stable/).