# Text Classification

This lab explores a new dataset for text classification tasks using naïve Bayes and logistic regression.

### Outcomes

- Train and test naive_bayes and LR classifiers using an established library.
- Apply evaluation metrics to the classifiers and display examples of misclassifications.
- Examine learned model parameters to explain how each classifier makes a decision.

### Overview

The first part of the notebook loads a new Twitter dataset, which is described in [this paper](https://arxiv.org/pdf/2010.12421.pdf), then extracts feature vectors from each sample.
The next part involves implementing and evaluating the classifiers using Scikit-learn.


# 1. Preparing the Data


In [1]:
import os
import sys

path = os.path.abspath(os.path.join(".."))

if path not in sys.path:
    sys.path.append(path)

In [2]:
from Modules.datasets import TweetEvalDataset

train = TweetEvalDataset("sentiment", "train")
test = TweetEvalDataset("sentiment", "test")

train_texts: list[str] = []
train_labels: list[int] = []

for item in train.iter():
    train_texts.append(item["text"])
    train_labels.append(item["label"])

test_texts: list[str] = []
test_labels: list[int] = []

for item in test.iter():
    test_texts.append(item["text"])
    test_labels.append(item["label"])

Found cached dataset tweet_eval (/Users/qr23940/Documents/git/dialogue_and_narrative/notebooks/data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)
Found cached dataset tweet_eval (/Users/qr23940/Documents/git/dialogue_and_narrative/notebooks/data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


The next step is to convert the tokenised text of each tweet to a feature vectors that we can use as input to a classifier. The feature vector needs to be a numerical vector of a fixed size. For the bag-of-words representation, the feature vector for a tweet will represent the number of occurrences of each word in the vocabulary in that tweet.

For this, we can use the CountVectorizer class: [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

**TO DO 1.1:** Why do we need to fit the CountVectorizer on the train set?


In [3]:
from Modules.term_vectors.document_term_matrix import DocumentTermMatrix

document_term_matrix = DocumentTermMatrix(train_texts)
document_term_matrix_test = document_term_matrix.transform(test_texts)

# 2. Naive Bayes Classifier

The code above has obtained the feature vectors and lists of labels. The data is now ready for use
with scikit-learn's classifiers.

**TODO 2.1:** Train a classifier using the [MultinomialNB class.](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) You will need to look at the linked documentation to see how to construct and train the model.


In [4]:
# pyright: reportUnknownMemberType=false

from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(document_term_matrix.matrix, train_labels)

**TODO 2.2:** Again use the documentation to write code to obtain predictions on the test set.


In [5]:
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false

pred_labels_naive_bayes = naive_bayes.predict(document_term_matrix_test)

**TODO 2.3:** Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules) Review the documentation to see the different options for evaluating classifiers.


In [6]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownVariableType=false

from typing import Any, Literal
import numpy.typing
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
)

Average = Literal["micro", "macro", "samples", "weighted", "binary"] | None


def print_metrics(
    test_labels: list[int],
    pred_labels: numpy.typing.NDArray[Any],
    average: Average = "macro",
):
    accuracy = accuracy_score(test_labels, pred_labels)
    print(f"accuracy = {accuracy:.3f}")
    precision = precision_score(test_labels, pred_labels, average=average)
    print(f"precision = {precision:.3f}")
    recall = recall_score(test_labels, pred_labels, average=average)
    print(f"recall = {recall:.3f}")
    f1 = f1_score(test_labels, pred_labels, average=average)
    print(f"f1 = {f1:.3f}")


print_metrics(test_labels, pred_labels_naive_bayes)

accuracy = 0.581
precision = 0.570
recall = 0.588
f1 = 0.576


**TODO 2.4:** Print out the ten features with the strongest association with each class. Hint: use the `feature_log_prob_` attribute of the MultinomialNB object. You may also need Numpy's argsort() function.

Beware offensive words below!


In [7]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false

import numpy
import numpy.typing
from typing import Any


def print_top_features_log_prob(
    feature_names: numpy.typing.NDArray[Any],
    classes: numpy.typing.NDArray[Any],
    classes_feature_log_probs: numpy.typing.NDArray[Any],
    n: int = 10,
):
    # Find the probability of the features over all classes.
    feature_probs = numpy.sum(numpy.exp(classes_feature_log_probs), axis=0)

    # For each class...
    for class_name, class_feature_log_probs in zip(
        classes, classes_feature_log_probs
    ):
        print(f"class = {class_name}")

        # Find the probability of the features for the class.
        class_feature_probs = numpy.vectorize(numpy.exp)(
            class_feature_log_probs
        )

        features = sorted(
            (
                # Find the ratio of the probabilities of the feature for the
                # class and over all classes.
                (feature_name, class_feature_prob / feature_prob)
                for feature_name, class_feature_prob, feature_prob in zip(
                    feature_names,
                    class_feature_probs,
                    feature_probs,
                )
            ),
            key=lambda feature: feature[1],
            reverse=True,
        )

        # Print the N features with the highest ratios.
        for feature, prob_ratio in features[:n]:
            print(f"  {feature} = {prob_ratio:.3f}")


print_top_features_log_prob(
    document_term_matrix.get_feature_names(),
    naive_bayes.classes_,
    naive_bayes.feature_log_prob_,
)

class = 0
  insult = 0.923
  militants = 0.923
  disgusting = 0.917
  blah = 0.914
  asshole = 0.912
  idiots = 0.912
  cunt = 0.897
  graduated = 0.897
  moron = 0.891
  pathetic = 0.888
class = 1
  betis = 0.813
  zap = 0.812
  departure = 0.805
  lucie = 0.805
  exo = 0.800
  obi = 0.790
  thine = 0.776
  tiffany = 0.776
  moderate = 0.765
  pacquiao = 0.765
class = 2
  amazing = 0.930
  congrats = 0.905
  excited = 0.892
  blessed = 0.886
  awesome = 0.885
  exciting = 0.874
  happy = 0.870
  delicious = 0.868
  enjoyed = 0.857
  congratulations = 0.856


Performance metrics are just one of the ways that we need to evaluate classifiers. Metrics summarise the performance of a classifier across many different examples in the test set, but they don't tell us what the model is good at, or what kind of mistakes it makes. For this, we need to examine the errors it makes, and try to identify patterns -- this helps us to come up with improvements to the model.

**TODO 2.5:** As a first error analysis step, print out some examples of misclassified tweets, along with their predicted and true labels.


In [8]:
# pyright: reportUnknownArgumentType=false


def print_misclassified(
    tweets: list[str],
    labels: list[int],
    pred_labels: numpy.typing.NDArray[Any],
    n: int = 10,
):
    index = 0
    for tweet, label, pred_label in zip(tweets, labels, pred_labels):
        if index >= n:
            break
        if label != pred_label:
            print(f"true label = {label}, predicted label = {pred_label}")
            print(tweet)
            index += 1


print_misclassified(test_texts, test_labels, pred_labels_naive_bayes)

true label = 1, predicted label = 0
@user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.
true label = 0, predicted label = 1
@user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.
true label = 1, predicted label = 0
Savchenko now Saakashvili took drug test live on Ukraine TV. To prove they are not drug-fueled loonies?
true label = 1, predicted label = 2
How many more days until opening day? 😩
true label = 2, predicted label = 1
Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS
true label = 0, predicted label = 1
When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!
true label = 1, predicted label = 0
@user ohhh ok i see 🤔 what if u have medical marijuana clearance? Does that make a difference
true label = 1, predicted label = 0
OPINION: The Anti-#Trump #Riots Are a #Smo

# 3. Logistic Regression Classifier

**TODO 3.1:** Train a classifier using the [LogisticRegression class.](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


In [9]:
# pyright: reportUnknownMemberType=false

from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
logistic_regression.fit(document_term_matrix.matrix, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**TODO 3.2:** Obtain predictions on the test set.


In [10]:
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false

pred_labels_logistic_regression = logistic_regression.predict(
    document_term_matrix_test
)

**TODO 3.3:** Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)


In [11]:
# pyright: reportUnknownArgumentType=false

print_misclassified(test_texts, test_labels, pred_labels_logistic_regression)

true label = 1, predicted label = 0
@user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.
true label = 2, predicted label = 1
I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
true label = 0, predicted label = 1
@user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.
true label = 1, predicted label = 0
Savchenko now Saakashvili took drug test live on Ukraine TV. To prove they are not drug-fueled loonies?
true label = 2, predicted label = 1
Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS
true label = 0, predicted label = 1
When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!
true label = 0, predicted label = 2
Swampbitch Nasty Pelosi  loves yelling 'Fire' in the crowded swamp. #blackfriday @user
true label = 1, predicted label = 0
@user

**TODO 3.3:** Print out the ten features with the highest weights for each class. Hint: use the `coef_` attribute of the LogisticRegression object.


In [12]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false


def print_top_features_coef(
    feature_names: numpy.typing.NDArray[Any],
    classes: numpy.typing.NDArray[Any],
    classes_coef: numpy.typing.NDArray[Any],
    n: int = 10,
) -> None:
    for class_name, coefficients in zip(classes, classes_coef):
        print(f"class = {class_name}")
        features = sorted(
            zip(feature_names, coefficients),
            key=lambda feature: feature[1],
            reverse=True,
        )
        for feature, log_prob in features[:n]:
            print(f"  {feature} = {log_prob:.3f}")


print_top_features_coef(
    document_term_matrix.get_feature_names(),
    logistic_regression.classes_,
    logistic_regression.coef_,
)

class = 0
  worst = 2.921
  sucks = 2.721
  horrible = 2.380
  disappointed = 2.335
  idiot = 2.325
  ruined = 2.248
  terrible = 2.242
  bullshit = 2.238
  disappointing = 2.217
  stupid = 2.214
class = 1
  paterno = 1.604
  dickens = 1.429
  load = 1.327
  bama = 1.304
  clooney = 1.251
  exo = 1.211
  rush = 1.196
  participate = 1.173
  compare = 1.151
  capital = 1.142
class = 2
  congrats = 2.656
  exciting = 2.474
  congratulations = 2.362
  amazing = 2.272
  excited = 2.249
  brilliant = 2.118
  impressive = 2.103
  happy = 2.072
  fantastic = 2.052
  awesome = 1.907


**TODO 3.4:** Print out an example of some misclassified tweets along with their predicted and true labels.

**TODO 3.5:** What differences do you find between the results with NB and LR classifiers? Are there any kinds of common mistakes that either classifier makes?


In [13]:
# pyright: reportUnknownArgumentType=false

print_misclassified(test_texts, test_labels, pred_labels_logistic_regression)

true label = 1, predicted label = 0
@user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.
true label = 2, predicted label = 1
I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
true label = 0, predicted label = 1
@user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.
true label = 1, predicted label = 0
Savchenko now Saakashvili took drug test live on Ukraine TV. To prove they are not drug-fueled loonies?
true label = 2, predicted label = 1
Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS
true label = 0, predicted label = 1
When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!
true label = 0, predicted label = 2
Swampbitch Nasty Pelosi  loves yelling 'Fire' in the crowded swamp. #blackfriday @user
true label = 1, predicted label = 0
@user

# 4. N-grams and Lexicon Features

We can try to improve the classifiers using some richer features.

**TODO 4.1:** Use bigram features as well as unigrams (single tokens). To do these, change the `ngram_range` parameter in the CountVectorizer then try running the best classifier again.


In [14]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false

from Modules.term_vectors.document_term_matrix import DocumentTermMatrix

document_term_matrix_2 = DocumentTermMatrix(train_texts, ngram_range=(1, 2))
document_term_matrix_2_test = document_term_matrix_2.transform(test_texts)

In [15]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false

from sklearn.naive_bayes import MultinomialNB

naive_bayes_2 = MultinomialNB()
naive_bayes_2.fit(document_term_matrix_2.matrix, train_labels)

pred_labels_naive_bayes_2 = naive_bayes_2.predict(document_term_matrix_2_test)

print_metrics(test_labels, pred_labels_naive_bayes_2)

print_top_features_log_prob(
    document_term_matrix_2.get_feature_names(),
    naive_bayes_2.classes_,
    naive_bayes_2.feature_log_prob_,
)

print_misclassified(test_texts, test_labels, pred_labels_naive_bayes_2)

: 

In [None]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false
from sklearn.linear_model import LogisticRegression

logistic_regression_2 = LogisticRegression()
logistic_regression_2.fit(document_term_matrix_2.matrix, train_labels)

pred_labels_logistic_regression_2 = logistic_regression_2.predict(
    document_term_matrix_2_test
)

print_metrics(test_labels, pred_labels_logistic_regression_2)

print_top_features_coef(
    document_term_matrix_2.get_feature_names(),
    logistic_regression_2.classes_,
    logistic_regression_2.coef_,
)

print_misclassified(test_texts, test_labels, pred_labels_logistic_regression_2)

For sentiment analysis, we can also make use of lexicons. Lexicons are lists of words associated with a particular property, such as positive sentiment. Because these lists were constructed in advance, we don't need to learn the associations between words and sentiment classes purely from the training data. This is useful because some words may be present in the test data but occur rarely, or never at all, in the training set.

Here is one way we can use a lexicon to create some new features:


In [None]:
# pyright: reportMissingTypeStubs=false
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np

nltk.download("vader_lexicon")

analyser = SentimentIntensityAnalyzer()

vocabulary = document_term_matrix.vocabulary

lex_pos_scores = np.zeros((1, len(vocabulary)))
lex_neg_scores = np.zeros((1, len(vocabulary)))

for index, term in enumerate(vocabulary):
    if term in analyser.lexicon and analyser.lexicon[term] > 0:
        lex_pos_scores[0, index] = 1
    elif term in analyser.lexicon and analyser.lexicon[term] < 0:
        lex_neg_scores[0, index] = 1

In [None]:
# pyright: reportMissingTypeStubs=false
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false

import numpy as np

lex_pos_train = np.sum(
    document_term_matrix.matrix.multiply(lex_pos_scores), axis=1
)

print(f"max positive train = {np.max(lex_pos_train)}")

lex_pos_test = np.sum(
    document_term_matrix_test.multiply(lex_pos_scores), axis=1
)

print(f"max positive test = {np.max(lex_pos_test)}")

lex_neg_train = np.sum(
    document_term_matrix.matrix.multiply(lex_neg_scores), axis=1
)

print(f"max negative train = {np.max(lex_neg_train)}")

lex_neg_test = np.sum(
    document_term_matrix_test.multiply(lex_neg_scores), axis=1
)

print(f"max negative test = {np.max(lex_neg_test)}")

Finally, we can append the counts to the feature vector and treat them as extra features:


In [None]:
# pyright: reportMissingTypeStubs=false
# pyright: reportUnknownVariableType=false
from scipy.sparse import hstack

lex_train = hstack((document_term_matrix.matrix, lex_pos_train, lex_neg_train))
lex_test = hstack((document_term_matrix_test, lex_pos_test, lex_neg_test))

**TODO 4.2:** Use the new X_train and X_test feature vectors to train and evaluate your classifier.
Does adding the lexicon features improve performance?


In [None]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false
from sklearn.naive_bayes import MultinomialNB

naive_bayes_3 = MultinomialNB()
naive_bayes_3.fit(lex_train, train_labels)

pred_labels_naive_bayes_3 = naive_bayes_3.predict(lex_test)

print_metrics(test_labels, pred_labels_naive_bayes_3)

print_top_features_log_prob(
    document_term_matrix.get_feature_names(),
    naive_bayes_3.classes_,
    naive_bayes_3.feature_log_prob_,
)

print_misclassified(test_texts, test_labels, pred_labels_naive_bayes_3)

In [None]:
# pyright: reportUnknownArgumentType=false
# pyright: reportUnknownMemberType=false
# pyright: reportUnknownVariableType=false
from sklearn.linear_model import LogisticRegression

logistic_regression_3 = LogisticRegression()
logistic_regression_3.fit(lex_train, train_labels)

pred_labels_logistic_regression_3 = logistic_regression_3.predict(lex_test)

print_metrics(test_labels, pred_labels_logistic_regression_3)

print_top_features_coef(
    document_term_matrix.get_feature_names(),
    logistic_regression_3.classes_,
    logistic_regression_3.coef_,
)

print_misclassified(test_texts, test_labels, pred_labels_logistic_regression_3)