# Text Classification

This lab explores a new dataset for text classification tasks using naïve Bayes and logistic regression.

### Outcomes
* Train and test NB and LR classifiers using an established library.
* Apply evaluation metrics to the classifiers and display examples of misclassifications.
* Examine learned model parameters to explain how each classifier makes a decision.

### Overview

The first part of the notebook loads a new Twitter dataset, which is described in [this paper](https://arxiv.org/pdf/2010.12421.pdf), then extracts feature vectors from each sample.
The next part involves implementing and evaluating the classifiers using Scikit-learn.

# 1. Preparing the Data 

In [None]:
from datasets import load_dataset

cache_dir = "./data_cache"

train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Training dataset with {len(train_dataset)} instances loaded")

test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

print(f"Test dataset with {len(test_dataset)} instances loaded")

In [None]:
train_dataset[0]

In [None]:
# Put the data into lists ready for the next steps...
train_tweets = []
train_labels = []
for i in range(len(train_dataset)):
    train_tweets.append(train_dataset[i]['text'])
    train_labels.append(train_dataset[i]['label'])

    if i % 1000 == 0:
        print(i)
    
print(train_tweets[2])

In [None]:
test_tweets = []
test_labels = []
for i in range(len(test_dataset)):
    test_tweets.append(test_dataset[i]['text'])
    test_labels.append(test_dataset[i]['label'])

    if i % 1000 == 0:
        print(i)

print(test_tweets[2])

The next step is to convert the tokenised text of each tweet to a feature vectors that we can use as input to a classifier. The feature vector needs to be a numerical vector of a fixed size. For the bag-of-words representation, the feature vector for a tweet will represent the number of occurrences of each word in the vocabulary in that tweet.

For this, we can use the CountVectorizer class: [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

**TO DO 1.1:** Why do we need to fit the CountVectorizer on the train set?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

In [None]:
print(vectorizer.vocabulary_)

# 2. Naive Bayes Classifier

The code above has obtained the feature vectors and lists of labels. The data is now ready for use
with scikit-learn's classifiers.

**TODO 2.1:** Train a classifier using the [MultinomialNB class.](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) You will need to look at the linked documentation to see how to construct and train the model.

In [None]:
# WRITE YOUR CODE HERE


**TODO 2.2:** Again use the documentation to write code to obtain predictions on the test set.

In [None]:
# WRITE YOUR CODE HERE


**TODO 2.3:** Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules) Review the documentation to see the different options for evaluating classifiers.

In [None]:
# WRITE YOUR CODE HERE


**TODO 2.4:** Print out the ten features with the strongest association with each class. Hint: use the `feature_log_prob_` attribute of the MultinomialNB object. You may also need Numpy's argsort() function. 

Beware offensive words below!

In [None]:
# WRITE YOUR CODE HERE



Performance metrics are just one of the ways that we need to evaluate classifiers. Metrics summarise the performance of a classifier across many different examples in the test set, but they don't tell us what the model is good at, or what kind of mistakes it makes. For this, we need to examine the errors it makes, and try to identify patterns -- this helps us to come up with improvements to the model.

**TODO 2.5:** As a first error analysis step, print out some examples of misclassified tweets, along with their predicted and true labels.

In [None]:
# WRITE YOUR CODE HERE


# 3. Logistic Regression Classifier

**TODO 3.1:** Train a classifier using the [LogisticRegression class.](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# WRITE YOUR CODE HERE


**TODO 3.2:** Obtain predictions on the test set.

In [None]:
# WRITE YOUR CODE HERE


**TODO 3.3:** Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [None]:
# WRITE YOUR CODE HERE


**TODO 3.3:** Print out the ten features with the highest weights for each class. Hint: use the `coef_` attribute of the LogisticRegression object.

In [None]:
# WRITE YOUR CODE HERE


**TODO 3.4:** Print out an example of some misclassified tweets along with their predicted and true labels.

**TODO 3.5:** What differences do you find between the results with NB and LR classifiers? Are there any kinds of common mistakes that either classifier makes?

In [None]:
# WRITE YOUR CODE HERE


# 4. N-grams and Lexicon Features

We can try to improve the classifiers using some richer features.

**TODO 4.1:** Use bigram features as well as unigrams (single tokens). To do these, change the `ngram_range` parameter in the CountVectorizer then try running the best classifier again.

In [None]:
# WRITE YOUR CODE HERE


In [None]:
# repeat the above with the new matrices...


For sentiment analysis, we can also make use of lexicons. Lexicons are lists of words associated with a particular property, such as positive sentiment. Because these lists were constructed in advance, we don't need to learn the associations between words and sentiment classes purely from the training data. This is useful because some words may be present in the test data but occur rarely, or never at all, in the training set.  

Here is one way we can use a lexicon to create some new features:

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
analyser = SentimentIntensityAnalyzer()  # a class that provides word-sentiment scores based on a lexicon

# get the Vader lexicon scores for each word in our vocabulary
vocabulary = vectorizer.vocabulary_

lex_pos_scores = np.zeros((1, len(vocabulary)))
lex_neg_scores = np.zeros((1, len(vocabulary)))

for i, term in enumerate(vocabulary):
    if term in analyser.lexicon and analyser.lexicon[term] > 0:
        lex_pos_scores[0, i] = 1
    elif term in analyser.lexicon and analyser.lexicon[term] < 0:
        lex_neg_scores[0, i] = 1

In [None]:
# Compute positive sentiment scores for each tweet in the data set by summing up the total positive scores
# for words in the tweet. 
# We do this by multiplying the lexicon scores by the feature vectors, then sum over the 
# vocabulary to get the total positive counts:
lex_pos_train = np.sum(X_train.multiply(lex_pos_scores), axis=1)
lex_pos_test = np.sum(X_test.multiply(lex_pos_scores), axis=1)

print(np.max(lex_pos_train))
print(np.max(lex_pos_test))

# Do the same for negative scores:
lex_neg_train = np.sum(X_train.multiply(lex_neg_scores), axis=1)
lex_neg_test = np.sum(X_test.multiply(lex_neg_scores), axis=1)

print(np.max(lex_neg_train))
print(np.max(lex_neg_test))

Finally, we can append the counts to the feature vector and treat them as extra features:

In [None]:
from scipy.sparse import hstack

X_train = hstack((X_train, lex_pos_train, lex_neg_train))
X_test = hstack((X_test, lex_pos_test, lex_neg_test))

**TODO 4.2:** Use the new X_train and X_test feature vectors to train and evaluate your classifier. 
Does adding the lexicon features improve performance?

In [None]:
### WRITE YOUR OWN CODE HERE
