# Lab 5: Text Classification

This lab explores a new dataset for text classification tasks using naïve Bayes and logistic regression.

### Learning Outcomes
* Be able to train and test naïve Bayes and logistic regression classifiers using scikit-learn.
* Know how to apply evaluation metrics to the classifiers and display examples of misclassifications.
* Be able to examine learned model parameters and explain how each classifier makes a decision.

### Outline

1. Load a new Twitter dataset, which is described in [this paper](https://arxiv.org/pdf/2010.12421.pdf), then extracts feature vectors from each sample.
1. Training and evaluating naïve Bayes using Scikit-learn.
1. Training and evaluating logistic regression using Scikit-learn.
1. Optional extension: lemmatization and bigram features.
1. Optional extensions: lexicon features.

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code. Look out for 'QUESTIONS' which you should try to answer before moving on to the next cell. Aim to work through the lab during the scheduled lab hours. To get help, you can talk to TAs or the lecturer during the labs, post questions to Blackboard (anonymously) or on Teams in the QA channel (with your name), or ask a question in the Wednesday live sessions. 

As you work through the notebooks, please make a note of any code that is unclear to you.

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises! To understand what's going on inside the methods we use here, make sure to watch the lecture videos for the same week.

# 1. Preparing the Data 

This time we are using part of the Tweet Eval dataset, which contains seven Twitter datasets for various social media classification tasks. Here, we'll focus on the sentiment analysis data. 
Run the code below to download the data from [HuggingFace's datasets hub](https://huggingface.co/datasets/tweet_eval):

In [1]:
from datasets import load_dataset

cache_dir = "./data_cache"

# The data is already divided into training and test sets.
# Load the training set:
train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    #ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

Reusing dataset tweet_eval (./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Training dataset with 45615 instances loaded


In [2]:
# Load the test set:
test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    #ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

Reusing dataset tweet_eval (./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Test dataset with 12284 instances loaded


Let's take a look at one of the instances in the training set:

In [3]:
train_dataset[0]

{'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"',
 'label': 2}

The next step is to tokenise the text of each tweet and convert it to a bag of words, ready for input to a classifier. 
To do this, we will use the scikit-learn library. 

In [4]:
# Put the data into lists ready for the next steps...
train_tweets = [sample['text'] for sample in train_dataset]
train_labels = [sample['label'] for sample in train_dataset]

In [5]:
test_tweets = [sample['text'] for sample in test_dataset]
test_labels = [sample['label'] for sample in test_dataset]

To extract a bag of words, we can use the CountVectorizer class ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).
This class outputs the bag of words as a feature vector, where the length of the vector is equal to the size of the vocabulary, and the values are the counts of each words in a document. 

Run the code below to obtain feature vectors for the training and test samples:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize

# CountVectorizer can do its own tokenization, but for consistency we want to
# carry on using WordNetTokenizer. We write a small wrapper class to enable this:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)

vectorizer = CountVectorizer(tokenizer=Tokenizer())  # construct the vectorizer

vectorizer.fit(train_tweets)  # Learn the vocabulary
X_train = vectorizer.transform(train_tweets)  # extract training set bags of words
X_test = vectorizer.transform(test_tweets)  # extract test set bags of words



The fit() method sets the vectorizer up by extracting a vocabulary from some text data. 

QUESTION: Why do we fit the CountVectorizer on the training set?

The vectorizer stores the vocabulary as a dictionary that maps a token to its index in the feature vector. The code below looks up the indexes of some example words:

In [7]:
import reprlib

vocabulary = vectorizer.vocabulary_
print(vocabulary['the'])
print(vocabulary['horse'])
print(vocabulary['smile'])

print(f'Vocabulary size = {len(vocabulary)}')

45903
23568
42626
Vocabulary size = 51903


# 2. Naive Bayes Classifier

The code above has obtained the feature vectors and lists of labels. The data is now ready for use
with scikit-learn's classifiers.

Scikit-learn contains several different variants of naïve Bayes for different kinds of data. For our bag of words data, we need to use the [MultinomialNB class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB).


TODO 2.1: Look at the documentation for MultinomalNB and write code to train a NB classifier using `X_train` and `train_labels`.

In [8]:
# WRITE YOUR CODE HERE

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

MultinomialNB()

Now we have a trained model, we would like to evaluate its performance on some test data. 

TODO 2.2: Refer to the documentation again and predict the labels for the test set. Use `X_test` as the inputs to the classifier.

In [9]:
# WRITE YOUR CODE HERE
y_test_pred = classifier.predict(X_test)

We can compute standard metrics for classifier performance using [scikit-learn's metrics libary](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules). A useful function for multi-class classification (when there are more than two classes) is the [classification report function](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report).

TODO 2.3: Refer again to the documentation, and compute accuracy, precision, recall and F1 scores on the test set. 

In [10]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

acc = accuracy_score(test_labels, y_test_pred)
print(f'Accuracy = {acc}')

prec = precision_score(test_labels, y_test_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(test_labels, y_test_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(test_labels, y_test_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(test_labels, y_test_pred))

Accuracy = 0.5891403451644416
Precision (macro average) = 0.5856910631417499
Recall (macro average) = 0.5819831225339609
F1 score (macro average) = 0.5717257637198551
              precision    recall  f1-score   support

           0       0.66      0.43      0.52      3972
           1       0.61      0.68      0.64      5937
           2       0.49      0.64      0.56      2375

    accuracy                           0.59     12284
   macro avg       0.59      0.58      0.57     12284
weighted avg       0.60      0.59      0.58     12284



Now, let's examine the classifier that we learned. If you don't follow what's happening here, you may wish to refer back to the slides on naïve Bayes classifiers or to [Jurafsky and Martin's textbook](https://web.stanford.edu/~jurafsky/slp3/4.pdf). 

Previously, we trained a MultinomialNB classifier. The trained classifier object stores all the probabilities that it learned during training, which are needed to make predictions. The log of the likelihoods of each word given the class are represented by the attribute `feature_log_prob_`. So, if your classifier object is named `classifier`, you can access the likelihoods with `classifier.feature_log_prob_`.

TODO 2.4: Print out the likelihood of the words 'happy' and 'hate' in each class. Hint: look up the index of the chosen words in `vocabulary`. The rows of `feature_log_prob` correspond to classes, and the columns to words.

In [11]:
import numpy as np

### CHANGE THE NAME OF THE CLASSIFIER VARIABLE BELOW TO USE YOUR TRAINED CLASSIFIER
feat_likelihoods = np.exp(classifier.feature_log_prob_)  # Use exponential to convert the logs back to probabilities
###

# WRITE YOUR CODE HERE
print(feat_likelihoods[:, vocabulary['happy']])
print(feat_likelihoods[:, vocabulary['hate']])

[0.00016529 0.00010482 0.00175175]
[5.09280976e-04 8.95763610e-05 5.33420957e-05]


The sentiment classes are negative (0), neutral (1) and positive (2). 

QUESTION: Which class has the strongest association with 'happy' and with 'hate'?

A key part of evaluating a classifier is investigating the errors it makes to better understand its limitations. 

TODO 2.5: Complete the code below to print out some misclassified tweets along with their predicted and true labels.

In [12]:
error_indexes = y_test_pred != test_labels  # compare predictions to gold labels

# get the text of tweets where the classifier made an error:
tweets_err = np.array(test_tweets)[error_indexes]

# WRITE YOUR CODE HERE
pred_err = y_test_pred[error_indexes]
gold_err = np.array(test_labels)[error_indexes]

for i in range(10):  # just print the first ten
    print(f'Tweet: {tweets_err[i]}; true label = {gold_err[i]}, prediction = {pred_err[i]}.')

Tweet: @user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.; true label = 1, prediction = 0.
Tweet: @user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.; true label = 0, prediction = 1.
Tweet: Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS; true label = 2, prediction = 1.
Tweet: @user @user @user @user @user @user take away illegals and dead people and Trump wins popular vote too.; true label = 0, prediction = 1.
Tweet: When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!; true label = 0, prediction = 1.
Tweet: @user ohhh ok i see 🤔 what if u have medical marijuana clearance? Does that make a difference; true label = 1, prediction = 0.
Tweet: @user alt-right was adopted by Deplorables. Average middle Americans.  I've now moved to Libertarian. @user; true label = 1

# 3. Logistic Regression Classifier

Another simple, linear classifier is logistic regression. This classifier does not rely on the conditional independence assumption, so can better model features that are highly correlated with each other. Scikit-learn provides the [logisticRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), which has a very similar interface to the naïve Bayes classifier.

TODO 3.1: Train a logistic regression classifier, referring to the scikit-learn documentation as required.

In [13]:
# WRITE YOUR CODE HERE

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

TODO 3.2: Obtain predictions on the test set.

In [14]:
# WRITE YOUR CODE HERE
y_test_pred = classifier.predict(X_test)

TODO 3.3: Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [15]:
# WRITE YOUR CODE HERE
acc = accuracy_score(test_labels, y_test_pred)
print(f'Accuracy = {acc}')

prec = precision_score(test_labels, y_test_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(test_labels, y_test_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(test_labels, y_test_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

print(classification_report(test_labels, y_test_pred))

Accuracy = 0.5942689677629437
Precision (macro average) = 0.5932110127221071
Recall (macro average) = 0.5688677245063735
F1 score (macro average) = 0.5685911119288147
              precision    recall  f1-score   support

           0       0.66      0.41      0.51      3972
           1       0.60      0.73      0.66      5937
           2       0.52      0.57      0.54      2375

    accuracy                           0.59     12284
   macro avg       0.59      0.57      0.57     12284
weighted avg       0.60      0.59      0.59     12284



QUESTION: How does the performance of logistic regression compare with naïve Bayes?

The logistic regression classifier works by learning a weight for each feature that indicates its importance in predicting a class. These weights are stored in the `coef_` attribute of the LogisticRegression object, which has rows corresponding to classes, and columns corresponding to words in the vocabulary. 

TODO 3.4: Print out the weights for 'happy' and 'hate' for each class.

In [16]:
### WRITE YOUR CODE HERE
print(classifier.coef_[:, vocabulary['happy']])
print(classifier.coef_[:, vocabulary['hate']])

[-0.77760673 -1.38021482  2.15782156]
[ 1.93711456 -0.18070966 -1.7564049 ]


QUESTION: Are the weights what you would expect to see?

The code below prints out the words with the highest weights for each class. We use numpy's `argsort` function to get the indexes of the sorted weights. Run the code below to show the result: 

In [17]:
n_feats_to_show = 10

# Flip the index so that values are keys and keys are values:
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

for c, weights_c in enumerate(classifier.coef_):
    print(f'\nWeights for class {c}:\n')
    strongest_idxs = np.argsort(weights_c)[-n_feats_to_show:]

    for idx in strongest_idxs:
        print(f'{vocab_inverted[idx]} with weight {weights_c[idx]}')


Weights for class 0:

fuck with weight 2.2729152707901554
ruined with weight 2.285801114997877
bullshit with weight 2.2952242372160514
worse with weight 2.308384547714172
horrible with weight 2.365476671867839
disappointed with weight 2.4626579983985653
terrible with weight 2.5510821394237997
stupid with weight 2.691283932646167
worst with weight 3.036680045685413
sucks with weight 3.153974847543261

Weights for class 1:

compare with weight 1.0878548371424837
load with weight 1.0920254709419257
capital with weight 1.0996527728738106
clemson with weight 1.1398262253842062
gucci with weight 1.1699364397002159
saturday\u002c with weight 1.2517803958964133
bama with weight 1.2592073255204252
yakub with weight 1.2740288265343915
paterno with weight 1.2778264994490574
rush with weight 1.340868379595107

Weights for class 2:

loved with weight 2.156808169881905
happy with weight 2.1578215552179896
proud with weight 2.244511292957456
congratulations with weight 2.2572528359704065
fantastic w

TODO 3.5: Use the same code as for naïve Bayes to print out examples of misclassified tweets and their labels. Hint: you should be able to compy and paste your code from above :) 

In [18]:
error_indexes = y_test_pred != test_labels  # compare predictions to gold labels

# get the text of tweets where the classifier made an error:
tweets_err = np.array(test_tweets)[error_indexes]

# WRITE YOUR CODE HERE
pred_err = y_test_pred[error_indexes]
gold_err = np.array(test_labels)[error_indexes]

for i in range(10):  # just print the first ten
    print(f'Tweet: {tweets_err[i]}; true label = {gold_err[i]}, prediction = {pred_err[i]}.')

Tweet: @user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.; true label = 1, prediction = 0.
Tweet: I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user; true label = 2, prediction = 1.
Tweet: @user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.; true label = 0, prediction = 1.
Tweet: Savchenko now Saakashvili took drug test live on Ukraine TV. To prove they are not drug-fueled loonies?; true label = 1, prediction = 0.
Tweet: Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS; true label = 2, prediction = 1.
Tweet: An interesting security vulnerability - albeit not for the everyday car thief; true label = 1, prediction = 2.
Tweet: When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!; true label = 0, prediction = 1.
Tweet: Swampbitch Nast

# 4. Lemmatization and N-grams

 4. Optional: Lemmatization and N-grams

You only need to do this section if you finish the previous sections before the end of the lab.

In the previous lab, we tried out lemmatization. This is useful for reducing the size of the vocabulary. Could it help us here?

To apply lemmatization, we have to go back to the CountVectorizer and define a new tokenizer that will carry out the extra step of lemmatization. Run the code below to test this out:

In [19]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

class LemmaTokenizer(object):
    
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        
    def __call__(self, tweets):
        return [self.wnl.lemmatize(self.wnl.lemmatize(self.wnl.lemmatize(tok, pos='n'), pos='v'), pos='a') for tok in word_tokenize(tweets)]
    
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survive', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']


In [20]:
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')

Vocabulary size: 45312


TODO 4.1: Now, repeat your training of the logistic regression using the new features, and compare its performance with the previous classifers.

In [21]:
### WRITE YOUR OWN CODE HERE
classifier = LogisticRegression()
classifier.fit(X_train, train_labels)
y_test_pred = classifier.predict(X_test)

acc = accuracy_score(test_labels, y_test_pred)
print(f'Accuracy = {acc}')

prec = precision_score(test_labels, y_test_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(test_labels, y_test_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(test_labels, y_test_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

print(classification_report(test_labels, y_test_pred))

Accuracy = 0.6015141647671768
Precision (macro average) = 0.5945523448181844
Recall (macro average) = 0.5753188319941674
F1 score (macro average) = 0.5777716456197073
              precision    recall  f1-score   support

           0       0.66      0.46      0.54      3972
           1       0.61      0.72      0.66      5937
           2       0.52      0.55      0.53      2375

    accuracy                           0.60     12284
   macro avg       0.59      0.58      0.58     12284
weighted avg       0.61      0.60      0.60     12284



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


QUESTION: Did lemmatization bring about any improvements on this dataset?

The bag of words is a very simple representation of the tweets that does not capture enough information to make accurate sentiment classifications. Another way to improve it could be to use bigrams instead of single words as our features. Bigrams are pairs of words that occur one after another in the text. Bigrams are a kind of 'n-gram', where 'n=2'. 

To extract bigrams, we again modify our CountVectorizer. This class has a parameter `ngram_range`, which determines the range of sizes of n-grams the vectorizer will include. If we set `ngram_range=(1,1)` we have our standard bag of words. If we set it to `ngram_range=(2,2)`, we use bigrams instead. Choosing If we set `ngram_range=(1,2)` will use both single tokens (unigrams) and bigrams.

TODO 4.2: Create a new CountVectorizer that extracts bigram features instead of unigrams (single tokens) and uses the LemmaTokenizer.

In [22]:
### WRITE YOUR CODE HERE
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(), ngram_range=(2,2))

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)
###

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])



['`` qt', 'qt @', '@ user', 'user in', 'in the', 'the original', 'original draft', 'draft of', 'of the', 'the 7th', '7th book', 'book ,', ', remus', 'remus lupin', 'lupin survive', 'survive the', 'the battle', 'battle of', 'of hogwarts', 'hogwarts .']


TODO 4.3: Now, repeat your training of the logistic regression or naïve Bayes classifier using the new features, and compare its performance with the previous classifers.

In [23]:
### WRITE YOUR OWN CODE HERE
classifier = LogisticRegression()
classifier.fit(X_train, train_labels)
y_test_pred = classifier.predict(X_test)

acc = accuracy_score(test_labels, y_test_pred)
print(f'Accuracy = {acc}')

prec = precision_score(test_labels, y_test_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(test_labels, y_test_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(test_labels, y_test_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

print(classification_report(test_labels, y_test_pred))

Accuracy = 0.544041028980788
Precision (macro average) = 0.5659706975652837
Recall (macro average) = 0.48593295054513086
F1 score (macro average) = 0.4633922975094517
              precision    recall  f1-score   support

           0       0.67      0.16      0.26      3972
           1       0.55      0.83      0.66      5937
           2       0.48      0.47      0.47      2375

    accuracy                           0.54     12284
   macro avg       0.57      0.49      0.46     12284
weighted avg       0.57      0.54      0.49     12284



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


QUESTION: Do bigrams improve performance on this dataset?

# 5. Optional: Lexicon Features

You only need to do this part if you finish the other parts before the end of the lab session. 

The NLTK library contains sentiment lexicons, which are lists of words with negative or positive connotations. 

In [24]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

analyser = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/es1595/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Now have a look at the sentiment scores for some words in the lexicon by running the code below. What do the scores mean and why do some words have no score?

In [25]:
testwords = ['happy', 'wonderful', 'horrible', 'boring', 'tablecloth', 'not']

for word in testwords:
    if word in analyser.lexicon:
        print(f'{word}: {analyser.lexicon[word]}')
    else:
        print(f'{word}: NOT IN LEXICON')

happy: 2.7
wonderful: 2.7
horrible: -2.5
boring: -1.3
tablecloth: NOT IN LEXICON
not: NOT IN LEXICON


Now we would like to use this function to compute counts of all positive and negative words. Let's start by recording whether the words in our vocabulary are positive or negative:

In [26]:
# get the Vader lexicon scores for each word in our vocabulary
vectorizer = CountVectorizer(tokenizer=Tokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])

vocabulary = vectorizer.vocabulary_

lex_pos_scores = np.zeros((1, len(vocabulary)))
lex_neg_scores = np.zeros((1, len(vocabulary)))

for i, term in enumerate(vocabulary):
    if term in analyser.lexicon and analyser.lexicon[term] > 0:
        lex_pos_scores[0, i] = 1
    elif term in analyser.lexicon and analyser.lexicon[term] < 0:
        lex_neg_scores[0, i] = 1



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survived', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']


Now let's compute the counts of positive and negative words in the dataset:

In [27]:
# Compute the scores for each instance in the data set. 

# Multiply the lexicon scores by the feature vectors, then sum over the 
# vocabulary to get the total positive and total negative counts:
lex_pos_train = np.sum(X_train.multiply(lex_pos_scores), axis=1)
lex_pos_test = np.sum(X_test.multiply(lex_pos_scores), axis=1)

lex_neg_train = np.sum(X_train.multiply(lex_neg_scores), axis=1)
lex_neg_test = np.sum(X_test.multiply(lex_neg_scores), axis=1)

Finally, we can append the counts to the feature vector and treat them as extra features:

In [28]:
from scipy.sparse import hstack

X_train = hstack((X_train, lex_pos_train, lex_neg_train))
X_test = hstack((X_test, lex_pos_test, lex_neg_test))

TODO 5.1: Use the new X_train and X_test feature vectors to train and evaluate your classifier. 
Does adding the lexicon features improve performance?

In [29]:
### WRITE YOUR OWN CODE HERE
classifier = LogisticRegression()
classifier.fit(X_train, train_labels)
y_test_pred = classifier.predict(X_test)

acc = accuracy_score(test_labels, y_test_pred)
print(f'Accuracy = {acc}')

prec = precision_score(test_labels, y_test_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(test_labels, y_test_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(test_labels, y_test_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

print(classification_report(test_labels, y_test_pred))

Accuracy = 0.5990719635297949
Precision (macro average) = 0.5984972931963172
Recall (macro average) = 0.5686547210579194
F1 score (macro average) = 0.5727623439699606
              precision    recall  f1-score   support

           0       0.66      0.42      0.52      3972
           1       0.60      0.74      0.66      5937
           2       0.54      0.54      0.54      2375

    accuracy                           0.60     12284
   macro avg       0.60      0.57      0.57     12284
weighted avg       0.61      0.60      0.59     12284



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
