# Lab 5: Text Classification

This lab explores a new dataset for text classification tasks using naïve Bayes and logistic regression.

### Learning Outcomes
* Be able to train and test naïve Bayes and logistic regression classifiers using scikit-learn.
* Know how to apply evaluation metrics to the classifiers and display examples of misclassifications.
* Be able to examine learned model parameters and explain how each classifier makes a decision.

### Outline

1. Load a new Twitter dataset, which is described in [this paper](https://arxiv.org/pdf/2010.12421.pdf), then extracts feature vectors from each sample.
1. Training and evaluating naïve Bayes using Scikit-learn.
1. Training and evaluating logistic regression using Scikit-learn.
1. Optional extension: lemmatization and bigram features.
1. Optional extensions: lexicon features.

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code. Look out for 'QUESTIONS' which you should try to answer before moving on to the next cell. Aim to work through the lab during the scheduled lab hours. To get help, you can talk to TAs or the lecturer during the labs, post questions to Blackboard (anonymously) or on Teams in the QA channel (with your name), or ask a question in the Wednesday live sessions. 

As you work through the notebooks, please make a note of any code that is unclear to you.

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises! To understand what's going on inside the methods we use here, make sure to watch the lecture videos for the same week.

# 1. Preparing the Data 

This time we are using part of the Tweet Eval dataset, which contains seven Twitter datasets for various social media classification tasks. Here, we'll focus on the sentiment analysis data. 
Run the code below to download the data from [HuggingFace's datasets hub](https://huggingface.co/datasets/tweet_eval):

In [1]:
from datasets import load_dataset

cache_dir = "./data_cache"

# The data is already divided into training and test sets.
# Load the training set:
train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    #ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

Downloading: 9.72kB [00:00, 2.07MB/s]                   
Downloading: 30.4kB [00:00, 12.1MB/s]                   


Downloading and preparing dataset tweet_eval/sentiment (download: 6.17 MiB, generated: 6.62 MiB, post-processed: Unknown size, total: 12.79 MiB) to ./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading: 4.97MB [00:00, 21.4MB/s]
Downloading: 91.2kB [00:00, 15.4MB/s]                   
Downloading: 1.16MB [00:00, 8.45MB/s].73it/s]
Downloading: 24.6kB [00:00, 5.69MB/s]                   
Downloading: 219kB [00:00, 9.66MB/s]                    
Downloading: 4.00kB [00:00, 2.57MB/s]                 
100%|██████████| 6/6 [00:01<00:00,  5.56it/s]
100%|██████████| 6/6 [00:00<00:00, 795.00it/s]
                                           

Dataset tweet_eval downloaded and prepared to ./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.
Training dataset with 45615 instances loaded




In [8]:
# Load the test set:
test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    #ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

Reusing dataset tweet_eval (./data_cache/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Test dataset with 12284 instances loaded


Let's take a look at one of the instances in the training set:

In [5]:
train_dataset[0]

{'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"',
 'label': 2}

The next step is to tokenise the text of each tweet and convert it to a bag of words, ready for input to a classifier. 
To do this, we will use the scikit-learn library. 

In [6]:
# Put the data into lists ready for the next steps...
train_tweets = [sample['text'] for sample in train_dataset]
train_labels = [sample['label'] for sample in train_dataset]

In [30]:
test_tweets = [sample['text'] for sample in test_dataset]
test_labels = [sample['label'] for sample in test_dataset]

To extract a bag of words, we can use the CountVectorizer class ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).
This class outputs the bag of words as a feature vector, where the length of the vector is equal to the size of the vocabulary, and the values are the counts of each words in a document. 

Run the code below to obtain feature vectors for the training and test samples:

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize

# CountVectorizer can do its own tokenization, but for consistency we want to
# carry on using WordNetTokenizer. We write a small wrapper class to enable this:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)

vectorizer = CountVectorizer(tokenizer=Tokenizer())  # construct the vectorizer

vectorizer.fit(train_tweets)  # Learn the vocabulary
X_train = vectorizer.transform(train_tweets)  # extract training set bags of words
X_test = vectorizer.transform(test_tweets)  # extract test set bags of words



The fit() method sets the vectorizer up by extracting a vocabulary from some text data. 

QUESTION: Why do we fit the CountVectorizer on the training set?

ANSWER: We fit the CountVectorizer to the training set, and not the test set, because we only want to use the training data to learn our parameters and build our model. The test data will be used for validation that the model is working to a suitable degree of accuracy. If we used all the data to train the CountVectorizer, we would have no way of validating that it is working. 

The vectorizer stores the vocabulary as a dictionary that maps a token to its index in the feature vector. The code below looks up the indexes of some example words:

In [14]:
import reprlib

vocabulary = vectorizer.vocabulary_
print(vocabulary['the'])
print(vocabulary['horse'])
print(vocabulary['smile'])

print(f'Vocabulary size = {len(vocabulary)}')

45903
23568
42626
Vocabulary size = 51903


# 2. Naive Bayes Classifier

The code above has obtained the feature vectors and lists of labels. The data is now ready for use
with scikit-learn's classifiers.

Scikit-learn contains several different variants of naïve Bayes for different kinds of data. For our bag of words data, we need to use the [MultinomialNB class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB).


TODO 2.1: Look at the documentation for MultinomalNB and write code to train a NB classifier using `X_train` and `train_labels`.

In [27]:
# WRITE YOUR CODE HERE
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, train_labels)

MultinomialNB()

(45615, 51903)
(45615,)



Now we have a trained model, we would like to evaluate its performance on some test data. 

TODO 2.2: Refer to the documentation again and predict the labels for the test set. Use `X_test` as the inputs to the classifier.

In [31]:
# WRITE YOUR CODE HERE
predicted_test_labels = nb_classifier.predict(X_test)

print(np.shape(X_test))
print(np.shape(test_labels))


(12284, 51903)
(12284,)


We can compute standard metrics for classifier performance using [scikit-learn's metrics libary](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules). A useful function for multi-class classification (when there are more than two classes) is the [classification report function](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report).

TODO 2.3: Refer again to the documentation, and compute accuracy, precision, recall and F1 scores on the test set. 

In [34]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

micro_f1 = f1_score(test_labels, predicted_test_labels, average='micro')
macro_f1 = f1_score(test_labels, predicted_test_labels, average='macro')
precision = precision_score(test_labels, predicted_test_labels, average='macro')
recall = recall_score(test_labels, predicted_test_labels, average='macro')

print(f'Accuracy: {accuracy_score(test_labels, predicted_test_labels)}')
print(f'Micro F1 Score: {micro_f1}')
print('Note: the following values all use macro aggregation')
print(f'Macro F1 Score: {macro_f1}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')

Accuracy: 0.5891403451644416
Micro F1 Score: 0.5891403451644416
Note: the following values all use macro aggregation
Macro F1 Score: 0.5717257637198551
Precision: 0.5856910631417499
Recall: 0.5819831225339609


Now, let's examine the classifier that we learned. If you don't follow what's happening here, you may wish to refer back to the slides on naïve Bayes classifiers or to [Jurafsky and Martin's textbook](https://web.stanford.edu/~jurafsky/slp3/4.pdf). 

Previously, we trained a MultinomialNB classifier. The trained classifier object stores all the probabilities that it learned during training, which are needed to make predictions. The log of the likelihoods of each word given the class are represented by the attribute `feature_log_prob_`. So, if your classifier object is named `classifier`, you can access the likelihoods with `classifier.feature_log_prob_`.

TODO 2.4: Print out the likelihood of the words 'happy' and 'hate' in each class. Hint: look up the index of the chosen words in `vocabulary`. The rows of `feature_log_prob` correspond to classes, and the columns to words.

In [43]:
### CHANGE THE NAME OF THE CLASSIFIER VARIABLE BELOW TO USE YOUR TRAINED CLASSIFIER
feat_likelihoods = np.exp(nb_classifier.feature_log_prob_)  # Use exponential to convert the logs back to probabilities
###

# WRITE YOUR CODE HERE
happy_index = vocabulary['happy']
hate_index = vocabulary['hate']

classes = {
    0: 'negative',
    1: 'neutral',
    2: 'positive'
}

for clss, clssname in classes.items():
    happy_prob = feat_likelihoods[clss, happy_index]
    hate_prob = feat_likelihoods[clss, hate_index]
    print(f'The probability of the word \'happy\' being in class \'{clssname}\' is: {happy_prob}')
    print(f'The probability of the word \'hate\' being in class \'{clssname}\' is: {hate_prob}')

# These probabilities are small, but correct. Remember, this is the probability of the word, given the class. 
# Therefore it will be small, because of all the words in that class it's the probability of choosing that particular one. 


The probability of the word 'happy' being in class 'negative' is: 0.00016529294824543762
The probability of the word 'hate' being in class 'negative' is: 0.0005092809756751326
The probability of the word 'happy' being in class 'neutral' is: 0.00010482340115725026
The probability of the word 'hate' being in class 'neutral' is: 8.957636098892291e-05
The probability of the word 'happy' being in class 'positive' is: 0.0017517544215263088
The probability of the word 'hate' being in class 'positive' is: 5.334209566158069e-05


The sentiment classes are negative (0), neutral (1) and positive (2). 

QUESTION: Which class has the strongest association with 'happy' and with 'hate'?
ANSWER: The positive class (class 2) has the strongest association with the word 'happy'. The negative class (class 0) has the strongest association with the word 'hate'.

A key part of evaluating a classifier is investigating the errors it makes to better understand its limitations. 

TODO 2.5: Complete the code below to print out some misclassified tweets along with their predicted and true labels.

In [50]:
error_indexes = predicted_test_labels != test_labels  # compare predictions to gold labels

# get the text of tweets where the classifier made an error:
tweets_err = np.array(test_tweets)[error_indexes]

### WRITE YOUR CODE HERE
pred_err = predicted_test_labels[error_indexes]
label_err = test_labels[error_indexes]

print(error_indexes)

####

for i in range(10):  # just print the first ten
    print(f'Tweet: {tweets_err[i]}; true label = {label_err[i]}, prediction = {pred_err[i]}.')

# Struggling with this one annoyingly! Seems simple 

TypeError: only integer scalar arrays can be converted to a scalar index

# 3. Logistic Regression Classifier

Another simple, linear classifier is logistic regression. This classifier does not rely on the conditional independence assumption, so can better model features that are highly correlated with each other. Scikit-learn provides the [logisticRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), which has a very similar interface to the naïve Bayes classifier.

TODO 3.1: Train a logistic regression classifier, referring to the scikit-learn documentation as required.

In [51]:
# WRITE YOUR CODE HERE
from sklearn.linear_model import LogisticRegression

lr_classifier = LogisticRegression()

lr_classifier.fit(X_train, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

TODO 3.2: Obtain predictions on the test set.

In [52]:
# WRITE YOUR CODE HERE
predictions_logreg = lr_classifier.predict(X_test)

TODO 3.3: Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [53]:
# WRITE YOUR CODE HERE
accuracy_lr = accuracy_score(test_labels, predictions_logreg)
f1_score_lr = f1_score(test_labels, predictions_logreg, average='macro')
precision_lr = precision_score(test_labels, predictions_logreg, average='macro')
recall_lr = recall_score(test_labels, predictions_logreg, average='macro')

print(f'Accuracy: {accuracy_lr}')
print(f'F1 Score: {f1_score_lr}')
print(f'Precision: {precision_lr}')
print(f'Recall: {recall_lr}')


Accuracy: 0.5943503744708564
F1 Score: 0.5685870965819503
Precision: 0.5932357044681887
Recall: 0.568726803590132


QUESTION: How does the performance of logistic regression compare with naïve Bayes?

ANSWER: Slightly better, but still not great. 

The logistic regression classifier works by learning a weight for each feature that indicates its importance in predicting a class. These weights are stored in the `coef_` attribute of the LogisticRegression object, which has rows corresponding to classes, and columns corresponding to words in the vocabulary. 

TODO 3.4: Print out the weights for 'happy' and 'hate' for each class.

In [56]:
### WRITE YOUR CODE HERE

feat_weights = lr_classifier.coef_

happy_index = vocabulary['happy']
hate_index = vocabulary['hate']

classes = {
    0: 'negative',
    1: 'neutral',
    2: 'positive'
}

for clss, clssname in classes.items():
    happy_prob = feat_likelihoods[clss, happy_index]
    hate_prob = feat_likelihoods[clss, hate_index]
    print(f'The weight of the word \'happy\' in class \'{clssname}\' is: {happy_prob}')
    print(f'The weight of the word \'hate\' in class \'{clssname}\' is: {hate_prob}')


The weight of the word 'happy' in class 'negative' is: 0.00016529294824543762
The weight of the word 'hate' in class 'negative' is: 0.0005092809756751326
The weight of the word 'happy' in class 'neutral' is: 0.00010482340115725026
The weight of the word 'hate' in class 'neutral' is: 8.957636098892291e-05
The weight of the word 'happy' in class 'positive' is: 0.0017517544215263088
The weight of the word 'hate' in class 'positive' is: 5.334209566158069e-05


QUESTION: Are the weights what you would expect to see?

ANSWER: Yes, higher weights for the classes you'd expect the word to fall in. 

The code below prints out the words with the highest weights for each class. We use numpy's `argsort` function to get the indexes of the sorted weights. Run the code below to show the result: 

In [69]:
n_feats_to_show = 5

# Flip the index so that values are keys and keys are values:
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

for c, weights_c in enumerate(lr_classifier.coef_):
    print(f'\nWeights for class {c}:\n')
    strongest_idxs = np.flip(np.argsort(weights_c)[-n_feats_to_show:])

    for idx in strongest_idxs:
        print(f'{vocab_inverted[idx]} with weight {weights_c[idx]}')


Weights for class 0:

sucks with weight 3.1526321346127872
worst with weight 3.0342095726396345
stupid with weight 2.6889778179393637
terrible with weight 2.549632390159403
disappointed with weight 2.461803150472878

Weights for class 1:

rush with weight 1.3409116343549219
paterno with weight 1.278296534704651
yakub with weight 1.27282481391156
bama with weight 1.2597113294061348
saturday\u002c with weight 1.2514631288868407

Weights for class 2:

amazing with weight 3.0595739864995113
congrats with weight 2.774124946022347
exciting with weight 2.6281831636826163
awesome with weight 2.553376158873567
brilliant with weight 2.4569745211237803


TODO 3.5: Use the same code as for naïve Bayes to print out examples of misclassified tweets and their labels. Hint: you should be able to compy and paste your code from above :) 

In [None]:
error_indexes = y_test_pred != test_labels  # compare predictions to gold labels

# get the text of tweets where the classifier made an error:
tweets_err = np.array(test_tweets)[error_indexes]

### WRITE YOUR CODE HERE

###

for i in range(10):  # just print the first ten
    print(f'Tweet: {tweets_err[i]}; true label = {gold_err[i]}, prediction = {pred_err[i]}.')

# 4. Optional: Lemmatization and N-grams

You only need to do this section if you finish the previous sections before the end of the lab.

In the previous lab, we tried out lemmatization. This is useful for reducing the size of the vocabulary. Could it help us here?

To apply lemmatization, we have to go back to the CountVectorizer and define a new tokenizer that will carry out the extra step of lemmatization. Run the code below to test this out:

In [70]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

class LemmaTokenizer(object):
    
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        
    def __call__(self, tweets):
        return [self.wnl.lemmatize(self.wnl.lemmatize(self.wnl.lemmatize(tok, pos='n'), pos='v'), pos='a') for tok in word_tokenize(tweets)]
    
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survive', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']


In [71]:
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')

Vocabulary size: 45312


TODO 4.1: Now, repeat your training of the logistic regression using the new features, and compare its performance with the previous classifers.

In [73]:
### WRITE YOUR OWN CODE HERE
lr_clf_lem = LogisticRegression()

lr_clf_lem.fit(X_train, train_labels)

lr_lem_pred = lr_clf_lem.predict(X_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [74]:
accuracy_lr_lem = accuracy_score(test_labels, lr_lem_pred)
print(f'Accuracy: {accuracy_lr_lem}')

Accuracy: 0.6015955714750896


QUESTION: Did lemmatization bring about any improvements on this dataset?

ANSWER: We improved the accuracy with lemmatisation. The other metrics could be calculated and compared too. 

The bag of words is a very simple representation of the tweets that does not capture enough information to make accurate sentiment classifications. Another way to improve it could be to use bigrams instead of single words as our features. Bigrams are pairs of words that occur one after another in the text. Bigrams are a kind of 'n-gram', where 'n=2'. 

To extract bigrams, we again modify our CountVectorizer. This class has a parameter `ngram_range`, which determines the range of sizes of n-grams the vectorizer will include. If we set `ngram_range=(1,1)` we have our standard bag of words. If we set it to `ngram_range=(2,2)`, we use bigrams instead. Choosing If we set `ngram_range=(1,2)` will use both single tokens (unigrams) and bigrams.

TODO 4.2: Create a new CountVectorizer that extracts bigram features instead of unigrams (single tokens) and uses the LemmaTokenizer.

In [75]:
### WRITE YOUR CODE HERE
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(), ngram_range=(2, 2))
###

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)


# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])



['`` qt', 'qt @', '@ user', 'user in', 'in the', 'the original', 'original draft', 'draft of', 'of the', 'the 7th', '7th book', 'book ,', ', remus', 'remus lupin', 'lupin survive', 'survive the', 'the battle', 'battle of', 'of hogwarts', 'hogwarts .']


TODO 4.3: Now, repeat your training of the logistic regression or naïve Bayes classifier using the new features, and compare its performance with the previous classifers.

In [76]:
### WRITE YOUR OWN CODE HERE
lr_clf_lem = LogisticRegression()

lr_clf_lem.fit(X_train, train_labels)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [77]:
lr_lem_preds = lr_clf_lem.predict(X_test)

accuracy_lr_lem = accuracy_score(test_labels, lr_lem_preds)
print(f'Accuracy: {accuracy_lr_lem}')

Accuracy: 0.5436339954412244


QUESTION: Do bigrams improve performance on this dataset?

ANSWER: No. 

# 5. Optional: Lexicon Features

You only need to do this part if you finish the other parts before the end of the lab session. 

The NLTK library contains sentiment lexicons, which are lists of words with negative or positive connotations. 

In [78]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

analyser = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/lukekirwan/nltk_data...


Now have a look at the sentiment scores for some words in the lexicon by running the code below. What do the scores mean and why do some words have no score?

The scores correlate to sentiment. Some words have no score because they were not in the lexicon. We can see this is most likely because these words are not associated with a positive or negative sentiment. However it could be because these words were omitted from the lexicon on purpose. 

In [79]:
testwords = ['happy', 'wonderful', 'horrible', 'boring', 'tablecloth', 'not']

for word in testwords:
    if word in analyser.lexicon:
        print(f'{word}: {analyser.lexicon[word]}')
    else:
        print(f'{word}: NOT IN LEXICON')

happy: 2.7
wonderful: 2.7
horrible: -2.5
boring: -1.3
tablecloth: NOT IN LEXICON
not: NOT IN LEXICON


Now we would like to use this function to compute counts of all positive and negative words. Let's start by recording whether the words in our vocabulary are positive or negative:

In [80]:
# get the Vader lexicon scores for each word in our vocabulary
vectorizer = CountVectorizer(tokenizer=Tokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])

vocabulary = vectorizer.vocabulary_

lex_pos_scores = np.zeros((1, len(vocabulary)))
lex_neg_scores = np.zeros((1, len(vocabulary)))

for i, term in enumerate(vocabulary):
    if term in analyser.lexicon and analyser.lexicon[term] > 0:
        lex_pos_scores[0, i] = 1
    elif term in analyser.lexicon and analyser.lexicon[term] < 0:
        lex_neg_scores[0, i] = 1



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survived', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']


Now let's compute the counts of positive and negative words in the dataset:

In [81]:
# Compute the scores for each instance in the data set. 

# Multiply the lexicon scores by the feature vectors, then sum over the 
# vocabulary to get the total positive and total negative counts:
lex_pos_train = np.sum(X_train.multiply(lex_pos_scores), axis=1)
lex_pos_test = np.sum(X_test.multiply(lex_pos_scores), axis=1)

lex_neg_train = np.sum(X_train.multiply(lex_neg_scores), axis=1)
lex_neg_test = np.sum(X_test.multiply(lex_neg_scores), axis=1)

In [89]:
print(f'Positive word count: {np.sum(lex_pos_train) + np.sum(lex_pos_test)}')
print(f'Negative word count: {np.sum(lex_neg_train) + np.sum(lex_neg_test)}')

Positive word count: 81573.0
Negative word count: 21649.0


Finally, we can append the counts to the feature vector and treat them as extra features:

In [90]:
from scipy.sparse import hstack

X_train = hstack((X_train, lex_pos_train, lex_neg_train))
X_test = hstack((X_test, lex_pos_test, lex_neg_test))

TODO 5.1: Use the new X_train and X_test feature vectors to train and evaluate your classifier. 
Does adding the lexicon features improve performance?

In [92]:
### WRITE YOUR OWN CODE HERE
lr_classifier = LogisticRegression()

lr_classifier.fit(X_train, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [93]:
preds = lr_classifier.predict(X_test)

accuracy_lr = accuracy_score(test_labels, preds)

print(accuracy_lr)

0.5990719635297949


No real improvement on performance of logistic regression without the use of a lexicon!