<a href="https://colab.research.google.com/github/xlzuvekas/Machine-Learning/blob/main/COMP_3703_Assignment_5_Evaluation_and_Sentiment_Lexicons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 5: Evaluation and Sentiment Lexicons

Course: COMP 3703 - Natural Language Processing

Due: Wednesday, February 15 at 11:59pm Mountain time

---

YOUR NAME HERE: Xavier Zuvekas

---

# Question 1: Sentiment Lexicons

This assignment very much builds off of the previous assignment. You will effectively create the same model as before using word counts in a Naive Bayes classifier. However, the steps of that process will be a little more formalized here since you will create a second model to compare to the first.

To start, import everything we will need by running the code cell below.

In [None]:
### RUN THIS ###
import nltk
import random
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

from nltk.sentiment import vader
from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.probability import FreqDist

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Similar to the previous assignment, the `movie_reviews` corpus is what we will use and some formatting is done for you here. This formatting code is not quite the same code that was provided last time, however. The documents are gathered for you in a variable called `raw_docs` have the words from each document as a list of strings (unigrams) along with the review label (`'pos'` or `'neg'`).

Note that `raw_docs` is shuffled for you here.

Run the following code cell to make the `raw_docs` variable and print the first few tuples so you can see the structure of the dataset.

In [None]:
from nltk import FreqDist

word_counts = []
for doc in raw_docs:
    tokens = doc[0]
    fdist = FreqDist(tokens)
    label = doc[1]
    word_counts.append((fdist, label))
out = word_counts[:5]
print(out)

[(FreqDist({',': 38, 'the': 32, '.': 29, 'of': 26, 'a': 22, '-': 20, "'": 17, 'it': 17, 's': 15, 'that': 10, ...}), 'pos'), (FreqDist({',': 86, 'the': 71, 'of': 48, 'a': 41, 'and': 41, '.': 38, 'that': 33, "'": 29, 'to': 26, '-': 23, ...}), 'pos'), (FreqDist({',': 29, '.': 23, "'": 20, 'the': 20, 's': 13, 'to': 12, 'it': 11, 'and': 9, 'of': 9, 'a': 8, ...}), 'pos'), (FreqDist({',': 34, '.': 27, 'the': 26, 'a': 16, "'": 13, 'of': 11, 'and': 10, 'to': 10, 's': 10, 'it': 10, ...}), 'neg'), (FreqDist({'the': 83, ',': 82, '.': 51, "'": 41, 'trek': 36, 'a': 33, 'and': 32, 'of': 31, 'star': 27, ':': 25, ...}), 'pos')]


## Part (b)

Here is where the sentiment lexicon comes into play. The lexicon we will use is from the VADER sentiment analyzer in the `nltk.sentiment` package. Run the following code cell to create the `lexicon` variable, which is a Python dictionary with a pre-defined set of words as keys with an associated score reflecting the positivity/negativity of the word as the values.

In [None]:
### RUN THIS ###
lexicon = vader.SentimentIntensityAnalyzer().make_lex_dict() # Get lexicon

print("happy:", lexicon['happy'])        # Very positive
print("terrible:", lexicon['terrible'])  # Very negative
print("okay:", lexicon['okay'])          # Fairly neutral

happy: 2.7
terrible: -2.1
okay: 0.9


The sentiment values in `lexicon` will be used as the features of another model, entirely separate from the one based on word counts. Following the same general structure in part (a), create *another new* list based on `raw_docs`. For each tuple in the list,
* the first element is a Python dictionary containing:
  * keys: all words in the document that appear in `lexicon`
  * values: the score from `lexicon` associated with each key
* the second element is the label associated with the document, as before.

*Recall that you can check if a key `key` is present in a dictionary with a conditional such as* `if key in dict`.

Note that **word counts are not accounted for with this feature set**. For example, the review
```
I am so happy I saw this great movie.
```
has the same feature set as
```
I am so happy happy happy happy happy I saw this great movie.
```
The feature set for *both* reviews would be
```
{'happy': 2.7,
 'great': 3.1}
```
since `'happy'` and `'great'` are the only tokens in these reviews that appear in `lexicon`. The number of times `'happy'` appears does not change the score.

Print out the first few elements of your new sentiment feature-extracted list to confirm the formatting.

In [None]:
### YOUR CODE HERE ###
sentiment_docs = []
for doc, label in raw_docs:
    sentiment_dict = {}
    for word in set(doc):
        if word in lexicon:
            sentiment_dict[word] = lexicon[word]
    sentiment_docs.append((sentiment_dict, label))
print(sentiment_docs[:5])


[({'mock': -1.8, 'horror': -2.7, 'blockbuster': 2.9, 'outrage': -2.3, 'caring': 2.2, 'anti': -1.3, 'propaganda': -1.0, 'great': 3.1, 'silly': 0.1, 'dizzy': -0.9, 'fresh': 1.3, 'save': 2.2, 'trite': -0.8, 'killing': -3.4, 'punishing': -2.6, 'devotees': 0.5, 'romantic': 1.7, 'thrilling': 2.1, 'indoctrinated': -0.4, 'rich': 2.6, 'talent': 1.8, 'better': 1.9, 'amusement': 1.5, 'entertainment': 1.8, 'battle': -1.6, 'energy': 1.1, 'strengths': 1.7, 'controversial': -0.8, 'brilliant': 2.8, 'super': 2.9, 'lack': -1.3, 'war': -2.9, 'no': -1.2, 'create': 1.1, 'conflicts': -1.6, 'novel': 1.3, 'worth': 0.9, 'enthusiastically': 2.6, 'hypocritical': -2.0, 'pleasures': 1.9, 'intelligence': 2.1, 'faithfully': 1.8, 'violent': -2.9, 'fun': 2.3, 'hilarious': 1.7, 'like': 1.5, 'battles': -1.6, 'true': 1.8, 'joys': 2.2, 'wish': 1.7, 'plays': 1.0, 'fight': -1.6, 'special': 1.7, 'nonsense': -1.7, 'threatened': -2.0, 'played': 1.4}, 'pos'), ({'debt': -1.5, 'unaware': -0.8, 'inspirational': 2.3, 'ironic': -0.5

# Part (c)

Since we now have two different datasets to create two classifiers for, we will want to reduce the amount of redundant code we will write by writing functions. For this part, you will write a function for splitting the lists we made into training and test sets.

Define the function `split_train_and_test` such that:
* the first parameter is the dataset, presumed to be a list 
* the second parameter is the percentage (presumed to be a value between 0.0 and 1.0) to be used for the training set; the default value for this parameter should be 0.9

*If you are not familar with how to set default values for parameters in Python, see [this page](https://www.w3schools.com/python/gloss_python_function_default_parameter.asp).*
  
The function should then get separate lists for the train (the first portion of the list) and test (the last portion of the list) sets, then return both lists as a tuple, i.e. `(train_set, test_set)`.

Just define the function here. We will call it in the next part.

In [None]:
### YOUR CODE HERE ###
def split_train_and_test(dataset, train_percent=0.9):

    # Determine the size of the training set based on the percentage
    train_size = int(len(dataset) * train_percent)
    
    # Split the dataset into training and testing sets
    train_set = dataset[:train_size]
    test_set = dataset[train_size:]
    
    # Return the training and testing sets as a tuple
    return (train_set, test_set)


# Part (d)

Call your `split_train_and_test` function to split your two formatted lists into train and test. You should end up with 
* a training set and test set for the documents with *word counts* as the feature set, and 
* a training set and test set for the documents with *sentiment scores* as the feature set.

**Train two separate Naive Bayes Classifiers**, one for each training set. Remember that `NaiveBayesClassifier.train(train_set)` is the function call for creating a classifier this way.

Print the 10 most informative features features for each classifier to confirm the models were trained.

In [None]:
### YOUR CODE HERE ###
# Import necessary modules
from nltk.classify import accuracy
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import apply_features

# Split the word_counts list into train and test sets
train_set_wc, test_set_wc = split_train_and_test(word_counts)

# Train a NaiveBayesClassifier on the word_counts train_set
classifier_wc = NaiveBayesClassifier.train(train_set_wc)

# Print the 10 most informative features for the word_counts classifier
print("Word Count Classifier")
print(classifier_wc.show_most_informative_features(10))

# Split the sentiment_docs list into train and test sets
train_set_sen, test_set_sen = split_train_and_test(sentiment_docs)

# Train a NaiveBayesClassifier on the sentiment_docs train_set
classifier_sen = NaiveBayesClassifier.train(train_set_sen)

# Print the 10 most informative features for the sentiment classifier
print("Sentiment Classifier")
print(classifier_sen.show_most_informative_features(10))


Word Count Classifier
Most Informative Features
               ludicrous = 1                 neg : pos    =     14.9 : 1.0
            breathtaking = 1                 pos : neg    =     13.5 : 1.0
               wonderful = 2                 pos : neg    =     13.4 : 1.0
                  avoids = 1                 pos : neg    =     12.7 : 1.0
                  boring = 2                 neg : pos    =     11.6 : 1.0
              astounding = 1                 pos : neg    =     11.4 : 1.0
                     bad = 5                 neg : pos    =     11.2 : 1.0
               affecting = 1                 pos : neg    =     10.8 : 1.0
              apparently = 2                 neg : pos    =     10.5 : 1.0
                  stupid = 2                 neg : pos    =     10.4 : 1.0
None
Sentiment Classifier
Most Informative Features
               ludicrous = -1.5              neg : pos    =     14.9 : 1.0
                  avoids = -0.7              pos : neg    =     12.7 : 1.0


# Question 2: Evaluation

*This question references the variables from question 1, so be sure to complete all previous parts before starting this one.*

Evaluation can be done using many methods and many metrics, as discussed in lecture. For this question, we will ultimately be calculating F1 scores for each model.

## Part (a)

Define a function called `confusion_matrix` such that
* the first parameter is a classifier with the function `classify` (which `NaiveBayesClassifier` does)
* the second parameter is a test set with documents formatted as we did in question 1

`confusion_matrix` should calculate the number of True Positives, False Positives, True Negatives, and False Negatives and return them in that order as a tuple. *(Remember that despite the name sounding like "two-ple", tuples can contain any number of elements.)* Similar to how you calculated accuracy in the previous assignment, calculating TP, FP, TN, and FN can be done by
* iterating over the test set, classifying each document on its features
* comparing the predicted label with the document's actual label

This part only defines this function. It will be called in the next part.

In [None]:
### YOUR CODE HERE ###
def confusion_matrix(classifier, test_set):
    tp = fp = tn = fn = 0
    for features, label in test_set:
        predicted_label = classifier.classify(features)
        if label == 'pos':
            if predicted_label == 'pos':
                tp += 1
            else:
                fn += 1
        else:
            if predicted_label == 'pos':
                fp += 1
            else:
                tn += 1
    print(tp,fp,tn,fn)
    return tp, fp, tn, fn


## Part (b)

Define a function called `f1_score` that, much like `confusion_matrix`, takes a classifier and a test set as parameters. The function should
* call `confusion_matrix` with the given parameters in order to calculate TP, FP, TN, and FN, then
* calculate and return the F1-score using the following formulas
$$
Precision = \frac{TP}{TP + FP}
$$

$$
Recall = \frac{TP}{TP + FN}
$$

$$
F_1 = \frac{2PR}{P+R}
$$

Call `f1_score` on each classifier and its corresponding test set and print the results.

In [None]:
### YOUR CODE HERE ###
def f1_score(classifier, test_set):
    tp, fp, tn, fn = confusion_matrix(classifier, test_set)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * precision * recall / (precision + recall)
    
    return f1


word_count_f1 = f1_score(classifier_wc, test_set_wc)
sentiment_f1 = f1_score(classifier_sen, test_set_sen)

print("F1-score for word count classifier:", word_count_f1)
print("F1-score for sentiment classifier:", sentiment_f1)


87 47 62 4
82 45 64 9
F1-score for word count classifier: 0.7733333333333333
F1-score for sentiment classifier: 0.7522935779816514


Before answering the last part of this question, you can (optionally) perform a sort of "soft" cross-validation with your code by re-running all of your code cells to see how much the final F1 scores change each time. *(In Colab, go to Runtime -> Run all.)* Each run will probably take about 5-10 seconds depending on your connection. You should see that the results do not change too much each time.

## Part (c)

What do you observe about the resulting F1 scores? Are they similar? If so, why do you think the performance does not differ much between the two models. If one score is notably higher than other, why might the higher-scoring model perform better?

**Provide your thoughts in the text cell below.** Your answer should have at least a couple sentences with insight regarding the differences between the features used in each classifier.

The resulting F1 scores for the word count and sentiment classifiers are similar, with a difference of only about 0.02, and a total accuracy ranging from 75%-82% across runs. This suggests that both classifiers are performing reasonably well and that the choice of feature set here does not make a significant difference in performance.

It's worth noting that the informative features for each classifier are quite different, with the word count classifier placing more weight on specific words like "ludicrous" and "breathtaking," while the sentiment classifier seems to place more emphasis on words like "hatred" and "sucks" but also considers words like "ludicrous" and "avoids", similar to the word count classifier.

These differences show that while both classifiers are using different datasets, both feature sets seem to be effective in distinguishing between positive and negative reviews with reasonable accuracy.