<a href="https://colab.research.google.com/github/xlzuvekas/Machine-Learning/blob/main/COMP_3703_Assignment_4_Naive_Bayes_and_Entropy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 4: Naive Bayes and Entropy

Course: COMP 3703 - Natural Language Processing

Due: Wednesday, February 8 at 11:59pm Mountain time

---

YOUR NAME HERE: Xavier Zuvekas

---

# Question 1: Naive Bayes

To avoid long explanations of how to use the `nltk` library's function calls since there are quite a few related to this problem, the code cells below set up most of the problem for you.

First, import everything we need and download the appropriate packages.

In [None]:
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')

from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.probability import FreqDist

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


The documents we want are in the `movie_reviews` corpus, but they are not formatted in a way that is nice for training a Naive Bayes classifier. Conveniently, the review text is already tokenized and lowercased, so we don't have to worry about that. The code below reformats the documents for you, with comments explaining what's happening on each line.

In [None]:
docs = []                                                    # Our custom dataset list
for label in movie_reviews.categories():                     # Categories are 'neg' and 'pos'
  files = movie_reviews.fileids(label)                       # Get list of file names associated with label
  for file in files:
    raw_words_from_file = movie_reviews.words(file)          # Get words in file (one movie review)
    word_set_with_features = FreqDist(raw_words_from_file)   # Use word counts as features
    words_with_label = (word_set_with_features, label)       # Put features together with label in a tuple
    docs.append(words_with_label)                            # Add tuple to our docs list

docs[:5]                                                     # Print a few docs to see formatting

[(FreqDist({',': 44, 'the': 38, '.': 34, 'it': 25, 'and': 20, 'to': 16, 'of': 16, "'": 16, 'a': 14, 'that': 13, ...}),
  'neg'),
 (FreqDist({',': 18, '.': 14, 'the': 13, "'": 13, 'a': 13, 's': 9, 'and': 8, 'of': 8, 'movie': 5, 'that': 4, ...}),
  'neg'),
 (FreqDist({'the': 42, ',': 31, '.': 22, 'is': 17, "'": 15, 'that': 13, 'of': 13, 'and': 12, 'it': 10, 'a': 10, ...}),
  'neg'),
 (FreqDist({',': 26, "'": 24, 'the': 21, '"': 20, '.': 20, 's': 18, '-': 15, 'to': 14, 'and': 11, 'of': 11, ...}),
  'neg'),
 (FreqDist({',': 54, 'the': 48, '.': 37, 'to': 26, 'a': 20, 'is': 18, 'and': 17, "'": 15, 'of': 15, '-': 15, ...}),
  'neg')]

Ultimately, you should end up with a variable called `docs` that is a list of tuples. Each tuple has a set of features - represented with the `FreqDist` class from the last assignment - and the associated label for that review, either `neg` or `pos`. Remember that features are any quantifiable characteristic of the text. In this case, we use counts of each token in the review text as features.

For example, if the (completely made up) review is
```
This is a bad movie that is bad because it's bad.
```
the actual word list in the `movie_reviews` corpus may have pre-tokenized this review to be something like
```
['this', 'is', 'a', 'bad', 'movie', 'that', 'is', 'bad', 'because', 'it', '\'', 's', 'bad', '.']
```
Notice that alphabetical characters are lowercased and punctuation is included. Making a `FreqDist` out of this text would produce
```
FreqDist({'bad': 3, 'is': 2, 'this': 1, 'a': 1, 'movie': 1, 'that': 1, ...})
```
The `NaiveBayesClassifier` expects features to be provided in this type of dictionary format, with the keys being the feature and the values being the associated numerical quantity.

## Part (a)

Before we start training and classifying, as with any machine learning dataset, we must partition the data into two parts: **the train set and the test set** (we are not using the dev set here). The exact cutoff for how much should be in each set is not a standardized nor provably definitive value, so let's just go with **90% train, 10% test**.

Split the `docs` list into two separate lists:
* a train set list with the **first 90%** of the documents
* a test set list with the **last 10%** of the documents

Print the lengths of each set to show how many documents are in each set.

*Hint: remember if you do floating point computation but want an integer, you can cast the result to an int with `int()`.*

In [None]:
### YOUR CODE HERE ###

train_set = docs[:int(0.9 * len(docs))]
test_set = docs[int(0.9 * len(docs)):]

print("Length of train set:", len(train_set))
print("Length of test set:", len(test_set))

Length of train set: 1800
Length of test set: 200


# Part (b)

Now to actually train the classifier. `NaiveBayesClassifier` is already imported for you. The method we will use to get a trained classifier is
```
NaiveBayesClassifier.train(your_train_set_variable)
```
Store the result in a variable as your classifier. The resulting model has a few useful functions.

Print out the output from the following functions on your classifier:
* `your_classifier_variable.labels()` - should show that `neg` and `pos` are the labels in our dataset
* `your_classifier_variable.most_informative_features(10)` - shows the features (the tokens in this case) that are most useful for classifying a review as `neg` or `pos`

Do the most informative features make sense to you that they would be useful tokens for distinguishing between positive and negative reviews? You may optionally leave a comment below on your observations.

In [None]:
### YOUR CODE HERE ###
classifier = NaiveBayesClassifier.train(train_set)
print("Labels:", classifier.labels())
print("Most Informative Features:", classifier.most_informative_features(10))

Labels: ['neg', 'pos']
Most Informative Features: [('sucks', 1), ('wonderful', 2), ('breathtaking', 1), ('avoids', 1), ('stupid', 2), ('outstanding', 1), ('terrific', 2), ('boring', 2), ('bad', 5), ('captures', 1)]


## Part (c)

Finally, we must test the classifier. We will later learn more formal approaches on how to do this, but for now we will use **accuracy** as our measurement of success. Accuracy is simply calculated as:

$$
\text{Accuracy} = \frac{\text{Number of reviews in test set classified correctly}}{\text{Number of reviews in test set}}
$$

The relevant function call here is `your_classifier_variable.classify(feature_set)`, which will produce one of the two labels `neg` or `pos` based on the model's internal calculations. The precise mathematics of those calculations are discussed in the lecture slides and textbook.

Calculate and print the accuracy of your classifier by iterating over the test set, classifying each review using its feature set, counting the number of reviews classified correctly, and dividing by the number of reviews in the test set.

*You may be surprised by how high (close to 1.0) the accuracy is. This is addressed in part (d).*

In [None]:
### YOUR CODE HERE ###
correct = 0
for review in test_set:
  feature_set, label = review
  predicted_label = classifier.classify(feature_set)
  if predicted_label == label:
    correct += 1

accuracy = correct / len(test_set)
print("Accuracy:", accuracy)

Accuracy: 0.69


## Part (d)

In part (c), you should find that the accuracy is really, really good. Like almost 100%. There are certainly classifiers that can produce this accuracy, but this is unrealistic for a first shot at a Naive Bayes classifier. Here we will explore why.

Recall that `docs` is the overall dataset of reviews and feature sets that you partitioned in part (a). The way `docs` was constructed was by appending on all the reviews associated with the label `neg`, then appending on all the reviews associated with the label `pos`. Since there happens to be an equal number of positive and negative reviews, that means the first 50% of the documents had the label `neg` and the last 50% had the label `pos`.

You then split the list into the first 90% being train and the last 10% being test. **Why, then, is our accuracy so high? What is it about this partition that made it very easy for your Naive Bayes classifier to classify this specific test set?**

**Provide your answer in the text cell below.** To further confirm your answer, try running the following code cell, which shuffles (randomizes the positioning of the elements in) the `docs` variable, then re-run your code in parts (a), (b), and (c) to re-train your classifier on the shuffled documents and see if your accuracy changes. You will likely get an accuracy of around `.65` to `.75` rather than `.97` or above.

YOUR ANSWER HERE

The accuracy is high because of the way the docs list was constructed and partitioned into the train and test sets. As mentioned, the first 50% of the documents had the label negative and the last 50% had the label positive. Since the train set was made up of the first 90% of the documents, it consisted of mostly negative reviews, with a slightly lower amount of positives. The test set was constructed of all positive reviews. This meant that when the Naive Bayes classifier was trained on the train set, it became  biased towards classifying negative reviews, which in turn led to a high accuracy when classifying the test set of mostly positive reviews. 

When we shuffled the docs, a lower accuracy of 0.69 was observed. This was likely an artifact of the classifier's bias towards negative reviews.

In [None]:
import random
random.shuffle(docs) # Re-run parts (a), (b), and (c) to see a more realistic accuracy

# Question 2: Entropy with Wordle

You may have come across a fun little word game called Wordle before. It started off as a personal project by Josh Wardle (you can see where he got the name from) who designed the game for his partner, Palak Shah, who is a fan of word games.

Since the game was released online, it grew immediately in popularity, to the point where [it is now hosted on the New York Times](https://www.nytimes.com/games/wordle/index.html) website and is featured in the NYT phone app alongside their famous crossword puzzles.

Linked below is a YouTube video by the wonderful channel [3Blue1Brown](https://www.youtube.com/@3blue1brown), which has loads of videos with fascinating mathematics, computer science, and other STEM-related topics explained with brilliant visuals. This video explores Wordle, how it's played, and how a Wordle session is an exercise in information theory and entropy.

**Watch the video, then answer the questions below.** Closed captioning is available if you need it.

Video link: https://www.youtube.com/watch?v=v68zYyaEmEA 

*If you enjoyed that video, you might consider (optionally) watching the follow-up video:* https://www.youtube.com/watch?v=fRed0Xmc2Wg&t=378s

---
**Explain in your own words: what does it mean for an observation to have 1 bit of information?**

---
**Explain in your own words: what is entropy in this context?** The video offers a few different definitions. What definition makes the most sense to you?

YOUR ANSWER HERE

1 bit of information is I=-log2(p)

It represents a binary classifcation of either true (1) or false (0). The numbers between the two represent the confidence one way or another. For example, 0.85 would be 85% likely to be a 1.

YOUR ANSWER HERE

In this context, entropy is used to determine how much a guess can narrow down the possible world pool. By maximising the entropy of each guess, the average potential information gain will be higher than guessing randomly.

---
**From a human perspective, how do you think applying the knowledge or the approach described in this video to a game like Wordle effects the gameplay experience?** This is a completely subjective question. You might describe how your style of Wordle gameplay might change or not change knowing what you've learned from this video. You might compare this to other programs that have been designed to play Starcraft or chess or go. You might be ambivalent and compartmentalize your personal enjoyment of a game from the more logical aspects of it. Feel free to answer how you wish.

YOUR ANSWER HERE

From my perspective, approaching a game such as wordle through the lens of information theory is extremely intriguing, despite it having potential to 'take the fun out of the game.' Games are as fun as you make them, and given that wordle is a single player game there is no harm in 'cheating.' It is simply offering a different, albeit more informed way to play. The key being that it is single player vs multiplayer game. Should an algorithm be used for games like chess or go, it would certainly be changing the gameplay experience in an unintended manner, offering a severe advantage to one player. 