# Exercise Set 2

## Part 1: Getting Started with Naïve Bayes

For our first Naïve Bayes classifier, we will see the Movie Reviews corpus that can be found [here](https://www.kaggle.com/datasets/nltkdata/movie-review). We will be using `movie_review.csv` specifically in these tasks.

### 1.1 Loading in the Dataset
First, we need to load in the data from the file. Create a pandas dataframe containing the reviews. What information does the dataset contain? How many tags are in the dataset and how many instances of each category do we have?

In [None]:
import pandas as pd

df = ...

### 1.2 Splitting the Data

Next, we shuffle the data and split it into training (80%), validation (10%) and test (10%) set. Scikit-Learn contains a method to split a dataset into two parts. (For Scikit-Learn installation help, check [here](https://scikit-learn.org/stable/install.html)). Fill in the blanks (represented by ...) in the following code block to split the data. You’ll have to apply this function twice to get a three-way split. Double check that the size of each dataset corresponds to what you expect. 

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
x_train, x_test, y_train, y_test = ...

# Split the test data further into val and test set
x_val, x_test, y_val, y_test = ...

### 1.3 Tokenization and Bag-of-Words with Scikit-Learn

For basic tokenization and bag-of-words feature extraction, we can use the `CountVectorizer` class from scikit-learn. We can fit and transform the `CountVectorizer` with the method `fit_transpose(raw_documents=...)`.
([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer))

The following commands may help you inspect the features:

```python
len(cv.vocabulary_)        # Number of words in the BOW
cv.vocabulary_["boring"]   # Index of "boring" in the feature matrix
```

Instantiate a vectorizer, then fit it to and transform the training dataset. How many words are in the bag-of-words?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = ...
x_train_bow = ...

print("The number of words in the BOW is:", ...)

### 1.4 Training and Prediction with a Multinomial Naïve Bayes Classifier

Now, we train a multinomial Naïve Bayes classifier, using again existing libraries from scikit-learn, specifically `MultinomialNB`. ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)).


Using the training BOW and corresponding tags, train the `MultinomialNB` using the `fit` method. 

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Instantiate the classifier object
nb = ...
# Train the classifier
nb.fit(..., ...)

Let's evaluate this model on the validation set. Remember, we first need to transform the validation set into the bag-of-words
representation, using the same `CountVectorizer` already used for training. Then, predict the tags for the validation set using the classifier's `predict` method.

In [None]:
x_val_bow = ...
predicted_y_val = ...

### 1.5 Evaluating the Classifier

Now that we have the predicted tags for the validation set, let's evaluate the results. We can compare the predictions with the gold labels and compute different metrics using the `metrics` module in Scikit-Learn ([Documentation](https://scikit-learn.org/stable/api/sklearn.metrics.html)). 

We can calculate the accuracy, precision, recall, and f-score of the predictions as shown below:

In [None]:
from sklearn import metrics

acc = metrics.accuracy_score(y_val, predicted_y_val)
precision, recall, fscore, _ = metrics.precision_recall_fscore_support(y_val, predicted_y_val)

print(f"Accuracy: {acc}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F-score: {fscore}")

When it comes to precision, recall, and f-score, each class has their own score, but it is not clear which score goes to which class. Luckily, Scikit-Learn provides a “classification report” that shows performance metrics for all classes in a structured output:

In [None]:
print(metrics.classification_report(y_val, predicted_y_val))

You can also display the raw confusion matrix, but note that scikit-learn displays the gold labels as the
rows and the predicted labels as columns:

In [None]:
print(metrics.confusion_matrix(y_val, predicted_y_val))

How does the model do? How could you improve the performance?

### 1.6 Improve the Classifier

Using some of the ideas you listed in 1.5, create two other classifiers by changing some parameters either in the data extraction (CountVectorizer)
or the training (MultinomialNB) step. Select the best of the three models based on the validation set
performance and use that model for predicting and evaluating the test set (don’t forget to vectorize the
test set with the appropriate CountVectorizer!).

*Note: The corpus is already lowercased and tokenized, so there are limited options in that respect, but you may still try to improve tokenization.*