## IT 720: NLP, Assignment 2

When you submit the assignment, make sure that you:

- add your name to the file name
- shut down the kernel one last time, restart it, and run your code from start to finish.
- leave the output in each cell to allow the grader to see it
- if there is a bug that you cannot resolve, leave the error message in the output cell so the grader can see it

### Rubric for the numbered sections below where you must write your own code, 175 points total
1.  10 points: Train, Valid, Test Split
2.  20 points: BiLSTM
3.  20 points: BiLSTM with Classifier Head
4.  20 points: Fit Encoder
5.  20 points: Visualize Training Loss and Accuracy
6.  20 points: Create Sentence Embeddings
7.  20 points: Define and Train FFClassifier
8.  15 points: Combine Data and Retrain
9.  10 points: Confusion Matrix
10. 20 points: Two UMAP Scatterplots


### Assignment 2 – Neural Sentence Embeddings with BiLSTM + Feedforward Classifier

#### The first 3 code cells are provided to help you get started, as well a few other cells below.

In this assignment, your code will:

- Build a neural **sentence encoder** using a bi‑directional LSTM
- Learn the word (token) embeddings
- Use the encoder to generate **fixed‑dimensional embeddings** for movie reviews.
- Train a **feed‑forward neural network (FFNN)** classifier on top of these embeddings.
- Evaluate performance and conduct **error analysis** on misclassified examples.

This solution notebook contains:

1. Data loading and (minimal) preprocessing  
2. BiLSTM sentence encoder construction and training  
3. Embedding extraction for all reviews  
4. Feed‑forward classifier training
5. Evaluation and confusion matrix  
6. Error analysis with concrete examples


In [None]:
# This cell contains all packages and functions you need to do the assignment.
# You may substitute PyTorch code instead of Keras/Tensorflow, but you would
# have to replace many imports here with similar imports for PyTorch.

import nltk
import os
import random
# !pip install umap-learn --quiet  # One time installation
import umap # We use this to visualize the movie review vector space

import matplotlib.pyplot as plt
import numpy             as np
import seaborn           as sns
import tensorflow        as tf


from nltk                    import word_tokenize
from nltk.corpus             import movie_reviews
from sklearn.metrics         import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

from tensorflow.keras.callbacks              import EarlyStopping
from tensorflow.keras.layers                 import Input, Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.models                 import Model, Sequential
from tensorflow.keras.optimizers             import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text     import Tokenizer

print("TensorFlow version:", tf.__version__)


## Reproducibility and setup

We set random seeds for `numpy`, Python's `random`, and TensorFlow to reduce variance in results.  
Note: in real training, results will still vary slightly across runs and hardware, but this helps.


In [None]:
SEED = 34

random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# The next line is for GPU determinism (may not be perfect across all environments)
# If this generates an error, just delete it. It's optional.
os.environ["PYTHONHASHSEED"] = str(SEED)


## Loading the movie reviews corpus

We reuse the same **NLTK `movie_reviews`** dataset as in Assignment 1.

- Each document is a full review.
- Each review is labeled with a **sentiment**: `pos` or `neg`.
- We will:
  - Extract raw tokenized reviews.
  - Convert them into text sequences (space‑joined tokens).
  - Assign integer labels: `0` for negative, `1` for positive.


In [None]:
# Load the movie reviews
nltk.download('movie_reviews')

# movie_reviews.fileids() gives list like ['neg/cv000_29416.txt', 'pos/cv000_29590.txt', ...]
fileids = movie_reviews.fileids()
print("Number of documents:", len(fileids))
print("Example fileid:", fileids[0])

# Extract documents and labels
documents = []
labels    = []    # We will add 0 for negative review and 1 for positive

for fid in fileids:
    # tokens: list of strings
    tokens = movie_reviews.words(fid)
    # join back into a whitespace-separated string
    text = " ".join(tokens)
    documents.append(text)

    # label: 'pos' or 'neg'
    label = movie_reviews.categories(fid)[0]
    labels.append(1 if label == "pos" else 0)

print("\nExample document:")
print(documents[0][:500], "...")
print("\nLabel (0=neg, 1=pos):", labels[0])


# Your own code will begin here.

## 1. Train/validation/test split

Add your own code to split the dataset into:

- **Train**: for fitting the BiLSTM encoder and the FFNN.
- **Validation**: for early stopping during BiLSTM training.
- **Test**: held‑out data for final evaluation.

Using scikit-learn's train_test_split, create an 80-20 split for training and test data, but then do another 80-20 split on that training data to get a training set with 64% of the original 2000 reviews with 16% of the 2000 held out for validation.

Please create the following variable names in your code for the training, validation and test sets:
- X_train, y_train
- X_val,   y_val
- X_test,  y_test

After you have set the variables print out the shapes of X_train, X_val and X_test

In [None]:
X = np.array(documents)
y = np.array(labels)

# Use X and y to now create your six variables for training, validation and test data.

# First split: train+val vs test


# Second split: train vs val


# Print out shapes




## Tokenization and integer encoding

We now:

1. Fit a Keras `Tokenizer` on the training text.
2. Convert texts to integer sequences.
3. Pad sequences to a fixed maximum length.


In [None]:
# This cell is provided to you and shows how to create and use a simple
# whole-word based tokenizer using Keras' Tokenizer utility. The Keras
# documentation is not very good regarding tokenizers. However, the following
# HuggingFace page does provide good detail on several approaches to
# tokenization, and is a good supplement to the somewhat limited discussion
# that you can find in Jurafsky and Martin's chapter on 'Words and Tokens'
# https://huggingface.co/docs/transformers/fast_tokenizers

MAX_VOCAB_SIZE = 10000   # This will limit our total vocab size to most frequent 10000 in the data
MAX_SEQ_LEN    = 256     # Truncate/pad to this length so every review is same length
EMBEDDING_DIM  = 128     # Dimensionality of learned token embeddings

# The next line will add '<OOV>' to the list of vocabulary words and assign it the integer 1.
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>")

# When fit_on_texts is called, it will find the top 9,999 words by frequency and assign a unique integer to each word.
# Integer 1 has already been assigned to the special vocab token <OOV>, so any word found in the texts
# that is not among those top 9,999 will just get assigned the 1.
tokenizer.fit_on_texts(X_train)

def texts_to_padded_sequences(texts):
    seqs = tokenizer.texts_to_sequences(texts)
    return(pad_sequences(seqs, maxlen=MAX_SEQ_LEN, padding='post', truncating='post'))

X_train_seq = texts_to_padded_sequences(X_train)
X_val_seq   = texts_to_padded_sequences(X_val)
X_test_seq  = texts_to_padded_sequences(X_test)

print("Train sequences shape:\t\t",    X_train_seq.shape)
print("Validation sequences shape:\t", X_val_seq.shape)
print("Test sequences shape:\t\t",     X_test_seq.shape)

# Each review has now been converted to a sequence of integers. Remember, 1 represents any low frequency word
# which could be a misspelling or simply a very rare word.
# Let's look at the tokenization of the first review in the training sequences:

print('\nThe tokenization of the first review in X_train_seq is:')
X_train_seq[0]

## 2. Building the BiLSTM sentence encoder

Build a Keras model that takes a sequence of token IDs and returns a **fixed‑dimensional sentence embedding**
In doing so, the model will also learn embeddings for the tokens.

Architecture:

- Input: integer sequence of length `MAX_SEQ_LEN`.
- Embedding layer: maps tokens to `EMBEDDING_DIM`‑dimensional learned vectors.
- BiLSTM: processes the sequence in both directions.
- We set `return_sequences=False` to get a single vector per sequence.
- A final dense layer to reshape the reviews dimension and add nonlinearity. This has been supplied for you.

The final encoder outputs a sentence embedding vector of reduced dimensionality, i.e. fewer dimensions than output of the biLSTM.

The shrinking of the final sentence embedding is because:
a. Fewer dimensions may result in less overfitting due to the reduced capacity of representation. You will see
   that overfitting may be a huge issue for this assignment, mainly because of the tiny size of the dataset
   
b. A 'bottleneck' of the  dimensionality for the sentence embeddings allows
    you to make changes elsewhere and still arrive at the same final dimensionality
    vis-a-vis the representation capacity of a biLSTM.
    
c. The tanh activation reshapes the embedding space into a bounded, centered region. This often
   makes downstream classifiers behave more predictably


In [None]:
# Build sequential LSTM model in Keras (or PyTorch, if you prefer).
# You can choose your own hyperparameter values, e.g. number of units for the layers,
# names of layers, etc. unless otherwise noted.

# Begin with an Input layer that takes reviews with shape=(MAX_SEQ_LEN,)
# Add an Embedding layer to learn word (token) embeddings
#     with input dimensions of MAX_VOCAB_SIZE and output dimensions of EMBEDDING_DIM defined above
# Add a bidirectional LSTM with return_sequences=False and that outputs 64 dimensions
#     in each direction producing 128 dimensions for final LSTM output
# Follow the biLSTM with a Dense layer that outputs only 48 dimensions. This is the 'bottleneck' mentioned above.
# This Dense layer should use a 'tanh' activation function.
# The output of this Dense layer will be the input to a small Classifier to perform the review classification
# Give this Dense layer a name by using name="sentence_embedding".  We will reference that name later.












## 3. Training the BiLSTM encoder for sentiment classification

Before we train our biLSTM, we need to give it a supervisory signal for training that is relevant to our task of classifying movie review sentiments. Without that signal, the learned token embeddings will not be able to contribute useful information to the desired task.  Therefore, we will attach a simple classification head to our encoder solely for this purpose. Do not confuse the classification head here with the Feed Forward Classifier that we will create later after we train the encoder.  

To train the encoder, we attach a classification head by simply following the sentence_embedding layer with:
- Dropout layer for regularization, choosing your own value for the dropout rate.
- Dense layer with 1 unit and sigmoid activation for binary sentiment.

Make sure you have saved your entire encoder with its classifer head into a variable called 'biLSTM'.
Compile biLSTM with "binary_crossentropy" for the loss function, the Adam optimizer and metrics=["accuracy"]

Call biLSTM.summary() to see a descrition of your complete encoder. You should see 6 layers in the output:
1. Input
2. Embedding
3. Bidirectional
4. Dense
5. Dropout
6. Dense


In [None]:
# Finish the rest of the encoder as just described.
# Compile it and call biLSTM.summary() to view the results.






## 4. Training with early stopping

We train the model and use `EarlyStopping` on validation loss (`val_loss`) to try to prevent overfitting.
Use `restore_best_weights=True` in the case that early stopping is triggered in order to have the best weights returned.

Save the results of the training into the variable 'history', which we will use
to plot the loss and accuracy curves below.

Key hyperparameters are:

- Batch size
- Number of epochs
- Learning rate (set above for Adam, if you wish to change the default)
- Dropout rate (set above)
- The `patience` for early stopping (i.e. how many epochs with no improvement in validation loss for early stopping)
- Use `verbose=1` if you wish to see the output for each epoch during training.


## 5. Visualizing training and validation curves

Using matplotlib, write code for two plots:
1. Training and Validation loss
2. Training and validation accuracy

Each plot should show the epochs on the horizontal axis, the metric on the vertical axis, and a legend that shows a different color for the Training vs. Loss or Accuracy curves.

These plots will be useful to visually display any:
- Underfitting vs overfitting behavior.
- The point at which early stopping kicks in.

Make sure you have saved the results of your model's fit into 'history'

## Baseline: BiLSTM classifier performance

Before we decouple the encoder and classifier, we can measure how well the full BiLSTM model performs on the test set. This gives a baseline:

- Test accuracy
- Classification report for Precision/recall/F1

Later, we compare this to the FFClassifier trained on learned embeddings.  You can copy/adapt this code later and use it for other models.


In [None]:
# Predict probabilities with threshold at 0.5
y_test_pred_proba = biLSTM.predict(X_test_seq)
y_test_pred       = (y_test_pred_proba >= 0.5).astype(int)

print("BiLSTM model test accuracy:", accuracy_score(y_test, y_test_pred))
print("\nClassification report (BiLSTM model):")
print(classification_report(y_test, y_test_pred, target_names=["neg", "pos"]))


## 6. Create sentence embeddings from the trained encoder

Now use the **encoder part only** (up to and including the `sentence_embedding` layer) to generate fixed‑dimensional vectors for each review.

Steps:

1. Extract the encoder inputs from:  `biLSTM.input`
2. Extract the encoder outputs from: `biLSTM.get_layer("sentence_embedding").output`
3. Use the Keras `Model` class to construct a new model from those inputs and outputs, saving the model into a new variable, e.g. maybe you want to use 'encoder_only_model'.
4. Use the 'predict' method of that new model on each of `X_train_seq`, `X_val_seq`, and `X_test_seq` to convert each movie review to a sentence embedding created by the bidirectional LSTM that was previously trained.
5. Store those embeddings into new variables, e.g. X_train_emb for the training reviews.
6. Call `encoder_only_model.summary()` (if that's the name you used) to verify the architecture
7. Print the shapes of the three new variables where you saved the results of Step 5.


In [None]:
# Rebuild the encoder model that outputs the sentence embedding









## 7. Define, compile and train a a new feed‑forward neural network (FFClassifier) on embeddings that you just saved.

We now use the Keras to build a new **feed‑forward neural network** with 3 layers:

- Input layer: to accept the sentence embeddings that you just created and saved.
- Hidden layer: a Dense layer with a configurable number of units (your choice) and using "relu".
- Output layer: a Dense layer with 1 unit and the "sigmoid" activation function.

Compile this network as before, define EarlyStopping as before and fit the classifier, again saving it into 'history' or a different history variable of your choice.

Make sure you use the embeddings you saved for the movie reviews for both training and validation.

Predict the sentiment on the VALIDATION set, in a similar way (i.e. using 0.5 as the class threshold) that we predicted the test data the earlier model, but making sure you use the VALIDATION set embeddings that you just saved.

Print the validation data accuracy and classification report, again as you have already done above.

## 8. Final training on train+validation and evaluation on test

1. Stack the training and validation embeddings, i.e combine them into a single set of data.
2. As you did near the beginning, create a new split of this data using train_test_split to create a new 90% split for training and 10% for validation, saving each into new variable names.
3. Redefine and Retrain the FFClassifier on these training and validation datasets.
4. Evaluate this final model on predictions for the held‑out test set, i.e. `X_test_emb` if that is the variable name you used for the test embedding data.
5. Print your final accuracy and classifcation report.

This gives us the final performance of the **BiLSTM embeddings + FFNN classifier** pipeline.


## 9. Confusion matrix for the Final classifier

Using matplotlib (or just Seaborn's `heatmap`) plot a confusion matrix on that test data to show:
- Which class is harder to predict.
- Whether the model is biased toward positive or negative predictions.


## Error analysis: inspecting misclassified reviews

We now examine some **false positives** and **false negatives** to understand:

- Which linguistic phenomena confuse the model.
- How the representation (BiLSTM embeddings) might fail.
- How data sparsity, sarcasm, negation, etc. show up in errors.

I am giving you this cell.


In [None]:
# Identify misclassified indices

y_test_pred = y_test_pred.ravel()
misclassified_indices = np.where(y_test != y_test_pred.ravel())[0]
print('Misclassified indices:\n', misclassified_indices)

print("\nNumber of misclassified examples:", len(misclassified_indices))

# Helper function to print a few examples
def show_misclassified_examples(X_text, y_true, y_pred, indices, n=5, truncated_text=True):
    selected = np.random.choice(indices, size=min(n, len(indices)), replace=False)
    for idx in selected:
        print("="*80)
        print(f"Index: {idx}")
        print(f"True label: {y_true[idx]} ({'pos' if y_true[idx]==1 else 'neg'})")
        print(f"Predicted label: {y_pred[idx]} ({'pos' if y_pred[idx]==1 else 'neg'})")

        if truncated_text:
            print("\nReview text (truncated to 800 chars):\n")
            print(X_text[idx][:800])
        else:
            print("\nReview full text:\n")
            print(X_text[idx])
        print("\n")

truncated_text = True  # Truncate reviews to first 800 characters
num_reviews    = 10     # Display this many reviews
show_misclassified_examples(X_test, y_test, y_test_pred, misclassified_indices,
                            n=num_reviews, truncated_text=truncated_text)


### 10. Use UMAP (imported at the beginning) to reduce the dimensionality of the test embeddings and plot them in 2 dimensions.

Individual instructions for each code cell below:
1. Print shape of test embeddings and accuracy score of the test data from Step 8 above.
2. Define UMAP model to reduce X_test_emb (if that was your variable for the test embeddings) to 2 dimensions. Use 'cosine' as the metric, your own values  for n_neighbors, min_dist, and random_state. Call fit_transform method on the X test embeddings, and save result in a variable.  Print the new shape, which should be (400, 2).
3. Create scatterplot of the ground truth labels (i.e. y_test) in the 2 dimensions with labeled axes, title, and legend. Color the plotted points by negative or positive sentiment.
4. Create similar scatterplot to display plotted points that are either correct or incorrect predictions (You can use `y_test == y_test_pred` to get those)


In [None]:
# Print shape of test embeddings and accuracy score from Step 8




In [None]:
# Run UMAP to reduce embeddings to 2D





In [None]:
# Scatterplot of ground-truth sentiment clusters




In [None]:
# Compare with scatterplot of correct vs incorrect predictions

correct = (y_test == y_test_pred)




Notice how the correct/incorrect predictions are  as thoroughtly mixed over the embedding space as the ground truth.

This is because:

- Only 2000 reviews total
- Reviews are long, messy, and full of mixed sentiment
- Many reviews contain both praise and criticism
- The writing style varies wildly (sarcasm, rhetorical questions, plot summary)
- Labels are sometimes ambiguous or borderline

#### Some topics for further thought.

1. **Reducing overfitting:**
   - Yaser Abu-Mostafa of CalTech says that learning how to deal with overfitting is what separates amateurs from professionals.
   - If your model above seems to be overfitting, what do you think are the causes and how would you procees to reduce overfitting?

2. **Embeddings as features:**
   - In Assignment 1, you used N‑gram counts or TF–IDF as features.
   - Here, you used **neural sentence embeddings**.
   - In what sense are these embeddings "just another kind of feature"?  
     In what sense are they fundamentally different?

3. **Error patterns:**
   - Look at some false positives and false negatives.
   - What linguistic patterns seem especially hard for the model?
   - How might you extend the model to handle these cases better (without jumping to Transformers yet)?

4. **Bias–variance and data limitations:**
   - How might model capacity (e.g., size of the BiLSTM or FFClassifier) affect underfitting vs overfitting here?
   - If you had access to more data, what changes would you expect in performance and error patterns?
