# **CIS 4190/5190 Fall 2025 - Homework 4**

**Before starting, you must click on the "Copy To Drive" option in the top bar. Go to File --> Save a Copy to Drive. This is the master notebook so <u>you will not be able to save your changes without copying it </u>! Once you click on that, make sure you are working on that version of the notebook so that your work is saved**

In [None]:
from __future__ import division
import random
import numpy as np
import pandas as pd
import os
import sys
import matplotlib.pyplot as plt
from numpy.linalg import *
np.random.seed(42)  # don't change this line or one below
random.seed(42)

In [None]:
# For autograder only, do not modify this cell.
# True for Google Colab, False for autograder
NOTEBOOK = (os.getenv('IS_AUTOGRADER') is None)
if NOTEBOOK:
    print("[INFO, OK] Google Colab.")
else:
    print("[INFO, OK] Autograder.")

# **PennGrader Setup**

First, you'll need to set up the PennGrader, an autograder we are going to use throughout the semester. The PennGrader will automatically grade your answer and provide you with an instant feedback. Unless otherwise stated, you can resubmit up to a reasonable number of attempts (e.g. 100 attemptes per day). **We will only record your latest score in our backend database**.

After finishing each homework assignment, you must submit your iPython notebook to gradescope before the homework deadline. Gradescope will then retrive and display your scores from our backend database.

In [None]:
%%capture
!pip install penngrader-client

In [None]:
%%writefile student_config.yaml
grader_api_url: 'https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23'
grader_api_key: 'flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa'

In [None]:
from penngrader.grader import *

In [None]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 12345678          # YOUR PENN-ID GOES HERE AS AN INTEGER#

Run the following cell to initialize the autograder. This autograder will let you submit your code directly from this notebook and immidiately get a score.

**NOTE:** Remember we store your submissions and check against other student's submissions... so, not that you would, but no cheating.

In [None]:
grader = PennGrader('student_config.yaml', 'cis5190_f25_HW4', STUDENT_ID, STUDENT_ID)

In [None]:
# Serialization code needed by the autograder
import inspect, sys
from IPython.core.magics.code import extract_symbols

def new_getfile(object, _old_getfile=inspect.getfile):
    if not inspect.isclass(object):
        return _old_getfile(object)

    # Lookup by parent module (as in current inspect)
    if hasattr(object, '__module__'):
        object_ = sys.modules.get(object.__module__)
        if hasattr(object_, '__file__'):
            return object_.__file__

    # If parent module is __main__, lookup by methods (NEW)
    for name, member in inspect.getmembers(object):
        if inspect.isfunction(member) and object.__qualname__ + '.' + member.__name__ == member.__qualname__:
            return inspect.getfile(member)
    else:
        raise TypeError('Source for {!r} not found'.format(object))
inspect.getfile = new_getfile

def grader_serialize(obj):
    cell_code = "".join(inspect.linecache.getlines(new_getfile(obj)))
    class_code = extract_symbols(cell_code, obj.__name__)[0][0]
    return class_code

# **1. [4190: 24 autograded; 5190: 24 autograded] Natural Language Processing**

#### Stanford Sentiment Treebank (SST)

We'll introduce the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) (SST) dataset, and use a Naive Bayes model as a simple baseline. The SST was introduced by [(Socher et al. 2013)](http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) and it consists of approximately 10,000 sentences from movie reviews. It consists of 11,855 sentences drawn from a corpus of movie reviews (originally from Rotten Tomatoes), each labeled with sentiment on a five-point scale ans is a widely used dataset as a benchmark for text classification.

An example of the five-point scale is:
```
sentence: [A warm , funny , engaging film .]
label:    4 (very positive)
```

**Note:** Unlike most classification datasets, SST is also a _treebank_, which means each sentence is associated with a tree structure that decomposes it into subphrases. So for the example above, we'd also have sentiment labels for `[warm , funny]` and `[engaging film .]` and so on. The tree structure will comes in handy for complex NLP tasks and we will be using it briefly to analyze an example that has negation. The data is distributed as serialized trees in [S-expression](https://en.wikipedia.org/wiki/S-expression) form, like this:
```
(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))
```

We've downladed the dataset and parse the S-expressions into a dataframe.



In [None]:
!pip3 install wget

In [None]:
from __future__ import division
import random, os, sys, re, json, time, datetime, shutil
import itertools, collections
from collections import defaultdict, Counter
from importlib import reload

# NLTK, NumPy, and Pandas.
import nltk
from nltk.tree import Tree
import numpy as np
from numpy import random as rd
import pandas as pd

In [None]:
# Constants for use by other modules.
START_TOKEN = u"<s>"
END_TOKEN   = u"</s>"
UNK_TOKEN   = u"<unk>"

#### Datasets:
Next, we will download the dataset from Github to your local runtime. After successful download, you may verify that all datasets are present in your Colab instance.

- [train parquet file](https://www.cis.upenn.edu/~myatskar/teaching/cis519/a5/train.parquet)

- [dev parquet file](https://www.cis.upenn.edu/~myatskar/teaching/cis519/a5/dev.parquet)

- [test parquet file](https://www.cis.upenn.edu/~myatskar/teaching/cis519/a5/test.parquet)

- [tokens in training data](https://www.cis.upenn.edu/~myatskar/teaching/cis519/a5/train_tokens.txt)

In [None]:
if NOTEBOOK:
  if not os.path.exists("train.parquet"):
    !wget https://raw.githubusercontent.com/upenn/cis-4190-5190-fall-25/main/hw4/train.parquet
  if not os.path.exists("dev.parquet"):
    !wget https://raw.githubusercontent.com/upenn/cis-4190-5190-fall-25/main/hw4/dev.parquet
  if not os.path.exists("test.parquet"):
    !wget https://raw.githubusercontent.com/upenn/cis-4190-5190-fall-25/main/hw4/test.parquet
  if not os.path.exists("train_tokens.txt"):
    !wget https://raw.githubusercontent.com/upenn/cis-4190-5190-fall-25/main/hw4/train_tokens.txt

In [None]:
train_file = "train.parquet"
dev_file = "dev.parquet"
test_file = "test.parquet"
vocab_file = "train_tokens.txt"

Below are some helper code to download and build the dataset, you do not need to modify these.

In [None]:
class SSTDataset(object):

    Example_fields = ["tokens", "ids", "label", "is_root", "root_id"]
    Example = collections.namedtuple("Example", Example_fields)


    def canonicalize(self, raw_tokens):
        wordset=(self.vocab.wordset if self.vocab else None)
        return canonicalize_words(raw_tokens, wordset=wordset)

    def __init__(self,train_file,dev_file,test_file,vocab_file,V=20000):
        self.vocab = None
        self.train = pd.read_parquet(train_file)
        self.dev = pd.read_parquet(dev_file)
        self.test = pd.read_parquet(test_file)
        train_words =[]
        with open(vocab_file) as f:
            train_words = f.readlines()
        train_words = [w.strip() for w in train_words]
        # # Build vocabulary over training set
        self.vocab = Vocabulary(train_words, size=V)
        print("Train set has {:,} words".format(self.vocab.size))
        self.target_names = [0,1]

    def get_filtered_split(self, split='train',is_root = True):
        df = getattr(self, split)
        if is_root:
            df = df[df.is_root]
        return df

    def as_padded_array(self, split='train', max_len=40, pad_id=0,is_root = True):
        df = self.get_filtered_split(split,is_root)
        x, ns = pad_np_array(df.ids, max_len=max_len, pad_id=pad_id)
        y = np.empty((1,1))
        if split != 'test':
            y  = np.array(df.label, dtype=np.int32)
        return x, ns, y

    def as_sparse_bow(self, split='train',is_root = True):
        from scipy import sparse
        df = self.get_filtered_split(split,is_root)
        x = id_lists_to_sparse_bow(df['ids'], self.vocab.size)
        if split != 'test':
            return x, np.array(df.label, dtype=np.int32)
        return x

def require_package(package_name):
    import pkgutil
    import subprocess
    import sys
    if not pkgutil.find_loader(package_name):
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package_name])

def canonicalize_digits(word):
    if any([c.isalpha() for c in word]): return word
    word = re.sub("\d", "DG", word)
    if word.startswith("DG"):
        word = word.replace(",", "") # remove thousands separator
    return word

def canonicalize_word(word, wordset=None, digits=True):
    word = word.lower()
    if digits:
        if (wordset != None) and (word in wordset): return word
        word = canonicalize_digits(word) # try to canonicalize numbers
    if (wordset == None) or (word in wordset):
        return word
    else:
        return UNK_TOKEN

def canonicalize_words(words, **kw):
    return [canonicalize_word(word, **kw) for word in words]


def pad_np_array(example_ids, max_len=250, pad_id=0):
    arr = np.full([len(example_ids), max_len], pad_id, dtype=np.int32)
    ns = np.zeros([len(example_ids)], dtype=np.int32)
    for i, ids in enumerate(example_ids):
        cpy_len = min(len(ids), max_len)
        arr[i,:cpy_len] = ids[:cpy_len]
        ns[i] = cpy_len
    return arr, ns

def id_lists_to_sparse_bow(id_lists, vocab_size):
    from scipy import sparse
    ii = []  # row indices (example ids)
    jj = []  # column indices (token ids)
    for row_id, ids in enumerate(id_lists):
        ii.extend([row_id]*len(ids))
        jj.extend(ids)
    x = sparse.csr_matrix((np.ones_like(ii), (ii, jj)),
                          shape=[len(id_lists), vocab_size])
    return x

class Vocabulary(object):

    START_TOKEN = START_TOKEN
    END_TOKEN   = END_TOKEN
    UNK_TOKEN   = UNK_TOKEN

    def __init__(self, tokens, size=None,
                 progressbar=lambda l:l):
        self.unigram_counts = Counter()
        self.bigram_counts = defaultdict(lambda: Counter())
        prev_word = None
        for word in progressbar(tokens):  # Make a single pass through tokens
            self.unigram_counts[word] += 1
            self.bigram_counts[prev_word][word] += 1
            prev_word = word
        self.bigram_counts.default_factory = None  # make into a normal dict

        # Leave space for "<s>", "</s>", and "<unk>"
        top_counts = self.unigram_counts.most_common(None if size is None else (size - 3))
        vocab = ([self.START_TOKEN, self.END_TOKEN, self.UNK_TOKEN] +
                 [w for w,c in top_counts])

        # Assign an id to each word, by frequency
        self.id_to_word = dict(enumerate(vocab))
        self.word_to_id = {v:k for k,v in self.id_to_word.items()}
        self.size = len(self.id_to_word)
        if size is not None:
            assert(self.size <= size)

        # For convenience
        self.wordset = set(self.word_to_id.keys())

        # Store special IDs
        self.START_ID = self.word_to_id[self.START_TOKEN]
        self.END_ID = self.word_to_id[self.END_TOKEN]
        self.UNK_ID = self.word_to_id[self.UNK_TOKEN]

    def words_to_ids(self, words):
        return [self.word_to_id.get(w, self.UNK_ID) for w in words]

    def ids_to_words(self, ids):
        return [self.id_to_word[i] for i in ids]

    def ordered_words(self):
        """Return a list of words, ordered by id."""
        return self.ids_to_words(range(self.size))


In [None]:
ds = SSTDataset(train_file,dev_file, test_file,vocab_file,V=20000)

A few members of the `SSTDataset()` class that we will be using are:
- **`ds.vocab`**: a `vocabulary.Vocabulary` object managing the model vocabulary.
- **`ds.{train,dev,test}`**: a Pandas DataFrame containing the _processed_ examples, including all subphrases. `label` is the target label, `is_root` denotes whether this example is a root node (full sentence), and `tokens` are the tokenized words from the original sentence.

Note if you set `root_only=True` the dataframe will return only examples corresponding to whole sentences. If you set `root_only=False` the dataframe will return examples for all phrases.

In [None]:
is_root = False

## **1.1 [4190: 16 autograded; 5190: 16 autograded] [Deep Averaging Networks](https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf)**

We are going to implement the Deep Averaging Networks

![dan](https://miro.medium.com/max/904/1*0LezMYWUk3pXptoMdO5M_Q.png)


Vector space models for natural language processing (NLP) represent words using low dimensional vectors called embeddings. To apply vector space
models to sentences or documents, one must first select an appropriate composition function, which combines multiple words into a single vector.

Composition functions fall into two classes: unordered and syntactic. Unordered functions treat input texts as bags of word embeddings, while syntactic functions take word order and sentence structure
into account. Syntactic functions outperform unordered functions on many tasks. However, there is a tradeoff: syntactic functions require more training time and computing resources.

The deep averaging network (DAN) is a deep unordered model which that obtains near state-of-the-art accuracies on a variety of sentence and document-level tasks with just minutes of training time on an average laptop computer. It
works in three simple steps:
1. Take the vector average of the embeddings
associated with an input sequence of tokens
2. Pass that average through one or more feedforward layers
3. Perform (linear) classification on the final
layer’s representation

Furthermore, DANs, can be effectively trained on data that have high syntactic variance. The model works by magnifying tiny but meaningful differences in the vector average.

We are going to use DANs for the same classification problem.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [None]:
import os
import time
import glob
import numpy as np

import sys
from argparse import ArgumentParser

### **1.1.1 [4190: 6 autograded; 5190: 6 autograded] [Glove Embeddings](https://nlp.stanford.edu/projects/glove/)**
We are downloading pretrained glove word vectors that has been trained on Common Crawl data, a snapshot of the whole web.
These embeddings serve as excellent initilizations for embeddings our model needs.
Please download the glove embeddings.

In [None]:
!wget -nc https://huggingface.co/stanfordnlp/glove/resolve/main/glove.840B.300d.zip
!unzip glove.840B.300d.zip
!ls -lat

In [None]:
glove_file = "glove.840B.300d.txt"
train_x, train_ns, train_y = ds.as_padded_array("train",is_root = is_root)
dev_x, dev_ns, dev_y = ds.as_padded_array("dev",is_root = is_root)
test_x, test_ns,_  = ds.as_padded_array("test",is_root = is_root)

print("Training set: x = {:s} sparse, ns={:s}, y = {:s}".format(str(train_x.shape), str(train_ns.shape),
                                                str(train_y.shape)))
print("Validation set: x = {:s} sparse,ns={:s}, y = {:s}".format(str(dev_x.shape), str(dev_ns.shape),
                                                str(dev_y.shape)))
print("Test set:     x = {:s} sparse,ns={:s}".format(str(test_x.shape), str(test_ns.shape)))

In [None]:
#look at the format of the file
!head glove.840B.300d.txt

#### **1.1.1.1 [2 pts autograded] Get Glove embeddings**
In this section we want to populate the `glove` dictionary with a mapping of word to the embedding. Remember: the embedding should be an `np.array` of type `np.float` The glove dictionary should only have words that are present in the train vocabulary.


**Hint:**
For getting the word and corresponding embedding from the glove file, remember to refer to the above structure of the word to embedding mapping.

In [None]:
#takes about 1 minute to read through the whole file and find the words we need.
def get_glove_mapping(vocab, file):
    """
    Gets the mapping of words from the vocabulary to pretrained embeddings

    INPUT:
    vocab       - set of vocabulary words
    file        - file with pretrained embeddings

    OUTPUT:
    glove_map   - mapping of words in the vocabulary to the pretrained embedding

    """

    glove_map = {}
    with open(file,'rb') as fi:
        for l in fi:
            try:
                # STUDENT TODO START:

                # 1. Decode the bytes into string and split into words

                # 2. The first element of the array is the word

                # 3. Only process if the word is in the vocabulary set

                # 4. Take the rest of the elements as the vector

                # 5. Assign the vector to the glove map for the corresponding word

                # STUDENT TODO END
            except:
                # Some lines contains urls, which will raise an exception.
                pass
    return glove_map

In [None]:
vocab_set = set(ds.vocab.ordered_words())
glove_map = get_glove_mapping(vocab_set,glove_file)

In [None]:
def test_glove_embedding(glove_map):
    assert(len(glove_map.keys()) == 15506)
    assert("November" not in glove_map.keys())

if NOTEBOOK:
    test_glove_embedding(glove_map)

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_glove_embedding', answer = list(glove_map.keys()))

#### **1.1.1.2 [2 pts autograded] Dimensions required for the weight matrix**

Fill in the dimensions required for weight matrix

In [None]:
# STUDENT TODO START:
d_out = #number of outputs
n_embed = #size of the dictionary of embeddings
d_embed = # the size of each embedding vector
dims =(d_out,n_embed,d_embed)
# STUDENT TODO END

In [None]:
def test_dimensions(dims):
    d_out,n_embed,d_embed = dims
    assert(n_embed == 16474)
    assert(d_out == 2)
    assert(d_embed == 300)

if NOTEBOOK:
    test_dimensions(dims)

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_dimensions', answer = dims)

#### **1.1.1.3 [2 pts autograded] Initializing the weight matrix**

Create `weights_matrix` for the parameters to be learnt. Initialize the weight matrix for a particular id with the glove embedding for the same id. If you do not find a particular word, initialize the weight matrix with `np.random.normal`

Hint: `ds.vocab.ordered_words()` can give you the mapping of id to words. `glove` has the embeddings you need.

In [None]:
def get_weight_matrix(n_embed, d_embed, glove_map):
    """
    Initialize the weight matrix

    INPUT:
    n_embed         - size of the dictionary of embeddings
    d_embed         - the size of each embedding vector

    OUTPUT:
    weights_matrix  - matrix of mapping from word id to embedding

    """
    # STUDENT TODO START:

    # 1. Initialize zero matrix with the dimensions

    # 2. Iterate through the vocabulary words

    # -- if the weight is found in the glove map, set the matrix to that embedding

    # -- else, assign to a random vector of normal distribution

    # STUDENT TODO END
    return weights_matrix

In [None]:
weights_matrix = get_weight_matrix(n_embed, d_embed, glove_map)
weight_data = (weights_matrix.shape, weights_matrix[:155])

In [None]:
def test_weight_matrix(weight_data):
    mat1 = [-0.18994 ,  0.11016 , -0.46874 ,  0.24375 ,  0.18241 ,  0.2649  ,
       -0.025122, -0.58228 , -0.23545 ,  0.20763 ]
    shape = (16474, 300)
    for i in range(0,10):
        if abs(mat1[i] - weight_data[1][150][200+i])>= 0.002:
            assert(mat1[i] != weight_data[1][150][200+i])
        if shape != weight_data[0]:
            assert(shape != weight_data[0])
if NOTEBOOK:
    test_weight_matrix(weight_data)

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_weight_matrix', answer = weight_data)

#### **1.1.1.4 Creating Embedding Layer**
Use the weight matrix to create the embedding layer by using `nn.Embedding`.

In [None]:
def create_emb_layer(weights_matrix, non_trainable=False):
    """
    Create the embedding layer

    INPUT:
    weights_matrix  - matrix of mapping from word id to embedding
    non_trainable   - Flag for whether the weight matrix should be trained.
                      If it is set to True, don't update the gradients

    OUTPUT:
    emb_layer       - embedding layer

    """
    # STUDENT TODO START:

    # 1. Extract the dimensions from weights_matrix

    # 2. Create an embedding layer using the dimensions

    # 3. Convert to tensor and update the embedding layer weight

    # 4. If non_trainable is set to True, don't update the gradients

    # STUDENT TODO END

    return emb_layer

#### **1.1.1.5 Defining the Dataloader**

For the ease of batch processing, we are defining the following to use the functionality of the `Dataloader` in Pytorch.

Note: The process of creating a mask for the word dropout.

In [None]:
class SSTpytorchDataset(Dataset):
    def __init__(self, sst_ds, word_dropout = 0.3, split='train'):
        super(SSTpytorchDataset, self).__init__()
        assert split in ['train', 'test', 'dev'], "Error!"
        self.ds = sst_ds
        self.split = split
        self.word_dropout = word_dropout
        self.data_x, self.data_ns, self.data_y = self.ds.as_padded_array(split,is_root =is_root)
        self.mask = np.zeros_like(self.data_x)

    def __len__(self):
        return self.data_x.shape[0]

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        y = 2
        if self.split != 'test':
            y = self.data_y[idx]

        #Returning the mask for the dataloader

        mask = np.zeros(len(self.data_x[idx]))
        sentl = self.data_ns[idx]
        total_dropped = 0
        for j in range(0,sentl):
            mask[j] = 1
            if self.split == 'train':
                rv = random.random()
                if rv  < self.word_dropout:
                    mask[j] = 0
                    total_dropped+=1
        if total_dropped >= sentl:
            mask[0] = 1
        for i in range(sentl,len(self.data_x[idx])):
            mask[i] = 0
        self.mask[idx] = mask
        return self.data_x[idx], self.data_ns[idx], self.mask[idx], y


### **1.1.2 [10 pts autograded] Training**

####  Masked Averaging

In this section, you will need to compute the average word embedding of tokens in the input. One complication is that sentences come in different lengths, and we will need to keep track of this to correctly average.

When a sentence is input into our network, it is mapped to list of token ids, up to some maximum length. We construct a matrix, M, where each row corresponds to a sentence, and entries correspond to integers representing tokens. Some sentences are, of course, shorter than this maximum length. For these sentences, we fill in the remaining elements of M with a pad index, up to the max length. This is a special pad index indicating we are beyond the end of a sentence. The dataloader takes care of this for you. When averaging, we need to ignore these elements.

Irrespective of if a token is pad or a real token, the first step is to look up an embedding for the index in our embedding table (the first line of the forward method). At this point we will have retrieved some vectors that correspond to the pad tokens as well. We need to ignore these, and only average vectors that correspond to non-pad symbols.

To help do so, often NLP applications will introduce a mask as part of the input. The mask is a binary vector for every sentence, where each position encodes whether the token is really from the sentence, or instead should be ignored. The shape of the mask is batch_size by maximum_length. Again, the dataloader has taken care of this for you. Your job will be to use this mask to ignore the embeddings components we don't want to average over.

You have to perform the following steps:

1. Change the view of the mask so it extends to the embeddings size. It started `batch` by `max_sent`, but we need it to be `batch` by `max_sent` by `d_embed`. The [expand](https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html) Pytorch function will help.
2. Pointwise multiply the expanded mask with the embeddings, to eliminate the tokens that aren't in the mask, and sum the rest (this is the `numerator` of our average). Remember: the mask is a binary vector, so the zeros correspond to elements we don't want in our average. The output of this sum should be `batch` by `d_embed`.
4. Calculate the number of words in each sentence (this is the `denominator` of our average)
3. return `x = numerator/denoninator`, the average

#### Defining the architecture for Deep Averaging Networks

In [None]:
import random as random

class DAN(nn.Module):

    def __init__(self,
                 n_embed=20000,
                 d_embed=300,
                 d_hidden=100,
                 d_out=2,
                 layer_dropout = 0.2,
                 word_dropout = 0.3,
                 embeddings=None,
                 depth = 0):
        super(DAN, self).__init__()

        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.embed = create_emb_layer(weights_matrix,False)

        self.fc_out = nn.Linear(d_hidden, d_out)
        self.word_dropout = word_dropout

    def masked_mean(self,v, mask):
        """
        Create the masked mean

        INPUT:
        v       - input
        mask    - mask that has 0 and 1 for all the tokens in the input
                  0 corresponds to a token we should not include in the average and 1 otherwise

        OUTPUT:
        x       - average

        """
        (batch, max_sent, d_embed ) = v.size() #these values we will be useful for expanding the mask
        # STUDENT TODO START:
        # 1. Reshape the mask and expand to the embeddings

        # 2. Sum the number of non-masked tokens

        # 3. Eliminate the masked tokens and sum the number of remaining tokens

        # 4. Take the average embedding as x

        # STUDENT TODO END
        return x

    def forward(self, text_ids, mask):
        embeddings = self.embed(text_ids) #this is a matrix of embeddings, one for each id, of size batch_size X max_sent_size X embedding dimension
        avg = self.masked_mean(embeddings,mask) #should return the average of the embeddings, ignoring the embeddings corresponding to the pad token
        output = self.fc_out(avg) #final classification layer
        return output

#### Training Loop

In [None]:
criterion = nn.CrossEntropyLoss()

batch_size = 64
epochs = 3
dev_every = 100
lr = 0.001
save_path = "best_model"
drop_out = 0
word_dropout = 0.01
weight_decay = 1e-5

In [None]:

def train(lr = .005, drop_out = 0, word_dropout = .3, batch_size = 16, weight_decay = 1e-5,args = None):
    if args is not None:
      drop_out = args["drop_out"]
      drop_out = args["drop_out"]

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    trainset = SSTpytorchDataset(ds, word_dropout, 'train')
    testset = SSTpytorchDataset(ds, word_dropout, 'test')
    devset = SSTpytorchDataset(ds, word_dropout, 'dev')

    train_iter = DataLoader(trainset, batch_size, shuffle=True, num_workers=0)
    test_iter = DataLoader(testset, batch_size, shuffle=False, num_workers=0)
    dev_iter = DataLoader(devset, batch_size, shuffle=False, num_workers=0)

    model = DAN(n_embed=n_embed, d_embed=d_embed, d_hidden=300, d_out=d_out, layer_dropout=drop_out, word_dropout = word_dropout )
    model.to(device)

    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay = weight_decay)


    acc, val_loss = evaluate(dev_iter, model, device)
    best_acc = acc

    print(
        'epoch |   %        |  loss  |  avg   |val loss|   acc   |  best  | time | save |')
    print(
        'val   |            |        |        | {:.4f} | {:.4f} | {:.4f} |      |      |'.format(
            val_loss, acc, best_acc))

    iterations = 0
    last_val_iter = 0
    train_loss = 0
    start = time.time()
    _save_ckp = ''
    for epoch in range(epochs):
        n_correct, n_total, train_loss = 0, 0, 0
        last_val_iter = 0
        for batch_idx, batch in enumerate(train_iter):
            # switch model to training mode, clear gradient accumulators
            model.train();
            optimizer.zero_grad()

            iterations += 1

            data, ns, mask, label = batch

            data = data.to(device)
            label = label.to(device).long()
            mask = mask.to(device).long()
            mask.requires_grad = False

            answer = model(data,mask)
            loss = criterion(answer, label)

            loss.backward();
            optimizer.step()

            train_loss += loss.item()
            print('\r {:4d} | {:4d}/{} | {:.4f} | {:.4f} |'.format(
                epoch, batch_size * (batch_idx + 1), len(trainset), loss.item(),
                       train_loss / (iterations - last_val_iter)), end='')

            if iterations > 0 and iterations % dev_every == 0:
                acc, val_loss= evaluate(dev_iter, model, device)

                if acc > best_acc:
                    best_acc = acc
                    torch.save(model.state_dict(), save_path)
                    _save_ckp = '*'

                print(
                    ' {:.4f} | {:.4f} | {:.4f} | {:.2f} | {:4s} |'.format(
                        val_loss, acc, best_acc, (time.time() - start) / 60,
                        _save_ckp))

                train_loss = 0
                last_val_iter = iterations
    model.load_state_dict(torch.load(save_path)) #this will be the best model
    test_y_pred = evaluate(test_iter,model, device,"test")
    print("\nValidation Accuracy : ", evaluate(dev_iter,model, device))
    return best_acc, test_y_pred


In [None]:
def evaluate(loader, model, device, split = "dev"):
    model.eval()
    n_correct, n = 0, 0
    losses = []
    y_pred = []
    with torch.no_grad():
        for batch_idx, batch in enumerate(loader):
            data, ns, mask, label = batch
            data = data.to(device)
            label = label.to(device).long()
            mask = mask.to(device).long()
            answer = model(data,mask)
            if split != "test":
                n_correct += (torch.max(answer, 1)[1].view(label.size()) == label).sum().item()
                n += answer.shape[0]
                loss = criterion(answer, label)
                losses.append(loss.data.cpu().numpy())
            else:
                y_pred.extend(torch.max(answer, 1)[1].view(label.size()).tolist())
    if split != "test":
        acc = 100. * n_correct / n
        loss = np.mean(losses)
        return acc, loss
    else:
        return y_pred


Run this to get the validation accuracy on the dev dataset and the predictions of the test dataset.

In [None]:
torch.manual_seed(1234)

epochs = 3
dev_value, test_y_pred = train(lr, batch_size, word_dropout, batch_size, weight_decay)

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_dan_predictions', answer = test_y_pred)

####

## **1.2 [4190: 8 autograded; 5190: 10 autograded] Transformers**

In Lecture we have discussed the heated model architecture Transformers. The original paper that proposed Transformer is [Attention Is All You Need (Vaswani et al. 2017)](https://arxiv.org/abs/1706.03762), and you can read it if interested.

Recall that it is a composition of self-attention layers, here is a graph representation of the architecture:
![transformer architecture](https://d2l.ai/_images/transformer.svg)

So the idea of self-attention is essential for Transformers, and in this homework question your task is to implement the multi-head attention block in a Transformer.

### **1.2.1 Helper functions**

There is no code that you need to write here, but you do need to run this section!

In [None]:
!pip install torchtext

In [None]:
import torch.nn as nn
import torch
import torch.nn.functional as F
import math,copy,re
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
warnings.simplefilter("ignore")

In [None]:
cur_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class PositionalEncoder(nn.Module):
    def __init__(self, embed_dim, max_len=300, device=cur_device):
        super().__init__()
        self.position_embedding = torch.zeros((1, max_len, embed_dim)).to(device)
        i = torch.arange(max_len, dtype=torch.float32).reshape(-1, 1)
        j2 = torch.arange(0, embed_dim, step=2, dtype=torch.float32)
        x = i / torch.pow(10000, j2 / embed_dim)
        self.position_embedding[..., 0::2] = torch.sin(x)
        self.position_embedding[..., 1::2] = torch.cos(x)

    def forward(self, x):
        x_plus_p = x + self.position_embedding[:, : x.shape[1]]
        return x_plus_p

In [None]:
class ResidualNorm(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x, residual):
        return self.norm(x + residual)


class Feedforward(nn.Module):
    def __init__(self, embed_dim, hidden_dim):
        super().__init__()
        self.fc1 = nn.Linear(embed_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, embed_dim)

    def forward(self, x):
        return self.fc2(F.relu(self.fc1(x)))

### **1.2.2 [8 pts autograded] Multihead Attention**

Recall that the attention mechanism requires three main components:

 - the values vectors V
 - the query vectors Q
 - the key vectors K

 And for self-attention, these are all calulated from the original input using three different learnable weight matrices. Essentially we are trying to see that how similar are my queries and keys and use this attention score to construct a weight sum of my values.

 As for Multihead attention, each head will attend to a set of (V, Q, K) values, so we need to replicate (V, Q, K) n times if we have `n_heads` number of attention heads. This is done by our helper function `mha_transform_input`. You will also need to transform the output back to the correct size at the end to make sure that it can be used as input to future layers using `mha_transform_output`.

 **Note:** EVERY computation with the 3 vectors must be done **in this order: Q,K,V**

In [None]:
def masked_softmax(x, mask):
    """
    x:   (B*H, T, T)
    mask:(B*H, T)  # 1 = keep, 0 = mask
    """
    mask = mask.to(dtype=torch.bool)
    # expand to (B*H, 1, T) so it masks along the key dimension
    x = x.masked_fill(~mask.unsqueeze(1), torch.finfo(x.dtype).min)
    attn = F.softmax(x, dim=-1)
    return attn


def mha_transform_input(x, n_heads, head_dim):
    """Restructure the input tensors to compute the heads in parallel
    Requires that head_dim = embed_dim / n_heads
    Args:
      x (n_batch, n_tokens, embed_dim): input tensor, one of queries, keys, or values
      n_heads (int): the number of attention heads
      head_dim (int): the dimensionality of each head
    Returns:
      (n_batch*n_heads, n_tokens, head_dim): 3D Tensor containing all the input heads
    """
    n_batch, n_tokens, _ = x.shape
    x = x.reshape((n_batch, n_tokens, n_heads, head_dim))
    x = x.permute(0, 2, 1, 3)
    return x.reshape((n_batch * n_heads, n_tokens, head_dim))


def mha_transform_output(x, n_heads, head_dim):
    """Restructures the output back to the original format
    Args:
      x (n_bacth*n_heads, n_tokens, head_dim): multi-head representation tensor
      n_heads (int): the number of attention heads
      head_dim (int): the dimensionality of each head
    Returns:
      (n_batch, n_tokens, embed_dim): 3D Tensor containing all the input heads
    """
    n_concat, n_tokens, _ = x.shape
    n_batch = n_concat // n_heads
    x = x.reshape((n_batch, n_heads, n_tokens, head_dim))
    x = x.permute(0, 2, 1, 3)
    return x.reshape((n_batch, n_tokens, n_heads * head_dim))


class ScaledDotProductAttention(nn.Module):
    def __init__(self, head_dim):
        super().__init__()
        self.head_dim = head_dim

    def forward(self, queries, keys, values, mask):
        """
        Args:
          queries (n_batch, n_tokens, embed_dim): queries (Q) tensor
          keys (n_batch, n_tokens, embed_dim): keys (K) tensor
          values (n_batch, n_tokens, embed_dim): values (V) tensor
          mask (n_batch, n_tokens): binary mask tensor
        Returns:
          (n_batch, n_tokens, embed_dim): scaled dot product attention tensor
        """
        # STUDENT TODO START:
        # 1. Calculate the batched dot product of queries and keys

        # 2. Scale it by the square root of embedding dimensions

        # 3. Pass the scaled dot product through masked_softmax to get attention weights

        # 4. Compute final attention using the attention weights and values

        # STUDENT TODO END
        return attention


class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, embed_dim):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = embed_dim // n_heads

        self.attention = ScaledDotProductAttention(self.head_dim)

        # STUDENT TODO START:
        # Define the weight matrices for each of V, Q, and K
        # You can do this with fully connected linear layers
        # Remember to set bias=False to make sure that it is pure weight matrices

        # STUDENT TODO END

        self.out_fc = nn.Linear(embed_dim, embed_dim, bias=False)

    def forward(self, queries, keys, values, mask):
        """
        Args:
          queries (n_batch, n_tokens, embed_dim): queries (Q) tensor
          keys (n_batch, n_tokens, embed_dim): keys (K) tensor
          values (n_batch, n_tokens, embed_dim): values (V) tensor
          mask (n_batch, n_tokens): binary mask tensor
        Returns:
          (n_batch, n_tokens, embed_dim): multi-head attention tensor
        """
        # STUDENT TODO START:

        # For each of V, Q, and K
        # 1. Multiply its corresponding weight matrix (passing through the Linear layer)

        # 2. Use mha_transform_input to transform it into multihead

        # 3. Calculate the attention results: pass in Q, K, V, mask

        # 4. Use mha_transform_output to transform it back into the correct size

        # 5. Pass the results through the output fully connect layer

        # STUDENT TODO END:
        return attention

#### Test your Multihead Attention implementation

In [None]:
embed_dim = 4
head_dim = 4
my_scaled = ScaledDotProductAttention(head_dim)

torch.manual_seed(522)
src_tokens = torch.Tensor([[[7,1,6,5],[8,2,2,3],[5,3,4,2],[1,4,2,1],[10,1,9,7]]]).to(cur_device)
src_mask = torch.IntTensor([[1,1,1,1,0]]).to(cur_device)

# HINT: scaled_answer should have shape (1, 5, 2)
scaled_answer = my_scaled(src_tokens, src_tokens, src_tokens, src_mask).cpu().numpy()

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_scaled_dot_product', answer = scaled_answer)

In [None]:
n_heads = 2
embed_dim = 2
my_att = MultiHeadAttention(n_heads, embed_dim)

torch.manual_seed(522)
src_tokens = torch.Tensor([[[2, 7],[3, 8],[4, 5],[9, 1],[2, 10]]])
src_mask = torch.IntTensor([[1,1,1,1,0]])

# HINT: att_answer should have shape (1, 5, 2)
att_answer = my_att(src_tokens, src_tokens, src_tokens, src_mask).detach().numpy()

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_multihead_attention', answer = att_answer)

#### Full Transformers model code

These code blocks implement a full transformer model using PyTorch, including all essential encoder and decoder components: embedding, positional encoding, multi-head attention, feedforward layers, and normalization. You can use the provided test case to verify that your own implementation matches the required architecture and produces the expected output shape for sequence-to-sequence tasks. This setup ensures your model is ready for tasks like translation, summarization, or next-token prediction.

In [None]:
class EncoderBlock(nn.Module):
    def __init__(self, n_heads, embed_dim, hidden_dim):
        super().__init__()
        self.attention = MultiHeadAttention(n_heads, embed_dim)
        self.norm1 = ResidualNorm(embed_dim)
        self.feedforward = Feedforward(embed_dim, hidden_dim)
        self.norm2 = ResidualNorm(embed_dim)

    def forward(self, src_tokens, src_mask):
        """
        Args:
          src_tokens (n_batch, n_tokens, embed_dim): the source sequence
          src_mask (n_batch, n_tokens): binary mask over the source
        Returns:
          (n_batch, n_tokens, embed_dim): the encoder state
        """
        # First compute self-attention on the source tokens by passing them in
        # as the queries, keys, and values to the attention module.
        self_attention = self.attention(src_tokens, src_tokens, src_tokens, src_mask)
        # Next compute the norm of the self-attention result with a residual
        # connection from the source tokens
        normed_attention = self.norm1(self_attention, src_tokens)
        # Pass the normed attention result through the feedforward component
        ff_out = self.feedforward(normed_attention)
        # Finally compute the norm of the feedforward output with a residual
        # connection from the normed attention output
        out = self.norm2(ff_out, normed_attention)
        return out

In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_heads, n_blocks):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim).to(cur_device)
        self.positional_encoding = PositionalEncoder(embed_dim).to(cur_device)
        self.encoder_blocks = nn.ModuleList(
            [EncoderBlock(n_heads, embed_dim, hidden_dim) for _ in range(n_blocks)]
        ).to(cur_device)

    def forward(self, src_tokens, src_mask):
        x = self.embedding(src_tokens)
        x = self.positional_encoding(x)
        for block in self.encoder_blocks:
            x = block(x, src_mask)
        return x

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, n_heads, embed_dim, hidden_dim):
        super().__init__()
        self.self_attention = MultiHeadAttention(n_heads, embed_dim)
        self.norm1 = ResidualNorm(embed_dim)
        self.encoder_attention = MultiHeadAttention(n_heads, embed_dim)
        self.norm2 = ResidualNorm(embed_dim)
        self.feedforward = Feedforward(embed_dim, hidden_dim)
        self.norm3 = ResidualNorm(embed_dim)

    def forward(self, tgt_tokens, tgt_mask, encoder_state, src_mask):
        """
        Args:
          tgt_tokens (n_batch, n_tokens, embed_dim): the target sequence
          tgt_mask (n_batch, n_tokens): binary mask over the target tokens
          encoder_state (n_batch, n_tokens, embed_dim): the output of the encoder pass
          src_mask (n_batch, n_tokens): binary mask over the source tokens
        Returns:
          (n_batch, n_tokens, embed_dim): the decoder state
        """
        # First compute self-attention on the target tokens by passing them in
        # as the queries, keys, and values to the attention module along with the
        # target mask.
        self_attention = self.self_attention(tgt_tokens, tgt_tokens, tgt_tokens, tgt_mask)
        # Next compute the norm of the self-attention result with a residual
        # connection from the target tokens
        normed_self_attention = self.norm1(self_attention, tgt_tokens)
        # Compute the encoder attention by using the normed self-attention output as
        # the queries and the encoder state as the keys and values along with the
        # source mask.
        encoder_attention = self.encoder_attention(normed_self_attention, encoder_state, encoder_state, src_mask)
        # Next compute the norm of the encoder attention result with a residual
        # connection from the normed self-attention
        normed_encoder_attention = self.norm2(encoder_attention, normed_self_attention)
        # Pass the normed encoder attention result through the feedforward component
        ff_out = self.feedforward(normed_encoder_attention)
        # Finally compute the norm of the feedforward output with a residual
        # connection from the normed attention output
        out = self.norm3(ff_out, normed_encoder_attention)
        return out

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_heads, n_blocks):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim).to(cur_device)
        self.positional_encoding = PositionalEncoder(embed_dim).to(cur_device)
        self.decoder_blocks = nn.ModuleList(
            [DecoderBlock(n_heads, embed_dim, hidden_dim) for _ in range(n_blocks)]
        ).to(cur_device)

    def forward(self, tgt_tokens, tgt_mask, encoder_state, src_mask):
        x = self.embedding(tgt_tokens)
        x = self.positional_encoding(x)
        for block in self.decoder_blocks:
            x = block(x, tgt_mask, encoder_state, src_mask)
        return x

In [None]:
class Transformer(nn.Module):
    def __init__(
        self, src_vocab_size, tgt_vocab_size, embed_dim, hidden_dim, n_heads, n_blocks
    ):
        super().__init__()
        self.encoder = Encoder(src_vocab_size, embed_dim, hidden_dim, n_heads, n_blocks)
        self.decoder = Decoder(tgt_vocab_size, embed_dim, hidden_dim, n_heads, n_blocks)
        self.out = nn.Linear(embed_dim, tgt_vocab_size).to(cur_device)

    def forward(self, src_tokens, src_mask, tgt_tokens, tgt_mask):
        # Compute the encoder output state from the source tokens and mask
        encoder_state = self.encoder(src_tokens, src_mask)
        # Compute the decoder output state from the target tokens and mask as well
        # as the encoder state and source mask
        decoder_state = self.decoder(tgt_tokens, tgt_mask, encoder_state, src_mask)
        # Compute the vocab scores by passing the decoder state through the output
        # linear layer
        out = self.out(decoder_state)
        return out

#### **1.2.2.3 [2 pts autograded] [5190 Extra] Test your implementation with the entire Transformer implementation**

In [None]:
# Test for Transformer
torch.manual_seed(522)
src_vocab_size = tgt_vocab_size = 5
n_blocks, n_heads, batch_size, embed_dim, hidden_dim = 10, 2, 1, 4, 8
src_tokens = tgt_tokens = torch.IntTensor([[0,1,2,3,4]]).to(cur_device)
src_mask = tgt_mask = torch.IntTensor([[1,1,1,1,1]]).to(cur_device)

transformer = Transformer(src_vocab_size, tgt_vocab_size, embed_dim, hidden_dim, n_heads, n_blocks)

# HINT: trans_answer should have shape (1, 5, 5)
trans_answer = transformer(src_tokens, src_mask, tgt_tokens, tgt_mask).cpu().detach().numpy()

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    grader.grade(test_case_id = 'test_transformer', answer = trans_answer)

# **2. [4190: 2 autograded + 14 manual; 5190: 2 autograded + 22 manual] Reinforcement Learning Section**

Install Dependencies and Imports:

In [None]:
if NOTEBOOK:
  !apt-get update
  !apt-get -qq -y install libnvtoolsext1 > /dev/null
  !ln -snf /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so.8.0 /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so
  !apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
  !pip -q install gymnasium[classic_control]
  !pip -q install pyglet
  !pip -q install pyopengl
  !pip -q install pyvirtualdisplay
  !apt-get install xvfb

In [None]:
%matplotlib inline

import gymnasium as gym
import itertools
import matplotlib
import numpy as np
import sys
import collections
import pandas as pd
from collections import namedtuple
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import sklearn.pipeline
import sklearn.preprocessing


import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical, Normal

from sklearn.kernel_approximation import RBFSampler

matplotlib.style.use('ggplot')

import matplotlib.animation
import imageio
from IPython.display import HTML, display, Image
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1024, 768))
virtual_display.start()

In [None]:
seed = 5190 # for the autograded test_featurize_state, use seed 5190
env = gym.make("MountainCarContinuous-v0", render_mode="rgb_array")
env.reset()
env.action_space.seed(seed)
env.observation_space.seed(seed)
env.action_space.sample()
env.observation_space.sample()

### **2.1 [2 pts autograded] Implement State Featurization for RL Function Approximation**

The Mountain Car environment provides states as raw, continuous vectors (position and velocity). To make these more suitable for RL algorithms using function approximation, we first fit a normalization transform on many observed states using `StandardScaler`, so each feature has mean 0 and unit variance. Next, we employ a feature map using multiple Radial Basis Function (RBF) kernels: this step expands the state into a high-dimensional feature vector, where each new dimension reflects similarity to a set of reference points in the normalized space at different scales. This data-driven featurization pipeline ensures that our RL agent’s neural networks or linear models receive well-scaled, expressive inputs, improving both learning stability and generalization.

The final result is that, for any input state, we obtain a 400-dimensional feature vector that encodes both normalized input values and a set of similarities to different parts of the state space. This featurized vector is then used as input for value or policy function approximators (neural networks or linear models), enabling stable, efficient function approximation in continuous reinforcement learning environments.

In [None]:
# Feature Preprocessing: Normalize to zero mean and unit variance
# We use a few samples from the observation space to do this
np.random.seed(seed)
observation_examples = np.array([env.observation_space.sample() for x in range(10000)])
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(observation_examples)

# Used to converte a state to a featurizes represenation.
# We use RBF kernels with different variances to cover different parts of the space
featurizer = sklearn.pipeline.FeatureUnion([
        ("rbf1", RBFSampler(gamma=5.0, n_components=100, random_state=seed)),
        ("rbf2", RBFSampler(gamma=2.0, n_components=100, random_state=seed)),
        ("rbf3", RBFSampler(gamma=1.0, n_components=100, random_state=seed)),
        ("rbf4", RBFSampler(gamma=0.5, n_components=100, random_state=seed))
        ])
featurizer.fit(scaler.transform(observation_examples))

Complete the `featurize_state` function by applying the fitted scaler and RBF-based featurizer to a new state. Your goal is to convert a raw continuous state into a normalized and high-dimensional feature vector suitable for policy and value function approximation in reinforcement learning. Fill in the TODO block with code to (1) scale the state and (2) transform it using the featurizer.

In [None]:
def featurize_state(state):
    """
    Returns the featurized representation for a state.
    """
    # STUDENT TODO START:
    # 1. Apply scaler to the state

    # 2. Transform the state using the featurizer

    # STUDENT TODO END

    return featurized[0]

In [None]:
# PennGrader Grading Cell
if NOTEBOOK:
    state = np.array([-0.8248636 ,  0.02986798])
    feat_state = featurize_state(state)
    grader.grade(test_case_id = 'test_featurize_state', answer = feat_state)


### **2.2 [14 pts manually graded] Policy Gradient with Neural Network Policies (REINFORCE)**

In policy gradient reinforcement learning, neural networks can be used to directly parameterize the agent’s policy—specifically, how actions are chosen from states. The `PolicyEstimator` constructs a probabilistic policy for continuous actions by outputting the mean (μ) and standard deviation (σ) parameters of a Normal (Gaussian) distribution, typically via neural network layers.

During an episode, the agent samples actions from this distribution for both exploration and learning. Policy parameters are updated by maximizing the expected **return**, where **“return”** is the discounted sum of rewards collected throughout each trajectory,  using the REINFORCE algorithm, which applies the gradient of the log-probabilities of the taken actions weighted by the observed returns (discounted sums of rewards). To encourage further exploration, an entropy bonus term can be included in the policy loss.

Use the code below to implement the missing logic in your `PolicyEstimator` class.

- Complete the network heads for μ and σ, their initialization, and the parameterization of the action distribution.
- Implement both the forward and update steps for policy gradients, including handling the entropy regularization term.
- *Note: In this context, "reward" refers to the instantaneous feedback from the environment; "return" refers to the (discounted) sum of these rewards over an episode, which is what the REINFORCE update maximizes. "Undiscounted return" is the simple sum of rewards, used for monitoring learning progress and reporting results.*

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal
import math


class PolicyEstimator(nn.Module):
    """
    Policy: linear μ, linear σ (softplus), Normal policy,
    entropy bonus, and REINFORCE-style loss using provided action & target.
    """
    def __init__(
        self,
        state_dim=400,
        learning_rate=5e-2,
        entropy_rate=1e-1,
        device='cuda',
    ):
        super().__init__()

        # STUDENT TODO START:
        # 1. Define the neural network heads for mu and sigma

        # 2. Initialize the weights and biases to zero (for controlled learning start).

        # STUDENT TODO END

        # Optimizer setup
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)
        self.featurize_state = featurize_state
        self.entropy_rate = entropy_rate

        # STUDENT TODO START:
        # Initialize and store device for computations (CPU/GPU).

        # STUDENT TODO END

    def forward(self, state_tensor: torch.Tensor):
        """
        state_tensor: (400,) or (N, 400)
        returns: dist (Normal), mu (N,1), sigma (N,1)
        """
        x = state_tensor

        # STUDENT TODO START:
        # 1. Compute the mean (mu) of the action distribution by passing x through the mu head.

        # 2. Compute the standard deviation (sigma) by passing x through the sigma head and applying softplus activation.
        # (Add a small epsilon like 1e-5 to avoid zero std.)

        # 3. Construct a Normal (Gaussian) distribution parameterized by mu and sigma.

        # STUDENT TODO END
        return dist, mu, sigma

    @torch.no_grad()
    def predict(self, state):
        """
        Samples and returns an action
        """
        s = torch.as_tensor(self.featurize_state(state), dtype=torch.float32, device=self.device)
        dist, _, _ = self.forward(s)
        action = dist.sample()
        return action.squeeze()

    def update(self, state, target, action):
        """
        state: raw state; will be featurized to shape (400,)
        target: scalar advantage/return (float)
        action: scalar action actually taken (float)
        returns: loss value (float)
        """
        # STUDENT TODO START:
        # 1. Zero out gradients for the optimizer (standard PyTorch preparation).

        # 2. Featurize the state and format as torch tensor; batchify it for the forward pass.

        # 3. Pass state through the forward method to get the action distribution.

        # 4. Prepare action and target tensors for loss calculation.

        # 5. Compute the log-probability of the action and policy distribution entropy.

        # 6. Construct the policy gradient loss, including an entropy bonus for exploration.

        # 7. Backpropagate loss and perform optimizer step to update policy parameters.

        # STUDENT TODO END
        return float(loss.item())


**REINFORCE** is a simple, foundational policy gradient algorithm in reinforcement learning that directly updates an agent's policy using gradients computed from the returns of complete episodes. The agent collects trajectories by sampling actions from its policy, then uses the total discounted return of each action to increase the likelihood of better actions in future episodes via gradient ascent on the expected reward. This algorithm operates without requiring a separate value function or environment model, which makes it widely applicable but also prone to high variance and instability during training.

In [None]:
def reinforce(env, estimator_policy, num_episodes, discount_factor=1.0):
    """
    REINFORCE with function approximation.
    estimator_policy: has .predict(state) -> action and .update(state, target, action)
                      .update(state, return, action) updates the policy
    """
    # stats["episode_rewards"] stores the undiscounted returns for each state
    stats = {"episode_rewards": np.zeros(num_episodes, dtype=np.float32),
             "episode_lengths": np.zeros(num_episodes, dtype=np.float32)}

    Transition = collections.namedtuple(
        "Transition", ["state", "action", "reward", "next_state"]
    )

    for i_episode in range(num_episodes):
        # Reset environment and initialize state
        out = env.reset(seed=seed)
        if hasattr(env, "action_space"):
          env.action_space.seed(seed)
        if hasattr(env, "observation_space"):
          env.observation_space.seed(seed)
        state = out[0] if hasattr(out, "__getitem__") else out

        episode = [] # type list[Transition]

        # STUDENT TODO START:

        # 1. Play through the episode, collecting transitions
        # Make sure to write to the stats buffer for every episode

        # 2. Calculate (discounted) returns for each step

        # 3. Update policy parameters for each transition

        # 4. Print progress
        print(
            "\rEpisode {}/{}  Cumulative Reward:{}".format(
                i_episode + 1, num_episodes, stats["episode_rewards"][i_episode]
            ),
            end=""
        )
        # STUDENT TODO END

    return stats


In the cell below, we’ll train a policy in 50 episodes using your REINFORCE implementation with function approximation. The process relies on stochastic exploration, so outcomes can vary substantially from run to run.​

To ensure results are reproducible and consistent with the course staff's testing, we set random seed to `5190` (`torch.manual_seed`, `np.random.seed`). This helps with learning deterministically and debugging if your agent does not converge as expected.

With a correct REINFORCE implementation, you should see your agent reach good performance (undiscounted returns above 80) in less than 50 episodes or under 3 minutes of training. If it takes longer than 5 minutes or you aren't seeing learning after several episodes, carefully re-check your policy update logic and reward calculations. You can also try out different configurations `num_episodes` and `discount_factor`, and even different seeds for experimentation.

In [None]:
seed = 5190 # try changing seed if you find your agent not learning
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

policy_estimator = PolicyEstimator(
    learning_rate=1e-4,
)

# Run training
num_episodes = 50
discount_factor = 0.90

stats = reinforce(env, policy_estimator,
                     num_episodes=num_episodes,
                     discount_factor=discount_factor)

print(stats)

### Plots and Visualizations

In this cell, we visualize the learning process of your REINFORCE agent using `matplotlib` and `pandas`. Three plots are generated:

* **Undiscounted return per episode**: Shows how the agent’s cumulative (undiscounted) reward evolves as it learns.

* **Smoothed undiscounted returns**: Highlights the learning trend and helps reveal any gradual improvements or plateaus in performance.

* **Episode length per episode**: Indicates how efficiently the agent completes each episode; in this task, shorter episodes typically mean better policies.

Your plots will be manually checked if training trajectory is reasonable and interpretable. Specifically, the agent should achieve an average undiscounted return of more than 80 within the 50 episodes. If your agent’s performance plateaus much lower or your curves do not show learning, please revisit your implementation, debugging code logic, and ensure proper reward calculations.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# assuming 'stats' contains episode-level undiscounted returns:
episode_undiscounted_returns = stats["episode_rewards"]  # cumulative rewards (undiscounted return)
episode_lengths = stats["episode_lengths"]

# Plot: Undiscounted return per episode
plt.figure(figsize=(12,5))
plt.plot(episode_undiscounted_returns, label="Undiscounted Return (Cumulative Reward per Episode)")
plt.xlabel("Episode")
plt.ylabel("Undiscounted Return")
plt.title("Undiscounted Return per Episode - REINFORCE (MountainCar)")
plt.legend()
plt.show()

# Plot: Rolling average for smoother trend (window=3)
rolling_avg = pd.Series(episode_undiscounted_returns).rolling(window=3, min_periods=1).mean()
plt.figure(figsize=(12,5))
plt.plot(rolling_avg, label="Rolling Avg Undiscounted Return (window=3)")
plt.xlabel("Episode")
plt.ylabel("Undiscounted Return")
plt.title("Smoothed Undiscounted Return - REINFORCE (MountainCar)")
plt.legend()
plt.show()

# Plot: Episode lengths (how quickly solved)
plt.figure(figsize=(12,5))
plt.plot(episode_lengths, label="Episode Length")
plt.xlabel("Episode")
plt.ylabel("Length")
plt.title("Episode Length per Episode - REINFORCE (MountainCar)")
plt.legend()
plt.show()


In [None]:
def visualize_policy(policy, max_steps=400, fps=40, seed=seed):
    obs, info = env.reset(seed=seed)

    frames = []
    policy.eval()
    for t in range(max_steps):
        frame = env.render()  # (H, W, 3) uint8
        frames.append(frame)
        action = policy.predict(obs).unsqueeze(0)
        obs, reward, terminated, truncated, info = env.step([action.item()])
        if terminated or truncated:
            break
    # env.close()

    path = "/content/mountaincar_policy.gif"
    imageio.mimsave(path, frames, duration=1.0 / fps)
    display(Image(filename=path))

In [None]:
visualize_policy(policy_estimator)

### **2.3 [0 pts] Behavioral Cloning**

In this section, you’ll run helper functions to collect expert data and evaluate continuous policies in the Mountain Car environment.

- collect_expert_data(...) rolls out the provided expert policy for several episodes, collecting states, actions, and rewards as the car attempts to reach the flag. Success is counted whenever the car achieves the goal (defined as reaching the top of the hill/flag). The output dataset will later be used for behavioral cloning and imitation learning experiments.​

- test_agent(...) runs your policy and records its final position, success rate, and episode rewards. This enables systematic comparison between different learned or expert policies—giving you a way to measure performance and understand how reliably your agent reaches the goal.

These functions generate the ground-truth and measured data that fuel later learning algorithms and benchmarking. In modern RL workflows, expert rollouts are crucial for behavioral cloning, and robust evaluation is needed to confirm if your agent consistently achieves high reward and reaches the flag.

No code are necessary from this point on, but run these cells and use the resulting data to assess your agent in the next part.

In [None]:
def collect_expert_data(env, policy, max_episodes=50):
    """
    Roll out a continuous policy on Gymnasium's MountainCarContinuous-v0 and
    return (states, actions, success_count).
    """

    states, actions, rewards = [], [], []
    success = 0

    for _ in range(max_episodes):
        obs, _ = env.reset(seed=seed)
        ep_reward = 0.0
        for t in range(1, 1000):
            pred = policy.predict(obs)
            if isinstance(pred, (np.ndarray, list)) and np.size(pred) == 1:
                a = float(np.ravel(pred)[0])
            elif hasattr(pred, "item"):
                a = float(pred.item())
            else:
                a = float(pred)

            next_obs, reward, terminated, truncated, _ = env.step([a])

            states.append(np.array(obs, dtype=np.float32))
            actions.append(np.array([a], dtype=np.float32))
            ep_reward += float(reward)

            if terminated:
                success += 1
            if terminated or truncated:
                break
            obs = next_obs
        rewards.append(ep_reward)

    print(f"Avg Reward ({max_episodes} episodes): {np.mean(rewards):.3f}")

    return states, actions, rewards, success

In [None]:
def test_agent(env, policy, max_episodes=50):
    """
    Evaluate a continuous policy on Gymnasium's MountainCarContinuous-v0.

    Assumes: policy.predict(obs) -> scalar tensor/float action in [-1, 1].
    Returns: (position_list, success_list, frames)  # frames kept empty for API parity
    """

    position_list, success_list, reward_list = [], [], []
    frames = []  # kept for signature compatibility
    success = 0

    for i in range(max_episodes):
        obs, _ = env.reset(seed=seed)
        ep_reward = 0.0
        last_obs = obs

        for t in range(1, 1000):
            # Reshape obs to be a 2D array (1 sample, N features) for policy.predict
            pred = policy.predict(obs.reshape(1, -1))
            if isinstance(pred, (np.ndarray, list)) and np.size(pred) == 1:
                a = float(np.ravel(pred)[0])
            elif hasattr(pred, "item"):
                a = float(pred.item())
            else:
                a = float(pred)

            obs, reward, terminated, truncated, _ = env.step([a])

            ep_reward += float(reward)
            last_obs = obs

            if terminated:
                success += 1
            if terminated or truncated:
                break

        position_list.append(float(last_obs[0]))          # final x-position
        reward_list.append(ep_reward)
        success_list.append(success / (i + 1))            # running success rate

    print(f"Avg Reward ({max_episodes} episodes): {np.mean(reward_list):.3f}")
    # print(f"Successes: {success}/{max_episodes}")

    return position_list, success_list, frames

#### **2.3.1 BC from REINFORCE (<5 min)**

Behavioral Cloning (BC) is a supervised learning technique where the agent learns to imitate expert actions by fitting a regression model, such as a decision tree, to state-action pairs collected from expert demonstrations. In the next code blocks, BC serves as a baseline for policy performance, allowing direct comparison between imitation-based policies and reinforcement learning approaches like actor-critic. By using BC, you can assess how well a policy trained from demonstration alone performs in the Mountain Car environment and contrast it with policies that learn from the REINFORCE feedback.

As before, if you find your agent not learning under 5 minutes, you might want to use other seed values.

In [None]:
import sklearn.tree as tree

In [None]:
torch.manual_seed(seed)
np.random.seed(seed)

# Collect REINFORCE evaluations
print('Expert')
states, actions, _, _ = collect_expert_data(env, policy_estimator)

# Train BC policy
print('BC')
bc_clf = tree.DecisionTreeRegressor()
bc_clf = bc_clf.fit(states, actions)
position, successes, frames = test_agent(env, bc_clf)

#### **2.3.2 BC from Actor-Critic**

Now we'll train another expert policy using an Actor-Critic algorithm (learning should reach a reward of above 80 in under 50 episodes/5 minutes), and collect demonstration data to train another BC policy. Again, please try different seed values if your agent is not learning correctly.

In [None]:
class ValueEstimator(nn.Module):
    """
    PyTorch value function: linear head, MSE loss, Adam.
    """
    def __init__(
        self,
        state_dim=400,
        learning_rate=1e-1,
        device='cuda',
    ):
        super().__init__()

        # 1. Define the value head as a linear layer mapping featurized state to scalar value.
        self.value_head = nn.Linear(state_dim, 1)

        # 2. Initialize value head weights and biases to zero for reproducibility.
        nn.init.zeros_(self.value_head.weight)
        nn.init.zeros_(self.value_head.bias)

        # Optimizer setup
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)
        self.featurize_state = featurize_state

        # Initialize and store device for computations (CPU/GPU).
        self.device = torch.device(device)
        self.to(self.device)

    @torch.no_grad()
    def predict(self, state):
        # 1. Featurize the input state and convert it to a torch tensor appropriate for model input.
        s = torch.as_tensor(self.featurize_state(state),dtype=torch.float32, device=self.device).unsqueeze(0)
        # 2. Pass this tensor through the value_head network to get the predicted value.
        v = self.value_head(s)

        # Return the value as a squeezed scalar for easy downstream use.
        return v.squeeze()

    def update(self, state, target):
        # 1. Zero out gradients for the optimizer to prepare for the update step.
        self.optimizer.zero_grad()
        # 2. Featurize the input state, convert to tensor, and batchify for the network input.
        s = torch.as_tensor(self.featurize_state(state), dtype=torch.float32, device=self.device).unsqueeze(0)
        # 3. Convert the target to a tensor and shape appropriately for MSE calculation.
        tgt = torch.as_tensor(target, dtype=torch.float32, device=self.device).view(1, 1)
        # 4. Forward pass: predict the value for the featurized state.
        v = self.value_head(s)
        # 5. Calculate the mean squared error (MSE) loss between prediction and target value.
        loss = F.mse_loss(v, tgt)
        # 6. Backpropagate the loss and update the network parameters using the optimizer.
        loss.backward()
        self.optimizer.step()

        return loss

In [None]:
def actor_critic(env, estimator_policy, estimator_value, num_episodes, discount_factor=1.0):
    """
    Actor-Critic with function approximation (PyTorch estimators).
    estimator_policy: has .predict(state) -> action and .update(state, target, action)
    estimator_value:  has .predict(state) -> value  and .update(state, target)
    """
    stats = {"episode_rewards":np.zeros(num_episodes, dtype=np.float32),
             "episode_lengths":np.zeros(num_episodes, dtype=np.float32)}

    Transition = collections.namedtuple(
        "Transition", ["state", "action", "reward", "next_state", "done"]
    )

    for i_episode in range(num_episodes):
        # 1. Reset the environment and obtain the initial state.
        out = env.reset(seed=seed)
        state = out[0]
        episode = []

        for t in itertools.count():
            # 2. Use the policy network to select an action given the current state.
            action = estimator_policy.predict(state).unsqueeze(0)

            # 3. Step the environment using the selected action and observe results.
            step_out = env.step([action.item()])
            next_state, reward, terminated, truncated, _info = step_out
            done = bool(terminated or truncated)

            episode.append(Transition(state, action, reward, next_state, done))

            # 4. Update episode reward and episode length statistics.
            stats["episode_rewards"][i_episode] += float(reward)
            if done:
                stats["episode_lengths"][i_episode] = t + 1

            # 5. Compute the value of the next state and calculate TD target and TD error for learning.
            value_next = estimator_value.predict(next_state) if not done else 0.0
            td_target = reward + discount_factor * value_next
            td_error = td_target - estimator_value.predict(state)

            # 6. Update the value function ("critic") using the TD target.
            estimator_value.update(state, td_target)

            # 7. Update the policy ("actor") using the TD error (advantage) and actual action taken.
            estimator_policy.update(state, td_error, action)

            print(
                "\rStep {} @ Episode {}/{} Reward:{}".format(
                    t, i_episode + 1, num_episodes,
                    stats["episode_rewards"][i_episode - 1] if i_episode > 0 else 0.0
                ),
                end=""
            )

            if done:
                break

            # 8. Move to the next state for the next time step/decision.
            state = next_state

    return stats


In [None]:
torch.manual_seed(seed)
np.random.seed(seed)

policy_estimator = PolicyEstimator(
    learning_rate=1e-4,
)
value_estimator = ValueEstimator(
    learning_rate=1e-1,
)

# Run training (same episode count / discount)
num_episodes = 50
discount_factor = 0.90

stats = actor_critic(env, policy_estimator, value_estimator,
                     num_episodes=num_episodes,
                     discount_factor=discount_factor)

print(stats)

In [None]:
# Collect actor-critic evaluations
print('Expert')
states, actions, _, _ = collect_expert_data(env, policy_estimator)

# Train BC policy
print('BC')
bc_clf = tree.DecisionTreeRegressor()
bc_clf = bc_clf.fit(states, actions)
position, successes, frames = test_agent(env, bc_clf)

In [None]:
visualize_policy(policy_estimator)

In [None]:
env.close()

### **2.4 [8 pts manual 5190 only] Analysis of Reinforcement and Imitation Learning Methods**


#### **2.4.1 [5190 2 pts manual] Actor-Critic Methods**
Why does our actor-critic algorithm, which learns the state-value function $V(s)$ in addition to the policy $\pi$, require fewer iterations (experiences in the environment) than the policy-gradient algorithm (which learns $\pi$ directly) to achieve similar returns?

**Answer:** [Your answer here]

#### **2.4.2 [5190 2 pts manual] Reward Alignment**
Review the [documentation](https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous/#reward) of the reward function. Does the instantaneous reward encourage efficient solutions? Justify your answer.

**Answer:** [Your answer here]

#### **2.4.3 [5190 2 pts manual] Hyperparameters**

Suppose we would like to modify the `reinforce()` call to encourage more efficient solutions. Which of the hyperparameter(s) could we change and how?

**Answer:** [Your answer here]

#### **2.4.4 [5190 2 pts manual] Behavioral Cloning**

For each of the 4 approaches in the previous section (REINFORCE, Actor-Critic, and BC using each of these as the expert), report the average undiscounted returns over 50 episodes for the trained policy. Why might it be possible for a BC policy to out-perform its expert?

**Answer:** [Your answer here]

# **Submit to Gradescope**
Congratulations! You've finished the homework. Don't forget to submit your final notebook on [Gradescope](gradescope.com).