## Homework 1a: Introduction to PyTorch

Welcome to CS 288.  The purpose of this first assignment is to make sure that you are familiar with all the tools you need to complete the programming assignments for the course.  We will walk you through the process of building a model with PyTorch in Colab.  Most of it will be structured as a tutorial, but we will ask you to fill in code and submit at the end. 

### Colab

Our assignments will be given to you as Jupyter notebooks, and we intend for you to run them using Google Colab.
Colab is an online editor that also provides free access to a GPU.
To get started, make a copy of the assignment by clicking `File->Save a copy in drive...`.  You will need to be logged into a Google account, such as your @berkeley.edu account.

To access a GPU, go to `Edit->Notebook settings` and in the `Hardware accelerator` dropdown choose `GPU`. 
As soon as you run a code cell, you will be connected to a cloud instance with a GPU.
Try running the code cell below to check that a GPU is connected (select the cell then either click the play button at the top left or press `Ctrl+Enter` or `Shift+Enter`).

In [1]:
import torch
from torch.utils.data import DataLoader

if torch.cuda.is_available():
    print('Found GPU')
else:
    print('Did not find GPU')

Found GPU


When you run a code cell, Colab executes it on a temporary cloud instance.  Every time you open the notebook, you will be assigned a different machine.  All compute state and files saved on the previous machine will be lost.  Therefore, you may need to re-download datasets or rerun code after a reset. If you save output files that you don't want to lose, you should download them to your personal computer before moving on to something else.  You can download files by hitting the > arrow at the top left of the page under the menus to expand the sidebar, selecting `Files`, right clicking the file you want, and clicking `Download`.  Alternatively, you can mount your Google drive to the temporary cloud instance's local filesystem using the following code snippet and save files under the specified directory (note that you will have to provide permission every time you run this).


In [2]:
# mount Google drive
from google.colab import drive
#drive.mount('/content/drive')

# now you can see files
#!echo -e "\nNumber of Google drive files in /content/drive/My Drive/:"
#!ls -l "/content/drive/My Drive/" | wc -l
# by the way, you can run any linux command by putting a ! at the start of the line

# by default everything gets executed and saved in /content/
#!echo -e "\nCurrent directory:"
#!pwd

Many of the assignments will require training a model for some period of time, often on the order of 20-30 minutes.  There are some important limitations to Colab that you should be aware of when running code for this amount of time.  If you close the window or put your computer to sleep, Colab will disconnect you from the compute machine and your code will stop running.   There are also timeouts for inactivity (somewhere on the order of 30 minutes), so if you want to leave code running, be sure to check back periodically.  After a timeout, your compute machine will be disconnected and the files on it will be lost.

A few other notes about using Colab:
* The `Runtime` menu has many different run options, such as `Run all` or `Run after` so you don't have to run each code block individually.
* Some people have run into CUDA device assert errors that did not originate from their code.  Restarting the runtime should fix this (unless there actually is a problem with your code). If your code is causing CUDA device assert errors, then debugging on the CPU may be easier.

If at some point you want to run longer jobs or connect to multiple GPUs, there are coupons for Google Compute Cloud available for students in the course. You could deploy your own cloud instance and run JupyterHub to recreate a similar environment to Colab. However, the course staff cannot offer technical support for this kind of configuration; you're on your own to set it up.

### Part-of-Speech Tagging

You'll be trying to predict the most common [part of speech](https://web.stanford.edu/~jurafsky/slp3/8.pdf) for a word from its characters.  This project will focus on word types rather than tokens and not use any context (https://en.wikipedia.org/wiki/Type%E2%80%93token_distinction). This task is different from (and simpler than) a standard part-of-speech tagging task, which predicts part-of-speech tags for tokens in their sentential context.

Many words can have multiple different parts of speech, but in this project we will associate each word only with its most common part of speech in the [Brown Corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html), which has been manually labeled with part-of-speech tags.  

Words are lowercased and filtered for length and frequency. Punctuation and numbers are removed. Any real NLP application would have to deal with the actual contents of text instead of filtering in this way, but we're just warming up.

Below, we provide you with code to load the dataset. Please don't change the cell below, or you may confuse our autograder.

In [3]:
import nltk
import random

from nltk.corpus import brown
from collections import defaultdict, Counter

nltk.download('brown')
nltk.download('universal_tagset')

brown_tokens = brown.tagged_words(tagset='universal')
print('Tagged tokens example: ', brown_tokens[:5])
print('Total # of word tokens:', len(brown_tokens))

max_word_len = 20

def most_common(s):
    "Return the most common element in a sequence."
    return Counter(s).most_common(1)[0][0]

def most_common_tags(tagged_words, min_count=3, max_len=max_word_len):
    "Return a dictionary of the most common tag for each word, filtering a bit."
    counts = defaultdict(list)
    for w, t in tagged_words:
        counts[w.lower()].append(t)
    return {w: most_common(tags) for w, tags in counts.items() if 
            w.isalpha() and len(w) <= max_len and len(tags) >= min_count}

brown_types = most_common_tags(brown_tokens)
print('Tagged types example: ', sorted(brown_types.items())[:5])
print('Total # of word types:', len(brown_types))

def split(items, test_size):
    "Randomly split into train, validation, and test sets with a fixed seed."
    random.Random(288).shuffle(items)
    once, twice = test_size, 2 * test_size
    return items[:-twice], items[-twice:-once], items[-once:]

val_test_size = 1000
all_data_raw = split(sorted(brown_types.items()), val_test_size)
train_data_raw, validation_data_raw, test_data_raw = all_data_raw
all_tags = sorted(set(brown_types.values()))
print('Tag options:', all_tags)

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
Tagged tokens example:  [('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN')]
Total # of word tokens: 1161192
Tagged types example:  [('a', 'DET'), ('aaron', 'NOUN'), ('ab', 'NOUN'), ('abandon', 'VERB'), ('abandoned', 'VERB')]
Total # of word types: 18954
Tag options: ['ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRON', 'PRT', 'VERB', 'X']


Note that this "universal" tagset is considerably simpler than the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) we have seen in class. You're welcome to insert additional cells and explore the data. Our autograders don't rely on any particular structure of the notebook.

First, let's run a baseline that predicts `NOUN` for every word. A predictor function takes a list of tagged words and returns a list of predicted tags. We've also provided some helper functions here to evaluate model outputs.  You don't need to fill in any code in this cell.



In [4]:
def noun_predictor(raw_data):
    "A predictor that always predicts NOUN."
    predictions = []
    for word, _ in raw_data:
        predictions.append('NOUN')
    return predictions

def accuracy(predictions, targets):
    """Return the accuracy percentage of a list of predictions.
    
    predictions has only the predicted tags
    targets has tuples of (word, tag)
    """
    assert len(predictions) == len(targets)
    n_correct = 0
    for predicted_tag, (word, gold_tag) in zip(predictions, targets):
        if predicted_tag == gold_tag:
            n_correct += 1

    return n_correct / len(targets) * 100.0

def evaluate(predictor, raw_data):
    return accuracy(predictor(raw_data), raw_data)

def print_sample_predictions(predictor, raw_data, k=10):
    "Print the first k predictions."
    d = raw_data[:k]
    print('Sample predictions:', 
          [(word, guess) for (word, _), guess in zip(d, predictor(d))])

print('noun baseline validation accuracy:', 
      evaluate(noun_predictor, validation_data_raw))
print_sample_predictions(noun_predictor, validation_data_raw)

noun baseline validation accuracy: 55.1
Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'NOUN'), ('downwind', 'NOUN'), ('exodus', 'NOUN'), ('avoiding', 'NOUN'), ('informal', 'NOUN'), ('padded', 'NOUN'), ('tantalizing', 'NOUN'), ('farce', 'NOUN'), ('berger', 'NOUN')]


### Building a PyTorch Classifier

We will be using the deep learning framework PyTorch for all our projects.
If you haven't used PyTorch at all before, we recommend you check out the tutorials on the PyTorch website: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html.  Throughout this project and the others in this course, you will need to reference the documentation at https://pytorch.org/docs/stable/index.html.  We'll be using PyTorch version 1.10, which comes pre-installed with Colab.  In this project, we'll walk you through the process of defining and training your neural network model, but future projects will have less guidance.

Below, we've provided a baseline network as a PyTorch Module that will learn a single parameter per part-of-speech tag. This model has the capacity to learn that `'NOUN'` is the most common tag and predict that. It can't do better. Use this network as you are developing your training and prediction code, then replace it with your actual network later. 

In [5]:
import torch
from torch import nn
import torch.nn.functional as F

class BaselineNetwork(nn.Module):
    def __init__(self, n_outputs):
        super().__init__()

        # learn a vector of size n_outputs, initialized with all zeros
        self.param = nn.Parameter(torch.zeros(n_outputs)) 

    def forward(self, chars, mask):
        # return the same outputs (self.param) for each example in a batch
        return self.param.expand(chars.size(0), -1)

print("PyTorch version: {}".format(torch.__version__))

PyTorch version: 1.10.0+cu111


To train or evaluate a neural model, we'll need to transform the raw data from strings into tensors.  We've provided the following function to perform the transformation for you. Each word is prepended with the `^` character and appended with `$` so that these boundary characters are available to the network.

In [6]:
def make_matrices(data_raw):
    """Convert a list of (word, tag) pairs into tensors with appropriate padding.
    
    character_matrix holds character codes for each word, 
      indexed as [word_index, character_index]
    character_mask masks valid characters (1 for valid, 0 invalid), 
      indexed similarly so that all inputs can have a constant length
    pos_labels holds part-of-speech values for each word as integer indices
    """
    max_len = max_word_len + 2  # leave room for word start/end symbols
    character_matrix = torch.zeros(len(data_raw), max_len, dtype=torch.int64) 
    character_mask = torch.zeros(len(data_raw), max_len, dtype=torch.float32)
    pos_labels = torch.zeros(len(data_raw), dtype=torch.int64)
    for word_i, (word, pos) in enumerate(data_raw):
        for char_i, c in enumerate('^' + word + '$'):
            character_matrix[word_i, char_i] = ord(c)
            character_mask[word_i, char_i] = 1
        pos_labels[word_i] = all_tags.index(pos)
    return torch.utils.data.TensorDataset(character_matrix, character_mask, pos_labels)

validation_data = make_matrices(validation_data_raw)

print('Sample datapoint after preprocessing:', validation_data[0])
print('Raw datapoint:', validation_data_raw[0])

Sample datapoint after preprocessing: (tensor([ 94, 115,  97, 108, 101, 109,  36,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0]), tensor([1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]), tensor(5))
Raw datapoint: ('salem', 'NOUN')


The output of a `BaselineNetwork` is a matrix of dimension (batch_size, num_pos_labels) containing logits, or unnormalized log probabilities. To get probabilities from this matrix, you would run `F.softmax(x, dim=1)`, which exponentiates the logits and then normalizes each row to sum to 1.  The cell below generates an output distribution for the first example of the validation set, which is uniform because the network param was initialized to zero.

In PyTorch, it is common to return pre-activation values from modules (e.g. the values before running the final softmax or sigmoid operation).  PyTorch has loss functions that combine the softmax/sigmoid operation into the loss operation for more numerical stability.  Be sure you know what type of values a network returns, as this will affect your training and prediction code.

In [7]:
# Create a network and copy its parameters to the GPU.
untrained_baseline = BaselineNetwork(len(all_tags)).cuda()
untrained_baseline.eval()

# Select the first validation example.
example = validation_data[0]
chars, mask, _ = example
print(_)
# Networks only process batches. Create a batch of size one.
chars_batch, mask_batch = chars.unsqueeze(0), mask.unsqueeze(0)

# Copy batch to the GPU.
chars_batch, mask_batch = chars_batch.cuda(), mask_batch.cuda()

# Run the untrained network.
logits = untrained_baseline(chars_batch, mask_batch)

# Convert to a distribution.
output_distribution = F.softmax(logits, dim=1).squeeze().tolist()

# Inspect the distribution, which should be uniform.
list(zip(all_tags, output_distribution))

tensor(5)


[('ADJ', 0.09090909361839294),
 ('ADP', 0.09090909361839294),
 ('ADV', 0.09090909361839294),
 ('CONJ', 0.09090909361839294),
 ('DET', 0.09090909361839294),
 ('NOUN', 0.09090909361839294),
 ('NUM', 0.09090909361839294),
 ('PRON', 0.09090909361839294),
 ('PRT', 0.09090909361839294),
 ('VERB', 0.09090909361839294),
 ('X', 0.09090909361839294)]

Finally, time to write some code!

In the cell below, define a predictor for a network by following the instructions in the comments. The predictor takes a list of words (strings) and returns a list of part-of-speech tags (also strings).

For this assignment, we've provided more fine-grained instructions as comments in the code template.  You are free to explore methods and architectures other than the ones we specified in the comments, but we highly recommend starting with them, as they will help you reach the required accuracies and give lots of best practices to use in later projects.

In [8]:
def predict_using(network):
    def predictor(raw_data):
        """Return a list of part-of-speech tags as strings, one for each word.

        raw_data - a list of (word, tag) pairs.
        """
        output=[]
        with torch.no_grad(): # turns off automatic differentiation, which isn't required but helps save memory

            # YOUR CODE HERE
            # * put `network` into evaluation mode (turning off dropout) using `.eval()`
            #   then back into train mode at the end of the function with `.train()`
            #   this is easy to forget and could lead to lower accuracy without warning
            #   see https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module.train for more info
            # * use `make_matrices` to get a preprocessed dataset from `raw_data`
            # * make a DataLoader to iterate over the preprocessed dataset from `make_matrices`, but don't use shuffling or your outputs will be in the wrong order
            # * iterate with the data loader (there will be a pos_labels vector, but don't use it - we want to be able to use our model on new inputs where we don't know the answer)
            #  * run `network` to get outputs
            #  * get the id of the predicted part of speeches with an argmax operation
            #  * convert the predictions to strings using `all_tags`
            # * return your predictions

            # BEGIN SOLUTION
            network.eval()
            dataset=torch.utils.data.DataLoader(make_matrices(raw_data),shuffle=False)
            for chars,mask,_ in dataset:
              chars_batch, mask_batch = chars, mask
              chars_batch, mask_batch = chars_batch.cuda(), mask_batch.cuda()
              logits = network(chars_batch, mask_batch)
              output_distribution = F.softmax(logits, dim=1).squeeze()
              output.append(all_tags[torch.argmax(output_distribution)])
            network.train()
            # END SOLUTION
            return output  
    
    return predictor

# The predictions of an untrained model should be arbitrary.
print_sample_predictions(predict_using(untrained_baseline), validation_data_raw)

Sample predictions: [('salem', 'ADJ'), ('unsympathetic', 'ADJ'), ('downwind', 'ADJ'), ('exodus', 'ADJ'), ('avoiding', 'ADJ'), ('informal', 'ADJ'), ('padded', 'ADJ'), ('tantalizing', 'ADJ'), ('farce', 'ADJ'), ('berger', 'ADJ')]



Fill in the training function for the neural network below. This function should train any network.  

Then, you'll have all the parts needed to train and evaluate the baseline network.  You should get the same accuracy as the all-noun baseline.  Make sure your train function prints validation scores so that you see score outputs here.

In [9]:
import tqdm

def train(network, n_epochs=25):
    # YOUR CODE HERE
    # * use `make_matrices` to get a preprocessed dataset from `train_data_raw`
    # * make a DataLoader from torch.utils.data to iterate over your dataset
    #   it can handle batching and shuffling of the data, you just need to pass it the `batch_size` and `shuffle` parameters
    # * move `network` to GPU using `.cuda()`
    # * make an optimizer from torch.optim with your network parameters
    #   `Adam` with its default hyperparameters often works pretty well without any tuning
    #   later you can explore other optimizers, as well as learning rate schedules

    # BEGIN SOLUTION
    data_loader=torch.utils.data.DataLoader(make_matrices(train_data_raw),shuffle=False)
    network.cuda()
    optimizer=torch.optim.Adam(network.parameters())
    best_score=0
    # END SOLUTION
    
    predictor = predict_using(network)
    for epoch in range(n_epochs):
        print('Epoch', epoch)
        for batch in tqdm.tqdm_notebook(data_loader, leave=False):
            chars_batch, mask_batch, pos_batch = batch
            assert network.training, 'make sure your network is in train mode with `.train()`'
            # YOUR CODE HERE
            # * call zero_grad on your optimizer
            #   warning: this is easy to forget and you won't get an error if you do - you might just get lower accuracies
            # * move the batch inputs to GPU
            # * run your network
            # * compute a loss; you can use `F.cross_entropy`, which combines a softmax operation with
            #   a cross-entropy loss operation for multi-class classification
            # * call `.backward()` on your loss and `.step()` on your optimizer

            # BEGIN SOLUTION
            optimizer.zero_grad()
            chars_batch, mask_batch, pos_batch = chars_batch.cuda(), mask_batch.cuda(), pos_batch.cuda()
            logits = network(chars_batch, mask_batch)
            loss = F.cross_entropy(logits, pos_batch)
            loss.backward()
            optimizer.step()
            # END SOLUTION

        validation_score = evaluate(predictor, validation_data_raw)
        print('Validation score:', validation_score)

        # YOUR CODE HERE
        # * if the validation score is better than your previous best score, save the model
        #   use `network.state_dict()` and `torch.save` (https://pytorch.org/docs/stable/notes/serialization.html)
        #   this gives us a form of early stopping in case the model starts overfitting

        # BEGIN SOLUTION
        if best_score<validation_score:
          best_score=validation_score
          torch.save(network.state_dict(),'network.pt')
        # END SOLUTION

    # YOUR CODE HERE
    # * load the best model from the file where you saved it using `torch.load` and `network.load_state_dict`
    #   and return it

    # BEGIN SOLUTION
    network.load_state_dict(torch.load('network.pt'))
    return network
    # END SOLUTION

trained_baseline_network = train(BaselineNetwork(len(all_tags)), 2)
print_sample_predictions(predict_using(trained_baseline_network), 
                         validation_data_raw)

Epoch 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 55.1
Epoch 1


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 55.1
Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'NOUN'), ('downwind', 'NOUN'), ('exodus', 'NOUN'), ('avoiding', 'NOUN'), ('informal', 'NOUN'), ('padded', 'NOUN'), ('tantalizing', 'NOUN'), ('farce', 'NOUN'), ('berger', 'NOUN')]


It's time to actually define a non-trivial neural network.  We'll start with a pretty simple model that takes embeddings of the characters of a word, pools them, and runs a feedforward network.  Fill in your code for `PoolingNetwork` below.  A correct implementation should get a validation score over 66%.

> Indented block



In [20]:
class PoolingNetwork(nn.Module):
    def __init__(self, n_outputs): # pass whatever arguments you need
        super().__init__() # you will get an error if you don't call the parent class __init__

        # YOUR CODE HERE
        # create Modules from torch.nn (imported as nn)
        # here you will need nn.Embedding and two nn.Linear
        # you may find it easier to start with the `forward` method and as you need components come back to place them here

        # BEGIN SOLUTION
        self.dim=256
        self.hidden=64
        self.embedding=nn.Embedding(128,self.dim)
        self.model = nn.Sequential(
            nn.Linear(self.dim,self.hidden),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            nn.Linear(self.hidden,11)
        )
        # END SOLUTION

    def forward(self, chars, mask): # the main method that runs this module
        # for this network, `chars` should be an int64 tensor of character ids with size (batch, n_chars)
        #   note that sometimes PyTorch puts sequence dimensions before the batch, so you will need to make sure you know which you are using
        # `mask` is a float32 tensor of size (batch, n_chars) that is 1.0 if the character at that position in `chars` is valid (else 0.0)
        # the function returns a float32 tensor of size (batch, n_pos)
        
        # we recommend that you return pre-activation values from modules (e.g. the values before running softmax or sigmoid)
        # pytorch has loss functions that combine the softmax/sigmoid operation into the loss operation for more numerical stability

        # YOUR CODE HERE
        # Your code should do the following:
        # 1) get character embeddings
        # 2) multiply embeddings by `mask` (you will need to use `view` or `unsqueeze` to make the broadcasting work correctly;
        #    see https://pytorch.org/docs/stable/notes/broadcasting.html)
        # 3) pool over the characters of each word with the Tensor `mean` function
        # 4) run a linear layer
        # 5) apply an activation (ReLU is a decent default choice; look in torch.nn.functional, which is imported as F)
        # 6) run dropout; you can either make a nn.Dropout module in __init__ or use F.dropout, but if you use F.dropout, be sure to pass training=self.training to
        #    make sure dropout gets turned off during evaluation
        # 7) run your second linear layer and return the output

        # BEGIN SOLUTION
        char_embedding = self.embedding(chars)
        mask=mask.unsqueeze(-1)
        masked = char_embedding*mask
        pooled = torch.mean(masked,1)
        output = self.model(pooled)
        return output
        # END SOLUTION
trained_pooling_network = train(PoolingNetwork(len(all_tags)),25)
pooling_predictor = predict_using(trained_pooling_network)

Epoch 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.5
Epoch 1


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.2
Epoch 2


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.4
Epoch 3


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.0
Epoch 4


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.10000000000001
Epoch 5


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.60000000000001
Epoch 6


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.7
Epoch 7


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.8
Epoch 8


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.60000000000001
Epoch 9


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.7
Epoch 10


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.8
Epoch 11


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 66.0
Epoch 12


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.7
Epoch 13


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.8
Epoch 14


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 66.3
Epoch 15


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.2
Epoch 16


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 66.0
Epoch 17


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.9
Epoch 18


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.0
Epoch 19


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.3
Epoch 20


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.7
Epoch 21


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.4
Epoch 22


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.5
Epoch 23


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 65.5
Epoch 24


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 64.9


And look at some outputs.

In [22]:
pooling_predictor = predict_using(trained_pooling_network)

In [23]:
print_sample_predictions(pooling_predictor, validation_data_raw)

Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'NOUN'), ('downwind', 'NOUN'), ('exodus', 'NOUN'), ('avoiding', 'VERB'), ('informal', 'NOUN'), ('padded', 'VERB'), ('tantalizing', 'VERB'), ('farce', 'NOUN'), ('berger', 'NOUN')]


For this next part, we'll give you a little more freedom to experiment.  Think about what types of information could be useful for predicting parts of speech.  Think about what the pooling model is missing.  Implement an improved model that reaches a validation score above 80%.

One way to reach the required accuracy is to operate over character n-grams before pooling.
There are several ways to implement this, but if you need help, you can use the following steps between the creation of embeddings and the mask/pool operations to process bigrams:
1. create two slices of the embedding tensor, one with the first character cut off and one with the last cut off
2. concatenate the two sliced tensors along the embedding dimension with `torch.cat`
3. run a linear layer with activation on the concatenated embeddings
4. cut off the first character of the mask tensor

In [24]:
class ImprovedNetwork(nn.Module):
    def __init__(self, n_outputs): # pass whatever arguments you need
        super().__init__()

        # YOUR CODE HERE
        # create Modules from torch.nn (imported as nn)

        # BEGIN SOLUTION
        self.dim=256
        self.hidden=64
        self.embedding=nn.Embedding(128,self.dim)
        self.n_chars = nn.Sequential(
            nn.Linear(3*self.dim,2*self.dim),
            nn.ReLU())
        self.model = nn.Sequential(
            nn.Linear(2*self.dim,self.hidden),
            nn.ReLU(),
            nn.Dropout(),
            nn.Linear(self.hidden,11)
        )
        # END SOLUTION

    def forward(self, chars, mask):
        # for this network, `chars` should be an int64 tensor of character ids with size (batch, n_chars)
        # `mask` is a float32 tensor of size (batch, n_chars) that is 1.0 if the character at that position in `chars` is valid (else 0.0)
        # the function returns a float32 tensor of size (batch, n_pos)

        # YOUR CODE HERE

        # BEGIN SOLUTION
        char_embedding = self.embedding(chars)
        #print(char_embedding.shape)
        mask=mask.view(1,22,1)
        #print(mask.shape)
        n_grams=torch.cat((char_embedding[:,2:], char_embedding[:,1:-1],char_embedding[:,:-2]),2) #how is this a trigram
        #print(n_grams.shape)
        n_out = self.n_chars(n_grams)
        #print(n_out.shape)
        masked = n_out*mask[:,2:] #why slice when the linear output can be anything
        #print("mask",masked.shape)
        pooled = torch.mean(masked,1)
        #print(pooled.shape)
        output = self.model(pooled)
        return output
        # END SOLUTION

trained_improved_network = train(ImprovedNetwork(len(all_tags)),25)
improved_predictor = predict_using(trained_improved_network)

Epoch 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 78.5
Epoch 1


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.5
Epoch 2


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.4
Epoch 3


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 80.10000000000001
Epoch 4


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 80.60000000000001
Epoch 5


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.60000000000001
Epoch 6


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.7
Epoch 7


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.7
Epoch 8


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.60000000000001
Epoch 9


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.60000000000001
Epoch 10


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 79.9
Epoch 11


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 78.60000000000001
Epoch 12


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 77.60000000000001
Epoch 13


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 78.4
Epoch 14


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 78.0
Epoch 15


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 78.3
Epoch 16


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 75.9
Epoch 17


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 77.5
Epoch 18


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 77.60000000000001
Epoch 19


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 78.3
Epoch 20


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 77.2
Epoch 21


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 77.7
Epoch 22


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 76.7
Epoch 23


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 76.7
Epoch 24


  0%|          | 0/16954 [00:00<?, ?it/s]

Validation score: 76.8


We can also get a feel for what our model learned by providing some of our own inputs that aren't real words (yet).

In [39]:
print_sample_predictions(improved_predictor, validation_data_raw)

print_sample_predictions(improved_predictor, [['successive','X'], ['under','X']])
print_sample_predictions(pooling_predictor, [['successive','X'], ['under','X']])

Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'ADJ'), ('downwind', 'NOUN'), ('exodus', 'NOUN'), ('avoiding', 'VERB'), ('informal', 'ADJ'), ('padded', 'VERB'), ('tantalizing', 'VERB'), ('farce', 'NOUN'), ('berger', 'NOUN')]
Sample predictions: [('successive', 'ADJ'), ('under', 'NOUN')]
Sample predictions: [('successive', 'NOUN'), ('under', 'NOUN')]


Finally, you need to run your model on the test set and save the outputs.  You'll turn in your predictions for us to grade.

In [26]:
def save_predictions(predictions, filename):
    """Save predictions to a file.
    
    predictions is a list of strings.
    """
    with open(filename, 'w') as f:
        for pred in predictions:
            f.write(pred)
            f.write('\n')

print('test score pooling:', evaluate(pooling_predictor, test_data_raw))
print('test score improved:', evaluate(improved_predictor, test_data_raw))

test_predictions = pooling_predictor(test_data_raw)
save_predictions(test_predictions, 'predicted_test_outputs_pooling.txt')
test_predictions = improved_predictor(test_data_raw)
save_predictions(test_predictions, 'predicted_test_outputs_improved.txt')

# Check that your test set looks like we expect it to
import hashlib
m = hashlib.md5()
m.update(str(test_data_raw).encode('utf-8'))
assert m.digest() == b'*N\xf6\xbe\xed\xde\xe8q)\xb9GG\xa6\x15UI'

test score pooling: 68.89999999999999
test score improved: 82.1


In [40]:
print(test_data_raw)

[('shocks', 'NOUN'), ('bess', 'NOUN'), ('successive', 'ADJ'), ('checking', 'VERB'), ('canvass', 'NOUN'), ('under', 'ADP'), ('marches', 'NOUN'), ('greeting', 'NOUN'), ('region', 'NOUN'), ('pianist', 'NOUN'), ('hastened', 'VERB'), ('deprived', 'VERB'), ('visibly', 'ADV'), ('leap', 'VERB'), ('policies', 'NOUN'), ('neuroses', 'NOUN'), ('chilled', 'VERB'), ('mammalian', 'ADJ'), ('author', 'NOUN'), ('masterpiece', 'NOUN'), ('possessing', 'VERB'), ('rejoicing', 'VERB'), ('abilities', 'NOUN'), ('personal', 'ADJ'), ('repeating', 'VERB'), ('transformation', 'NOUN'), ('biting', 'VERB'), ('holding', 'VERB'), ('stub', 'NOUN'), ('electron', 'NOUN'), ('modernization', 'NOUN'), ('roadway', 'NOUN'), ('mayflower', 'NOUN'), ('neighborhood', 'NOUN'), ('meetings', 'NOUN'), ('linearly', 'ADV'), ('pennant', 'NOUN'), ('deliberate', 'ADJ'), ('saluted', 'VERB'), ('privy', 'ADJ'), ('admonitions', 'NOUN'), ('ruger', 'NOUN'), ('budget', 'NOUN'), ('liable', 'ADJ'), ('pack', 'NOUN'), ('conferred', 'VERB'), ('confine

### Gradescope

To download this notebook, go to `File->Download .ipynb`.  Please rename the file to match the name in our file list.  You can download other outputs, like `predicted_test_output_improved.txt` by clicking the > arrow near the top left and finding it under `Files`.

When submitting your ipython notebooks, make sure everything runs correctly if the cells are executed in order starting from a fresh session.  Note that just because a cell runs in your current session doesn't mean it doesn't rely on code that you have already changed or deleted.  If the code doesn't take too long to run, we recommend re-running everything with `Runtime->Restart and run all...`.