In [1]:
import collections, time, math, random

# Writing a simple model in PyTorch

This notebook shows you how to get started with PyTorch and also provides you some skeleton code. You can make a copy of the notebook and write your solution in it, or you can download it (**File &rarr; Download .py**) and work on it locally.

## Setup

Clone the HW1 repository. (If you rerun the notebook, you'll get an error that directory `hw1` already exists, which you can ignore.)

In [2]:
!git clone https://github.com/ND-CSE-40657/hw1

Cloning into 'hw1'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 79 (delta 16), reused 36 (delta 10), pack-reused 32[K
Unpacking objects: 100% (79/79), done.


Import PyTorch. If you want to run on your own computer, you'll need to install PyTorch, which is usually as simple as `pip install torch`.

In [3]:
import torch
print(f'Using Torch v{torch.__version__}')

Using Torch v1.7.0+cu101


Check for a GPU. A GPU is not necessary for this assignment -- in fact, for the size of model we're training, it probably makes things slower. To enable/disable GPU, go to **Runtime &rarr; Change runtime type &rarr; Hardware accelerator** and select **GPU** (to enable the GPU) or **None** (to disable the GPU).

In [4]:
if torch.cuda.device_count() > 0:
    print(f'Using GPU ({torch.cuda.get_device_name(0)})')
    device = 'cuda'
else:
    print('Using CPU')
    device = 'cpu'

Using CPU


## Read and preprocess data

Read in the data files. Note that we strip trailing newlines.

In [5]:
def read_data(filename):
    return [list(line.rstrip('\n')) + ['<EOS>'] for line in open(filename)]
traindata = read_data('hw1/data/train')
devdata = read_data('hw1/data/dev')
testdata = read_data('hw1/data/test')

Create a vocabulary containing the most frequent words and some special words.

In [6]:
class Vocab:
    def __init__(self, counts, size):
        self.size = size
        words = {'<EOS>', '<UNK>'}
        for word, _ in counts.most_common():
            words.add(word)
            if len(words) == size:
                break
        self.num_to_word = list(words)    
        self.word_to_num = {word:num for num, word in enumerate(self.num_to_word)}

    def numberize(self, word):
        if word in self.word_to_num:
            return self.word_to_num[word]
        else: 
            return self.word_to_num['<UNK>']

    def denumberize(self, num):
        return self.num_to_word(num)

chars = collections.Counter()
for line in traindata:
    chars.update(line)
vocab = Vocab(chars, 100) # For our data, 100 is a good size.

## Define the model

Now we want to define a unigram language model. The parameters of the model are _logits_ $\mathbf{s}$, which are unconstrained real numbers, and we will apply a softmax to change them into probabilities (which are nonnegative and sum to one).

\begin{align}
P(i) &= [\operatorname{softmax} \mathbf{s}]_i \\
&= \frac{\exp s_i}{\sum_{i'} \exp s_{i'}}.
\end{align}

Create an array (a `Tensor`) of logits, one for each word in the vocabulary.

In [7]:
logits = torch.normal(mean=0, std=0.01, 
                      size=(vocab.size,), 
                      requires_grad=True, 
                      device=device)

The function `torch.normal` creates an array of random numbers, normally distributed (here with mean zero and standard deviation 0.01).

The `size` argument says that it should be a one-dimensional array with `vocab.size` elements, one for each word in the vocabulary.

The next two arguments are important. The `requires_grad` argument tells PyTorch that we will want to compute gradients with respect to `logits`, because we want to learn its values. The `device` argument says where to store the array.

It will be useful to keep a list of all the parameters of the model:

In [8]:
parameters = [logits]

Next, we write code to convert the logits into probabilities -- actually, log-probabilities. Torch has a function that does a softmax and a log together; it's more numerically stable than doing them in two steps. (Even though `logits` has only one dimension, we still have to say `dim=0` to specify which dimension the softmax should be computed over.)

In [9]:
def logprobs():
    return torch.log_softmax(logits, dim=0)

This returns an array of floats like you'd expect, but it also remembers _how_ it was computed. PyTorch will use this information to compute gradients for learning.

## Train the model

Next, we create an optimizer, whose job is to adjust a set of parameters to minimize a loss function. Here, we're using `SGD` (stochastic gradient descent); other options are `Adagrad`, `Adam`, and others. Different optimizers take different options. Here, `lr` stands for "learning rate" and we usually try different powers of ten until we get the best results on the dev set.

In [10]:
o = torch.optim.SGD(parameters, lr=0.1)

Next, we run through the training data a few times (epochs). For each sentence, move the parameters a little bit to decrease the loss function. If you want to rerun the training, go to **Run &rarr; Restart and run all** or **Runtime &rarr; Run all**. It takes about 5 minutes per epoch.

In [13]:
prev_dev_acc = None

for epoch in range(100):
    epoch_start = time.time()

    # Run through the training data

    random.shuffle(traindata) # Important

    train_loss = 0
    train_chars = 0
    for chars in traindata:
        nums = [vocab.numberize(char) for char in chars]

        # Compute the negative log-likelihood of this line,
        # which is the thing we want to minimize.
        loss = 0.
        for i in nums:
            train_chars += 1
            loss -= logprobs()[i]

        # Keep a running total of negative log-likelihood.
        # The .item() turns a one-element tensor into an ordinary float,
        # including detaching the history of how it was computed,
        # so we don't save the history across sentences.
        train_loss += loss.item()

        # Compute gradient of loss with respect to parameters.
        o.zero_grad()   # important: this must come first
        loss.backward()

        # Clip gradients (not needed here, but helpful for RNNs)
        torch.nn.utils.clip_grad_norm_(parameters, 1.0)

        # Do one step of gradient descent.
        o.step()

    # Run through the development data

    dev_chars = dev_correct = 0
    for chars in devdata:
        nums = [vocab.numberize(char) for char in chars]
        for i in nums:
            dev_chars += 1

            # Find the character with highest predicted probability.
            # The .item() is needed for comparing with i.
            best = logprobs().argmax().item()
            if best == i:
                dev_correct += 1

    dev_acc = dev_correct/dev_chars
    print(f'time={time.time()-epoch_start} train_ppl={math.exp(train_loss/train_chars)} dev_acc={dev_acc}')

    # If dev accuracy got worse, halve the learning rate
    if prev_dev_acc is not None and dev_acc <= prev_dev_acc:
            o.param_groups[0]['lr'] *= 0.5
            print(f"lr={o.param_groups[0]['lr']}")

    # When the learning rate gets too low, stop training
    if o.param_groups[0]['lr'] < 0.01:
        break

    prev_dev_acc = dev_acc

time=176.4177610874176 train_ppl=27.169235823154356 dev_acc=0.16477499004380725
time=173.23859167099 train_ppl=27.17237523692376 dev_acc=0.16477499004380725
lr=0.05
time=173.773264169693 train_ppl=27.117303524319112 dev_acc=0.16477499004380725
lr=0.025
time=172.58394718170166 train_ppl=27.090992142791837 dev_acc=0.16477499004380725
lr=0.0125
time=176.56773948669434 train_ppl=27.079781865860916 dev_acc=0.16477499004380725
lr=0.00625
