# NLP 2: RNNs
This notebook is based on fastai's **[Chapter 12](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb)**.

Please read that chapter before looking at this review.

*I suggest opening this notebook in Colab (where it can be easier to use GPU).*
*If you want to run it locally, set up the **deep-learning** environment in your terminal with `conda env create -f environment.yml` and activate it in your preferred IDE.*

In [2]:
### FOR COLAB USERS ###
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [3]:
# ### FOR LOCAL USERS ###
# import fastai
# print(fastai.__version__)

# ! pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()

In [4]:
from fastbook import *
from fastai.text.all import *

# Data

## Download

In [5]:
# path
path = untar_data(URLs.HUMAN_NUMBERS)
Path.BASE_PATH = path
path.ls()

(#2) [Path('train.txt'),Path('valid.txt')]

In [6]:
# show an example
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

In [7]:
# first 100 characters
text = ' . '.join([l.strip() for l in lines])
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

## Preprocessing

In [8]:
# tokenize -- split on spaces
tokens = text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

In [9]:
# create vocab (all unique tokens)
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

In [10]:
# token map (come up with index for where the token is in the vocab above)
token_map = {w:i for i,w in enumerate(vocab)}
token_map

{'one': 0,
 '.': 1,
 'two': 2,
 'three': 3,
 'four': 4,
 'five': 5,
 'six': 6,
 'seven': 7,
 'eight': 8,
 'nine': 9,
 'ten': 10,
 'eleven': 11,
 'twelve': 12,
 'thirteen': 13,
 'fourteen': 14,
 'fifteen': 15,
 'sixteen': 16,
 'seventeen': 17,
 'eighteen': 18,
 'nineteen': 19,
 'twenty': 20,
 'thirty': 21,
 'forty': 22,
 'fifty': 23,
 'sixty': 24,
 'seventy': 25,
 'eighty': 26,
 'ninety': 27,
 'hundred': 28,
 'thousand': 29}

In [11]:
# numericalize
nums = L(token_map[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

# Aside: Embeddings
- Embeddings = a simple lookup table that stores embeddings of a fixed dictionary and size
- Can think of this sort of like a dictionary

### Create a vocab of 5 words

In [12]:
# vocab
tmp_vocab = {"test":0, "example":1, "evie":2, "cat":3, "dog":4}
tmp_vocab

{'test': 0, 'example': 1, 'evie': 2, 'cat': 3, 'dog': 4}

### Create embedding structure
This is sort of like a dictionary with keys = indices, values = embeddings

In [13]:
# set up parameters
vocab_size = len(tmp_vocab) # number of tokens (here, 5 words in the vocab)
embedding_dim = 3 # number of dimensions for each token's embedding (here, 3 dimensions)

In [14]:
# create embedding structure
embedding = nn.Embedding(vocab_size, embedding_dim)
embedding

Embedding(5, 3)

### Get embeddings
These embeddings are random values, since we did not give it pretrained values

In [15]:
# get indices for two tokens in the vocab
tok1 = tmp_vocab['evie']
tok2 = tmp_vocab['cat']
tok1, tok2

(2, 3)

In [16]:
# input = list of indices (sort of like a dict key)
input = torch.LongTensor([2,3]) # tokens with indices 2 and 3
input

tensor([2, 3])

In [17]:
# output = corresponding word embeddings (sort of like a dict value)
output = embedding(input)
output

tensor([[ 2.2082, -0.6380,  0.4617],
        [ 0.2674,  0.5349,  0.8094]], grad_fn=<EmbeddingBackward0>)

### Get embeddings using pretrained weights

#### Pretrained embedding weights for the tokens in the vocab

In [18]:
# there is one "row" per token (5 "rows"), with 3 dimension each ("columns")
weight = torch.FloatTensor(
  [
    [0.0, 5.0, 10.0],
    [1.0, 6.0, 11.0],
    [2.0, 7.0, 12.0],
    [3.0, 8.0, 13.0],
    [4.0, 9.0, 14.0],
  ]
)
print(weight)

tensor([[ 0.,  5., 10.],
        [ 1.,  6., 11.],
        [ 2.,  7., 12.],
        [ 3.,  8., 13.],
        [ 4.,  9., 14.]])


#### Train the embeddings
Essentially, create a dictionary where keys are the indices and weights are the values

In [19]:
embedding = nn.Embedding.from_pretrained(weight)
embedding

Embedding(5, 3)

#### Get embeddings for tokens with index 2 and 3 in the vocab
These embeddings come directly from the pretrained weights

In [20]:
# input = list of indices (sort of like a dict key)
input = torch.LongTensor([2,3])
input

tensor([2, 3])

In [21]:
# output = corresponding word embeddings (sort of like a dict value)
output = embedding(input)
output

tensor([[ 2.,  7., 12.],
        [ 3.,  8., 13.]])

#### Summary

In [22]:
# Get embeddings for "evie"
word_index = tmp_vocab['evie']
word_index = torch.LongTensor([word_index]) # make it a tensor
embedding(word_index)

tensor([[ 2.,  7., 12.]])

# Baseline Language Model
- Goal = predict each word based on the previous 3 words
- Inputs = 3 words
- Outputs = Probability of each possible next word (from the vocab)

### Example of inputs and outputs
- x = 3 tokens
- y = 1 token

In [23]:
# small sample
toks = tokens[0:25]
print("index \t\t input \t\t\t output".upper())

# get sample inputs, the first input's token index, and true outputs
for i in range(0,len(toks)-4,3):
  x = toks[i:i+3]
  y = toks[i+3]
  print(i, "\t", x, "\t\t", [y])

INDEX 		 INPUT 			 OUTPUT
0 	 ['one', '.', 'two'] 		 ['.']
3 	 ['.', 'three', '.'] 		 ['four']
6 	 ['four', '.', 'five'] 		 ['.']
9 	 ['.', 'six', '.'] 		 ['seven']
12 	 ['seven', '.', 'eight'] 		 ['.']
15 	 ['.', 'nine', '.'] 		 ['ten']
18 	 ['ten', '.', 'eleven'] 		 ['.']


In [24]:
# same example with the numericalized version
n = nums[0:25]
print("index \t input \t\t\t output".upper())

for i in range(0,len(n)-4,3):
  x = n[i:i+3]
  y = n[i+3]
  print(i, "\t", x, "\t\t", [y])

INDEX 	 INPUT 			 OUTPUT
0 	 [0, 1, 2] 		 [1]
3 	 [1, 3, 1] 		 [4]
6 	 [4, 1, 5] 		 [1]
9 	 [1, 6, 1] 		 [7]
12 	 [7, 1, 8] 		 [1]
15 	 [1, 9, 1] 		 [10]
18 	 [10, 1, 11] 		 [1]


In [25]:
# same numericalized example, just as tensors
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

### Data

In [26]:
# data loaders
bs = 10 # small batch size for now
cut = int(len(seqs) * 0.8) # where to split into train/valid sets (randomly, for now)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=bs, shuffle=False)
dls

<fastai.data.core.DataLoaders at 0x79c5d1532da0>

In [27]:
# sample -- first batch
x,y = first(dls.train)
print("x shape:", x.shape)
print(x)
print()
print("y shape:", y.shape)
print(y)

x shape: torch.Size([10, 3])
tensor([[ 0,  1,  2],
        [ 1,  3,  1],
        [ 4,  1,  5],
        [ 1,  6,  1],
        [ 7,  1,  8],
        [ 1,  9,  1],
        [10,  1, 11],
        [ 1, 12,  1],
        [13,  1, 14],
        [ 1, 15,  1]])

y shape: torch.Size([10])
tensor([ 1,  4,  1,  7,  1, 10,  1, 13,  1, 16])


### Create Predictions
For this baseline model, we will always predict the most common token

In [28]:
# count the number of times each token in the vocab appears in the validation set
n,counts = 0,torch.zeros(len(vocab))

for x,y in dls.valid:
  n += y.shape[0]
  for i in range_of(vocab):
    counts[i] += (y==i).long().sum()

counts

tensor([106., 637., 159., 107., 106., 159., 108., 106., 464., 442.,   6.,   7.,   6.,   6.,   7.,   6.,   6.,   7.,   6.,   6.,  64.,  63.,  63.,  64.,  63.,  63.,  66.,  66., 600., 638.])

In [29]:
# find the most common token
idx = torch.argmax(counts)

print("index of most common token:", idx)
print("most common token:", vocab[idx.item()])

index of most common token: tensor(29)
most common token: thousand


In [30]:
# if you always predict that token, how accurate will you be?
print("number of counts for the most common token:", counts[idx].item())
print("total number of tokens in the corpus:", n)
print("accuracy, if always predict most common token:", (counts[idx].item()/n))

number of counts for the most common token: 638.0
total number of tokens in the corpus: 4207
accuracy, if always predict most common token: 0.15165200855716662


# RNN 1: Unrolled representation
Unrolled = the representation of an RNN before refactoring with a for loop

## Model architecture
- NN architecture:
  - First linear layer inputs = first word's embedding
  - Second linear layer inputs = second word's embedding + the first layer's output activations
  - Third linear layer inputs = third word's embedding + the second layer's output activations
  - Why? It takes the context of every word into account -- every word is interpreted in the information context of any words preceding it
- Each of these three layers will use the same weight matrix
  -  The way that one word impacts the activations from previous words should not change depending on the position of a word
  - In other words, activation values will change as the data moves through the layers, but the layer weights themselves will not change from layer to layer
  - Aka a layer does not learn one sequence position -- it must learn to handle all positions
  - Since layer weights don't change, you can think of the sequential layers as "the same layer" repeated

## Data

In [31]:
# recall our vocab
print(vocab)

['one', '.', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'hundred', 'thousand']


In [32]:
# x -- first 3 words in a sequence
# numbers = indices of the tokens in the vocab
x,y = first(dls.train)
print("first batch of x:")
print(x)

first batch of x:
tensor([[ 0,  1,  2],
        [ 1,  3,  1],
        [ 4,  1,  5],
        [ 1,  6,  1],
        [ 7,  1,  8],
        [ 1,  9,  1],
        [10,  1, 11],
        [ 1, 12,  1],
        [13,  1, 14],
        [ 1, 15,  1]])


In [33]:
# y -- 4th word in a sequence
print("first batch of y:")
print(y)

first batch of y:
tensor([ 1,  4,  1,  7,  1, 10,  1, 13,  1, 16])


## Step 1. Get activations for the FIRST token in the sequence

### Layer 1 (embedding layer)
- Get hidden state (the embeddings for token 1)
- Input --> hidden

#### Input
- Indices of the first token from each sample in the batch
- Shape = [batch size]
  - batch size = the number of samples you have

In [34]:
# input = the first COLUMN of x (first word of each sequences)
input_tok1 = x[:,0]

print("number of inputs:", input_tok1.shape)
print("numericalized input:", input_tok1)
print("corresponding tokens:", vocab[input_tok1])

number of inputs: torch.Size([10])
numericalized input: tensor([ 0,  1,  4,  1,  7,  1, 10,  1, 13,  1])
corresponding tokens: ['one', '.', 'four', '.', 'seven', '.', 'ten', '.', 'thirteen', '.']


#### Define embedding layer architecture
- Shape = [vocab size, number of hidden dimensions]
  - number of hidden dimensions = number of dimensions you want for the embeddings

In [35]:
# choose number of embedding dimensions
n_hidden = 5
print("number of hidden layers:", n_hidden)

# get vocab size
vocab_sz = len(vocab)
print("vocab size:", vocab_sz)

number of hidden layers: 5
vocab size: 30


In [36]:
# create embedding layer architecture
input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
input_to_hidden

Embedding(30, 5)

#### Get hidden state
- Turn the raw input into embeddings, using the embedding architecture just defined
- Hidden state for the first word = embeddings for the first word
  - These are randomly initialized for now
  - Will be updated with SGD later
- Shape = [batch size, number of hidden dimensions]

In [37]:
# get hidden state (embeddings for the first word of each sequence)
hidden_tok1_emb = input_to_hidden(input_tok1)
print("hidden state -- embeddings for first word of each sequence")
print(hidden_tok1_emb)

hidden state -- embeddings for first word of each sequence
tensor([[ 0.3057, -0.7746,  0.0349,  0.3211,  1.5736],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [ 1.8113,  0.1606,  0.3672,  0.1754, -1.1845],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [ 0.2311,  0.0087, -0.1423,  0.1971, -1.1441],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [ 0.7281, -0.7106, -0.6021,  0.9604,  0.4048],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [-0.4502, -0.6788,  0.5743,  0.1877, -0.3576],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879]], grad_fn=<EmbeddingBackward0>)


In [38]:
print("batch size: ", len(x))
print("number of hidden dimensions:", n_hidden)
print("hidden layer shape:", hidden_tok1_emb.shape)

batch size:  10
number of hidden dimensions: 5
hidden layer shape: torch.Size([10, 5])


### Layer 2 (linear layer)
- Matrix multiplication (hidden state * initialized linear params)
- Hidden --> hidden

#### Define linear layer architecture
- Initializes the parameter matrices (weights and bias)
- Weights shape = [number of hidden dimensions, number of hidden dimensions]
- Bias shape = [number of hidden dimensions]

In [39]:
hidden_to_hidden = nn.Linear(n_hidden, n_hidden)

print('Structure:\t',hidden_to_hidden)
print('Weights shape:\t',hidden_to_hidden.weight.shape)
print('Bias shape:\t',hidden_to_hidden.bias.shape)

Structure:	 Linear(in_features=5, out_features=5, bias=True)
Weights shape:	 torch.Size([5, 5])
Bias shape:	 torch.Size([5])


#### Update the hidden state
- Apply the initialized linear parameters (weights/bias) to the hidden state (from the embedding layer)
- The linear function does matrix multiplication with the initialized hidden state (the first word from our x samples) and parameter matrices
- Result = updated hidden state
- Shape = [batch size, number of hidden dimensions]

In [40]:
hidden_tok1_lin = hidden_to_hidden(hidden_tok1_emb)
print("updated hidden state:")
print(hidden_tok1_lin)

updated hidden state:
tensor([[-0.6069, -1.0809,  0.2330,  0.1919, -0.4181],
        [-0.3987,  0.6445,  1.0581,  0.4980,  1.3290],
        [-0.5230,  0.2812,  0.0594,  0.0501,  0.1410],
        [-0.3987,  0.6445,  1.0581,  0.4980,  1.3290],
        [-0.3221, -0.1254,  0.0772,  0.1889,  0.6168],
        [-0.3987,  0.6445,  1.0581,  0.4980,  1.3290],
        [-0.5450, -1.0688, -0.1006, -0.1317, -0.3340],
        [-0.3987,  0.6445,  1.0581,  0.4980,  1.3290],
        [-0.4286, -0.4842,  0.0209,  0.6692,  0.7102],
        [-0.3987,  0.6445,  1.0581,  0.4980,  1.3290]], grad_fn=<AddmmBackward0>)


In [41]:
print("shape of updated hidden state:", hidden_tok1_lin.shape)

shape of updated hidden state: torch.Size([10, 5])


### Layer 3 (output layer)
- Get activations by applying ReLu to the updated hidden state
- ReLu means replace every negative number with 0
- Shape = [batch size, number of hidden dimensions]

In [42]:
# apply relu to the (updated) hidden state
activations_tok1 = F.relu(hidden_tok1_lin)
activations_tok1

tensor([[0.0000, 0.0000, 0.2330, 0.1919, 0.0000],
        [0.0000, 0.6445, 1.0581, 0.4980, 1.3290],
        [0.0000, 0.2812, 0.0594, 0.0501, 0.1410],
        [0.0000, 0.6445, 1.0581, 0.4980, 1.3290],
        [0.0000, 0.0000, 0.0772, 0.1889, 0.6168],
        [0.0000, 0.6445, 1.0581, 0.4980, 1.3290],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.6445, 1.0581, 0.4980, 1.3290],
        [0.0000, 0.0000, 0.0209, 0.6692, 0.7102],
        [0.0000, 0.6445, 1.0581, 0.4980, 1.3290]], grad_fn=<ReluBackward0>)

In [43]:
print("shape of the activations:", activations_tok1.shape)

shape of the activations: torch.Size([10, 5])


### Summary of what happened

In [44]:
# token 1 summary
print('raw input for token 1\n'.upper(), input_tok1)
print()
print('vocab for token 1\n'.upper(), vocab[input_tok1])
print()
print('embedding structure applied --> hidden state\n'.upper(), hidden_tok1_emb)
print()
print('linear function applied --> updated hidden state\n'.upper(), hidden_tok1_lin)
print()
print('relu applied --> activations for token 1\n'.upper(), activations_tok1)

RAW INPUT FOR TOKEN 1
 tensor([ 0,  1,  4,  1,  7,  1, 10,  1, 13,  1])

VOCAB FOR TOKEN 1
 ['one', '.', 'four', '.', 'seven', '.', 'ten', '.', 'thirteen', '.']

EMBEDDING STRUCTURE APPLIED --> HIDDEN STATE
 tensor([[ 0.3057, -0.7746,  0.0349,  0.3211,  1.5736],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [ 1.8113,  0.1606,  0.3672,  0.1754, -1.1845],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [ 0.2311,  0.0087, -0.1423,  0.1971, -1.1441],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [ 0.7281, -0.7106, -0.6021,  0.9604,  0.4048],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879],
        [-0.4502, -0.6788,  0.5743,  0.1877, -0.3576],
        [-0.8455,  1.3123,  0.6872, -1.2347, -0.4879]], grad_fn=<EmbeddingBackward0>)

LINEAR FUNCTION APPLIED --> UPDATED HIDDEN STATE
 tensor([[-0.6069, -1.0809,  0.2330,  0.1919, -0.4181],
        [-0.3987,  0.6445,  1.0581,  0.4980,  1.3290],
        [-0.5230,  0.2812,  0.0594,  0.0501,  0.14

## Step 2. Get activations for the SECOND word in the sequence

### Layer 1 (embedding layer)
Update hidden state based on the embeddings of the SECOND token and the activations from the FIRST token

#### Input
Indices of the second token from each sample in the batch

In [45]:
# input = the second COLUMN of x (the second word in each sequence)
# shape = [batch size]
input_tok2 = x[:,1]
print(input_tok2.shape)
print(input_tok2)

torch.Size([10])
tensor([ 1,  3,  1,  6,  1,  9,  1, 12,  1, 15])


#### Get embeddings for SECOND token

In [46]:
# use the same embedding architecture
input_to_hidden

Embedding(30, 5)

In [47]:
# get embeddings
# shape = [batch size, number of hidden dimensions]
hidden_tok2_emb = input_to_hidden(input_tok2)
hidden_tok2_emb.shape

torch.Size([10, 5])

#### Update hidden state with the output of the first token
Activations from the FIRST token + embeddings for the SECOND token

In [48]:
# updated hidden state
# shape = [batch size, number of hidden dimensions]
hidden_tok2_updated = activations_tok1 + hidden_tok2_emb
hidden_tok2_updated.shape

torch.Size([10, 5])

### Layer 2 (linear layer)
Update the hidden state with the parameters (weights/bias)

In [49]:
# use the same linear architecture (weights and biases)
hidden_to_hidden

Linear(in_features=5, out_features=5, bias=True)

In [50]:
# apply linear
# shape = [batch size, number of hidden dimensions]
hidden_tok2_lin = hidden_to_hidden(hidden_tok2_updated)
print(hidden_tok2_lin.shape)

torch.Size([10, 5])


### Layer 3 (output layer)
Get activations for the second token

In [51]:
# apply relu
# shape = [batch size, number of hidden dimensions]
activations_tok2 = F.relu(hidden_tok2_lin)
print(activations_tok2.shape)

torch.Size([10, 5])


## Step 3. Get activations for the THIRD word in the sequence

### Layer 1 (embedding layer)
Update hidden state based on the embeddings of the THIRD token and the activations from the SECOND token

In [52]:
# input = the third of x
input_tok3 = x[:,2]
print(input_tok3.shape)

torch.Size([10])


In [53]:
# get embeddings (using the same embedding architecture)
hidden_tok3_emb = input_to_hidden(input_tok3)
hidden_tok3_emb.shape

torch.Size([10, 5])

In [54]:
# update hidden state (by including the activations from the second token)
hidden_tok3_updated = activations_tok2 + hidden_tok3_emb
hidden_tok3_updated.shape

torch.Size([10, 5])

### Layer 2 (linear layer)
Update the hidden state with parameters (weights / biases)

In [55]:
# update hidden state (using the same linear architecture)
hidden_tok3_lin = hidden_to_hidden(hidden_tok3_updated)
print(hidden_tok3_lin.shape)

torch.Size([10, 5])


### Layer 3 (output layer)
Get activations for the third token

In [56]:
# apply relu
activations_tok3 = F.relu(hidden_tok3_lin)
print(activations_tok3.shape)

torch.Size([10, 5])


## Step 4. Get final outputs (activations)
- Goal: Predict the next (4th) word in the sequence
- Hidden --> output

In [57]:
# recall vocab shape
print("number of tokens in the vocab:", vocab_sz)

number of tokens in the vocab: 30


In [58]:
# recall y
print("y shape:", y.shape)
print(y)

y shape: torch.Size([10])
tensor([ 1,  4,  1,  7,  1, 10,  1, 13,  1, 16])


### Define linear architecture
- Matrix multiplication (hidden state * initialized linear params)
- Initializes the parameter matrices (weights and bias)
- Weights shape = [vocab size, number of hidden dimensions]
- Bias shape = [vocab size]

In [59]:
# the hidden to output linear layer
hidden_to_output = nn.Linear(n_hidden, vocab_sz)

print('Structure:\t',hidden_to_output)
print('Weights shape:\t',hidden_to_output.weight.shape)
print('Bias shape:\t',hidden_to_output.bias.shape)

Structure:	 Linear(in_features=5, out_features=30, bias=True)
Weights shape:	 torch.Size([30, 5])
Bias shape:	 torch.Size([30])


### Get the outputs
- Apply the initialized linear parameters (from the architecture above) to the hidden state we created above (aka the activations of the third token)
- The linear function does matrix multiplication with the hidden state and the initialized parameter matrices
- Result = predictions of the next word in the sequence
- Shape = [batch size, vocab size]
  - In other words, for each sequence (1 sample of the batch), get an activation for EVERY word in the vocab
  - Vocab size = number of "classes"

In [60]:
# apply these initialized linear parameters to the hidden state
outputs = hidden_to_output(activations_tok3)

In [61]:
# note the shape is the batch size x vocab size
print(outputs.shape)
# outputs

torch.Size([10, 30])


## Same Model -- Class Format
Unrolled representation

### Version 1
Every step is completely written out for each token in the x sequence

In [62]:
class LMModel1(Module):
  def __init__(self, vocab_sz, n_hidden):
    """
    Initialize parameters
    """
    self.i_h = nn.Embedding(vocab_sz, n_hidden) # initialize embeddings
    self.h_h = nn.Linear(n_hidden, n_hidden) # initialize parameters (weights/bias) for the hidden linear layers
    self.h_o = nn.Linear(n_hidden,vocab_sz) # initialize parameters for the final layer (getting output)

  def forward(self, x):
    """
    Calculate the activations for each of the 3 layers
    i_h (input to hidden) = embedding layer
    h_h (hidden to hidden) = linear layer that creates the activations for the next word
    h_o (hidden to output) = final linear layer that predicts the fourth word
    """
    ### Get activations for the FIRST word in the sequence ###
    # Layer 1 (embedding layer): Get hidden state (the embeddings for word 1)
    h = self.i_h(x[:,0]) # first word = x[:,0]

    # Layer 2 (linear layer): Matrix multiplication (hidden state * initialized linear params)
    h = self.h_h(h)

    # Layer 3 (output layer): Get activations with ReLu (replace every negative number with 0)
    h = F.relu(h)

    ### Get activations for the SECOND word in the sequence ###
    # Layer 1 (embedding layer): Update hidden state (embeddings for word 2 + activations from word 1)
    h = h + self.i_h(x[:,1]) # second word = x[:,1]

    # Layer 2 (linear layer): Matrix multiplication (hidden state * initialized linear params)
    h = self.h_h(h)

    # Layer 3 (output layer): Get activations with ReLu
    h = F.relu(h)

    ### Get activations for the THIRD word in the sequence ###
    # Layer 1 (embedding layer): Update hidden state (the embeddings for word 3 + activations from second layer)
    h = h + self.i_h(x[:,2]) # third word = x[:,2]

    # Layer 2 (linear layer): Matrix multiplication
    h = self.h_h(h)

    # Layer 3 (output layer): Get activations with ReLu
    h = F.relu(h)

    ### Get Final Activations -- predicting 4th word in the sequence ###
    # Final linear layer: Matrix multiplication (hidden state * initialized linear params)
    # Note: often, softmax is applied to these outputs, but not in this function
    out = self.h_o(h)

    return out

### Version 2
- Same model as above, but written in a more RNN-y format
  - A neural network that is defined using a loop is called a recurrent neural network (RNN)
  - An RNN is not a complicated new architecture, but simply a refactoring of a multilayer neural network using a for loop
- Hidden state = the activations that are updated at each step of a recurrent neural network

In [63]:
class LMModel2(Module):
  def __init__(self, vocab_sz, n_hidden):
    """
    Initialize parameters
    """
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden,vocab_sz)

  def forward(self, x):
    """
    Calculate the activations
    """
    ### Initialize the hidden state ###
    # h will be (re)set to 0 for EVERY sample of x (where x = a sequence of words)
    # this is not ideal -- we will fix this later with `stateful`
    h = 0

    ### Get/update the hidden state ###
    # this is done for each of tokens in the sequence
    # note "3" can be replaced with any length
    for i in range(3):
      h = h + self.i_h(x[:,i]) # get embeddings and update hidden state
      h = F.relu(self.h_h(h)) # linear layer, followed by activation function (ReLU)

    ### Get final activations (predictions) ###
    out = self.h_o(h)

    return out

## Train Model
- Plug in batches of x data into model architecture
- Define the number of hidden dimensions you want
- Get prediced y values (using the model)
- Get loss / metrics
- Optimize with SGD
- Do this over multiple epochs

In [64]:
# get dataloaders using larger batch size
bs = 64
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=bs, shuffle=False)

In [65]:
# make modeling choices
n_hidden = 64
model = LMModel1(len(vocab), n_hidden)

# train
learn = Learner(
  dls,
  model,
  loss_func=F.cross_entropy,
  metrics=accuracy
)

learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.784861,2.03109,0.47421,00:04
1,1.403965,1.836548,0.468743,00:05
2,1.395346,1.708262,0.49275,00:03
3,1.37143,1.670374,0.494176,00:02


In [66]:
# show that the results are (roughly) the same in version 2
model = LMModel2(len(vocab), n_hidden)
learn = Learner(dls, model, loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.813884,1.904099,0.465652,00:01
1,1.391428,1.797721,0.468505,00:02
2,1.416528,1.664593,0.493701,00:02
3,1.368125,1.71018,0.42049,00:02


# RNN 2: Stateful version of the model
- The previous models were not stateful
  - The hidden state was initialized to zero for every new input sequence (aka for token 1, token 2, and token 3 in our 3-token sequence)
  - That is problematic because the sample sequences will be read in order by the model, exposing the model to long stretches of the original sequence -- by initializing the hidden state to 0 each time, you are throwing away all the information it has seen before
  - It also means that the model doesn't actually know where it is in the overall counting sequence
- Stateful models
  - The model remembers its activations between different calls to `forward` (which represent its use for different samples in the batch)
  - Aka it remembers activations from one x sample to the next

## Data
- Need to make sure the samples are going to be seen in a certain order
- If the first line of the first batch is our `dset[0]` then the second batch should have `dset[1]` as the first line, so that the model sees the text flowing -- refer back to **_08_nlp_basics.ipynb_** for details on this

### Previous Version

In [67]:
# bs = 64
x,y = first(dls.train)
print(x[:5])
print(y[:5])

tensor([[0, 1, 2],
        [1, 3, 1],
        [4, 1, 5],
        [1, 6, 1],
        [7, 1, 8]])
tensor([1, 4, 1, 7, 1])


### Rearranged Version

#### Divide the dataset into m groups (of size `batch_size`)

In [68]:
def group_chunks(ds, bs):
  """
  Group the dataset into groups with a special order
  """
  m = len(ds) // bs # divide the dataset into m groups (of size batch_size)
  new_ds = L() # initialize new dataset

  # rearrange datasets so...
    # batch 1 will be (0, m, 2m, ..., (bs-1)m)
    # batch 2 will be (1, m+1, 2m+1, ..., (bs-1)m+1)
    # etc.
  for i in range(m):
    new_ds += L(ds[i + m*j] for j in range(bs))

  return new_ds

#### Dataloaders

In [69]:
bs = 64
cut = int(len(seqs) * 0.8) # where to split into train/valid sets

grouped_dls = DataLoaders.from_dsets(
  group_chunks(seqs[:cut], bs), # train set
  group_chunks(seqs[cut:], bs), # valid set
  bs=bs, # batch size
  drop_last = True, # drop the last batch (which will not have the shape of bs)
  shuffle=False # make sure the texts are read in order
)

#### View data
- At each epoch, the model will see a chunk of contiguous text of size 3*m (since each text is of size 3) on each line of the batch

In [70]:
# first batch
x,y = first(grouped_dls.train)
print(x.shape, y.shape)
x[:5], y[:5]

torch.Size([64, 3]) torch.Size([64])


(tensor([[ 0,  1,  2],
         [11,  1,  2],
         [25,  7,  1],
         [ 5,  1,  5],
         [28, 12,  1]]),
 tensor([ 1, 28,  3, 28,  7]))

## Model Architecture
- Stateful
  - The model remembers the activations between calls to `forward` (call `forward` for each new x sample in the batch)
  - Only resets the hidden state at the beginning of each epoch/validation set
  - The model will have the same activations for whatever sequence length we pick, because the hidden state will remember the last activation from the previous batch
- Backpropogation Through Time (BPTT)
  - Only calculate gradients on sequence length tokens in the past, instead of the whole stream
  - "Detach" the not-needed history of gradients
  - Treats a neural net with effectively one layer per time step (usually refactored using a loop) as one big model, and calculating gradients on it in the usual way
  - To avoid running out of memory and time, we usually use truncated BPTT, which "detaches" the history of computation steps in the hidden state every few time steps

In [71]:
class LMModel3(Module):
  def __init__(self, vocab_sz, n_hidden):
    """
    Initialize the parameters for each layer
    Initialize the hidden state (h)
    """
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden, vocab_sz)

    # initializing the hidden state here allows the model to be stateful
    self.h = 0

  def forward(self, x):
    """
    Get activations for each token in the x sequence (of length 3)
    """
    for i in range(3):
      self.h = self.h + self.i_h(x[:,i]) # get embeddings and update hidden state
      self.h = F.relu(self.h_h(self.h)) # linear layer and relu layer

    out = self.h_o(self.h) # get outputs -- predictions for the 4th word
    self.h = self.h.detach() # detach gradient -- remove its history (for memory size purposes)
    return out

  def reset(self):
    """
    Reset the hidden state at the beginning of each epoch and before each validation phase
    Helps make sure we start with a clean state before reading continuous chunks of text
    This is called in the learner
    """
    self.h = 0


### Train

In [72]:
# train and fit
learn = Learner(
  grouped_dls,
  LMModel3(len(vocab), 64), # vocab and n_hidden
  loss_func=F.cross_entropy,
  metrics=accuracy,
  cbs=ModelResetter # reset the hidden state of the model at the beginning of each epoch and before each validation phase
)

learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.682869,1.876281,0.409135,00:01
1,1.245189,1.751637,0.449519,00:01
2,1.095319,1.587953,0.503846,00:01
3,1.022529,1.643607,0.552885,00:01
4,0.992957,1.651556,0.539423,00:02
5,0.929095,1.819821,0.547356,00:02
6,0.877613,1.85066,0.563702,00:02
7,0.832252,1.670566,0.598558,00:01
8,0.792325,1.787299,0.599279,00:01
9,0.785346,1.722843,0.603365,00:01


# RNN 3: Predicting the Next Word after Each Word
- Input = word
- Output = next word
- This approach gives more signal -- it uses the intermediate predictions to predict the second and third words, in addition to the 4th word
- Very deep network -- not great because can result in very large / very small gradients (will deal with this later)

## Data

### Previous
Predict 1 word after every 3 words

In [73]:
x,y = first(dls.train)

print("X \t\t\t Y")
for i in range(len(x[:10])):
  print(x[i],"\t", y[i])

X 			 Y
tensor([0, 1, 2]) 	 tensor(1)
tensor([1, 3, 1]) 	 tensor(4)
tensor([4, 1, 5]) 	 tensor(1)
tensor([1, 6, 1]) 	 tensor(7)
tensor([7, 1, 8]) 	 tensor(1)
tensor([1, 9, 1]) 	 tensor(10)
tensor([10,  1, 11]) 	 tensor(1)
tensor([ 1, 12,  1]) 	 tensor(13)
tensor([13,  1, 14]) 	 tensor(1)
tensor([ 1, 15,  1]) 	 tensor(16)


### Now
- After every word, predict the next word
- For each sequence in the sample, there are two lists of the same size
  - The first list is x
  - The second list is y
  - y is offset by one element from x

In [74]:
# make choices
sl = 4 # sequence length (number of tokens in the sequence)
bs = 10 # batch size (number of samples)

In [75]:
# create new sequences
new_seqs = L(
  (
    tensor(nums[i:i+sl]), # x
    tensor(nums[i+1:i+sl+1]) # y
  )
  for i in range(0,len(nums)-sl-1,sl) # sl tells you how long the sequence is
)

# first set of x and y
new_seqs[0]

(tensor([0, 1, 2, 1]), tensor([1, 2, 1, 3]))

In [76]:
# explain more clearly
for i in range(3):
  x = new_seqs[i][0]
  y = new_seqs[i][1]

  x_vocab = [L(vocab[o] for o in new_seqs[i][0])]
  y_vocab = [L(vocab[o] for o in new_seqs[i][1])]

  print("SAMPLE", i+1)
  print(f"X (raw input and vocab):\t{x}\t{x_vocab}")
  print(f"Y (raw input and vocab):\t{y}\t{y_vocab}")
  print()

SAMPLE 1
X (raw input and vocab):	tensor([0, 1, 2, 1])	[['one', '.', 'two', '.']]
Y (raw input and vocab):	tensor([1, 2, 1, 3])	[['.', 'two', '.', 'three']]

SAMPLE 2
X (raw input and vocab):	tensor([3, 1, 4, 1])	[['three', '.', 'four', '.']]
Y (raw input and vocab):	tensor([1, 4, 1, 5])	[['.', 'four', '.', 'five']]

SAMPLE 3
X (raw input and vocab):	tensor([5, 1, 6, 1])	[['five', '.', 'six', '.']]
Y (raw input and vocab):	tensor([1, 6, 1, 7])	[['.', 'six', '.', 'seven']]



In [77]:
# data loaders
cut = int(len(new_seqs) * 0.8) # train / valid split

new_dls = DataLoaders.from_dsets(
  group_chunks(new_seqs[:cut], bs), # at each epoch, the model will see a chunk of contiguous text of size sl*m on each line of the batch
  group_chunks(new_seqs[cut:], bs),
  bs=bs, drop_last=True, shuffle=False
)

In [78]:
# sample
x,y = first(new_dls.train)

print("X \t\t\t\t\t Y")
for i in range(len(x)):
  print(x[i], "\t", y[i])

X 					 Y
tensor([0, 1, 2, 1]) 	 tensor([1, 2, 1, 3])
tensor([ 1,  0, 29,  0]) 	 tensor([ 0, 29,  0, 28])
tensor([28, 24,  2,  1]) 	 tensor([24,  2,  1,  0])
tensor([28, 22,  4,  1]) 	 tensor([22,  4,  1,  2])
tensor([28, 20,  6,  1]) 	 tensor([20,  6,  1,  3])
tensor([ 6,  1,  4, 29]) 	 tensor([ 1,  4, 29,  2])
tensor([ 8,  1,  4, 29]) 	 tensor([ 1,  4, 29,  9])
tensor([ 1,  5, 29,  7]) 	 tensor([ 5, 29,  7, 28])
tensor([ 1,  6, 29,  5]) 	 tensor([ 6, 29,  5, 28])
tensor([28,  2,  1,  7]) 	 tensor([ 2,  1,  7, 29])


## Get y preds using an RNN model architecture
- Stateful
- Outputs a prediction after every word (rather than just at the end of a sequence)

### Step 1. Initialize layer structures

#### Embedding layer (input --> hidden)
- Creates the embedding structure
- vocab_sz = number of tokens in the vocab
- n_hidden = number of hidden dimensions (for the embeddings)

In [79]:
vocab_sz = len(vocab)
n_hidden = 5

input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
input_to_hidden

Embedding(30, 5)

#### Linear (hidden) layer (hidden --> hidden)
- Initializes the parameter matrices (weights/bias)
- Creates the activations for the next token in the sequence

In [80]:
hidden_to_hidden = nn.Linear(n_hidden, n_hidden)

print('Structure:\t', hidden_to_hidden)
print('Weights shape:\t', hidden_to_hidden.weight.shape)
print('Bias shape:\t', hidden_to_hidden.bias.shape)

Structure:	 Linear(in_features=5, out_features=5, bias=True)
Weights shape:	 torch.Size([5, 5])
Bias shape:	 torch.Size([5])


#### Linear (output) layer (hidden --> output)
- Outputs predictions for the next token in the sequence by giving each token in the vocab a probability


In [81]:
hidden_to_output = nn.Linear(n_hidden, vocab_sz)

print('Structure:\t',hidden_to_output)
print('Weights shape:\t',hidden_to_output.weight.shape)
print('Bias shape:\t',hidden_to_output.bias.shape)

Structure:	 Linear(in_features=5, out_features=30, bias=True)
Weights shape:	 torch.Size([30, 5])
Bias shape:	 torch.Size([30])


#### Initialize hidden state

In [82]:
hidden = 0

### Step 2. First token (of 4-token sequence)

In [83]:
# raw input = the first COLUMN of x
input = x[:,0]
print(input.shape)
input

torch.Size([10])


tensor([ 0,  1, 28, 28, 28,  6,  8,  1,  1, 28])

In [84]:
# get embeddings
emb = input_to_hidden(input)
print(emb.shape)
emb

torch.Size([10, 5])


tensor([[-0.0541,  1.4377, -0.6071, -0.6359, -1.4333],
        [-0.3883,  0.5718, -0.7854, -0.2312,  0.0647],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957],
        [ 0.2785,  2.4130,  0.4558, -0.9273, -0.4406],
        [-0.2804,  0.3452,  0.5821,  0.0870,  0.7165],
        [-0.3883,  0.5718, -0.7854, -0.2312,  0.0647],
        [-0.3883,  0.5718, -0.7854, -0.2312,  0.0647],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957]], grad_fn=<EmbeddingBackward0>)

In [85]:
# update hidden state with embeddings
# since hidden stat was initialized to 0, here, it's the same as the embeddings
hidden = hidden + emb
print(hidden.shape)
hidden

torch.Size([10, 5])


tensor([[-0.0541,  1.4377, -0.6071, -0.6359, -1.4333],
        [-0.3883,  0.5718, -0.7854, -0.2312,  0.0647],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957],
        [ 0.2785,  2.4130,  0.4558, -0.9273, -0.4406],
        [-0.2804,  0.3452,  0.5821,  0.0870,  0.7165],
        [-0.3883,  0.5718, -0.7854, -0.2312,  0.0647],
        [-0.3883,  0.5718, -0.7854, -0.2312,  0.0647],
        [-1.6021, -1.0879, -0.6479, -0.6746, -0.3957]], grad_fn=<AddBackward0>)

In [86]:
# apply linear layer
hidden = hidden_to_hidden(hidden)
print(hidden.shape)
hidden

torch.Size([10, 5])


tensor([[-1.7006e-01, -1.5248e-01,  2.4186e-01, -1.4339e-01, -8.7209e-01],
        [ 5.9632e-01, -1.9563e-01, -1.5672e-01,  3.2954e-01, -3.3494e-01],
        [ 1.0002e+00, -9.9780e-01,  4.7206e-01,  3.8699e-01,  6.9360e-04],
        [ 1.0002e+00, -9.9780e-01,  4.7206e-01,  3.8699e-01,  6.9360e-04],
        [ 1.0002e+00, -9.9780e-01,  4.7206e-01,  3.8699e-01,  6.9360e-04],
        [-5.5977e-02,  9.2950e-01,  3.3538e-01, -5.3880e-01, -1.3467e+00],
        [ 6.4640e-01,  4.8039e-01,  2.0160e-01,  1.3311e-01, -1.2474e-01],
        [ 5.9632e-01, -1.9563e-01, -1.5672e-01,  3.2954e-01, -3.3494e-01],
        [ 5.9632e-01, -1.9563e-01, -1.5672e-01,  3.2954e-01, -3.3494e-01],
        [ 1.0002e+00, -9.9780e-01,  4.7206e-01,  3.8699e-01,  6.9360e-04]], grad_fn=<AddmmBackward0>)

In [87]:
# apply relu
hidden = F.relu(hidden)
print(hidden.shape)
hidden

torch.Size([10, 5])


tensor([[0.0000e+00, 0.0000e+00, 2.4186e-01, 0.0000e+00, 0.0000e+00],
        [5.9632e-01, 0.0000e+00, 0.0000e+00, 3.2954e-01, 0.0000e+00],
        [1.0002e+00, 0.0000e+00, 4.7206e-01, 3.8699e-01, 6.9360e-04],
        [1.0002e+00, 0.0000e+00, 4.7206e-01, 3.8699e-01, 6.9360e-04],
        [1.0002e+00, 0.0000e+00, 4.7206e-01, 3.8699e-01, 6.9360e-04],
        [0.0000e+00, 9.2950e-01, 3.3538e-01, 0.0000e+00, 0.0000e+00],
        [6.4640e-01, 4.8039e-01, 2.0160e-01, 1.3311e-01, 0.0000e+00],
        [5.9632e-01, 0.0000e+00, 0.0000e+00, 3.2954e-01, 0.0000e+00],
        [5.9632e-01, 0.0000e+00, 0.0000e+00, 3.2954e-01, 0.0000e+00],
        [1.0002e+00, 0.0000e+00, 4.7206e-01, 3.8699e-01, 6.9360e-04]], grad_fn=<ReluBackward0>)

In [88]:
# get outputs!
# one "row" for each sample in the batch; one "column" for each token in vocab
outputs = hidden_to_output(hidden)
print(outputs.shape)

torch.Size([10, 30])


In [89]:
# save outputs to list (will save an output for every token in the sequence)
outs = []
outs.append(outputs)
len(outs)

1

### Step 3. Second token

In [90]:
# raw input from SECOND column of x
input = x[:,1]
print(input.shape)

# get embeddings
emb = input_to_hidden(input)
print(emb.shape)

# update hidden state -- activations from first layer + embeddings of second column
hidden = hidden + emb
print(hidden.shape)

# apply linear function
hidden = hidden_to_hidden(hidden)
print(hidden.shape)

# apply relu
hidden = F.relu(hidden)
print(hidden.shape)
print()

# get outputs
outputs = hidden_to_output(hidden)

# save to output list
outs.append(outputs)
print(len(outs))
print(outs[1].shape)

torch.Size([10])
torch.Size([10, 5])
torch.Size([10, 5])
torch.Size([10, 5])
torch.Size([10, 5])

2
torch.Size([10, 30])


### Steps 3, 4, 5. Repeat for tokens 2, 3, and 4 (in the 4-token sequence)

In [91]:
### TOKEN 2 ###
# raw input from SECOND column of x
input = x[:,1]

# get embeddings
emb = input_to_hidden(input)

# update hidden state -- activations from first layer + embeddings of second column
hidden = hidden + emb

# apply linear function
hidden = hidden_to_hidden(hidden)

# apply relu
hidden = F.relu(hidden)

# get outputs
outputs = hidden_to_output(hidden)
outs.append(outputs)

# save to output list
len(outs)

3

In [92]:
# TOKEN 3
input = x[:,2] # raw input from THIRD column of x
emb = input_to_hidden(input)
hidden = hidden + emb # update hidden state -- activations from second layer + embeddings of third column
hidden = hidden_to_hidden(hidden)
hidden = F.relu(hidden)
outputs = hidden_to_output(hidden)
outs.append(outputs)
len(outs)

4

In [93]:
# TOKEN 4
input = x[:,3] # raw input from FOURTH column of x
emb = input_to_hidden(input)
hidden = hidden + emb # update hidden state -- activations from third layer + embeddings of fourth column
hidden = hidden_to_hidden(hidden)
hidden = F.relu(hidden)
outputs = hidden_to_output(hidden)
outs.append(outputs)
len(outs)

5

### Repeat Steps 2-5 in a more efficient way
- Loop for all tokens in the sequence
- Recall, `sl` = sequence length

In [94]:
tokens_all = range(sl)
outs = []

for i in tokens_all:
  print("TOKEN", i+1)

  input = x[:,i]
  print("input shape:", input.shape)

  emb = input_to_hidden(input)
  print("embedding shape:", emb.shape)

  hidden = hidden + emb
  print("hidden shape:", hidden.shape)

  hidden = hidden_to_hidden(hidden)
  print("hidden shape after linear:", hidden.shape)

  hidden = F.relu(hidden)
  print("hidden shape after relu:", hidden.shape)

  outputs = hidden_to_output(hidden)
  print("outputs shape:", outputs.shape)

  outs.append(outputs)
  print("----\n")

print("total number of outputs:", len(outs))

TOKEN 1
input shape: torch.Size([10])
embedding shape: torch.Size([10, 5])
hidden shape: torch.Size([10, 5])
hidden shape after linear: torch.Size([10, 5])
hidden shape after relu: torch.Size([10, 5])
outputs shape: torch.Size([10, 30])
----

TOKEN 2
input shape: torch.Size([10])
embedding shape: torch.Size([10, 5])
hidden shape: torch.Size([10, 5])
hidden shape after linear: torch.Size([10, 5])
hidden shape after relu: torch.Size([10, 5])
outputs shape: torch.Size([10, 30])
----

TOKEN 3
input shape: torch.Size([10])
embedding shape: torch.Size([10, 5])
hidden shape: torch.Size([10, 5])
hidden shape after linear: torch.Size([10, 5])
hidden shape after relu: torch.Size([10, 5])
outputs shape: torch.Size([10, 30])
----

TOKEN 4
input shape: torch.Size([10])
embedding shape: torch.Size([10, 5])
hidden shape: torch.Size([10, 5])
hidden shape after linear: torch.Size([10, 5])
hidden shape after relu: torch.Size([10, 5])
outputs shape: torch.Size([10, 30])
----

total number of outputs: 4


### Step 6. Examining the outputs -- predictions of next word

- `outs` = a list that has one output for each token in the sequence (here, 4 outputs because there are 4 tokens in each sequence)
  - There will be the same number of outputs as there are tokens in the sequence length
- The output from each token are the predicted y values for the next token
  - Shape: [batch size, vocab size]
  - One "row" for each sample in the batch
  - Each "column" refers to a token in the vocab

In [95]:
# outs -- list of all outputs (one per token in the sequence length)
print(len(outs))

# output from one token (token 1)
  # 10 "rows" for the 10 samples in the batch
  # 30 "columns" because there's 30 tokens in our vocab
print(outs[0].shape)

4
torch.Size([10, 30])


#### Rearrange the outputs
- Stack all of the outputs together into one tensor
  - Shape: [batch size, sequence length, vocab size]
  - Dimension 0 = each sample in the batch
  - Dimension 1 = the location in the sequence (0-sl)
  - Dimension 2 = each token in the vocab
- Each of the items in dimension 0 (a sample) now corresponds to the predictions for one sample
  - Predictions for each location in the sequence
  - Predictions for every token in the vocab

In [96]:
# stacked outputs in a single tensor
stacked_outs = torch.stack(outs, dim=1)
print(stacked_outs.shape)

torch.Size([10, 4, 30])


In [97]:
# sample 1
print(stacked_outs[0].shape)

torch.Size([4, 30])


## Loss: Cross entropy

### Target values (the true Y values)
The true next token in the sequence

In [98]:
# recall x (shape = [batch size, sequence length])
print(x.shape)
x

torch.Size([10, 4])


tensor([[ 0,  1,  2,  1],
        [ 1,  0, 29,  0],
        [28, 24,  2,  1],
        [28, 22,  4,  1],
        [28, 20,  6,  1],
        [ 6,  1,  4, 29],
        [ 8,  1,  4, 29],
        [ 1,  5, 29,  7],
        [ 1,  6, 29,  5],
        [28,  2,  1,  7]])

In [99]:
# recall y (shape = [batch_size, sequence_length])
print(y.shape)
y

torch.Size([10, 4])


tensor([[ 1,  2,  1,  3],
        [ 0, 29,  0, 28],
        [24,  2,  1,  0],
        [22,  4,  1,  2],
        [20,  6,  1,  3],
        [ 1,  4, 29,  2],
        [ 1,  4, 29,  9],
        [ 5, 29,  7, 28],
        [ 6, 29,  5, 28],
        [ 2,  1,  7, 29]])

In [100]:
# flatten y values to be a vector (so can do cross entropy on them)
# shape: [batch size*sequence length]
flattened_y = y.view(-1)
print(flattened_y.shape)
flattened_y

torch.Size([40])


tensor([ 1,  2,  1,  3,  0, 29,  0, 28, 24,  2,  1,  0, 22,  4,  1,  2, 20,  6,  1,  3,  1,  4, 29,  2,  1,  4, 29,  9,  5, 29,  7, 28,  6, 29,  5, 28,  2,  1,  7, 29])

### Predicted Y values (outputs of the RNN)
- Shape = [batch size, sequence length, vocab size]
- dimension 0 = each sample in the batch size
- dimension 1 = the location in the sequence (0-sl)
- dimension 2 = each token in the vocab

In [101]:
# recall RNN outputs = predicted y values
print(stacked_outs.shape)

torch.Size([10, 4, 30])


In [102]:
# flatten these values to be a matrix (so can compare in cross entropy function below)
# shape: [(batch size*sequence length), vocab size]
flattened_outs = stacked_outs.view(-1, len(vocab))
flattened_outs.shape

torch.Size([40, 30])

### Cross entropy loss

In [103]:
loss = F.cross_entropy(flattened_outs, flattened_y)
loss

tensor(3.5575, grad_fn=<NllLossBackward0>)

## Create model class for all steps
Recall general deep learning steps:
- Get data
- Initialize params
- Run inputs through RNN architecture
- Get outputs
- Get loss
- Optimize with SGD and learning rate
- Step parameters
- Iterate

### Data

In [104]:
# make sequence length longer and batch size bigger
sl = 16
bs = 64

# create sequences (same process as above)
new_seqs = L(
  (tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
  for i in range(0,len(nums)-sl-1,sl)
)

# train / valid split
cut = int(len(new_seqs) * 0.8)

# data loaders
new_dls = DataLoaders.from_dsets(
  group_chunks(new_seqs[:cut], bs), # at each epoch, the model will see a chunk of contiguous text of size sl*m on each line of the batch
  group_chunks(new_seqs[cut:], bs),
  bs=bs, drop_last=True, shuffle=False
)

In [105]:
# show sample
x,y = first(new_dls.train)
for i in range(len(x[:3])):
  print("X:", x[i])
  print("Y:", y[i])
  print()

X: tensor([0, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1])
Y: tensor([1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9])

X: tensor([ 2, 28, 11,  1,  2, 28, 12,  1,  2, 28, 13,  1,  2, 28, 14,  1])
Y: tensor([28, 11,  1,  2, 28, 12,  1,  2, 28, 13,  1,  2, 28, 14,  1,  2])

X: tensor([ 6,  1,  3, 28, 25,  7,  1,  3, 28, 25,  8,  1,  3, 28, 25,  9])
Y: tensor([ 1,  3, 28, 25,  7,  1,  3, 28, 25,  8,  1,  3, 28, 25,  9,  1])



### Model class

In [106]:
class LMModel4(Module):
  def __init__(self, vocab_sz, n_hidden):
    """
    Initialize layers and hidden state
    """
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.h_h = nn.Linear(n_hidden, n_hidden)
    self.h_o = nn.Linear(n_hidden,vocab_sz)
    self.h = 0

  def forward(self, x):
    """
    Get activations (predicted tokens)
    """
    # initialize outputs
    outs = []

    # each hidden layer = sl words in the x sequences
    for i in range(sl):
      # get embeddings for the hidden layers
      self.h = self.h + self.i_h(x[:,i])

      # get activations
      self.h = F.relu(self.h_h(self.h))

      # get outputs
      out = self.h_o(self.h)
      outs.append(out)

    # detach gradient -- remove its history (for memory size purposes)
    self.h = self.h.detach()

    # stack outputs
    outs = torch.stack(outs, dim=1)
    return outs

  # function to reset the hidden state
  def reset(self):
    self.h = 0

### Create loss function that includes a flattening step
This is what we did above

In [107]:
def loss_func(preds, targets):
  loss = F.cross_entropy(
    preds.view(-1, len(vocab)), # reshape preds to have 2 dimensions: (batch size*sequence length) x vocab size
    targets.view(-1) # reshape targets to have 1 dimension: (batch size*sequence length)
  )
  return loss

### Create learner, train, fit

In [108]:
n_hidden = 64

learn = Learner(
  new_dls,
  LMModel4(len(vocab), n_hidden),
  loss_func=loss_func,
  metrics=accuracy,
  cbs=ModelResetter
)

learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.227531,3.086109,0.226156,00:00
1,2.307408,1.944043,0.467367,00:00
2,1.74972,1.90168,0.438802,00:01
3,1.488089,1.969942,0.474935,00:01
4,1.307086,2.102713,0.472005,00:01
5,1.200593,2.241389,0.511719,00:01
6,1.091871,2.195455,0.55127,00:01
7,1.00158,2.262634,0.541341,00:00
8,0.912107,2.321158,0.569661,00:00
9,0.858945,2.33203,0.573812,00:00


# RNN 4: Multilayer RNN
- Pass the activations from our recurrent neural network into a second recurrent neural network
- Note -- this still suffers from the "vanishing gradients" aka "exploding gradients" problem. We will address this when we use an LSTM below.

## Data

In [109]:
def create_dls(sl, bs, nums):
  """
  In the book, they use the same number (64) for the batch size and number of hidden dimensions. I find that confusing.
  I'm making this function to easily go back and forth between dimensions we set and what the book says for results comparison.
  """
  # create sequences (same process as above)
  new_seqs = L(
    (tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
    for i in range(0,len(nums)-sl-1,sl)
  )

  # train / valid split
  cut = int(len(new_seqs) * 0.8)

  # data loaders
  dls = DataLoaders.from_dsets(
    group_chunks(new_seqs[:cut], bs), # at each epoch, the model will see a chunk of contiguous text of size sl*m on each line of the batch
    group_chunks(new_seqs[cut:], bs),
    bs=bs, drop_last=True, shuffle=False
  )
  return dls

In [110]:
# small size (easier to understand what's happening)
sl = 4
bs = 10

dls = create_dls(sl, bs, nums)

# first batch of x and y
x,y = first(dls.train)
x

tensor([[ 0,  1,  2,  1],
        [ 1,  0, 29,  0],
        [28, 24,  2,  1],
        [28, 22,  4,  1],
        [28, 20,  6,  1],
        [ 6,  1,  4, 29],
        [ 8,  1,  4, 29],
        [ 1,  5, 29,  7],
        [ 1,  6, 29,  5],
        [28,  2,  1,  7]])

## Step 1. Initialize layers and hidden state

### Embedding layer
- Creates the embeddings for each token of x

In [111]:
vocab_sz = len(vocab)
n_hidden = 5

emb_layer = nn.Embedding(vocab_sz, n_hidden)
emb_layer

Embedding(30, 5)

### Hidden (RNN) layer
- Previously, the hidden layer was linear
- This RNN layer creates activations for the next word in the sequence
- RNN inputs: [batch size, sequence length, input size]
- RNN outputs: [batch size, sequence length, output size]

In [112]:
n_layers = 2

rnn_layer = nn.RNN(
  n_hidden, # number of features in the inputs
  n_hidden, # number of features in the hidden state -- the outputs of this layer (here, same as inputs)
  num_layers = n_layers, # number of recurrent layers
  batch_first=True # has to do with the shape of the inputs
)

print('Structure:\t',rnn_layer)

Structure:	 RNN(5, 5, num_layers=2, batch_first=True)


### Output (linear) layer
- Creates the predictions of the next word (one prediction for each token in the vocab)

In [113]:
output_layer = nn.Linear(n_hidden, vocab_sz)

print('Structure:\t',output_layer)
print('Weights shape:\t',output_layer.weight.shape)
print('Bias shape:\t',output_layer.bias.shape)

Structure:	 Linear(in_features=5, out_features=30, bias=True)
Weights shape:	 torch.Size([30, 5])
Bias shape:	 torch.Size([30])


### Hidden State
- Shape = [number of layers, batch size, number of hidden dimensions]
- All 0s

In [114]:
hidden = torch.zeros(n_layers, bs, n_hidden)
print(hidden.shape)

torch.Size([2, 10, 5])


## Step 2. Get Y Preds using the RNN architecture
Note: don't need to loop over layers here -- it does it all at once!

### Input
- Shape = [bs, sl]

In [115]:
# raw input = ALL of x
input = x
print(input.shape)

torch.Size([10, 4])


### Get embeddings
- Note: don't need to add the embeddings to hidden state -- the RNN will do something different!
- Shape = [bs, sl, n hidden]

In [116]:
emb = emb_layer(input)
print(emb.shape)

torch.Size([10, 4, 5])


### Run RNN
- Inputs
  - Embeddings of the raw x: [bs, sl, n_hidden]
  - Hidden state: [rnn_layers, bs, n_hidden]
- Outputs
  - Activations: [bs, sl, n_hidden]
  - Updated hidden state: [rnn_layers, bs, n_hidden]

In [117]:
results, hidden = rnn_layer(emb, hidden)
print("activations (results) shape:", results.shape)
print("updated hidden state shape:", hidden.shape)

activations (results) shape: torch.Size([10, 4, 5])
updated hidden state shape: torch.Size([2, 10, 5])


### Get Y preds (with the output layer)
- Preds shape: [bs, sl, vocab size]
- dim 0 = each sample in the batch
- dim 1 = the location in the sequence (0-sl)
- dim 2 = each token in the vocab

In [118]:
outputs = output_layer(results)
print(outputs.shape)

torch.Size([10, 4, 30])


## Step 3. Train, fine-tune

### Model architecture as a class

In [119]:
class LMModel5(Module):
  def __init__(self, vocab_sz, n_hidden, n_layers):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
    self.h_o = nn.Linear(n_hidden, vocab_sz)
    self.h = torch.zeros(n_layers, bs, n_hidden)

  def forward(self, x):
    res,h = self.rnn(self.i_h(x), self.h)
    self.h = h.detach()
    return self.h_o(res)

  def reset(self):
    self.h.zero_()

### Train!

In [120]:
# use the book's sl and bs
sl = 16
bs = 64

# dataloaders
dls = create_dls(sl, bs, nums)

In [121]:
# set number of hidden states and RNN layers
n_hidden = 64
n_layers = 2

# create learner
learn = Learner(
  dls,
  LMModel5(len(vocab), n_hidden, n_layers),
  loss_func=CrossEntropyLossFlat(),
  metrics=accuracy,
  cbs=ModelResetter
)

# fit
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.106834,2.695658,0.446777,00:00
1,2.186411,1.820358,0.471191,00:01
2,1.723488,1.941531,0.335531,00:01
3,1.494312,1.874899,0.452474,00:01
4,1.318809,1.705663,0.50651,00:01
5,1.159192,1.723225,0.539714,00:01
6,1.033712,1.723945,0.545573,00:00
7,0.944427,1.670675,0.554036,00:00
8,0.87111,1.674439,0.564372,00:00
9,0.803593,1.68773,0.568115,00:00


# LSTM Cell (not a whole LSTM model!)
**Problem with deep networks: risk of exploding / disappearing activations**
  - Deep neural networks = each layer is another matrix multiplication (multiplying numbers and adding them up)
  - These deep models can create problems because it requires you to multiply by a matrix many times (get very big or very small results)
  - This is a problem because the way computers store numbers (known as "floating point") means that they become less and less accurate the further away the numbers get from zero
  - This inaccuracy means that often the gradients calculated for updating the weights end up as zero or infinity for deep networks (called "vanishing / exploding gradients") --> in SGD, the weights are either not updated at all or jump to infinity. Either way, they won't improve with training
- LSTMs (Long Short Term Memory): a type of layer that avoids exploding activations
- GRUs (Gated Recurrent Units): another type of layer that avoids exploding activations (not covered here)

**LSTMs vs RNNs**
- RNNs
  - 1 hidden state (the output of the RNN at the previous time step). Goal = predict next token
  - It is bad at remembered what happened much earlier in a document
- LSTMs
  - Have 2 hidden states:
  (1) *Hidden state* (confusing name, I know...). Goal = predict next token
  (2) *Cell state*. Goal = long and short term memory
  - Better at remembering things now!

## LSTM step by step for single layer
*Note: this would never be done in practice, since the outputs of the layers go into other layers (like the multilayered RNN). There would never be a single output like this (this is showing a single loop from one layer).*

*But I find that I need to see this to get a sense of what the gates actually do*

### Make decisions about structure

In [122]:
bs = 10
n_hidden = 8
n_inputs = 8 # n_inputs is the same as n_hidden for our example (because of the way we defined the embeddings)
vocab_sz = len(vocab)

# n_layers = 1 # not using rnn layers for this example
sl = 4 # not using sequences for this example, but used to create the data below

In [123]:
# back to small version of data
dls = create_dls(sl, bs, nums)
x,y = first(dls.train)
x.shape, y.shape

(torch.Size([10, 4]), torch.Size([10, 4]))

### Initialize hidden and cell states
- Hidden state
  - Shape: [bs, n hidden]
  - Focus: the NEXT token to predict
  - Goals: have the right information for the output layer to predict the correct next token; retain memory of everything that happened in the sentence
- Cell state
  - Shape: [bs, n hidden]
  - Focus: keep long short-term memory
  - Goals: longer-term state
- (note: only one layer for now)

In [124]:
# initilize hidden state and cell state to have 0s
state = [torch.zeros(bs, n_hidden) for _ in range(2)]
hidden, cell = state[0], state[1]

print(hidden.shape)
print(cell.shape)

torch.Size([10, 8])
torch.Size([10, 8])


### Inputs

In [125]:
# raw x input -- one column
i = 0
input = x[:,i]
input.shape

torch.Size([10])

In [126]:
# define embedding structure
input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
input_to_hidden

Embedding(30, 8)

In [127]:
# get embeddings
emb = input_to_hidden(input)
emb.shape

torch.Size([10, 8])

### Combine embeddings with (initialized or previous) hidden state
- Stack hidden state and embeddings in one big tensor
- Because they are stacked, the dimension of the embeddings (n_inputs) can be different from the dimension of the hidden state (n_hidden)

In [128]:
updated_hidden = torch.cat([hidden, emb], dim=1)
updated_hidden.shape

torch.Size([10, 16])

**Note: as we'll see below, all the neural nets (gates) are linear layers with (n_inputs + n_hidden) inputs and (n_hidden) outputs.**

### Forget Gate
- NN (linear layer, followed by sigmoid)
- Inputs -- (updated) hidden layer
- Outputs -- scalars between 0 and 1
- Decides which information to keep vs. throw away

In [129]:
# forget gate
forget_gate = nn.Linear(n_inputs + n_hidden, n_hidden) # initialize params
lin = forget_gate(updated_hidden) # linear layer
forget_gate_outputs = torch.sigmoid(lin) # sigmoid

forget_gate_outputs.shape

torch.Size([10, 8])

### Update Cell State
- Multiply the forget gate outputs by the cell state -- discard values closer to 0, keep values closer to 1
- Gives the LSTM the ability to forget things about its long-term state (e.g., if coming across an `xxbos` token or a `.` in this example)

In [130]:
updated_cell = cell * forget_gate_outputs
updated_cell.shape

torch.Size([10, 8])

### Input Gate
- NN (linear layer, followed by sigmoid)
- *Note: this is the same process as forget gate (but the linear functions have different initialized parameters)*
- Inputs -- (updated) hidden layer
- Outputs -- scalars between 0 and 1
- Decides which elements of the cell state to update (values close to 1) or not (values close to 0)

In [131]:
# input gate
input_gate = nn.Linear(n_inputs + n_hidden, n_hidden) # initialize params
lin = input_gate(updated_hidden) # linear layer
input_gate_outputs = torch.sigmoid(lin) # sigmoid

input_gate_outputs.shape

torch.Size([10, 8])

### Cell gate
- NN (linear layer, followed by tanh). tanh = sigmoid function rescaled to the range -1 to 1
- Inputs -- (updated) hidden layer
- Outputs -- numbers with range of -1 to +1
- If the input gate thinks a value should be updated, the cell gate determines what the updated values actually are

In [132]:
cell_gate = nn.Linear(n_inputs + n_hidden, n_hidden) # initialize params
lin = cell_gate(updated_hidden) # linear layer
cell_gate_outputs = torch.tanh(lin) # tanh

cell_gate_outputs.shape

torch.Size([10, 8])

### Update Cell State
- Multiply the input gate outputs and the cell gate outputs
- Add these products to the updated cell (that was updated by the forget gate)
- Lets the LSTM replace information if need be (e.g., replace a gender pronoun if the forget gate removed it)

In [133]:
updated_cell2 = updated_cell + input_gate_outputs * cell_gate_outputs
print(updated_cell2.shape)

torch.Size([10, 8])


### Output gate
- NN (linear layer, followed by a sigmoid)
- Inputs -- (updated) hidden layer
- Outputs -- scalars between 0 and 1
- Determines which information to use from the cell state to generate the output

In [134]:
output_gate = nn.Linear(n_inputs + n_hidden, n_hidden) # initialize params
lin = output_gate(updated_hidden) # linear layer
output_gate_outputs = torch.sigmoid(lin) # sigmoid

output_gate_outputs.shape

torch.Size([10, 8])

### Update hidden and cell states
Updated hidden state is also the output

In [135]:
# update cell state with tanh -- this is the new cell state!
new_cell_state = torch.tanh(updated_cell2)
new_cell_state.shape

torch.Size([10, 8])

In [136]:
# update hidden state by combining updated cell state with results from output gate -- this is the new hidden state!
new_hidden_state = output_gate_outputs * new_cell_state
new_hidden_state.shape

torch.Size([10, 8])

In [137]:
# keep track of the new hidden and cell states
new_state = (new_hidden_state, new_cell_state)

## All together in class

In [138]:
class LSTMCell(Module):
  def __init__(self, ni, nh):
    # the way we define the embeddings right now, the number of inputs (ni)
    # is the same as the number of hidden dimensions (nh)
    self.forget_gate = nn.Linear(ni + nh, nh)
    self.input_gate  = nn.Linear(ni + nh, nh)
    self.cell_gate   = nn.Linear(ni + nh, nh)
    self.output_gate = nn.Linear(ni + nh, nh)

  def forward(self, input, state):
    # same steps as above!
    h,c = state
    h = torch.cat([h, input], dim=1)
    forget = torch.sigmoid(self.forget_gate(h))
    c = c * forget
    inp = torch.sigmoid(self.input_gate(h))
    cell = torch.tanh(self.cell_gate(h))
    c = c + inp * cell
    out = torch.sigmoid(self.output_gate(h))
    h = out * torch.tanh(c)
    return h, (h,c)

In [139]:
# more efficient version (uses parallelization)
class LSTMCell(Module):
  def __init__(self, ni, nh):
    self.ih = nn.Linear(ni,4*nh)
    self.hh = nn.Linear(nh,4*nh)

  def forward(self, input, state):
    h,c = state

    # one big multiplication for all the gates is better than 4 smaller ones
    gates = (self.ih(input) + self.hh(h)).chunk(4, 1) # chunk -- splits our tensor into four pieces
    ingate,forgetgate,outgate = map(torch.sigmoid, gates[:3])
    cellgate = gates[3].tanh()

    c = (forgetgate*c) + (ingate*cellgate)
    h = outgate * c.tanh()

    return h, (h,c)

# Full LSTM Model

## Step by Step

### Initialize params (in layers and state)

In [140]:
# make decisions
bs = 10
n_hidden = 8
n_inputs = 8
vocab_sz = len(vocab)
n_layers = 2 # number of rnn layers
sl = 4

In [141]:
# embedding layer
input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
input_to_hidden

Embedding(30, 8)

In [142]:
# lstm layer (instead of the basic rnn layer)
rnn = nn.LSTM(
  n_hidden, # number of features in the inputs
  n_hidden, # number of features in the hidden state (here, same as inputs)
  num_layers = n_layers, # number of recurrent layers
  batch_first=True # has to do with the shape of the inputs
)

rnn

LSTM(8, 8, num_layers=2, batch_first=True)

In [143]:
# output layer
hidden_to_output = nn.Linear(n_hidden, vocab_sz)
hidden_to_output

Linear(in_features=8, out_features=30, bias=True)

In [144]:
# initilize hidden state and cell state to have 0s
state = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
hidden, cell = state[0], state[1]
hidden.shape, cell.shape

(torch.Size([2, 10, 8]), torch.Size([2, 10, 8]))

### Forward step
Note: don't need to add the embeddings to hidden state -- the LSTM will do something different!

In [145]:
# raw input = ALL of x
input = x
input.shape

torch.Size([10, 4])

In [146]:
# get embeddings
emb = input_to_hidden(input)
emb.shape

torch.Size([10, 4, 8])

In [147]:
# LSTM (a type of RNN) -- two inputs:
  # input: embeddings, hidden / cell states
  # output: activations, hidden / cell states
results, state = rnn(emb, state)
print("activations shape:", results.shape)
print("hidden state shape:", state[0].shape)
print("cell state shape:", state[1].shape)

activations shape: torch.Size([10, 4, 8])
hidden state shape: torch.Size([2, 10, 8])
cell state shape: torch.Size([2, 10, 8])


In [148]:
# get predicted y values
  # dimension 0 = each sample in the batch size
  # dimension 1 = the location in the sequence (0-sl)
  # dimension 2 = each token in the vocab
outputs = hidden_to_output(results)
outputs.shape

torch.Size([10, 4, 30])

In [149]:
# detach
state = [state_.detach() for state_ in state]

# reset state
for state_ in state:
  state_.zero_()

## As a Class

In [150]:
class LMModel6(Module):
  def __init__(self, vocab_sz, n_hidden, n_layers):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
    self.h_o = nn.Linear(n_hidden, vocab_sz)

    # hidden AND cell states
    self.state = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

  def forward(self, x):
    res,state = self.rnn(self.i_h(x), self.state)
    self.state = [h_.detach() for h_ in state]
    return self.h_o(res)

  def reset(self):
    for state in self.state: state.zero_()

In [151]:
# back to big version of data
sl, bs = 16, 64
dls = create_dls(sl, bs, nums)

In [152]:
# train
n_hidden = 64
n_layers = 2

learn = Learner(
  dls,
  LMModel6(len(vocab), n_hidden, n_layers),
  loss_func=CrossEntropyLossFlat(),
  metrics=accuracy,
  cbs=ModelResetter
)

learn.fit_one_cycle(15, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.032064,2.761434,0.29484,00:01
1,2.233891,2.206143,0.216471,00:01
2,1.651446,1.998826,0.393717,00:01
3,1.364812,2.169563,0.469727,00:01
4,1.140272,2.139261,0.509521,00:01
5,0.931072,2.333035,0.518311,00:01
6,0.73096,1.916736,0.541585,00:01
7,0.539785,1.410841,0.636719,00:01
8,0.361171,1.231609,0.691569,00:01
9,0.22938,1.136598,0.714925,00:01


# Regularizing an LSTM
- RNNs tend to overfit (even using LSTMs and GRU cells)
- Can't really use data augmentation because that requires another model (e.g., translate text into another language and then back to the original language)
- Other regularization techniques can be used to reduce overfitting: dropout, activation regularization, temporal activation regularization

## Dropout
- Randomly change some activations to 0 at training time -- makes sure all neurons actively work toward the output
- At test time, you use all activations
- Helps with overfitting / generalization -- helps the neurons to cooperate better together, and then it makes the activations more noisy (making the model more robust)
- Using dropout before passing the output of the LSTM to the final layer will help reduce overfitting
- If you apply dropout with probability `p`, you rescale all activations by dividing them by `1-p` --> on average, p will be zeroed, so it leaves `1-p`


### Initialize

In [153]:
# create sample activations
act = tensor(range(1,11)).float()
act

tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [154]:
# choose proportion of activations to dropout (p)
p = 0.4

In [155]:
# choose if training (use dropout) or evaluation (no dropout)
training=True

### Forward step

In [156]:
# creates a new tensor (called mask) with the same shape of x
mask = act.new(*act.shape)
mask

tensor([2.3759e-34, 3.2165e-41, 0.0000e+00, 0.0000e+00, 1.4013e-45, 0.0000e+00, 0.0000e+00, 8.0000e+00, 1.1210e-43, 0.0000e+00])

In [157]:
# randomly make some values of mask 0s (with probability p) and the rest 1s (with probability 1-p)
mask = mask.bernoulli_(1-p)
mask

tensor([0., 0., 0., 1., 1., 0., 1., 0., 1., 0.])

In [158]:
# multiply the inputs (act) and mask together -- keeps some of the original values and replaces others with 0s
tmp = act*mask
tmp

tensor([0., 0., 0., 4., 5., 0., 7., 0., 9., 0.])

In [159]:
# divide the result by (1-p) -- this helps scale the activations (see below)
dropout = act*mask.div_(1-p)
dropout

tensor([ 0.0000,  0.0000,  0.0000,  6.6667,  8.3333,  0.0000, 11.6667,  0.0000, 15.0000,  0.0000])

#### To understand why the scaling is done (dividing the result by (1-p)
`w` is the same for original and rescaled activations, but not the (unscaled) activations with dropout

In [160]:
def get_values(input):
  num_nonzeros = torch.count_nonzero(input) # count number of non-zero elements
  s = torch.sum(input) # get the sum of all elements
  v = s / num_nonzeros # the "weight" of each tensor
  w = num_nonzeros * v # the total "weight" of all of the non-zero tensors

  return num_nonzeros, s, v, w

In [161]:
# original activations
num_nonzeros,s,v,w = get_values(act)

print("number of non-zeros:", num_nonzeros)
print("the sum of all elements:", s)
print("the weight of each tensor:", v)
print("the total weight of all the non-zero tensors:", w)

number of non-zeros: tensor(10)
the sum of all elements: tensor(55.)
the weight of each tensor: tensor(5.5000)
the total weight of all the non-zero tensors: tensor(55.)


In [162]:
# activations with dropout (but no scaling)
num_nonzeros,s,v,w = get_values(mask)

print("number of non-zeros:", num_nonzeros)
print("the sum of all elements:", s)
print("the weight of each tensor:", v)
print("the total weight of all the non-zero tensors:", w)

number of non-zeros: tensor(4)
the sum of all elements: tensor(6.6667)
the weight of each tensor: tensor(1.6667)
the total weight of all the non-zero tensors: tensor(6.6667)


In [163]:
# the rescaled activations (divide each activation by 1-p)
num_nonzeros,s,v,w = get_values(dropout)

print("number of non-zeros:", num_nonzeros)
print("the sum of all elements:", s)
print("the weight of each tensor:", v)
print("the total weight of all the non-zero tensors:", w)

number of non-zeros: tensor(4)
the sum of all elements: tensor(41.6667)
the weight of each tensor: tensor(10.4167)
the total weight of all the non-zero tensors: tensor(41.6667)


### As a class

In [164]:
class Dropout(Module):
  def __init__(self, p):
    """
    Choose proportion of activations to dropout (p)
    """
    self.p = p

  def forward(self, x):
    """
    At evaluation time, use all neurons
    During training, only keep some of the neurons (keep 1-p)
    """
    if not self.training: # training attribute is available in any PyTorch nn.Module
      return x

    # creates a tensor of random 0s (probability p) and 1s (probability 1-p)
    # which is then multiplied with the input
    mask = x.new(*x.shape).bernoulli_(1-p)

    # divide by 1-p
    res = x * mask.div_(1-p)
    return res

## Activation Regularization
- Tries to make the final activations produced by the LSTM as small as possible -- aka regularizes the final activations
- Store the activations, then add the means of the squares of them to the loss (along with a multiplier alpha)
- `loss += alpha * activations.pow(2).mean()`
- Often applied on the DROPPED-OUT activations (to not penalize the activations we turned into zeros afterward)

In [165]:
# reminder of sample activations (the y preds)
print(act)

# make true y values half 0s and half ones
targets = tensor([0,0,0,0,0,1,1,1,1,1]).float()
print(targets)

tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
tensor([0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])


In [166]:
# get the loss
loss = F.cross_entropy(act, targets)
print(loss)

tensor(12.2931)


In [167]:
# add a penalty to the loss
alpha = 2 # hyperparameter we can set (a multiplier)

In [168]:
# square each of the values in the activations tensor -- get the mean of these values
tmp_act = act.pow(2)
tmp_act_mean = tmp_act.mean()

print(tmp_act)
print(tmp_act_mean)

tensor([  1.,   4.,   9.,  16.,  25.,  36.,  49.,  64.,  81., 100.])
tensor(38.5000)


In [169]:
# multiply this value (the mean of the squared activations) by the alpha parameter
penalty = alpha * tmp_act_mean
print(penalty)

tensor(77.)


In [170]:
# update the loss with the penalty
updated_loss = loss + penalty
print(updated_loss)

tensor(89.2932)


## Temporal Activation Regularization
- Linked to the fact that we are predicting tokens in a sequence -- the outputs of our LSTMs should somewhat make sense when we read them in order
- Adds a penalty to the loss to make the difference between two consecutive activations as small as possible
- `loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()`
- Applied on the NON-DROPPED-OUT activations (because those zeros create big differences between two consecutive time steps)

In [171]:
# recall original y
print(y.shape) # bs x sl

torch.Size([10, 4])


In [172]:
# recall activations (from the lstm)
activ, state = rnn(emb, state)
print(activ.shape) # bs x sl x n_hidden

torch.Size([10, 4, 8])


In [173]:
# recall: each x consists of a sequence of 4 tokens
# note: the dimension in the middle (of the activations) has to do with the location in the sequence
# for example, this gets all of the activations that correspond to the FIRST word in every x sequence
first_token = activ[:,0,:]
print(first_token.shape)

torch.Size([10, 8])


In [174]:
# this gets all of the activations that correspond to the SECOND word in every x sequence
second_token = activ[:,1,:]
print(second_token.shape)

torch.Size([10, 8])


In [175]:
# get the activations for the second, third, AND fourth tokens (in each x)
tokens_234 = activ[:,1:,:]
print(tokens_234.shape)

torch.Size([10, 3, 8])


In [176]:
# get the activations for the first, second, AND third tokens (in each x)
tokens_123 = activ[:,:-1,:]
print(tokens_123.shape)

torch.Size([10, 3, 8])


In [177]:
# get the loss
# recall outputs have shape bs x sl x vocab size
loss = F.cross_entropy(
  outputs.view(-1, len(vocab)), # reshape preds to have 2 dimensions: (batch size*sequence length) x vocab size
  y.view(-1) # reshape targets to have 1 dimension: (batch size*sequence length)
)
loss

tensor(3.4257, grad_fn=<NllLossBackward0>)

In [178]:
# the penalty to the loss will make the difference between two consecutive activations as small as possible
# add a penalty to the loss
beta = 1 # hyperparameter we can set (a multiplier)

In [179]:
# Temporal activation regularization wants to minimize the DIFFERENCE between two tokens
diff = (tokens_234 - tokens_123)
print(diff.shape)

torch.Size([10, 3, 8])


In [180]:
# square each of the values in the activations tensor -- get the mean of these values
tmp_act = diff.pow(2).mean()
print(tmp_act)

tensor(0.0002, grad_fn=<MeanBackward0>)


In [181]:
# multiply this value (the mean of the squared activations) by the alpha parameter
penalty = beta * tmp_act
print(penalty)

tensor(0.0002, grad_fn=<MulBackward0>)


In [182]:
# update the loss with the penalty
updated_loss = loss + penalty
print(updated_loss)

tensor(3.4259, grad_fn=<AddBackward0>)


## Weight-Tied Regularized LSTM
- In a language model, the input embeddings represent a mapping from English words to activations, and the output hidden layer represents a mapping from activations to English words
- We might expect, intuitively, that these mappings could be the same
- We can represent this in PyTorch by assigning the same weight matrix to each of these layers: `self.h_o.weight = self.i_h.weight`
- Weight tying: make the parameter matrices for the embeddings (i_h) the same as the preds for tokens (h_o)

In [183]:
class LMModel7(Module):
  """
  Includes dropout and weight tying
  """
  def __init__(self, vocab_sz, n_hidden, n_layers, p):
    self.i_h = nn.Embedding(vocab_sz, n_hidden)
    self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
    self.drop = nn.Dropout(p) # proportion of activations to drop out (prevent overfitting)
    self.h_o = nn.Linear(n_hidden, vocab_sz)
    self.h_o.weight = self.i_h.weight # weight typing
    self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

  def forward(self, x):
    raw,h = self.rnn(self.i_h(x), self.h)
    out = self.drop(raw)
    self.h = [h_.detach() for h_ in h]
    return self.h_o(out),raw,out

  def reset(self):
    for h in self.h: h.zero_()

In [184]:
# RNNRegularizer applies activation regularization (AR) and temporal activation regularization (TAR)
  # alpha = for AR
  # beta = for TAR
batch_size = 64
n_rnn_layers = 2
dropout_prop = 0.4

learn = Learner(
  dls,
  LMModel7(len(vocab), batch_size, n_rnn_layers, dropout_prop),
  loss_func=CrossEntropyLossFlat(),
  metrics=accuracy,
  cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)]
)

In [185]:
# this learner does the same thing as the learner above
# but as a text learner (does the two callbacks automatically)
learn = TextLearner(
  dls,
  LMModel7(len(vocab), batch_size, n_rnn_layers, dropout_prop),
  loss_func=CrossEntropyLossFlat(),
  metrics=accuracy
)

In [186]:
# add even more regularization when training (weight decay)
learn.fit_one_cycle(15, 1e-2, wd=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,2.639513,2.030643,0.45638,00:01
1,1.584239,1.332765,0.689697,00:01
2,0.84208,0.922213,0.791911,00:01
3,0.421628,0.691511,0.815674,00:01
4,0.220742,0.668626,0.837891,00:01
5,0.124636,0.428851,0.87321,00:01
6,0.075268,0.375319,0.878418,00:01
7,0.050386,0.402634,0.883301,00:01
8,0.036881,0.412301,0.88501,00:01
9,0.028361,0.371559,0.896566,00:01


### AWD-LSTM architecture for text classification (from *08_nlp_basics.ipynb*)
- Includes more dropout:
  - Embedding dropout (inside the embedding layer, drops some random lines of embeddings)
  - Input dropout (applied after the embedding layer)
  - Weight dropout (applied to the weights of the LSTM at each training step)
  - Hidden dropout (applied to the hidden state between two layers)
- Since fine-tuning those five dropout values (including the dropout before the output layer) is complicated, fastai has determined good defaults and allow the magnitude of dropout to be tuned overall with the drop_mult parameter we saw in that chapter (which is multiplied by each dropout).

### Transformers architecture
- Also works works well for "sequence-to-sequence" problems (where the dependent variable is itself a variable-length sequence, such as language translation)