# PyTorch Tutorial - Neural Networks
Prof. Dorien Herremans, with many thanks to Nelson Lui for the base text. 

**To edit the notebook**:

There are two ways to edit the notebook.

You can either open it in the "playground", where you can change and run cells. After closing the tab, your changes will be lost. To do so, press "File" > "Open in playground".

Alternatively, you can make a copy of this notebook to your own Google Drive account through "File" > "Save a copy in Drive..."

**Activating the GPU on Colab**:

Colab now gives you 12 hours of free GPU time (before you have to request a new node).
Simply select "GPU" in the Accelerator drop-down in Notebook Settings (either through the Edit menu or the command palette at cmd/ctrl-shift-P).

# Setting up the notebook on colab

Let's check if we are using the GPU environment and cuda is installed: 

In [None]:
# Import PyTorch and other libraries
import torch
import numpy as np
from tqdm import tqdm

print("PyTorch version:")
print(torch.__version__)
print("GPU Detected:")
print(torch.cuda.is_available())

#defining a shortcut function for later:
import os
using_GPU = os.path.exists('/opt/bin/nvidia-smi')

PyTorch version:
1.8.1+cu101
GPU Detected:
True


# Computation Graphs

A computation graph is simply a way to define a sequence of operations to go from input to model output. 

You can think of the nodes in the graph as representing operations, and the edges in the graph represent tensors going in and out.

For example, say we wanted to build a linear regression model. This has the form $\hat y = Wx + b$. 

In this equation, $x$ is our input, $W$ is a learned weight matrix, $b$ is a learned bias, and $\hat y$ is the predicted output. 

As a computation graph, this looks like:

![Linear Regression Computation Graph](https://imgur.com/IcBhTjS.png)

When implementing deep learning models, you're basically designing and specifying computation graphs. It's a bit like playing with Legos in that you're stringing together a bunch of blocks (the operations) to achieve a final desired output.

# The building blocks of deep learning models

`torch.nn` makes it easy to build neural nets by providing functions for specifying arbitrary computation graphs and abstractions for putting them all together. We'll start by covering a few classes in the `torch.nn` module that form basic building blocks of many deep learning applications.

The classes below are all callable, so you can use them with `outputs = YourDeepLearningBlock(its_inputs)`

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

## Linear Layers (Affine Transforms)

A linear layer (also known as an affine transform) defines a function:

$$f(x) = Wx + b$$

This linear transform is a core part of deep learning. $W$ and $b$ are the parameters of this layer, where $W$ is a learned weight matrix and $b$ is a learned bias vector.

`nn.Linear()` takes two construction parameters: the dimensionality of the input and the dimensionality of the desired output.

In [None]:
# Create a Linear layer. Input should have 5 dimensions, output will have 3.
lin = nn.Linear(5, 3)
# Data is a matrix of shape (2, 5). Can we use the linear layer on it?
data = torch.randn(2, 5)

# Yes! Running the data matrix through the layer outputs shape (2, 3).
print(lin(data))

tensor([[-0.3185, -0.3782, -0.2782],
        [-1.0956, -0.7800,  0.0323]], grad_fn=<AddmmBackward>)


In [None]:
# What about a matrix of shape (2, 4, 5)?
data = torch.randn(2, 4, 5)
# This works as well! As long as the last dimension is the specified
# input dimension to the Linear layer, you're good.
# Output shape: (2, 4, 3)
print(lin(data))

tensor([[[-1.1062, -0.0990,  0.6715],
         [ 0.2896, -0.0345, -0.7421],
         [-0.7320, -0.3075, -0.4137],
         [-0.9439,  0.3548, -0.0080]],

        [[-0.2755, -0.6129,  0.4344],
         [ 0.2238, -0.4358, -0.2388],
         [ 1.3256, -1.2502, -0.1006],
         [-0.0343, -0.5950, -0.7378]]], grad_fn=<AddBackward0>)


In [None]:
# But (5, 2) is an incompatible shape (uncomment and run to see error)
data = torch.randn(5, 2)
# print(lin(data))

In [None]:
# But we can transpose it using t()!
# Now its shape (2, 5) and all is fine.
print(lin(data.t()))

tensor([[-0.0688,  0.2682, -0.5533],
        [ 0.0644,  0.0635, -1.1504]], grad_fn=<AddmmBackward>)


## Nonlinearities / Activation Functions

Since composing linear transformations gives you a linear transformation, we don't gain any representational power by just chaining `Linear` layers.

In deep learning, we add nonlinearities after our Linear transforms, which lets us build more powerful models.

PyTorch comes with a veritable zoo of nonlinearities.

In [None]:
data = torch.randn(2, 3)
print(data)

# Nonlinearities are layers too!
relu = nn.ReLU()
print(relu)
print(relu(data))

tanh = nn.Tanh()
print(tanh)
print(tanh(data))

sigmoid = nn.Sigmoid()
print(sigmoid)
print(sigmoid(data))

tensor([[-1.5414,  0.6249, -0.3737],
        [ 0.1418,  0.9413,  1.4098]])
ReLU()
tensor([[0.0000, 0.6249, 0.0000],
        [0.1418, 0.9413, 1.4098]])
Tanh()
tensor([[-0.9124,  0.5545, -0.3572],
        [ 0.1408,  0.7358,  0.8874]])
Sigmoid()
tensor([[0.1763, 0.6513, 0.4077],
        [0.5354, 0.7194, 0.8037]])


If you'd prefer to not create a class for the nonlinearity, you can also call it functionally as below:

In [None]:
data = torch.randn(2, 3)
print(data)

# Nonlinearities can also be used functionally, with no need to create a class!
print("ReLu:")
print(torch.relu(data))

print("tanh:")
print(torch.tanh(data))

print("Sigmoid:")
print(torch.sigmoid(data))

tensor([[ 0.0731, -1.1085, -0.0794],
        [-0.5666,  2.1620, -0.4659]])
ReLu:
tensor([[0.0731, 0.0000, 0.0000],
        [0.0000, 2.1620, 0.0000]])
tanh:
tensor([[ 0.0729, -0.8035, -0.0792],
        [-0.5128,  0.9739, -0.4349]])
Sigmoid:
tensor([[0.5183, 0.2481, 0.4802],
        [0.3620, 0.8968, 0.3856]])


## Dropout

Dropout is used to regularize our models by randomly setting some outputs to 0. 

This helps to prevent overfitting by encouraging the model to look beyond specific spurious patterns and find features that generalize.

**Note that we should only apply dropout during training!**

In [None]:
data = torch.randn(2, 3)
print(data)

# Create a Dropout layer and call it on input
# Here, the probability of zeroing an element is 0.5
dropout = nn.Dropout(0.5)
print(dropout)
print(dropout(data))

# Use dropout functionally, training=False by default so no change.
print("Functional dropout, training=False")
print(F.dropout(data, 0.5, training=False))

# Set training=True, so things are dropped out
print("Functional dropout, training=True")
print(F.dropout(data, 0.5, training=True))

tensor([[ 2.2049, -0.6618, -0.3987],
        [-0.0396, -0.8184, -1.5733]])
Dropout(p=0.5, inplace=False)
tensor([[4.4098, -0.0000, -0.0000],
        [-0.0000, -0.0000, -0.0000]])
Functional dropout, training=False
tensor([[ 2.2049, -0.6618, -0.3987],
        [-0.0396, -0.8184, -1.5733]])
Functional dropout, training=True
tensor([[ 0.0000, -0.0000, -0.7973],
        [-0.0000, -0.0000, -3.1465]])


## RNNs (LSTMs / GRUs in particular) 

RNNs encode a sequence of vectors as another sequence of vectors --- generally you take the final output as a representation of the input sequence. They're powerful models for sequential data, and thus they're very popular in NLP. Their API takes a tensor of shape `(sequence length, batch size, dim)` if `batch_first=False` (default) and `(batch size, sequence length, dim)` if `batch_first=True`.

In [None]:
# Some batch-first data. Batch size is 2, sequence length is 3, 
# and length of each feature vector is 4.
data = torch.randn(2, 3, 4)

input_size = 4
hidden_dim = 5
# Create a LSTM RNN with hidden dim of 5
# batch_first=True since semantics of shape is
# (batch_size, sequence_length, num_features)
lstm = nn.LSTM(input_size, hidden_dim, batch_first=True)


lstm_output_tuple = lstm(data)
print(type(lstm_output_tuple))

lstm_output, (lstm_hidden_state, lstm_cell_state) = lstm_output_tuple
print("LSTM Outputs: ")
print(lstm_output)

print("Final LSTM Output: ")
print(lstm_output[:, -1])

print("LSTM Hidden State: ")
print(lstm_hidden_state)

print("LSTM Cell State: ")
print(lstm_cell_state)

<class 'tuple'>
LSTM Outputs: 
tensor([[[-0.2372,  0.0695,  0.0314, -0.2503, -0.0216],
         [-0.2535,  0.1016,  0.1237, -0.3354, -0.1271],
         [-0.3613,  0.0689,  0.0064, -0.0630, -0.0270]],

        [[-0.2001,  0.0386, -0.0416, -0.1907,  0.0331],
         [-0.2543, -0.0578,  0.0335, -0.1742, -0.0367],
         [-0.0218, -0.0860, -0.3245,  0.0958,  0.0857]]],
       grad_fn=<TransposeBackward0>)
Final LSTM Output: 
tensor([[-0.3613,  0.0689,  0.0064, -0.0630, -0.0270],
        [-0.0218, -0.0860, -0.3245,  0.0958,  0.0857]],
       grad_fn=<SelectBackward>)
LSTM Hidden State: 
tensor([[[-0.3613,  0.0689,  0.0064, -0.0630, -0.0270],
         [-0.0218, -0.0860, -0.3245,  0.0958,  0.0857]]],
       grad_fn=<StackBackward>)
LSTM Cell State: 
tensor([[[-0.7591,  0.1134,  0.0123, -0.1152, -0.1213],
         [-0.0285, -0.1945, -0.5226,  0.2445,  0.6596]]],
       grad_fn=<StackBackward>)


# Structuring PyTorch models

At the highest level, `nn.Module` defines what most would refer to as a "model". It's a convenient way for encapsulating the trainable parameters of a model or a component of your model, and subclassing this class gives you Python functions for moving your model to the GPU, saving it, loading it etc.

When you're building your own model, you're going to subclass `nn.Module`. Critically, you also need to override the `__init__()` and `forward()` functions.

*   In `__init__()`, you should take arguments that modify how the model runs (e.g. # of layers, # of hidden units, output sizes). You'll also set up most of the layers that you use in the forward pass here.
*   In `forward()`, you define the "forward pass" of your model, or the operations needed to transform input to output. **You can use any of the Tensor operations in the forward pass.**



### Feed-forward neural net

To improve upon the logistic regression model we saw last week, we can add some intermediate layers (called hidden layers), nonlinearities, and dropout for regularization. This is essentially a multi-layer feed forward neural net, and it's implementation as a module is outlined below:

In [None]:
class FeedForwardNN(nn.Module):
  # input_size: Dimensionality of input feature vector.
  # num_classes: The number of classes in the classification problem.
  # num_hidden: The number of hidden (intermediate) layers to use.
  # hidden_dim: The size of each of the hidden layers.
  # dropout: The proportion of units to drop out after each layer.
  def __init__(self, input_size, num_classes, num_hidden, hidden_dim, dropout):
    # Always call the superclass (nn.Module) constructor first!
    super(FeedForwardNN, self).__init__()
    
    # Set up the hidden layers.
    assert num_hidden > 0
    # A special ModuleList to store our hidden layers.
    self.hidden_layers = nn.ModuleList([])
    # First hidden layer maps from input_size -> num_hidden.
    self.hidden_layers.append(nn.Linear(input_size, hidden_dim))
    # Subsequent hidden layers map from num_hidden -> num_hidden.
    # Note that they can map to any dimensionality --- as long as the final
    # output is a distribution over your classes!
    for i in range(num_hidden - 1):
      self.hidden_layers.append(nn.Linear(hidden_dim, hidden_dim))
    
    # Set up the dropout layer.
    self.dropout = nn.Dropout(dropout)
    
    # Set up the final transform to a distribution over classes.
    self.output_projection = nn.Linear(hidden_dim, num_classes)
    
    # Set up the nonlinearity to use between layers.
    self.nonlinearity = nn.ReLU()
    
  # Forward's sole argument is the input.
  # input is of shape (batch_size, input_size)
  def forward(self, x):
    # Apply the hidden layers, nonlinearity, and dropout.
    for hidden_layer in self.hidden_layers:
      x = hidden_layer(x)
      x = self.dropout(x)
      x = self.nonlinearity(x)
      
    # Output layer: project x to a distribution over classes.
    out = self.output_projection(x)
    
    # Softmax the out tensor to get a log-probability distribution
    # over classes for each example.
    out_distribution = F.log_softmax(out, dim=-1)
    return out_distribution

# Training PyTorch models: Losses and Optimizers

By now, we've learned how to construct models in PyTorch. In this section, we'll go over how to calculate your model's loss and how to optimize the parameters to minimize the loss.

## Loss Functions

Intuitively, loss functions serve to tell your model how poorly it's doing --- the purpose of training is to adjust the weights of our model to minimize the loss.

A loss function takes a true output $y$ and a model-predicted output $\hat y$ and calculates the loss. If $y = \hat y$, our model produced the correct output and thus our loss is 0. The further our predicted $\hat y$ from the true $y$, the higher our loss is.

PyTorch comes with a large collection of loss functions. The most commonly used loss for classification is negative log likelihood (`nn.NLLLoss` or the very related `nn.CrossEntropyLoss`). The difference between `nn.NLLLoss` and `nn.CrossEntropyLoss` for classification problems is that `nn.NLLLoss` expects the output to be log-softmax normalized, which is easy to do with the `nn.LogSoftmax` layer. On the other hand `nn.CrossEntropyLoss`, automatically applies the log-softmax --- you can think of it as `nn.LogSoftmax` + `nn.NLLLoss`. Which to use depends on whether you want to add the extra `nn.LogSoftmax` to your model's `forward()`.

A common loss used for regression problems is the mean squared error (`nn.MSELoss`).

Here's a usage example of the `CrossEntropyLoss`.

In [None]:
# 3 examples, unnormalized scores over 4 classes.
model_output = torch.rand(3, 4, requires_grad = True)

# The correct labels.
targets = torch.LongTensor([1, 0, 3])

# CrossEntropyLoss
cross_entropy = nn.CrossEntropyLoss()
# Loss, averaged across all 3 batch elements.
# Can call this functionally: avg_loss = F.cross_entropy(model_output, targets)
avg_loss = cross_entropy(model_output, targets)
print("CrossEntropyLoss averaged across all 3 batch elements:")
print(avg_loss)

# Backpropagate wrt avg_loss
avg_loss.backward()
# Print out the gradients of model_output
print("Gradients of model_output")
print(model_output.grad)

CrossEntropyLoss averaged across all 3 batch elements:
tensor(1.4086, grad_fn=<NllLossBackward>)
Gradients of model_output
tensor([[ 0.0766, -0.2424,  0.0632,  0.1026],
        [-0.2602,  0.1009,  0.1067,  0.0526],
        [ 0.1028,  0.0446,  0.1046, -0.2520]])


And here's a snippet showing that `LogSoftmax` + `NLLLoss` is the same as `CrossEntropyLoss`.

In [None]:
nll = nn.NLLLoss()
log_softmax_model_output = F.log_softmax(model_output, dim=-1)
# Loss, averaged across all 3 batch elements.
# Can call this functionally: avg_loss = F.nll_loss(model_output, targets)
avg_loss = nll(log_softmax_model_output, targets)
print("Negative-Log Likelihood averaged across all 3 batch elements:")
print(avg_loss)

Negative-Log Likelihood averaged across all 3 batch elements:
tensor(1.4086, grad_fn=<NllLossBackward>)


## Optimizers

Now that we can calculate the loss and backpropagate through our model (with `.backward()`), we can update the weights and try to reduce the loss!

PyTorch includes a variety of optimizers that do exactly this, from the standard SGD to more recent techniques like Adam and RMSProp.

At construction, PyTorch parameters take the parameters to optimize. When we run an input through our model, calculate the loss, and backpropagate, the gradients are automatically stored in the parameters (since they're all `Variables`). With these gradients, the optimizer can update the weights.

Optimizers live in the `torch.optim` module.

In [None]:
import torch.optim as optim

To get the parameters of our model, we can just call `.parameters()` on a `Module`. Below, we create an instance of our previously-defined feed forward neural network and get its parameters.

In [None]:
input_size = 784
num_classes = 10
num_hidden = 2
hidden_dim = 50
dropout = 0.2
ffnn_clf = FeedForwardNN(input_size, num_classes, num_hidden, 
                         hidden_dim, dropout)
print(ffnn_clf)

parameters = ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

FeedForwardNN(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=50, bias=True)
    (1): Linear(in_features=50, out_features=50, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=50, out_features=10, bias=True)
  (nonlinearity): ReLU()
)
Shapes of model parameters:
[torch.Size([50, 784]), torch.Size([50]), torch.Size([50, 50]), torch.Size([50]), torch.Size([10, 50]), torch.Size([10])]


Now to create an optimizer for this model, we construct a optimizer class and pass it the parameters of the model: stochastic gradient descend.

In [None]:
ffnn_optim = optim.SGD(ffnn_clf.parameters(), lr=0.5)

Let's try using our optimizer to take a gradient update on our model! We'll generate a few random examples, and run them through our model (the forward pass).

In [None]:
# Make some fake data for our model.
# 5 examples in the batch, each example has 784 features.
sample_input = torch.randn(5, 784)
# Multilabel classification, 10 possible classes.
sample_labels = torch.LongTensor([0, 3, 9, 6, 2])

# Run the sample_input through ffnn_clf to get a distribution
# over our classes
sample_predictions = ffnn_clf(sample_input)
print("Predicted distribution over classes: ")
print(sample_predictions)
print("Target Labels:")
print(sample_labels)

Predicted distribution over classes: 
tensor([[-2.4123, -2.2653, -2.3019, -2.2186, -2.3577, -2.2580, -2.5144, -2.1066,
         -2.3739, -2.2742],
        [-2.3874, -2.2367, -2.2141, -2.2145, -2.4727, -2.3362, -2.4876, -2.1779,
         -2.2908, -2.2611],
        [-2.2994, -2.0950, -2.4574, -2.3188, -2.3064, -2.2903, -2.4334, -2.2391,
         -2.3431, -2.2889],
        [-2.4188, -2.3447, -2.4864, -2.1886, -2.4201, -2.2330, -2.5731, -1.8513,
         -2.3474, -2.3567],
        [-2.5830, -2.2838, -2.4189, -2.1783, -2.3462, -2.2080, -2.4593, -2.0148,
         -2.4167, -2.2387]], grad_fn=<LogSoftmaxBackward>)
Target Labels:
tensor([0, 3, 9, 6, 2])


Now let's calculate the loss of our model on these examples.

In [None]:
nll_loss = F.nll_loss(sample_predictions, sample_labels)
print("Average NLL Loss:")
print(nll_loss)

Average NLL Loss:
tensor(2.3815, grad_fn=<NllLossBackward>)


Let's print the gradients of one of the parameter matrices in our model, to ensure it's `None`. We haven't done backprop yet, so there shouldn't be any gradients.

In [None]:
print(list(ffnn_clf.parameters())[0].grad)

None


Now we can backpropagate with respect to the loss to calculate the gradients for the parameters of our model with `.backward()`. It's also good practice to call `optimizer.zero_grad()` before `loss.backwards()`, which ensures that the gradients are reset to 0 before backprop.

In [None]:
ffnn_optim.zero_grad()
nll_loss.backward()

Let's check our gradients now...

In [None]:
print(list(ffnn_clf.parameters())[0].grad)

tensor([[-0.0019,  0.0030,  0.0064,  ..., -0.0051,  0.0124,  0.0005],
        [-0.0059,  0.0096,  0.0077,  ..., -0.0115,  0.0263,  0.0025],
        [-0.0065,  0.0001, -0.0040,  ..., -0.0037,  0.0136,  0.0085],
        ...,
        [ 0.0012, -0.0033,  0.0038,  ...,  0.0011, -0.0004, -0.0005],
        [ 0.0056, -0.0097, -0.0039,  ...,  0.0058, -0.0165, -0.0038],
        [-0.0076, -0.0146, -0.0155,  ...,  0.0146, -0.0168, -0.0010]])


Now that we have gradients for each of our parameters, we can update them by using `optimizer.step()`.

In [None]:
# save the old value of the parameter for comparison later
old_parameter = list(ffnn_clf.parameters())[0].data.clone()

# Make a gradient update with our optimizer
ffnn_optim.step()

new_parameter = list(ffnn_clf.parameters())[0].data

print("Difference between weight matrix before and after update:")
print(old_parameter - new_parameter)


Difference between weight matrix before and after update:
tensor([[-9.5985e-04,  1.5247e-03,  3.1813e-03,  ..., -2.5291e-03,
          6.2233e-03,  2.4290e-04],
        [-2.9426e-03,  4.8146e-03,  3.8305e-03,  ..., -5.7310e-03,
          1.3148e-02,  1.2534e-03],
        [-3.2531e-03,  7.2472e-05, -1.9972e-03,  ..., -1.8707e-03,
          6.8135e-03,  4.2494e-03],
        ...,
        [ 6.1779e-04, -1.6681e-03,  1.8901e-03,  ...,  5.4004e-04,
         -2.2479e-04, -2.3867e-04],
        [ 2.7868e-03, -4.8669e-03, -1.9510e-03,  ...,  2.9181e-03,
         -8.2476e-03, -1.9001e-03],
        [-3.8144e-03, -7.3181e-03, -7.7710e-03,  ...,  7.3142e-03,
         -8.4062e-03, -5.1304e-04]])


If you're familiar with the SGD update rule, you know that:

$$ \theta^{t+1} = \theta^{t} - \left( \eta \cdot \nabla L \left(\theta^{t} \right) \right)$$

Where $\theta^{t}$ is the weight at time $t$, $\eta$ is the learning rate, $\nabla L(\theta^{t})$ is the gradient. Since $\eta = 0.5$, it makes perfect sense that the difference between the weight vectors printed above is exactly half of the gradient.

# Example: Classification on FashionMNIST

Let's use the `FeedForwardNN` model we built earlier to do a simple classification task! This example is meant to be an annotated walkthrough of how to build, train, and evaluate a model in PyTorch. We'll use the [FashionMNIST dataset](https://github.com/zalandoresearch/fashion-mnist), where we are tasked with classifying black and white images of clothes into 10 different classes.

## Loading Data

We'll start by loading the data with `torchvision` --- knowing how to use torchvision isn't the point of this tutorial, so it's relatively unannotated.

In [None]:
import torchvision
from torchvision.datasets import FashionMNIST

train_dataset = FashionMNIST(root='./torchvision-data', 
                             train=True, 
                             transform=torchvision.transforms.ToTensor(),
                             download=True)

test_dataset = FashionMNIST(root='./torchvision-data', train=False, 
                            transform=torchvision.transforms.ToTensor())

`train_dataset` and `test_dataset` are both subclasses of PyTorch's `torch.utils.data.Dataset`. The main benefit of subclassing this abstract class is that we can use `torch.utils.data.DataLoader`s to handle batching our examples and iterating over them. We'll create `DataLoader`s for our datasets now.

In [None]:
from torch.utils.data import DataLoader

# Data-related hyperparameters
batch_size = 64

# Set up a DataLoader for the training dataset.
train_dataloader = DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True)

# Set up a DataLoader for the test dataset.
test_dataloader = DataLoader(
    dataset=test_dataset, batch_size=batch_size)

Let's take a look at what's inside our datasets. `torch.utils.data.Dataset`s are indexable, so we can easily peek inside.

In [None]:
# Print the first training example
print(train_dataset[0])

(tensor([[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0000, 0.0000, 0.0510,
          0.2863, 0.0000, 0.0000, 0.0039, 0.0157, 0.0000

From this output, we can see the dataset elements are tuple of `(data_tensor, label)`. `data_tensor` is a `FloatTensor` of shape `(1, 28, 28)` (since the image is 28x28), and `label` is an integer from 0 to 9 (since there are 10 classes in the data).

Let's similarly look at what the `DataLoader` produces.

In [None]:
list(train_dataloader)[0]

[tensor([[[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           ...,
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]],
 
 
         [[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           ...,
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]],
 
 
         [[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
           [0.0000

As we can see, the `DataLoader` groups examples into batches of size `batch_size` (64 by default in the code above). Thus, the shape of the returned tensor is `(64, 1, 28, 28)`, since we essentially stacked `batch_size` examples together. Similarly, `labels` is now a `LongTensor` of size `batch_size`. 

Note that the label for a single example was a Python `int` --- the dataloader automatically grouped them into a `LongTensor` of the appropriate size.

## Building our model

Now we can construct a `FeedForwardNN` instance that we'll train. Each FashionMNIST example is `28x28`, so we get it as a Tensor of shape `(28, 28)`.

We'll flatten out each example to a vector of size `(784,)` for compatibility with our model.

In [None]:
# Hyperparameters of our model.
num_hidden = 2
hidden_dim = 512
dropout = 0.2

fashionmnist_ffnn_clf = FeedForwardNN(input_size=784, num_classes=10, 
                                      num_hidden=num_hidden, 
                                      hidden_dim=hidden_dim, dropout=dropout)
print(fashionmnist_ffnn_clf)

FeedForwardNN(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=512, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=512, out_features=10, bias=True)
  (nonlinearity): ReLU()
)


If we're using a GPU, we'll move the model to the GPU which should speed up training. We do this with the same `.cuda()` method we used for Tensors.

In [None]:
if using_GPU:
  fashionmnist_ffnn_clf = fashionmnist_ffnn_clf.cuda()

# Check if the Module is on GPU by checking if a parameter is on GPU
print("Model on GPU?:")
print(next(fashionmnist_ffnn_clf.parameters()).is_cuda)

Model on GPU?:
True


## Construct other classes we need for training: loss and optimizer

Now, we'll set up a criterion for calculating the loss and an Optimizer for updating our parameters.

In [None]:
# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
ffnn_optimizer = optim.SGD(fashionmnist_ffnn_clf.parameters(), 
                           lr=lr, momentum=momentum)

## Train the model!

Now, we'll implement the procedure to train the model --- this is typically called the "train loop" since we loop over our batches, performing the forward pass, calculating a loss, backpropping, and then updating our parameters. This is the bulk of the code necessary to train the model.

This block looks pretty long, but that's mostly because of the comments :)

In [None]:
# Number of epochs (passes through the dataset) to train the model for.
num_epochs = 10

# A counter for the number of gradient updates we've performed.
num_iter = 0

# Iterate `num_epochs` times.
for epoch in range(num_epochs):
  print("Starting epoch {}".format(epoch + 1))
  # Iterate over the train_dataloader, unpacking the images and labels
  for (images, labels) in train_dataloader:
    # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784), since
    # that's what our model expects. Remember that -1 does shape inference!
    reshaped_images = images.view(-1, 784)
    
    # Wrap reshaped_images and labels in Variables,
    # since we want to calculate gradients and backprop.
    reshaped_images = Variable(reshaped_images)
    labels = Variable(labels)
    
    # If we're using the GPU, move reshaped_images and labels to the GPU.
    if using_GPU:
      reshaped_images = reshaped_images.cuda()
      labels = labels.cuda()
      
    # Run the forward pass through the model to get predicted log distribution.
    # predicted shape: (batch_size, 10) (since there are 10 classes)
    predicted = fashionmnist_ffnn_clf(reshaped_images)
    
    # Calculate the loss
    batch_loss = nll_criterion(predicted, labels)
    
    # Clear the gradients as we prepare to backprop.
    ffnn_optimizer.zero_grad()
    
    # Backprop (backward pass), which calculates gradients.
    batch_loss.backward()
    
    # Take a gradient step to update parameters.
    ffnn_optimizer.step()
    
    # Increment gradient update counter.
    num_iter += 1
    
    # Calculate test set loss and accuracy every 500 gradient updates
    # It's standard to have this as a separate evaluate function, but
    # we'll place it inline for didactic purposes.
    if num_iter % 500 == 0:
      # Set model to eval mode, which turns off dropout.
      fashionmnist_ffnn_clf.eval()
      # Counters for the num of examples we get right / total num of examples.
      num_correct = 0
      total_examples = 0
      total_test_loss = 0
      
      # Iterate over the test dataloader
      for (test_images, test_labels) in test_dataloader:
        # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784) again
        reshaped_test_images = test_images.view(-1, 784)
    
        # Wrap test data in Variable, like we did earlier.
        # We set volatile=True bc we don't need history; speeds up inference.
        reshaped_test_images = Variable(reshaped_test_images, volatile=True)
        test_labels = Variable(test_labels, volatile=True)
    
        # If we're using the GPU, move tensors to the GPU.
        if using_GPU:
          reshaped_test_images = reshaped_test_images.cuda()
          test_labels = test_labels.cuda()
          
        # Run the forward pass to get predicted distribution.
        predicted = fashionmnist_ffnn_clf(reshaped_test_images)
        
        # Calculate loss for this test batch. This is averaged, so multiply
        # by the number of examples in batch to get a total.
        total_test_loss += nll_criterion(
            predicted, test_labels).data * test_labels.size(0)
        
        # Get predicted labels (argmax)
        # We need predicted.data since predicted is a Variable, and torch.max
        # expects a Tensor as input. .data extracts Tensor underlying Variable.
        _, predicted_labels = torch.max(predicted.data, 1)
        
        # Count the number of examples in this batch
        total_examples += test_labels.size(0)
        
        # Count the total number of correctly predicted labels.
        # predicted == labels generates a ByteTensor in indices where
        # predicted and labels match, so we can sum to get the num correct.
        num_correct += torch.sum(predicted_labels == test_labels.data)
      accuracy = 100 * num_correct / total_examples
      average_test_loss = total_test_loss / total_examples
      print("Iteration {}. Test Loss {}. Test Accuracy {}.".format(
          num_iter, average_test_loss, accuracy))
      # Set the model back to train mode, which activates dropout again.
      fashionmnist_ffnn_clf.train()

Starting epoch 1




Iteration 500. Test Loss 0.6602397561073303. Test Accuracy 75.08999633789062.
Starting epoch 2
Iteration 1000. Test Loss 0.5724758505821228. Test Accuracy 79.48999786376953.
Iteration 1500. Test Loss 0.5400569438934326. Test Accuracy 81.80999755859375.
Starting epoch 3
Iteration 2000. Test Loss 0.5474987030029297. Test Accuracy 79.48999786376953.
Iteration 2500. Test Loss 0.500251293182373. Test Accuracy 82.07999420166016.
Starting epoch 4
Iteration 3000. Test Loss 0.5006492137908936. Test Accuracy 83.27999877929688.
Iteration 3500. Test Loss 0.5060549974441528. Test Accuracy 83.00999450683594.
Starting epoch 5
Iteration 4000. Test Loss 0.5307899117469788. Test Accuracy 82.13999938964844.
Iteration 4500. Test Loss 0.4667015075683594. Test Accuracy 83.88999938964844.
Starting epoch 6
Iteration 5000. Test Loss 0.4831177592277527. Test Accuracy 83.06999969482422.
Iteration 5500. Test Loss 0.46682479977607727. Test Accuracy 83.19999694824219.
Starting epoch 7
Iteration 6000. Test Loss 0.46

## Extending this example

A good exercise to extend this example would be to look in to how convolutional neural nets (CNNs) work, build a `ConvolutionalNeuralNet` classification `Module`, train it on the FashionMNIST dataset, and compare the feed-forward neural net with the CNN. We explore CNNs next week. 

# Example: Sentiment classification with RNN

More examples and details at the source + base text: https://github.com/bentrevett/pytorch-sentiment-analysis


The task tackled here is sentiment analysis of IMDB articles. In this notebook, we'll actually get decent results. We will take the first base model from the url above and build upon it so it becomes more efficient. Some of the concepts we will use: 

We will use:
- packed padded sequences
- pre-trained word embeddings
- a type of RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- Adam optimizer

The general architecture will be: pretrained word embeddings (to capture meaning of the words), fed into an LSTM which captures longer term time dependencies of sentences, to predict the sentiment. 

This will allow us to achieve ~84% test accuracy.

## Preparing Data

We will set a seed first. Why do we do this? Because then our random function is initialized in the same way and we all get the same 'random' results. Then we will define the `Fields` and get the train/valid/test splits. Note, this dataset moved to legacy this year, but is still accessible [as per here](https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb#scrollTo=7DRXJFgzriaH). 

Since our sentences are all of a different length, we typically 'pad' them, so they all become of the same length. We take it one step further however, and use *packed padded sequences*, which will make our RNN only process the non-padded elements of our sequence, and for any padded element the `output` will be a zero tensor. More info on this [here](https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch). 

To use packed padded sequences, we have to tell the RNN how long the actual sequences are. We do this by setting `include_lengths = True` for our `TEXT` field. This will cause `batch.text` to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

In [None]:
import torchtext
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets


# setting the seed so our random output is actually deterministic
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# defining our input fields (text) and labels. 
# We use the Spacy function because it provides strong support for tokenization in languages other than English
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)


# IS THE NEXT CELL SUPER SLOW?? 
# Note, Spacy tokenization can be very slow. As an alternative, to get past this step (but with worse results, you could run the below for TEXT instead of tehe spacy tokenizer):
# def tokenize(s):
#     return s.split(' ')
# TEXT = data.Field(tokenize=tokenize, include_lengths = True)

We then load the IMDb dataset, which is included in the torchtext package. You can find more information about the built in datasets in torchtext here: https://torchtext.readthedocs.io/en/latest/datasets.html

Note: applying the spacy tokenizer, which is done in the command below can take about 5-10 minutes. If it doesn't work for you, you can replace the TEXT = ... line with the commented code below to define your own super naive tokenizer for the purpose of this demo. 

In [None]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets

Then we create the validation set from our training set. We will use this to tweek our parameters. 

In [None]:
import random 

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

What does this dataset looks like? Datasets from torchtext are iterables, so let's get one example: 

In [None]:
example = next(iter(test_data))

This contains the text, tokenized (meaning each word is an element of a list): 

In [None]:
example

<torchtext.legacy.data.example.Example at 0x7fef0b40c290>

And a label (in this case 'pos' for positive sentiment)

In [None]:
example.label

'pos'

In [None]:
example.text

['I',
 'thought',
 'Anywhere',
 'But',
 'Here',
 'was',
 'a',
 'good',
 'movie.It',
 'stars',
 'two',
 'wonderful',
 'actresses,',
 'Susan',
 'Sarandon',
 'and',
 'Natlie',
 'Portman,',
 'which',
 'when',
 'I',
 'heard',
 'they',
 'were',
 'in',
 'a',
 'movie',
 'together',
 'I',
 'resist',
 'watching',
 'it.Overall,',
 'it',
 'was',
 'a',
 'pretty',
 'enjoyable',
 'movie.It',
 'had',
 "it's",
 'moments',
 'where',
 'I',
 'felt',
 'as',
 'if',
 'they',
 'tried',
 'to',
 'hard,',
 'and',
 'there',
 'was',
 'also',
 'some',
 'really',
 'overdone',
 'and',
 'worn-out',
 'material,',
 'but',
 'there',
 "wasn't",
 'anything',
 'in',
 'the',
 'movie',
 'that',
 'I',
 'absolutely',
 'hated.I',
 'even',
 'liked',
 'how',
 'they',
 'used',
 'the',
 'pop-up',
 'performance',
 'of',
 'the',
 'uncredited',
 'Thora',
 'Birch,',
 'and',
 'all',
 'the',
 'little',
 'happy/sad',
 'moments',
 'are',
 'touching',
 'and',
 'effective.If',
 'you',
 'want',
 'to',
 'watch',
 'this',
 'movie,',
 'go',
 'ahe

Next comes loading the pre-trained word embeddings. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors, namely those from glove (Stanford). There are other alternatives you can use here. 

We get these vectors simply by specifying which vectors we want and passing it as an argument to `build_vocab`. `TorchText` handles downloading the vectors and associating them with the correct words in our vocabulary.

Here, we'll be using the `"glove.6B.100d" vectors"`. `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens and `100d` indicates these vectors are 100-dimensional, meaning there are 100 values (or coordinates) for each word.

You can see the other available vectors [here](https://github.com/pytorch/text/blob/master/torchtext/vocab.py#L113).

The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. "terrible", "awful", "dreadful" are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.

**Note**: these vectors are about 862MB.

By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings (so in order words: new words), to zero. We don't want this, and instead initialize them randomly by setting `unk_init` to `torch.Tensor.normal_`. This will now initialize those words via a Gaussian distribution.

We limit the number of unique words we use to 25,000. The least frequent words will be replaced by the word `unk`.

In [None]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(test_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_) # how to initialize unseen words not in glove

LABEL.build_vocab(test_data)

.vector_cache/glove.6B.zip: 862MB [02:42, 5.29MB/s]                           
100%|█████████▉| 398460/400000 [00:20<00:00, 19632.60it/s]

We create the iterators, placing the tensors on the GPU if one is available.

one consderation for packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the batch iterator by setting `sort_within_batch = True`.

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

## Build the Model

Alright, we have our data pipeline set up and are ready to define the model. 

### Different RNN Architecture

We'll be using the RNN architecture called a Long Short-Term Memory (LSTM). Why is an LSTM better than a standard RNN? Standard RNNs suffer from the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). LSTMs overcome this by having an extra recurrent state called a _cell_, $c$ - which can be thought of as the "memory" of the LSTM - and the use use multiple _gates_ which control the flow of information into and out of the memory. For more information, go [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). We can simply think of the LSTM as a function of $x_t$, $h_t$ and $c_t$, instead of just $x_t$ and $h_t$.

$$(h_t, c_t) = \text{LSTM}(x_t, h_t, c_t)$$

Thus, the model using an LSTM looks something like (with the embedding layers omitted):

![](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment2.png)

The initial cell state, $c_0$, like the initial hidden state is initialized to a tensor of all zeros. The sentiment prediction is still, however, only made using the final hidden state, not the final cell state, i.e. $\hat{y}=f(h_T)$.


### Bidirectional RNN

In this lab we are going one step above the traditional LSTM, by using a **bidirectional** one, which may be even more effective. 

The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$. 

In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other in a single tensor. 

We make our sentiment prediction using a concatenation of the last hidden state from the forward RNN (obtained from final word of the sentence), $h_T^\rightarrow$, and the last hidden state from the backward RNN (obtained from the first word of the sentence), $h_T^\leftarrow$, i.e. $\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$   

The image below shows a bi-directional RNN, with the forward RNN in orange, the backward RNN in green and the linear layer in silver.  

![](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment3.png)

### Multi-layer RNN

Multi-layer RNNs (also called *deep RNNs*) are another simple concept. The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.

The image below shows a multi-layer unidirectional RNN, where the layer number is given as a superscript. Also note that each layer needs their own initial hidden state, $h_0^L$.

![](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment4.png)

### Regularization

Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into too much detail, the more parameters you have in in your model, the higher the probability that your model will overfit (memorize the training data, causing  a low training error but high validation/testing error, i.e. poor generalization to new, unseen examples). To combat this, we use regularization. More specifically, we use a method of regularization called *dropout*. Dropout works by randomly *dropping out* (setting to 0) neurons in a layer during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a "weaker" (less parameters) model. The predictions from all these "weaker" models (one for each forward pass) get averaged together withinin the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.

### Implementation Details

Another addition to this model is that we are not going to learn the embedding for the `<pad>` token. This is because we want to explitictly tell our model that padding tokens are irrelevant to determining the sentiment of a sentence. This means the embedding for the pad token will remain at what it is initialized to (we initialize it to all zeros later). We do this by passing the index of our pad token as the `padding_idx` argument to the `nn.Embedding` layer.

To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN`. Also, note that the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. 

As the final hidden state of our LSTM has both a forward and a backward component, which will be concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.

Implementing bidirectionality and adding additional layers are done by passing values for the `num_layers` and `bidirectional` arguments for the RNN/LSTM. 

Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropping out each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers (`text` or `fc` in this case), you only ever want to use dropout on intermediate layers. The LSTM has a `dropout` argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer. 

As we are passing the lengths of our sentences to be able to use packed padded sequences, we have to add a second argument, `text_lengths`, to `forward`. 

Before we pass our embeddings to the RNN, we need to pack them, which we do with `nn.utils.rnn.packed_padded_sequence`. This will cause our RNN to only process the non-padded elements of our sequence. The RNN will then return `packed_output` (a packed sequence) as well as the `hidden` and `cell` states (both of which are tensors). Without packed padded sequences, `hidden` and `cell` are tensors from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence. 

We then unpack the output sequence, with `nn.utils.rnn.pad_packed_sequence`, to transform it from a packed sequence to a tensor. The elements of `output` from padding tokens will be zero tensors (tensors where every element is zero). Usually, we only have to unpack output if we are going to use it later on in the model. Although we aren't in this case, we still unpack the sequence just to show how it is done.

The final hidden state, `hidden`, has a shape of _**[num layers * num directions, batch size, hid dim]**_. These are ordered: **[forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1, ..., forward_layer_n, backward_layer n]**. As we want the final (top) layer forward and backward hidden states, we get the top two hidden layers from the first dimension, `hidden[-2,:,:]` and `hidden[-1,:,:]`, and concatenate them together before passing them to the linear layer (after applying dropout). 

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, #'100'
                           hidden_dim, 
                           num_layers=n_layers, #set to two: makes our LSTM 'deep'
                           bidirectional=bidirectional, #bidirectional or not
                           dropout=dropout) #we add dropout for regularization
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout) #dropout layer
        
    def forward(self, text, text_lengths):
        
        embedded = self.dropout(self.embedding(text)) ## change the text to the embedding
        
        #pack sequence
        # note, we move text_lengths to cpu due to a small bug in current pytorch: https://github.com/pytorch/pytorch/issues/43227
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu()) #use packed padding
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded) #feed to rnn
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output) #unpack the padding
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)) #add dropout
            
        return self.fc(hidden)

Like before, we'll create an instance of our RNN class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.

To ensure the pre-trained vectors can be loaded into the model, the `EMBEDDING_DIM` must be equal to that of the pre-trained GloVe vectors loaded earlier, which is 100 if you recall from above. 

We get our pad token index from the vocabulary, getting the actual string representing the pad token from the field's `pad_token` attribute, which is `<pad>` by default.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2 #this makes our LSTM deep
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # this will be passed to the lstm, to tell it when id the word is in our vocabulary

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

We'll print out the number of parameters in our model. 

Notice how we have almost twice as many parameters as before!

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 4,810,857 trainable parameters


The final addition is copying the pre-trained word embeddings we loaded earlier into the `embedding` layer of our model.

We retrieve the embeddings from the field's vocab, and check they're the correct size, _**[vocab size, embedding dim]**_ 

In [None]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We then replace the initial weights of the `embedding` layer with the pre-trained embeddings.

**Note**: this should always be done on the `weight.data` and not the `weight`!

In [None]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9742, -0.9265, -0.8306,  ...,  1.0355, -1.2270, -1.4588],
        [-0.4788, -1.9278, -2.0450,  ...,  0.3158,  0.4361, -2.0683],
        [-1.3864,  0.0110,  0.5185,  ...,  0.7835, -1.1318, -0.1939]])

As our `<unk>` and `<pad>` token aren't in the pre-trained vocabulary they have been initialized using `unk_init` (an $\mathcal{N}(0,1)$ distribution) when building our vocab. It is preferable to initialize them both to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment. 

We do this by manually setting their row in the embedding weights matrix to zeros. We get their row by finding the index of the tokens, which we have already done for the padding index.

**Note**: like initializing the embeddings, this should be done on the `weight.data` and not the `weight`!

In [None]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9742, -0.9265, -0.8306,  ...,  1.0355, -1.2270, -1.4588],
        [-0.4788, -1.9278, -2.0450,  ...,  0.3158,  0.4361, -2.0683],
        [-1.3864,  0.0110,  0.5185,  ...,  0.7835, -1.1318, -0.1939]])


We can now see the first two rows of the embedding weights matrix have been set to zeros. As we passed the index of the pad token to the `padding_idx` of the embedding layer it will remain zeros throughout training, however the `<unk>` token embedding will be learned.

## Train the Model

Now to training the model.

The only change we'll make here is changing the optimizer from `SGD` to `Adam`. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about `Adam` (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).

To change `SGD` to `Adam`, we simply change `optim.SGD` to `optim.Adam`, also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile default initial learning rate.

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

The rest of the steps for training the model are unchanged.

We define the criterion and place the model and criterion on the GPU (if available)...

In [None]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

We define a function for training our model. 

As we have set `include_lengths = True`, our `batch.text` is now a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. We separate these into their own variables, `text` and `text_lengths`, before passing them to the model.

**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Then we define a function for testing our model, again remembering to separate `batch.text`.

**Note**: as we are now using dropout, we must remember to use `model.eval()` to ensure the dropout is "turned off" while evaluating.

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

And also create a nice function to tell us how long our epochs are taking.

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model...

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 39s
	Train Loss: 0.675 | Train Acc: 57.50%
	 Val. Loss: 0.601 |  Val. Acc: 69.73%
Epoch: 02 | Epoch Time: 0m 38s
	Train Loss: 0.656 | Train Acc: 61.39%
	 Val. Loss: 0.642 |  Val. Acc: 62.66%
Epoch: 03 | Epoch Time: 0m 38s
	Train Loss: 0.602 | Train Acc: 68.03%
	 Val. Loss: 0.589 |  Val. Acc: 68.49%
Epoch: 04 | Epoch Time: 0m 38s
	Train Loss: 0.478 | Train Acc: 77.48%
	 Val. Loss: 0.405 |  Val. Acc: 83.26%
Epoch: 05 | Epoch Time: 0m 38s
	Train Loss: 0.402 | Train Acc: 82.35%
	 Val. Loss: 0.351 |  Val. Acc: 85.09%


...and get our new and vastly improved test accuracy!

In [None]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.353 | Test Acc: 84.82%


## User Input

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

When using a model for inference it should always be in evaluation mode. If this tutorial is followed step-by-step then it should already be in evaluation mode (from doing `evaluate` on the test set), however we explicitly set it to avoid any risk.

Our `predict_sentiment` function does a few things:
- sets the model to evaluation mode
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- gets the length of our sequence
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing 
- converts the length into a tensor
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [None]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

An example negative review...

In [None]:
predict_sentiment(model, "This film is terrible")

0.009660288691520691

An example positive review...

In [None]:
predict_sentiment(model, "This film is great")

0.9771714210510254