Lab 3 - Introduction to PyTorch
==========================

This is based on the [Introduction to PyTorch](https://pytorch.org/tutorials/beginner/basics/intro.html) and [text classification with the torchtext library](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) tutorials from the official PyTorch website.

# PyTorch

PyTorch is one of the most widely used libraries for implementing models in NLP today (and all of AI). Today, we'll be learning the basics of the library, including the low-level building blocks of all neural networks.

First, we need to install it (note, this can take a little while):

In [1]:
!pip3 install torch torchvision torchaudio

Collecting torch
  Downloading torch-2.2.1-cp39-cp39-manylinux1_x86_64.whl (755.5 MB)
[K     |████████████████                | 380.1 MB 132.6 MB/s eta 0:00:036       | 306.7 MB 132.6 MB/s eta 0:00:04██▎                  | 314.4 MB 132.6 MB/s eta 0:00:04    |██████████████▏                 | 335.5 MB 132.6 MB/s eta 0:00:04

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 755.5 MB 10 kB/s s eta 0:00:012     |██████████████████████████████▏ | 712.5 MB 26.9 MB/s eta 0:00:02
[?25hCollecting torchvision
  Downloading torchvision-0.17.1-cp39-cp39-manylinux1_x86_64.whl (6.9 MB)
[K     |████████████████████████████████| 6.9 MB 34.0 MB/s eta 0:00:01
[?25hCollecting torchaudio
  Downloading torchaudio-2.2.1-cp39-cp39-manylinux1_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 37.0 MB/s eta 0:00:01
[?25hCollecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[K     |████████████████████████████████| 23.7 MB 701 kB/s  eta 0:00:01
[?25hCollecting nvidia-cudnn-cu12==8.9.2.26
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
[K     |███████████████████████████████▎| 715.2 MB 57.2 MB/s eta 0:00:013    |█████████████▊                  | 314.6 MB 67.5 MB/s eta 0:00:07�██████████                

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 731.7 MB 10 kB/s 
[?25hCollecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
[K     |███████████████████████████▊    | 355.7 MB 124.2 MB/s eta 0:00:01�███▋                      | 123.9 MB 52.4 MB/s eta 0:00:06�██████████▎        | 298.8 MB 85.2 MB/s eta 0:00:02�██▏      | 323.4 MB 85.2 MB/s eta 0:00:02

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 410.6 MB 5.0 kB/s 
[?25hCollecting triton==2.2.0
  Downloading triton-2.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (167.9 MB)
[K     |████████████████████████████████| 167.9 MB 58.8 MB/s eta 0:00:01    |██████████▉                     | 56.9 MB 71.9 MB/s eta 0:00:02��███████▋              | 92.2 MB 71.9 MB/s eta 0:00:02��█████████████████████████   | 152.2 MB 91.9 MB/s eta 0:00:01
[?25hCollecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
[K     |████████████████████████████████| 124.2 MB 36 kB/s /s eta 0:00:01██████▉                     | 42.0 MB 50.7 MB/s eta 0:00:020.7 MB/s eta 0:00:02��████████████▉    | 108.2 MB 128.3 MB/s eta 0:00:01    |█████████████████████████████   | 112.5 MB 128.3 MB/s eta 0:00:01
[?25hCollecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


Now we can import it and see what version we have.

In [2]:
import torch
print(torch.__version__)

2.2.1+cu121


# Tensors

Tensors are a specialized data structure that are very similar to arrays and matrices.
In PyTorch, tensors encode:
- The inputs to a model
- The outputs of a model
- A model's parameters (e.g., weights)

Tensors are similar to [NumPy’s](https://numpy.org/) ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and
NumPy arrays can often share the same underlying memory. Tensors are also optimized for automatic differentiation (we'll see more about that later in this lab).

Let's import numpy to explore the relationship between the data types:

In [3]:
import numpy as np

### Initializing a Tensor

Tensors can be initialized in various ways. Take a look at the following examples:

**Directly from data**

Tensors can be created directly from data. The data type is automatically inferred.



In [4]:
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)
print(data, "has type", x_data.dtype)

float_data = [[1.0, 2.0],[3.0, 4.0]]
xf_data = torch.tensor(float_data)
print(float_data, "has type", xf_data.dtype)

[[1, 2], [3, 4]] has type torch.int64
[[1.0, 2.0], [3.0, 4.0]] has type torch.float32


**From a NumPy array**

Tensors can be created from NumPy arrays.



In [5]:
np_array = np.array(data)
x_np = torch.from_numpy(np_array)
print("From NumPy ndarray:\n", x_np)
print("From List:\n", x_data)

From NumPy ndarray:
 tensor([[1, 2],
        [3, 4]])
From List:
 tensor([[1, 2],
        [3, 4]])


**From another tensor:**

Tensors can be created from other tensors. The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.

These two functions show examples of creating tensors with the same shape as an existing one, but new data.

In [6]:
x_ones = torch.ones_like(x_data)
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float)
print(f"Random Tensor: \n {x_rand} \n")

Ones Tensor: 
 tensor([[1, 1],
        [1, 1]]) 

Random Tensor: 
 tensor([[0.4493, 0.5064],
        [0.5436, 0.8545]]) 



**From just a set of dimensions:**

`shape` is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor. The contents of the tensor are determined by the function called.



In [7]:
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)
print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

Random Tensor: 
 tensor([[0.3957, 0.3586, 0.3756],
        [0.5573, 0.5490, 0.4902]]) 

Ones Tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

Zeros Tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]])


### Attributes of a Tensor

Tensor attributes describe their shape, datatype, and the device on which they are stored.



In [8]:
tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


### Operations on Tensors

Our models are implemented as a series of operations on tensors, e.g., multiplying tensors together, and applying non-linear functions to them.

For details on the tensor opertations available, see [this page in the PyTorch documentation](https://pytorch.org/docs/stable/torch.html).

Each of these operations can be run on the GPU (at typically higher speeds than on a
CPU).
Beyond the operations defined below, some other useful ones include:
- `torch.tensor.view(-1)`, which reduces the tensor dimension, useful in convert batch input to single input [documentation](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html)
- `torch.squeeze`, which removes dimensions of size 1 [documentation](https://pytorch.org/docs/stable/generated/torch.squeeze.html)
- `torch.unsqueeze`, which adds a dimension of size 1 [documentation](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html)

By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using
``.to`` method (after checking for GPU availability). Keep in mind that copying large tensors
across devices can be expensive in terms of time and memory!



In [None]:
# We move our tensor to the GPU if available
if torch.cuda.is_available():
    print("GPU is available, moving tensor to it")
    tensor = tensor.to('cuda')
else:
    print("No GPU available")

Now let's see how a few operations work.

**Indexing:**

Indexing into a tensor is similar to accessing an element of a list. The tensor has a multidimensional structure and you can access parts of that structure.

In [None]:
tensor = torch.rand(4, 4)
print("Complete tensor:")
print(tensor)
print('\nFirst row:')
print(tensor[0])
print('\nType of the first row:')
print(type(tensor[0]))

**Slicing:**

Sometimes we want to get a part of the tensor that does not correspond to one chunk we can index into. For example, in a 2-D tensor we may want to get a column of the tensor:

In [None]:
tensor = torch.rand(4, 4)
print("Complete tensor:")
print(tensor)
print('\nFirst column: ', tensor[:, 0])
print('\nLast column:', tensor[..., -1])

Note that indexing and slicing gives you a view of the tensor - it does not make a copy.

If you modify the slice then the corresponding part of the tensor will be adjusted:

In [None]:
tensor = torch.rand(4, 4)
print("Original tensor")
print(tensor)
print("\nA column of the tensor")
print(tensor[:,1])
tensor[:,1] = 0
print("\nThe updated tensor")
print(tensor)

**Joining tensors**

You can use `torch.cat` to concatenate a sequence of tensors along a given dimension.

In [None]:
tensor1 = torch.rand(2, 3)
tensor2 = torch.rand(2, 3)
print("Initial tensors:")
print(tensor1)
print(tensor2)
print("\nCombine along dimension 0")
combined = torch.cat([tensor1, tensor2], dim=0)
print(combined)
print("\nCombine along dimension 1")
combined = torch.cat([tensor1, tensor2], dim=1)
print(combined)

We can also combine tensors and create a new dimension in the process with `torch.stack`:

In [None]:
tensor1 = torch.rand(2, 3)
tensor2 = torch.rand(2, 3)
print("Initial tensors:")
print(tensor1)
print(tensor2)
print("\nStacked")
combined = torch.stack([tensor1, tensor2], dim=0)
print(combined)

**Arithmetic operations**



In [None]:
# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y3 = torch.rand_like(tensor)
torch.matmul(tensor, tensor.T, out=y3)


# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

**Single-element tensors** If you have a one-element tensor, for example by aggregating all
values of a tensor into one value, you can convert it to a Python
numerical value using ``item()``:



In [None]:
agg = tensor.sum()
agg_item = agg.item()
print(agg_item, type(agg_item))

--------------





### Connect with NumPy

It is possible to access the same memory as both a PyTorch tensor and a NumPy array. This allows the data to be modified by either library.


In [None]:
t = torch.ones(5)
print(f"t: {t} (PyTorch tensor)")
n = t.numpy()
print(f"n: {n} (NumPy array)")

print("\nAdd one to the PyTorch tensor and we now have:")
t.add_(1)
print(f"t: {t}")
print(f"n: {n}")
print("\nAdd one to the NumPy array and we now have:")
np.add(n, 1, out=n)
print(f"t: {t}")
print(f"n: {n}")

print("\nWe can go the other way too:")
n = np.ones(5)
t = torch.from_numpy(n)
print(f"t: {t}")
print(f"n: {n}")
np.add(n, 1, out=n)
print(f"t: {t}")
print(f"n: {n}")

# Datasets & DataLoaders


Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code
to be decoupled from our model training code for better readability and modularity.
PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset`
that allow you to use pre-loaded datasets as well as your own data.
`Dataset` stores the samples and their corresponding labels, and `DataLoader` wraps an iterable around
the `Dataset` to enable easy access to the samples.

PyTorch domain libraries provide a number of pre-loaded datasets (such as AG_NEWS) that
subclass `torch.utils.data.Dataset` and implement functions specific to the particular data.
They can be used to prototype and benchmark your model. You can find them
here: [Image Datasets](https://pytorch.org/vision/stable/datasets.html),
[Text Datasets](https://pytorch.org/text/stable/datasets.html), and
[Audio Datasets](https://pytorch.org/audio/stable/datasets.html)




## Loading a Dataset

We are going to see an example of a text dataset in PyTorch. First, we need to load two libraries (note, the order these are installed in is important):

In [9]:
!pip3 install portalocker>=2.0.0
!pip3 install torchtext

Collecting torchtext
  Downloading torchtext-0.17.1-cp39-cp39-manylinux1_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 4.6 MB/s eta 0:00:01
Collecting tqdm
  Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 15.0 MB/s eta 0:00:01
Collecting torchdata==0.7.1
  Downloading torchdata-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 37.9 MB/s eta 0:00:01
Installing collected packages: tqdm, torchdata, torchtext
Successfully installed torchdata-0.7.1 torchtext-0.17.1 tqdm-4.66.2


Here is an example of how to load the [AG News](https://huggingface.co/datasets/ag_news) dataset from TorchText.
AG News is a dataset of articles from four categories: “World”, “Sports”, “Business”, and “Sci/Tech”.

We load the [AG News Dataset](https://pytorch.org/text/stable/datasets.html#ag-news) with the following parameters:
 - `root` is the path where the train/test data is stored
 - `split` specifies whether to get the training or test data

In [10]:
from torch.utils.data import Dataset
from torchtext.datasets import AG_NEWS

training_data = AG_NEWS(
    root = '.data',
    split = 'train'
)

test_data = AG_NEWS(
    root = '.data',
    split = 'test'
)

# Get one element from the data
next_train = iter(training_data)
print(next(next_train))

(3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")


## Creating a Custom Dataset for your files

We are going to load the same data as from Lab 2 (https://edstem.org/au/courses/14541/lessons/50326/slides/340219), this time into PyTorch, rather than scikit-learn.

A custom Dataset class must implement three functions: `__init__`, `__len__`, and `__getitem__`.

In the next few sections, we'll explain each function so you can implement the data reader at the bottom of this section.

### `__init__`

The `__init__` function is run once when instantiating the Dataset object. We read the data file and store the relevant parts. We also record some transforms, which will be covered in more detail in the next section.

In [12]:
def __init__(self, json_file, transform=None, target_transform=None):
    self.labels = []
    self.inputs = []
    for line in open(json_file):
        data = json.loads(line.strip())
        for msg, tlabel in zip(data['messages'], data['sender_labels']):
            self.inputs.append(msg)
            self.labels.append(tlabel)
    self.transform = transform
    self.target_transform = target_transform

### `__len__`

The `__len__` function returns the number of samples in our dataset.



In [13]:
def __len__(self):
    return len(self.labels)

### `__getitem__`

The `__getitem__` function loads and returns a sample from the dataset at the given index `idx`.
Based on the index, it gets the data, calls the transform functions on them (if applicable), and returns the
text and a corresponding label in a tuple.

In many cases, there is further preprocessing that occurs here, so we can provide a tensor rather than strings.

In [14]:
def __getitem__(self, idx):
    text = self.inputs[idx]
    label = self.labels[idx]
    if self.transform:
        text = self.transform(text)
    if self.target_transform:
        label = self.target_transform(label)
    return text, label

### The complete data loader

Below we have put these pieces together to make the dataset loader.

In [15]:
import json

class CustomTextDataset(Dataset):
    def __init__(self, json_file, transform=None, target_transform=None):
        self.labels = []
        self.inputs = []
        for line in open(json_file):
            data = json.loads(line.strip())
            for msg, tlabel in zip(data['messages'], data['sender_labels']):
                self.inputs.append(msg)
                self.labels.append(tlabel)
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        text = self.inputs[idx]
        label = self.labels[idx]
        if self.transform:
            text = self.transform(text)
        if self.target_transform:
            label = self.target_transform(label)
        return text, label

## Preparing the AG News data for training with DataLoaders

Now let's return to the AG News data.
The `Dataset` retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to
pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's `multiprocessing` to speed up data retrieval.

`DataLoader` is an iterable that abstracts this complexity for us in an easy API.



In [16]:
from torch.utils.data import DataLoader

training_data = AG_NEWS(
    root = '.data',
    split = 'train'
)

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

## Iterate through the DataLoader

We have loaded that dataset into the ``DataLoader`` and can iterate through the dataset as needed.
Each iteration below returns a batch of ``train_features`` and ``train_labels`` (containing ``batch_size=64`` features and labels respectively).
Because we specified ``shuffle=True``, after we iterate over all batches the data is shuffled (for finer-grained control over
the data loading order, take a look at [Samplers](https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler)).



In [17]:
training_labels, training_texts = next(iter(train_dataloader))
print("First text:")
print(training_texts[0])
print("\nFirst label:")
print(training_labels[0])
print("\nNumber of labels:")
print(training_labels.size())

First text:
Krispy Kreme operating chief quits WINSTON-SALEM, N.C. -- Krispy Kreme Doughnuts Inc.'s John Tate quit as chief operating officer amid a government probe of the chain's accounting and earnings forecasts.

First label:
tensor(3)

Number of labels:
torch.Size([64])


--------------




### Prepare data processing pipelines

Our data is currently strings, but a neural network needs numbers. To convert to numbers we will create a vocabulary, which maps from tokens to integer IDs.

The first step is to build a vocabulary with the raw training dataset. Here we use the built in factory function `build_vocab_from_iterator` which accepts an iterator that yield list or iterator of tokens. Users can also pass special symbols to be added to the vocabulary.

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split="train")

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

print(vocab(['here', 'is', 'an', 'example']))

Using this vocabulary, we can define functions that will convert strings.

In [None]:
def text_pipeline(text):
    return vocab(tokenizer(text))

def label_pipeline(label):
    return int(label) - 1

print(text_pipeline('here is the an example'))
print(label_pipeline('10'))

### Generate data batch and iterator

The data produced by the `DataLoader` above is not ready to be used by a model because all the words are strings. Now, with the preprocessing functions, we can change that.

Before sending data to the model, `collate_fn` function works on a batch of samples generated from DataLoader. The input to `collate_fn` is a batch of data with the batch size in DataLoader, and `collate_fn` processes them according to the data processing pipelines declared previously. Pay attention here and make sure that `collate_fn` is declared as a top level def. This ensures that the function is available in each worker.

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of `nn.EmbeddingBag`. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

In [None]:
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split="train")
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

# Build the Neural Network

Neural networks are comprised of layers/modules that perform operations on data.
The [torch.nn](https://pytorch.org/docs/stable/nn.html) namespace provides all the building blocks you need to
build your own neural network. Every module in PyTorch subclasses the [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html).
A neural network is a module itself that consists of other modules (layers). This nested structure allows for
building and managing complex architectures easily.

In the following sections, we'll build a neural network to classify text in the AG News dataset.

The model is composed of the nn.EmbeddingBag layer plus a linear layer for the classification purpose. nn.EmbeddingBag with the default mode of “mean” computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.

Additionally, since nn.EmbeddingBag accumulates the average across the embeddings on the fly, nn.EmbeddingBag can enhance the performance and memory efficiency to process a sequence of tensors

In [None]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchtext import datasets

## Get Device for Training
We want to be able to train our model on a hardware accelerator like the GPU or MPS,
if available. Let's check to see if [torch.cuda](https://pytorch.org/docs/stable/notes/cuda.html)
or [torch.backends.mps](https://pytorch.org/docs/stable/notes/mps.html) are available, otherwise we use the CPU.



In [None]:
device = 'cpu'
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
print(f"Using {device} device")

## Define the Class
We define our neural network by subclassing ``nn.Module``, and
initialize the neural network layers in ``__init__``. Every ``nn.Module`` subclass implements
the operations on input data in the ``forward`` method.



In [None]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        # Create the embedding, which goes from token IDs to a vector, which is the sum of word vectors        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        # Create a linear layer, which is a matrix of weights we multiply the embedding by to get scores        self.fc = nn.Linear(embed_dim, num_class)
        # Call the function below to initialise the values of the weights        self.init_weights()

    def init_weights(self):
        # This function sets the starting / initial value of the weights for each part of the model        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        # This function does the actual computation. When text comes in as a set of token IDs, it runs the embedding, then the linear layer, to get scores for each possible label        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

We build a model with the embedding dimension of 64. The vocab size is equal to the length of the vocabulary instance. The number of classes is equal to the number of labels. We also set a seed for the random number generator so that results are consistent across students:

In [None]:
train_iter = AG_NEWS(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
torch.manual_seed(0)model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

To use the model, we pass it the input data. This executes the model's ``forward``,
along with some [background operations](https://github.com/pytorch/pytorch/blob/270111b7b611d174967ed204776985cefca9c144/torch/nn/modules/module.py#L866).
Do not call ``model.forward()`` directly!

Calling the model on the input returns a 2-dimensional tensor with dim=0 corresponding to each output of 10 raw predicted values for each class, and dim=1 corresponding to the individual values of each output.
We get the prediction probabilities by passing it through an instance of the ``nn.Softmax`` module.

We haven't trained the model yet, so the code below will run, but its output will be somewhat random.

In [None]:
X = torch.tensor(text_pipeline("This is a sample sentence"), dtype=torch.int64)
X = X.to(device)
offsets = torch.tensor([0])
offsets = offsets.to(device)
logits = model(X, offsets)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

--------------




## Model Layers

Let's break down the layers in the classification model. To illustrate it, we
will take a sample text input and see what happens as we pass it through the network.


In [None]:
input_text = "This is a sample"
processed_text = torch.tensor(text_pipeline(input_text), dtype=torch.int64)
print(processed_text.size())

### nn.EmbeddingBag
We initialize the [nn.EmbeddingBag](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) layer to convert each token into a vector (the minibatch dimension (at dim=0) is maintained).



In [None]:
embed_dim = 128
embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
embedded_text = embedding(processed_text, torch.tensor([0]))

### nn.Linear
The [linear layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
is a module that applies a linear transformation on the input using its stored weights and biases.




In [None]:
linear_layer = nn.Linear(embed_dim, num_class)
after_linear = linear_layer(embedded_text)
print(after_linear.size())

## Model Parameters
Many layers inside a neural network are *parameterized*, i.e. have associated weights
and biases that are optimized during training. Subclassing ``nn.Module`` automatically
tracks all fields defined inside your model object, and makes all parameters
accessible using your model's ``parameters()`` or ``named_parameters()`` methods.

In this example, we iterate over each parameter, and print its size and a preview of its values.




In [None]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

--------------

# Automatic Differentiation with ``torch.autograd``

When training neural networks, the most frequently used algorithm is
**back propagation**. In this algorithm, parameters (model weights) are
adjusted according to the **gradient** of the loss function with respect
to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine
called ``torch.autograd``. It supports automatic computation of gradient for any
computational graph.

Consider the simplest one-layer neural network, with input ``x``,
parameters ``w`` and ``b``, and some loss function. It can be defined in
PyTorch in the following manner:


In [None]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

## Tensors, Functions and Computational graph

This code defines a small **computational graph**. In this network, ``w`` and ``b`` are **parameters**, which we need to
optimize. Thus, we need to be able to compute the gradients of loss
function with respect to those variables. In order to do that, we set
the ``requires_grad`` property of those tensors.



<div class="alert alert-info"><h4>Note</h4><p>You can set the value of ``requires_grad`` when creating a
          tensor, or later by using ``x.requires_grad_(True)`` method.</p></div>



A function that we apply to tensors to construct computational graph is
in fact an object of class ``Function``. This object knows how to
compute the function in the *forward* direction, and also how to compute
its derivative during the *backward propagation* step. A reference to
the backward propagation function is stored in ``grad_fn`` property of a
tensor. You can find more information of ``Function`` [in the
documentation](https://pytorch.org/docs/stable/autograd.html#function)_.




In [None]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

## Computing Gradients

To optimize weights of parameters in the neural network, we need to
compute the derivatives of our loss function with respect to parameters,
namely, we need $\frac{\partial loss}{\partial w}$ and
$\frac{\partial loss}{\partial b}$ under some fixed values of
``x`` and ``y``. To compute those derivatives, we call
``loss.backward()``, and then retrieve the values from ``w.grad`` and
``b.grad``:




In [None]:
loss.backward()
print(w.grad)
print(b.grad)

<div class="alert alert-info"><h4>Note</h4><p>- We can only obtain the ``grad`` properties for the leaf
    nodes of the computational graph, which have ``requires_grad`` property
    set to ``True``. For all other nodes in our graph, gradients will not be
    available.
  - We can only perform gradient calculations using
    ``backward`` once on a given graph, for performance reasons. If we need
    to do several ``backward`` calls on the same graph, we need to pass
    ``retain_graph=True`` to the ``backward`` call.</p></div>




## Disabling Gradient Tracking

By default, all tensors with ``requires_grad=True`` are tracking their
computational history and support gradient computation. However, there
are some cases when we do not need to do that, for example, when we have
trained the model and just want to apply it to some input data, i.e. we
only want to do *forward* computations through the network. We can stop
tracking computations by surrounding our computation code with
``torch.no_grad()`` block:




In [None]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

Another way to achieve the same result is to use the ``detach()`` method
on the tensor:




In [None]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

There are reasons you might want to disable gradient tracking:
  - To mark some parameters in your neural network as **frozen parameters**.
  - To **speed up computations** when you are only doing forward pass, because computations on tensors that do
    not track gradients would be more efficient.



## More on Computational Graphs
Conceptually, autograd keeps a record of data (tensors) and all executed
operations (along with the resulting new tensors) in a directed acyclic
graph (DAG) consisting of
[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)_
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

- run the requested operation to compute a resulting tensor
- maintain the operation’s *gradient function* in the DAG.

The backward pass kicks off when ``.backward()`` is called on the DAG
root. ``autograd`` then:

- computes the gradients from each ``.grad_fn``,
- accumulates them in the respective tensor’s ``.grad`` attribute
- using the chain rule, propagates all the way to the leaf tensors.

<div class="alert alert-info"><h4>Note</h4><p>**DAGs are dynamic in PyTorch**
  An important thing to note is that the graph is recreated from scratch; after each
  ``.backward()`` call, autograd starts populating a new graph. This is
  exactly what allows you to use control flow statements in your model;
  you can change the shape, size and operations at every iteration if
  needed.</p></div>



--------------




# Optimizing Model Parameters

Now that we have a model and data it's time to train, validate and test our model by optimizing its parameters on
our data. Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates
the error in its guess (*loss*), collects the derivatives of the error with respect to its parameters, and **optimizes** these parameters using gradient descent.


## Hyperparameters

Hyperparameters are adjustable parameters that let you control the model optimization process.
Different hyperparameter values can impact model training and convergence rates
([read more](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)_ about hyperparameter tuning)

We define the following hyperparameters for training:
 - **Number of Epochs** - the number times to iterate over the dataset
 - **Batch Size** - the number of data samples propagated through the network before the parameters are updated
 - **Learning Rate** - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.




In [None]:
learning_rate = 5
batch_size = 64
epochs = 5

## Optimization Loop

Once we set our hyperparameters, we can then train and optimize our model with an optimization loop. Each
iteration of the optimization loop is called an **epoch**.

Each epoch consists of two main parts:
 - **The Train Loop** - iterate over the training dataset and try to converge to optimal parameters.
 - **The Validation/Test Loop** - iterate over the test dataset to check if model performance is improving.

Let's briefly familiarize ourselves with some of the concepts used in the training loop. Jump ahead to
see the `full-impl-label` of the optimization loop.

### Loss Function

When presented with some training data, our untrained network is likely not to give the correct
answer. **Loss function** measures the degree of dissimilarity of obtained result to the target value,
and it is the loss function that we want to minimize during training. To calculate the loss we make a
prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) (Mean Square Error) for regression tasks, and
[nn.NLLLoss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss) (Negative Log Likelihood) for classification.
[nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) combines `nn.LogSoftmax` and `nn.NLLLoss`.

We pass our model's output logits to `torch.nn.CrossEntropyLoss`, which will normalize the logits and compute the prediction error.



In [None]:
# Initialize the loss function
loss_function = torch.nn.CrossEntropyLoss()

### Optimizer

Optimization is the process of adjusting model parameters to reduce model error in each training step. **Optimization algorithms** define how this process is performed (in this example we use Stochastic Gradient Descent).
All optimization logic is encapsulated in  the ``optimizer`` object. Here, we use the SGD optimizer; additionally, there are many [different optimizers](https://pytorch.org/docs/stable/optim.html)
available in PyTorch such as ADAM and RMSProp, that work better for different kinds of models and data.

We initialize the optimizer by registering the model's parameters that need to be trained, and passing in the learning rate hyperparameter.



In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Inside the training loop, optimization happens in three steps:
 * Call ``optimizer.zero_grad()`` to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.
 * Backpropagate the prediction loss with a call to ``loss.backward()``. PyTorch deposits the gradients of the loss w.r.t. each parameter.
 * Once we have our gradients, we call ``optimizer.step()`` to adjust the parameters by the gradients collected in the backward pass.



## Full Implementation
We define `train_loop` that loops over our optimization code, and `test_loop` that
evaluates the model's performance against our test data.

We'll also redefine our data loading code here to make sure it is configured correctly (sometimes in the process of experimenting above, it gets modified).


In [None]:
import time

from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split="train")
train_dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)
test_iter = AG_NEWS(split="test")
test_dataloader = DataLoader(
    test_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

def train_loop(dataloader, epoch, loss_function, optimizer, model):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = loss_function(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()


def test_loop(dataloader, loss_function, model):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = loss_function(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

We initialize the loss function and optimizer, and pass it to ``train_loop`` and ``test_loop``.
Feel free to increase the number of epochs to track the model's improving performance.



In [None]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
learning_rate = 5
batch_size = 64
epochs = 5

loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=batch_size, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=batch_size, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch
)

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train_loop(train_dataloader, epoch, loss_function, optimizer, model)
    accu_val = test_loop(valid_dataloader, loss_function, model)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)


# Save and Load the Model

Finally, we will look at how to save and load a model.


## Saving and Loading Model Weights
PyTorch models store the learned parameters in an internal
state dictionary, called ``state_dict``. These can be persisted via the ``torch.save``
method:



In [None]:
torch.save(model.state_dict(), 'model_weights.pth')

To load model weights, you need to create an instance of the same model first, and then load the parameters
using ``load_state_dict()`` method.



In [None]:
new_model = TextClassificationModel(vocab_size, emsize, num_class) # we do not specify ``weights``, i.e. create untrained model
new_model.load_state_dict(torch.load('model_weights.pth'))

If we plan to use our model, but not train it further, then we call `model.eval()`. This is a standard way to tell the model to set itself up for use, rather than training (e.g., it disables dropout). Failing to do this could yield inconsistent outputs.



In [None]:
new_model.eval()

## Saving and Loading Models with Shapes
When loading model weights, we needed to instantiate the model class first, because the class
defines the structure of a network. We might want to save the structure of this class together with
the model, in which case we can pass `model` (and not `model.state_dict()`) to the save function:



In [None]:
torch.save(model, 'model.pth')

We can then load the model like this:



In [None]:
model = torch.load('model.pth')

This approach uses Python [pickle](https://docs.python.org/3/library/pickle.html) module when serializing the model, thus it relies on the actual class definition to be available when loading the model.



# Task 1

Adapt the model above to have a hidden layer with 100 dimensions and a tanh activation. The model should:

1. Use an embedding to represent words
2. Use a linear layer to convert them to the hidden dimension
3. Apply a tanh activation function (see [this page](https://pytorch.org/docs/stable/generated/torch.tanh.html#torch.tanh))
4. Use a linear layer to get scores across the possible labels
In your implementation, use `torch.manual_seed(0)` so that the results are easy for us to check.

# Task 2

For the original model, let's look at some properties of the embeddings. You may find `model.embedding.weight.data.tolist()` usedful as a way to convert the embeddings into a Python list of lists.

1. What are the minimum and maximum values in the embeddings?
2. If you put the values in buckets of width 0.1 (ie., 0 to 0.1, 0.1 to 0.2, 0.2 to 0.3, etc), what is the distribution? (you can be approximate, any form of rounding is fine)

This type of analysis can reveal if there are strange asymmetries in the weights we are learning.