# Week 5: home assignment

## Assignment structure

- DIY: loss scaling (3 points)
- Efficient batching for language modelling (5 points)
- Profiling of the pipeline (2 points)

Your grade for the assignment is the sum of the points for the sections above. Maximum is 10 points.

## DIY: loss scaling (1.5 + 1.5 points)

Let's use a semantic segmentation pipeline for this section. Your task is to train the model in the AMP mode with loss scaler implemented by you. You **can use** `torch.cuda.amp.autocast` and you **cannot use** `torch.cuda.amp.GradScaler()` (you may only for checking your solution).

Let us remind what loss scaling is. Loss scaling is used to avoid the gradient underflow problem, when computing gradients in FP16 precision. The issue here is that while training in full precision, we might acquire rather small values in the gradients, which will vanish when we cast a tensor to a half precision. To fix the problem the following solution is used:

- make a forward pass for the model and compute the loss
- multiply loss value to some factor
- call `.backward()`
- update model's master weights with **unscaled** FP32 gradients

**Note.** Loss scaling might be done in two different ways: static and dynamic ones. In static mode, you choose a factor for scaling only once and use it for the whole training procedure. In dynamic mode you recompute the factor each time you scale the loss. 

For static scaling you will get **1.5 points**, for dynamic scaling you will get additional **1.5 points**. The task is done if you managed to stably achieve high accuracy values (0.985+) within 5 training epochs. For a start, you can run the training in a full precision mode, then try to run in an AMP mode with and without PyTorch loss scaler. You will observe that adding a scaler gives you additional accuracy points.

**Hint.** To make sure that you're doing everything right, you might want to examine gradients' values: (almost) no zeros must be present there.

In [1]:
# Download and unpack data
!wget https://www.dropbox.com/s/tc1qo73rrm3gt3m/CARVANA.zip  # Carvana dataset
!unzip -q CARVANA.zip
!rm -rf ./train/.DS_Store
!rm -rf ./train_masks/.DS_Store

--2022-03-07 13:47:38--  https://www.dropbox.com/s/tc1qo73rrm3gt3m/CARVANA.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.70.18, 2620:100:6026:18::a27d:4612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.70.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/tc1qo73rrm3gt3m/CARVANA.zip [following]
--2022-03-07 13:47:38--  https://www.dropbox.com/s/raw/tc1qo73rrm3gt3m/CARVANA.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucbe7aeba3b62c4d346ece644953.dl.dropboxusercontent.com/cd/0/inline/BhAv_omEPaOc0AnpPe6-LnXmrTrkSKTiXlde0ulfzDxpmSZ516DVmzaadVIDfjED7GbSuJCTRiLaB7Wtk1P3sTnc03CL6ZQHCuMu991StcrEMPVYEQR1VUBi5tbfkGBWtxKvnfDVSkgFnpRfK2ojK7d_/file# [following]
--2022-03-07 13:47:39--  https://ucbe7aeba3b62c4d346ece644953.dl.dropboxusercontent.com/cd/0/inline/BhAv_omEPaOc0AnpPe6-LnXmrTrkSKTiXlde0ulfzDxpmSZ516DVmzaadVIDfjED7GbSuJCTRiLaB7Wtk1P3sTnc03

## Efficient batching for language modelling (1 + 1 + 3 points)

In this part we suggest you examine the efficiency of the three batching approaches we discussed during the seminar. Let us remind you shortly:

**BRAIN**: pad everything to a fixed `max_length`

**BIG BRAIN**: pad only in the `collate_fn`

**ULTRA DUPER BIG BRAIN**: presort data to sample sequences smartly, preserving similar examples length in the batch

___
More formally, we suggest you download [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) and implement all of the mentioned approaches. Use the training part for all of the task's sub-problems.

- For naive batching, you will need to implement a Pytorch Dataset class that will parse training data from the source files of the dataset and pad everything to a `max_length=640` of the training samples. For sequences longer than 640 tokens just truncate the overflowing part. **(1 point)**
- For the second approach, you will need to implement the approach from the seminar for this dataset. More specifically, you needed to pad sequences only up to maximum sample length in the current batch. **(1 point)**
- Finally, for the third approach, you will need to make a small trick. While initializing the dataset, you need to split it into the several bins (let's say, python lists) by samples length. For the task we suggest you uniformly split the samples list sorted by sample length. Conduct experiments for 1, 5, 10, 25, 50 bins. While calling a `__getitem__` method, you firstly sample a bin number, then sample the needed examples number form the bin and pad them with collator from the second subtask. **(3 points)**

For each of the implemented methods mock one training epoch and provide min, max, mean and median batch processing times. Use a `pandas.DataFrame` to display the results in the notebook. For mocking a training epoch we suggest you construct a small GPT-2-like model: use `nn.Embedding` layer, `PositionalEncoding` class from `transformer.py` file and a single `nn.TransformerDecoder` layer with hidden size 1024 and 8 heads. For tokenization use `torchtext.data.utils.get_tokenizer("basic_english")`. Run one epoch **without a backward pass**. Make sure you've [warmed up](https://forums.developer.nvidia.com/t/why-warm-up/48565) GPU before computing the statistics and do not forget about asynchronous CUDA kernels execution.

**Note.** In the third subtask you might want to use (not obligatory) a `batch_sampler` in the data loader. For that, you need to inspect the corresponding Pytorch docs [section](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler).

## Profiling (2 points)

In this section, you're given a training script for a Transformer model on WikiText2 dataset. Your task is to examine the bottlenecks of the model. You can find the model script in the `transformer.py` file. As you might notice, this is a PyTorch Transformer implementation.

We suppose that in this task you use [PyTorch profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). However, feel free to use any other profiler that we've discussed. In the training function, you can vary the number of steps that are done during one epoch. We suggest you use only one epoch since our goal is not to train a model but to profile its performance.

To complete the task, provide a detailed description of the model performance:
- Forward pass
    - Inspect PositionalEncoding layer
    - Inspect the Embedding layer
    - Inspect Attention layer (both self attention and projections computations)
- Backward pass
    - How long does it take compared to a forward pass?
    
Provide corresponding profiler's outputs and analyse them. We assume that you will analyse all of the mentioned model parts and other parts if you think it is reasonable (their time consumption is comparable).

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2
# code sourse: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
import math
from typing import Tuple

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from tqdm.auto import trange

from transformer import generate_square_subsequent_mask, TransformerModel
from torch.profiler import profile, record_function, ProfilerActivity

In [3]:
train_iter = WikiText2(split="train")
tokenizer = get_tokenizer("basic_english")
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# train_iter was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into bsz separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Args:
        data: Tensor, shape [N]
        bsz: int, batch size

    Returns:
        Tensor of shape [N // bsz, bsz]
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape [seq_len, batch_size]
val_data = batchify(val_data, eval_batch_size)

In [4]:
bptt = 35
def get_batch(source: Tensor, i: int) -> Tuple[Tensor, Tensor]:
    """
    Args:
        source: Tensor, shape [full_seq_len, batch_size]
        i: int

    Returns:
        tuple (data, target), where data has shape [seq_len, batch_size] and
        target has shape [seq_len * batch_size]
    """
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1)
    return data, target

In [5]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability

model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

In [6]:
import copy
import time

criterion = nn.CrossEntropyLoss()
lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

def train(model: nn.Module) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 200
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(bptt).to(device)

    num_batches = len(train_data) // bptt
    i = 0
    for batch in trange(0, train_data.size(0) - 1, bptt, desc="Epoch progress: "):
        data, targets = get_batch(train_data, i)
        batch_size = data.size(0)
        if batch_size != bptt:  # only on last batch
            src_mask = src_mask[:batch_size, :batch_size]
        with record_function("forward"):
            output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)

        # feel free to comment out this 
        optimizer.zero_grad()
        with record_function("backward"):
            loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f"| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | "
                  f"lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | "
                  f"loss {cur_loss:5.2f} | ppl {ppl:8.2f}")
            total_loss = 0
            start_time = time.time()
        i += 1
            

def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    src_mask = generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            batch_size = data.size(0)
            if batch_size != bptt:
                src_mask = src_mask[:batch_size, :batch_size]
            output = model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += batch_size * criterion(output_flat, targets).item()
    return total_loss / (len(eval_data) - 1)

In [None]:
best_val_loss = float("inf")
epochs = 1
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    
    with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
        train(model)
    val_loss = evaluate(model, val_data)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time
    print("-" * 89)
    print(f"| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | "
          f"valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}")
    print("-" * 89)

#     if val_loss < best_val_loss:
#         best_val_loss = val_loss
#         best_model = copy.deepcopy(model)

#     scheduler.step()

Epoch progress:   0%|          | 0/2929 [00:00<?, ?it/s]

| epoch   1 |  1400/ 2928 batches | lr 5.00 | ms/batch 24.74 | loss  1.53 | ppl     4.62
| epoch   1 |  2800/ 2928 batches | lr 5.00 | ms/batch  4.31 | loss  0.68 | ppl     1.97
| epoch   1 |  4200/ 2928 batches | lr 5.00 | ms/batch  4.51 | loss  0.46 | ppl     1.58


In [None]:
stat = prof.key_averages()

In [10]:
print(stat.table(sort_by="cpu_time_total", row_limit=100))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               backward        29.75%       27.478s        29.95%       27.663s       9.445ms       0.000us         0.00%       3.012ms       1.028us          2929  
                                                forward         0.47%     435.287ms        26.12%       24.125s       8.237ms       0.000us         0.00%        3.803s       1.298ms          2929  
         

In [11]:
print(stat.table(sort_by="cuda_time_total", row_limit=100))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::mm         2.64%        2.441s         4.70%        4.340s      54.880us        8.615s        38.89%        8.615s     108.940us         79083  
       autograd::engine::evaluate_function: MmBackward0         0.48%     446.109ms         5.17%        4.775s     181.157us       0.000us         0.00%        6.300s     239.005us         26361  
         

In [None]:
for i in range(10):
    print(123 ** i)

1
123
15129
1860867
228886641
28153056843
3462825991689
425927596977747
52389094428262881
6443858614676334363


In [6]:
(45**67) % 4

1

In [18]:
import math
import numpy as np

In [12]:
(3+math.sqrt(7))

5.645751311064591

In [15]:
5.6457513110645 ** 3

179.95554457618744

In [16]:
5.645751311064591 ** 3

179.9555445761961

In [25]:
dp = np.zeros(200, dtype=int)

In [26]:
dp[0] = 1

In [27]:
for i in range(1, 101):
    dp[i] = dp[i-1]
    if i >= 3:
        dp[i] += dp[i-3]
    

In [29]:
int(dp[100])

24382819596721629