<h1><center>Deep Learning Experiment Tracking with Weights and Biases</center></h1>
                                                      
<center><img src = "https://i.imgur.com/1sm6x8P.png" width = "750" height = "500"/></center>                                                                                               

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Contents</center></h2>

1. [But What is a TPU?](#but-what-is-a-tpu)  
2. [Because Libraries are Inevitable](#Because-Libraries-are-Inevitable)  
3. [Always a Neat Config](#Always-a-Neat-Config)  
4. [Preparing The Dataset](#Preparing-The-Dataset)  
5. [BERT is All We Need](#BERT-is-All-We-Need)
6. [Fit and Run](#Fit-and-Run)  
7. [Where did I learn All This?](Where-did-I-learn-All-This)

<a id="but-what-is-a-tpu"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>But What is a TPU?</center></h2>

### Introduction

**TPU** stands for Tensor Processing Unit.  
  
TPUs are hardware accelerators specialized in deep learning tasks. For explanation of what TPU's are and how they work please go through the following videos :
- [Tensor Processing Units: History and hardware](https://www.youtube.com/watch?v=MXxN4fv01c8)
- [Diving into the TPU v2 and v3](https://www.youtube.com/watch?v=kBjYK3K3P6M)

### Key Points

- Each TPU v3 board has 8 TPU cores and 64 GB's of memory
T- PU's consist of two units, Matrix Multiply Unit (MXU) which runs matrix multiplications and a Vector Processing Unit (VPU) for all other tasks such as activations, softmax, etc.  
  
- TPU's v2/v3 use a new type of dtype called bfloat16 which combines the range of a 32-bit floating point number with just the storage space of only a 16-bit floating point number and this allows to do fit more matrices in the memory and thus more matrix multiplications. This increased speed comes at the cost of precision as bfloat16 is able to represent fewer decimal places as compared to 16-bit floating point integer but its okay because neural networks can work at a reduced precision while maintaining their high accuracy  
  
- The ideal batch size for TPUs is 128 data items per TPU core but the hardware can already show good utilization from 8 data items per TPU core

### Simple Explanation

- We know that any deep learning framework first defines a computation graph which is then executed by any processing chip to train a neural network. Similarly, The TPU does not directly run Python code, it runs the computation graph defined by your program.However the computation graph is first converted into TPU machine code. Under the hood, a compiler called XLA (accelerated Linear Algebra compiler) transforms the graph of computation nodes into TPU machine code. This compiler also performs many advanced optimizations on your code and your memory layout.

- In tensorflow the conversion from computation to TPU machine code automatically takes place as work is sent to the TPU, whereas there was no such support for Pytorch and thus XLA module was created to include XLA in our build chain explicitly.

![TPU](https://lh5.googleusercontent.com/NjGqp60oF_3Bu4Q63dprSivZ77BgVnaPEp0Olk1moFm8okcmMfPXs7PIJBgL9LB5QCtqlmM4WTepYxPC5Mq_i_0949sWSpq8pKvfPAkHnFJWuHjrNVLPN2_a0eggOlteV7mZB_Z9)

### Changes required from GPU Code to TPU Code
  
 GPU -> TPU 
- `optimizer.step()` -> ` xm.optimizer_step(optimizer)`
- `device = "cuda"`  -> `device = xm.xla_device()`

<a id="because-libraries-are-inevitable"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Because Libraries are Inevitable</center></h2>

In [None]:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev
!pip install wandb -q

In [None]:
import os
os.environ['WANDB_SILENT'] = 'true'

**WANDB STEP 1:** : Connect with your API Key

In [None]:
# Import wandb
import wandb

wandb.login()

`torch-xla` is used to be able to use the TPU and torch-image-models (timm).

In [None]:
import torch
import pandas as pd
from scipy import stats
import numpy as np

from tqdm import tqdm
from collections import OrderedDict, namedtuple
import torch.nn as nn
from torch.optim import lr_scheduler
import joblib

import logging
import transformers
from transformers import AdamW, get_linear_schedule_with_warmup, get_constant_schedule
import sys
from sklearn import metrics, model_selection

import warnings
import torch_xla
import torch_xla.debug.metrics as met
import torch_xla.distributed.data_parallel as dp
import torch_xla.distributed.parallel_loader as pl
import torch_xla.utils.utils as xu
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.test.test_utils as test_utils
import warnings

warnings.filterwarnings("ignore")

<a id="always-a-neat-config"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Always a Neat Config</center></h2>

In [None]:
def seed_everything(seed):
    """
    Seeds basic parameters for reproductibility of results
    
    Arguments:
        seed {int} -- Number of the seed
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


seed_everything(42)

<a id="preparing-the-dataset"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Preparing The Dataset</center></h2>

In [None]:
mx = BERTBaseUncased(bert_path="../input/bert-base-multilingual-uncased/")
df_train1 = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv", usecols=["comment_text", "toxic"]).fillna("none")
df_train2 = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv", usecols=["comment_text", "toxic"]).fillna("none")
df_train_full = pd.concat([df_train1, df_train2], axis=0).reset_index(drop=True)
df_train = df_train_full.sample(frac=1).reset_index(drop=True).head(200000)

df_valid = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/validation.csv', 
                       usecols=["comment_text", "toxic"])

df_train = pd.concat([df_train, df_valid], axis=0).reset_index(drop=True)
df_train = df_train.sample(frac=1).reset_index(drop=True)

In [None]:
class BERTDatasetTraining:
    def __init__(self, comment_text, targets, tokenizer, max_length):
        self.comment_text = comment_text
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.targets = targets

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, item):
        comment_text = str(self.comment_text[item])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            max_length=self.max_length,
            truncation=True,
        )
        ids = inputs["input_ids"]
        token_type_ids = inputs["token_type_ids"]
        mask = inputs["attention_mask"]
        
        padding_length = self.max_length - len(ids)
        
        ids = ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[item], dtype=torch.float)
        }

<a id="bert-is-all-we-needl"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>BERT is All We Need</center></h2>

## BERT - Bidirectional Encoder Representations from Transformers
- BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). 
- BERT outperforms previous methods because it is the first unsupervised, *deeply bidirectional system* for pre-training NLP.
  
- *Unsupervised* means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages.

### Types of Pre Trained Representations: 
  
1. **Context Free** -    
Context-free models such as `word2vec` or `GloVe` generate a single "word embedding" representation for each word in the vocabulary, so `bank` would have the same representation in `bank deposit` and `river bank`.
  
2. **Contextual** - 
Contextual representations can further be unidirectional or bidirectional.

    a. **Unidirectional or Shallow Bidirectional** -   
    BERT was built upon recent work in pre-training contextual representations — including`Semi-supervised Sequence Learning`, `Generative Pre-Training`, `ELMo`, and `ULMFit` — but crucially these models are all unidirectional or shallowly bidirectional.   

    This means that each word is only contextualized using the words to its left (or right).   

    For example, in the sentence `I made a bank deposit` the unidirectional representation of `bank` is only based on `I made a` but not `deposit`. 

    Some previous work does combine the representations from separate left-context and right-context models, but only in a "shallow" manner. 
    
    b. **Deeply Bidirectional** -   
    BERT represents `bank` using both its left and right context — `I made a` ... `deposit` — starting from the very bottom of a deep neural network, so it is deeply bidirectional.
    
    BERT uses a simple approach for this.   
      
    We mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words.   
    
    For example:  
    `Original` : the man went to the store. he bought a gallon of milk.  
    `Input` : the man went to the [MASK1] . he bought a [MASK2] of milk.  
    `Labels` : [MASK1] = store; [MASK2] = gallon
    
    In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus.  
      
    Given two sentences A and B, is B the actual next sentence that comes after A, or just a random sentence from the corpus?  
      
      `Sentence A`: the man went to the store .
      `Sentence B`: he bought a gallon of milk .
      `Label`: IsNextSentence
      
      `Sentence A`: the man went to the store .
      `Sentence B`: penguins are flightless .
      `Label`: NotNextSentence

In [None]:
class BERTBaseUncased(nn.Module):
    def __init__(self, bert_path):
        super(BERTBaseUncased, self).__init__()
        self.bert_path = bert_path
        self.bert = transformers.BertModel.from_pretrained(self.bert_path)
        self.bert_drop = nn.Dropout(0.3)
        self.out = nn.Linear(768 * 2, 1)

    def forward(
            self,
            ids,
            mask,
            token_type_ids
    ):
        o1, o2 = self.bert(
            ids,
            attention_mask=mask,
            token_type_ids=token_type_ids,
            return_dict = False
        )
        
        apool = torch.mean(o1, 1)
        mpool, _ = torch.max(o1, 1)
        cat = torch.cat((apool, mpool), 1)

        bo = self.bert_drop(cat)
        p2 = self.out(bo)
        return p2

# Model Inputs

## Input IDs:
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

In [None]:
import transformers

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained("../input/bert-base-multilingual-uncased/", do_lower_case=True)

sequence = "A Titan RTX has 24GB of VRAM"

In [None]:
tokenized_sequence = tokenizer.tokenize(sequence)

The tokens are either words or subwords. Here for instance, “VRAM” wasn’t in the model vocabulary, so it’s been split in “V”, “RA” and “M”. To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is added for “RA” and “M”:

In [None]:
print(tokenized_sequence)

These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence to the tokenizer,

In [None]:
inputs = tokenizer(sequence)

In [None]:
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)

Note that the tokenizer automatically adds “special tokens” (if the associated model relies on them) which are special IDs the model sometimes uses.

If we decode the previous sequence of ids,

In [None]:
decoded_sequence = tokenizer.decode(encoded_sequence)

In [None]:
print(decoded_sequence)

## Attention Mask

The attention mask is an optional argument used when batching sequences together.  
This argument indicates to the model which tokens should be attended to, and which should not.

In [None]:
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]

print("encoded_sequence_a: ", encoded_sequence_a)
print("encoded_sequence_b: ", encoded_sequence_b)

In [None]:
# Length of Encoded Versions
len(encoded_sequence_a), len(encoded_sequence_b)

Therefore, we can’t put them together in the same tensor as-is. The first sequence needs to be padded up to the length of the second one, or the second one needs to be truncated down to the length of the first one.

In the first case, the list of IDs will be extended by the padding indices. We can pass a list to the tokenizer and ask it to pad like this:

In [None]:
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)

We can see that 0s have been added on the right of the first sentence to make it the same length as the second one:

In [None]:
print(padded_sequences["input_ids"])

This can then be converted into a tensor in PyTorch or TensorFlow.   

The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them. For the `BertTokenizer`, `1` indicates a value that should be attended to, while `0` indicates a padded value.   
This attention mask is in the dictionary returned by the tokenizer under the key “attention_mask”.

In [None]:
print(padded_sequences["attention_mask"])

## Token Type IDs

Some models’ purpose is to do classification on pairs of sentences or question answering. 
  
These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier (`[CLS]`) and separator (`[SEP]`) tokens. For example, the BERT model builds its two sequence input as such:

`[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]`

We can use our tokenizer to automatically generate such a sentence by passing the two sequences to `tokenizer` as two arguments (and not a list, like before) like this:

In [None]:
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])

In [None]:
print(decoded)

This is enough for some models to understand where one sequence ends and where another begins. However, other models, such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying the two types of sequence in the model.  

The tokenizer returns this mask as the `token_type_ids` entry:

In [None]:
encoded_dict['token_type_ids']

The first sequence, the “context” used for the question, has all its tokens represented by a `0`, whereas the second sequence, corresponding to the “question”, has all its tokens represented by a `1`.

<a id="fit-and-run"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Fit and Run</center></h2>

**Wandb Step 2:** In the next step we need to initialize wandb with the name of a project where we want to save our runs.

**Wandb Step 3:** In this example we are going to log the Training Loss, Epoch and ROC AUC Score. To do this we need to instruct wandb to watch the model.

In [None]:
def _run():
    with wandb.init(project = "jigsaw-tpu"):

        def loss_fn(outputs, targets):
            return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))
        
        def train_loop_fn(data_loader, model, optimizer, device, scheduler=None):
            
            wandb.watch(model)
            model.train()
            for bi, d in enumerate(data_loader):
                ids = d["ids"]
                mask = d["mask"]
                token_type_ids = d["token_type_ids"]
                targets = d["targets"]

                ids = ids.to(device, dtype=torch.long)
                mask = mask.to(device, dtype=torch.long)
                token_type_ids = token_type_ids.to(device, dtype=torch.long)
                targets = targets.to(device, dtype=torch.float)

                optimizer.zero_grad()
                outputs = model(
                    ids=ids,
                    mask=mask,
                    token_type_ids=token_type_ids
                )

                loss = loss_fn(outputs, targets)
                
                wandb.log({"loss": loss}) # Log the Training Loss to W&B

                if bi % 10 == 0:
                    xm.master_print(f'batch index = {bi}, loss = {loss}')

                loss.backward()
                xm.optimizer_step(optimizer)
                if scheduler is not None:
                    scheduler.step()

        def eval_loop_fn(data_loader, model, device):
            model.eval()
            fin_targets = []
            fin_outputs = []
            for bi, d in enumerate(data_loader):
                ids = d["ids"]
                mask = d["mask"]
                token_type_ids = d["token_type_ids"]
                targets = d["targets"]

                ids = ids.to(device, dtype=torch.long)
                mask = mask.to(device, dtype=torch.long)
                token_type_ids = token_type_ids.to(device, dtype=torch.long)
                targets = targets.to(device, dtype=torch.float)

                outputs = model(
                    ids=ids,
                    mask=mask,
                    token_type_ids=token_type_ids
                )

                targets_np = targets.cpu().detach().numpy().tolist()
                outputs_np = outputs.cpu().detach().numpy().tolist()
                fin_targets.extend(targets_np)
                fin_outputs.extend(outputs_np)    

            return fin_outputs, fin_targets


        MAX_LEN = 192
        TRAIN_BATCH_SIZE = 64
        EPOCHS = 2

        tokenizer = transformers.BertTokenizer.from_pretrained("../input/bert-base-multilingual-uncased/", do_lower_case=True)

        train_targets = df_train.toxic.values
        valid_targets = df_valid.toxic.values

        train_dataset = BERTDatasetTraining(
            comment_text=df_train.comment_text.values,
            targets=train_targets,
            tokenizer=tokenizer,
            max_length=MAX_LEN
        )

        train_sampler = torch.utils.data.distributed.DistributedSampler(
              train_dataset,
              num_replicas=xm.xrt_world_size(),
              rank=xm.get_ordinal(),
              shuffle=True)

        train_data_loader = torch.utils.data.DataLoader(
            train_dataset,
            batch_size=TRAIN_BATCH_SIZE,
            sampler=train_sampler,
            drop_last=True,
            num_workers=1
        )

        valid_dataset = BERTDatasetTraining(
            comment_text=df_valid.comment_text.values,
            targets=valid_targets,
            tokenizer=tokenizer,
            max_length=MAX_LEN
        )

        valid_sampler = torch.utils.data.distributed.DistributedSampler(
              valid_dataset,
              num_replicas=xm.xrt_world_size(),
              rank=xm.get_ordinal(),
              shuffle=False)

        valid_data_loader = torch.utils.data.DataLoader(
            valid_dataset,
            batch_size=16,
            sampler=valid_sampler,
            drop_last=False,
            num_workers=1
        )

        device = xm.xla_device()
        model = mx.to(device)

        param_optimizer = list(model.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]

        lr = 0.4 * 1e-5 * xm.xrt_world_size() # You can or cannot make this change , 
                                              # it will work if not multiplied with xm.xrt_world_size()

        num_train_steps = int(len(train_dataset) / TRAIN_BATCH_SIZE / xm.xrt_world_size() * EPOCHS)
        xm.master_print(f'num_train_steps = {num_train_steps}, world_size={xm.xrt_world_size()}')

        optimizer = AdamW(optimizer_grouped_parameters, lr=lr)
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=0,
            num_training_steps=num_train_steps
        )

        for epoch in range(EPOCHS):
            para_loader = pl.ParallelLoader(train_data_loader, [device])
            train_loop_fn(para_loader.per_device_loader(device), model, optimizer, device, scheduler=scheduler)

            para_loader = pl.ParallelLoader(valid_data_loader, [device])
            o, t = eval_loop_fn(para_loader.per_device_loader(device), model, device)
            xm.save(model.state_dict(), "model.bin")
            auc = metrics.roc_auc_score(np.array(t) >= 0.5, o)
            xm.master_print(f'AUC = {auc}')
            
            wandb.log({'Epoch': epoch, 'ROC AUC Score':auc}) # Log the Epoch and ROC AUC Score

## Important Methods

### 1. `ParallelLoader`  
- ParallelLoader loads the training data onto each device i.e onto each TPU core
- Wraps an existing PyTorch DataLoader with background data upload.

### 2. `Spawn Function`
- This is the most important of all to know how to effectively use multi-processing and Multiple TPU cores.
- What spawn function does is it creates multiple copies of the computation graphs to be fed to different cores or xla_devices . It also makes copies of the data on which the model is trained upon.
- `spawn()` takes a function (the "map function"), a tuple of arguments (the placeholder flags dict), the number of processes to create, and whether to create these new processes by "forking" or "spawning."
- In the below code here, `spawn()` will create eight processes, one for each Cloud TPU core, and call _map_fn() -- the map function -- on each process. The inputs to _map_fn() are an index (zero through seven) and the placeholder flags. When the proccesses acquire their device they actually acquire their corresponding Cloud TPU core automatically.

### Map_function
- Let's now talk about the map function. 
- So it is the function which is called on the replicated n number of processes. 
- Pytorch XLA makes nprocs copies as soon as the spawn function is called , one for each device , then the map function is called the first thing on each of these devices. Map function takes two arguments , one is process index (zero to n) and the placeholder flags which is a dictionary and can contain configuration of your model like max_len, epochs, num_workers,etc

**WANDB STEP 4 :** After completion of the runs use `wandb.finish()` to finish the wandb instance.

In [None]:
# Start training processes
def _mp_fn(rank, flags):
    torch.set_default_tensor_type('torch.FloatTensor')
    a = _run()

FLAGS={}
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

wandb.finish() # Finish the instance

<a id="where-did-i-learn-all-this"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Where did I learn All This?</center></h2>

- [Attention is All you Need](https://arxiv.org/abs/1706.03762)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
- [bert multi lingual tpu training (8 cores) w/ valid](https://www.kaggle.com/abhishek/bert-multi-lingual-tpu-training-8-cores-w-valid)
- [Pytorch-XLA: Understanding TPU's and XLA](https://www.kaggle.com/tanulsingh077/pytorch-xla-understanding-tpu-s-and-xla)
- [Huggingface Glossary](https://huggingface.co/transformers/glossary.html)