<h1><center>Jigsaw: PyTorch Lightning⚡ + FP16 + GPU/TPU + W&B</center></h1>
                                                      
<center><img src = "https://jigsaw.google.com/static/images/social-share.jpg?cache=df11f5c" width = "750" height = "500"/></center>                                                                          

**I have written a detailed blog on Jarvislabs AI which explaines more concepts covered in this notebook in an elaborate way. You can read it [here](https://jarvislabs.ai/blogs/jigsaw)**

<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Contents</center></h2>

> | S.No       |                   Heading                |
> | :------------- | :-------------------:                |         
> |  01 |  [**Competition Overview**](#competition-overview)  |                   
> |  02 |  [**Libraries**](#libraries)                        |  
> |  03 |  [**Global Config**](#global-config)                |
> |  04 |  [**Weights and Biases**](#weights-and-biases)      |
> |  05 |  [**Utilities**](#utilities)                |
> |  06 |  [**Dataset**](#dataset)  |
> |  07 |  [**Datamodule**](#datamodule)   |
> |  08 |  [**Model**](#model)   |
> |  09 |  [**Understanding Mixed Precision**](#understanding-mixed-precision) |
> |  10 |  [**Understanding TPUs**](#understanding-tpus) |
> |  11 |  [**Train**](#train) |

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:maroon; border:0; color:white' role="tab" aria-controls="home"><center>If you find this notebook useful, do give me an upvote, it helps to keep up my motivation. This notebook will be updated frequently so keep checking for furthur developments.</center></h3>

---

<a id="competition-overview"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Competition Overview</center></h2>

## **<span style="color:orange;">Description</span>**


In this competition, we will be asking you to score a set of about fourteen thousand comments. Pairs of comments were presented to expert raters, who marked one of two comments more harmful — each according to their own notion of toxicity. In this contest, when you provide scores for comments, they will be compared with several hundred thousand rankings. Your average agreement with the raters will determine your individual score. In this way, we hope to focus on ranking the severity of comment toxicity from innocuous to outrageous, where the middle matters as much as the extremes.

Can you build a model that produces scores that rank each pair of comments the same way as our professional raters?
  
---

## **<span style="color:orange;">Evaluation Metric</span>**

Submissions are evaluated on Average Agreement with Annotators. For the ground truth, annotators were shown two comments and asked to identify which of the two was more toxic. Pairs of comments can be, and often are, rated by more than one annotator, and may have been ordered differently by different annotators.

For each of the approximately 200,000 pair ratings in the ground truth test data, we use your predicted toxicity `score` to rank the comment pair. The pair receives a 1 if this ranking matches the annotator ranking, or `0` if it does not match.

The final score is the average across all the pair evaluations.

---

<a id="libraries"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Libraries</center></h2>

To run the model on TPU, un-comment and run the below cell and   
change the `gpus = -1` argument to `tpu_cores = 1` or `tpu_cores=8` in the Trainer class.

In [None]:
# ! curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# ! python pytorch-xla-env-setup.py --version 1.7 --apt-packages libomp5 libopenblas-dev

In [None]:
# Necessities
import wandb
import pandas as pd

# PyTorch
import torch
import torch.nn as nn
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader

# Transformers
from transformers import AutoTokenizer, AutoModel, AdamW

# PyTorch Lightning
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger, WandbLogger
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping

# Colored Terminal Text
from colorama import Fore, Back, Style
b_ = Fore.BLUE
y_ = Fore.YELLOW
sr_ = Style.RESET_ALL

# Aesthetics
import warnings
warnings.simplefilter('ignore')

wandb.login()

---

To build this notebook I have taken inspiration from [Debarshi's](https://www.kaggle.com/debarshichanda) [starter notebook](https://www.kaggle.com/debarshichanda/pytorch-w-b-jigsaw-starter)

**What will be different in this notebook?**
1. Code has been written in PyTorch Lightning
2. Multi-GPU Training Compatible
3. TPU Training Compatible
4. Uses Mixed Precision Training to reduce Training Time Significantly
5. Uses Weights and Biases as a logger

---

<a id="global-config"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Global Config</center></h2>

In [None]:
CONFIG = {"seed": 42,
          "epochs": 2,
          "model_name": "../input/roberta-base",
          "tokenizer": AutoTokenizer.from_pretrained("../input/roberta-base"),
          "train_file_path": "../input/jigsaw-folds/train_5folds.csv",
          "checkpoint_directory_path": "./checkpoints",
          "train_batch_size": 32,
          "valid_batch_size": 64,
          "max_length": 128,
          "learning_rate": 1e-4,
          "scheduler": 'CosineAnnealingLR',
          "min_lr": 1e-6,
          "T_max": 500,
          "weight_decay": 1e-6,
          "n_fold": 5,
          "n_accumulate": 1,
          "num_classes": 1,
          "margin": 0.5,
          "num_workers": 2,
          "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
          "infra" : "Kaggle",
          "competition" : 'Jigsaw',
          "_wandb_kernel" : 'neuracort',
          "wandb" : True
          }

# Seed
pl.seed_everything(seed=42)

---

<a id="weights-and-biases"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Weights and Biases</center></h2>

<center><img src = "https://i.imgur.com/1sm6x8P.png" width = "750" height = "500"/></center>        

**Weights & Biases** is the machine learning platform for developers to build better models faster.

You can use W&B's lightweight, interoperable tools to

- quickly track experiments,
- version and iterate on datasets,
- evaluate model performance,
- reproduce models,
- visualize results and spot regressions,
- and share findings with colleagues.
  
Set up W&B in 5 minutes, then quickly iterate on your machine learning pipeline with the confidence that your datasets and models are tracked and versioned in a reliable system of record.

In this notebook I will use Weights and Biases's amazing features to perform wonderful visualizations and logging seamlessly.

---

<a id="utilities"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Utilities</center></h2>

In [None]:
def fetch_scheduler(optimizer):
    
    if CONFIG['scheduler'] == 'CosineAnnealingLR':
        scheduler = lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=CONFIG['T_max'],
            eta_min=CONFIG['min_lr']
        )
        
    elif CONFIG['scheduler'] == 'CosineAnnealingWarmRestarts':
        scheduler = lr_scheduler.CosineAnnealingWarmRestarts(
            optimizer,
            T_0=CONFIG['T_0'],
            eta_min=CONFIG['min_lr']
        )
        
    elif CONFIG['scheduler'] == None:
        return None
        
    return scheduler

---

In [None]:
# W&B Logger
wandb_logger = WandbLogger(
    project='jigsaw-lightning', 
#     group='nlp', 
    job_type='train', 
    anonymous='allow', 
    config=CONFIG
)

---

<a id="dataset"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Dataset</center></h2>

In [None]:
class JigsawDataset(Dataset):
    def __init__(self, df, tokenizer, max_length):
        self.df = df
        self.max_len = max_length
        self.tokenizer = tokenizer
        self.more_toxic = df['more_toxic'].values
        self.less_toxic = df['less_toxic'].values
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        more_toxic = self.more_toxic[index]
        less_toxic = self.less_toxic[index]
        inputs_more_toxic = self.tokenizer.encode_plus(
                                more_toxic,
                                truncation=True,
                                add_special_tokens=True,
                                max_length=self.max_len,
                                padding='max_length'
                            )
        inputs_less_toxic = self.tokenizer.encode_plus(
                                less_toxic,
                                truncation=True,
                                add_special_tokens=True,
                                max_length=self.max_len,
                                padding='max_length'
                            )
        target = 1
        
        more_toxic_ids = inputs_more_toxic['input_ids']
        more_toxic_mask = inputs_more_toxic['attention_mask']
        
        less_toxic_ids = inputs_less_toxic['input_ids']
        less_toxic_mask = inputs_less_toxic['attention_mask']
        
        
        return {
            'more_toxic_ids': torch.tensor(more_toxic_ids, dtype=torch.long),
            'more_toxic_mask': torch.tensor(more_toxic_mask, dtype=torch.long),
            'less_toxic_ids': torch.tensor(less_toxic_ids, dtype=torch.long),
            'less_toxic_mask': torch.tensor(less_toxic_mask, dtype=torch.long),
            'target': torch.tensor(target, dtype=torch.long)
        }

---

<a id="datamodule"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Datamodule</center></h2>

In [None]:
class JigsawDataModule(pl.LightningDataModule):

  def __init__(self, df_train, df_valid):
    super().__init__()
    self.df_train = df_train
    self.df_valid = df_valid

  def setup(self, stage=None):
    
    self.train_dataset = JigsawDataset(
        self.df_train, 
        tokenizer = CONFIG['tokenizer'], 
        max_length=CONFIG['max_length']
    )
    
    self.valid_dataset = JigsawDataset(
        self.df_valid, 
        tokenizer=CONFIG['tokenizer'], 
        max_length=CONFIG['max_length']
    )

  def train_dataloader(self):
    return DataLoader(
      self.train_dataset,
      batch_size=CONFIG['train_batch_size'],
      num_workers=CONFIG["num_workers"],
      shuffle=True,
      pin_memory=True, 
      drop_last=True
    )

  def val_dataloader(self):
    return DataLoader(
      self.valid_dataset,
      batch_size=CONFIG['valid_batch_size'],
      num_workers=CONFIG["num_workers"],
      shuffle=False, 
      pin_memory=True
    )

---

<a id="model"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Model</center></h2>

## **<span style="color:orange;">Margin Ranking Loss</span>**

Creates a criterion that measures the loss given inputs x1, x2, two 1D mini-batch Tensors, and a label 1D mini-batch tensor yy (containing 1 or -1).
  
If y = 1 then it assumed the first input should be ranked higher (have a larger value) than the second input, and vice-versa for y = -1.

The loss function for each pair of samples in the mini-batch is:

` loss(x1, x2, y) = max(0, -y * (x1 - x2) + margin)`

Refer to [docs](https://pytorch.org/docs/stable/generated/torch.nn.MarginRankingLoss.html) to understand all the parameters.

In [None]:
class JigsawModel(pl.LightningModule):
    
    def __init__(self, model_name):
        super(JigsawModel, self).__init__()
        self.model = AutoModel.from_pretrained(model_name)
        self.drop = nn.Dropout(p=0.2)
        self.fc = nn.Linear(768, CONFIG['num_classes'])
        
    def forward(self, ids, mask):        
        out = self.model(input_ids=ids,attention_mask=mask,
                         output_hidden_states=False)
        out = self.drop(out[1])
        outputs = self.fc(out)
                    
        return outputs
    
    def training_step(self, batch, batch_idx):
        more_toxic_ids = batch['more_toxic_ids']
        more_toxic_mask = batch['more_toxic_mask']
        less_toxic_ids = batch['less_toxic_ids']
        less_toxic_mask = batch['less_toxic_mask']
        targets = batch['target']
        
        more_toxic_outputs = self(more_toxic_ids, more_toxic_mask)
        less_toxic_outputs = self(less_toxic_ids, less_toxic_mask)
        
        loss = self.criterion(more_toxic_outputs, less_toxic_outputs, targets)
        
        self.log("train_loss", loss, prog_bar=True, logger=True)
        
        return {"loss": loss}
    
    def validation_step(self, batch, batch_idx):
        more_toxic_ids = batch['more_toxic_ids']
        more_toxic_mask = batch['more_toxic_mask']
        less_toxic_ids = batch['less_toxic_ids']
        less_toxic_mask = batch['less_toxic_mask']
        targets = batch['target']
        
        more_toxic_outputs = self(more_toxic_ids, more_toxic_mask)
        less_toxic_outputs = self(less_toxic_ids, less_toxic_mask)
        
        loss = self.criterion(more_toxic_outputs, less_toxic_outputs, targets)
        
        self.log("val_loss", loss, prog_bar=True, logger=True)
        
        return {'val_loss': loss}      
        
    def configure_optimizers(self):
        
        optimizer = AdamW(self.parameters(), lr=CONFIG['learning_rate'], weight_decay=CONFIG['weight_decay'])
        scheduler = fetch_scheduler(optimizer)
        
        return dict(
            optimizer = optimizer,
            lr_scheduler = scheduler
        )
    
    def criterion(self, outputs1, outputs2, targets):
        return nn.MarginRankingLoss(margin=CONFIG['margin'])(outputs1, outputs2, targets)

---

<a id="understanding-mixed-precision"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Understanding Mixed Precision</center></h2>

Read [NVIDIA Docs](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) for additional examples.

## **<span style="color:orange;">Introduction</span>**

**There are numerous benefits to using numerical formats with lower precision than 32-bit floating point**: 

1. They require less memory, enabling the training and deployment of larger neural networks.   
2. They require less memory bandwidth, thereby speeding up data transfer operations.     
3. Math operations run much faster in reduced precision, especially on GPUs with 
  
Tensor Core support for that precision. Mixed precision training achieves all these benefits while ensuring that no task-specific accuracy is lost compared to full precision training. It does so by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.

---

## **<span style="color:orange;">Mixed Precision Training</span>**

Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. 
  
Since the introduction of Tensor Cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.
  
**Using mixed precision training requires two steps:**

1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
  
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA® 8 in the NVIDIA Deep Learning SDK.
  
Mixed precision is the combined use of different numerical precisions in a computational method.
  
> **Half precision** **(also known as FP16)** data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.
>   
> **Single precision** **(also known as 32-bit)** is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double).

Deep Neural Networks (DNNs) have led to breakthroughs in a number of areas, including image processing and understanding, language modeling, language translation, speech processing, game playing, and many others. DNN complexity has been increasing to achieve these results, which in turn has increased the computational resources required to train these networks. 
  
One way to lower the required resources is to use lower-precision arithmetic, which has the following benefits.
  
> **Decrease the required amount of memory**  
> Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Lowering the required memory enables training of larger models or training with larger mini-batches.
>   
> **Shorten the training or inference time**  
> Execution time can be sensitive to memory or arithmetic bandwidth. Half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.

---

<a id="understanding-tpus"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Understanding TPUs</center></h2>

To view the original pytorch implementation by Tanul refer [this](https://www.kaggle.com/tanulsingh077/pytorch-xla-understanding-tpu-s-and-xla) notebook.

## **<span style="color:orange;">What are TPU's? How they work? How are they different from a GPU?</span>**

You might be thinking why knowing how tpus work is important , well it's not a must but to exploit something fully we must know how it works right?
TPUs are hardware accelerators specialized in deep learning tasks. For explanation of what  TPU's are and how they work please go through the following videos :
* [video1](https://www.youtube.com/watch?v=MXxN4fv01c8)
* [video2](https://www.youtube.com/watch?v=kBjYK3K3P6M)<br><br>
Its important to understand the underlying concepts of Pytorch XLA's . If you want to dig even deeper [here](https://codelabs.developers.google.com/codelabs/keras-flowers-data/#2) is a article by google explaining everything about TPU's

---

## **<span style="color:orange;">Key Takeaways</span>**

Following are the key takeaways from the above videos and articles :-

* Each TPU v3 board has 8 TPU cores and 64 GB's of memory
* TPU's consist of two units, Matrix Multiply Unit (MXU) which runs matrix multiplications and a Vector Processing Unit (VPU) for all other tasks such as activations, softmax, etc.
* TPU's v2/v3 use a new type of dtype called bfloat16 which combines the range of a 32-bit floating point number with just the storage space of only a 16-bit floating point number and this allows to do fit more matrices in the memory and thus more matrix multiplications. This increased speed comes at the cost of precision as bfloat16 is able to represent fewer decimal places as compared to 16-bit floating point integer but its ohk because neural networks can work at a reduced precision while maintaining their high accuracy
* The ideal batch size for TPUs is 128 data items per TPU core but the hardware can already show good utilization from 8 data items per TPU core

---

**Now we move onto the final question does TPU's directly run the Python code? Or is there something else working under the hood without credits**

![](https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2018/12/bfloat.jpg)

---

## **<span style="color:orange;">Under the Hood</span>**

* We know that any deep learning framework first defines a computation graph which is then executed by any processing chip to train a neural network. Similarly, The TPU does not directly run Python code, it runs the computation graph defined by your program.However the computation graph is first converted into TPU machine code. Under the hood, a compiler called XLA (accelerated Linear Algebra compiler) transforms the graph of computation nodes into TPU machine code. This compiler also performs many advanced optimizations on your code and your memory layout. 
* In tensorflow the conversion from computation to TPU machine code automatically takes place as work is sent to the TPU, whereas there was no such support for Pytorch and thus XLA module was created to include XLA in our build chain explicitly.

![](https://lh5.googleusercontent.com/NjGqp60oF_3Bu4Q63dprSivZ77BgVnaPEp0Olk1moFm8okcmMfPXs7PIJBgL9LB5QCtqlmM4WTepYxPC5Mq_i_0949sWSpq8pKvfPAkHnFJWuHjrNVLPN2_a0eggOlteV7mZB_Z9)

---

<a id="train"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Train</center></h2>

In [None]:
if __name__ == "__main__":    
    for fold in range(0, 1): # Replace `1` with `CONFIG['n_fold']` to run for all folds
        print(f"{y_}====== Fold: {fold} ======{sr_}")
        
        df = pd.read_csv(CONFIG['train_file_path'])        
        
        logger = TensorBoardLogger("lightning_logs", name="toxic-comments")

        checkpoint_callback = ModelCheckpoint(
          dirpath=CONFIG["checkpoint_directory_path"],
          filename= f"fold_{fold}_roberta-base",
          save_top_k=1,
          verbose=True,
          monitor="val_loss",
          mode="min"
        )

        early_stopping_callback = EarlyStopping(monitor='val_loss', patience=2)

        trainer = pl.Trainer(
          logger=wandb_logger,
          callbacks=[checkpoint_callback, early_stopping_callback],
          max_epochs=CONFIG['epochs'],
          gpus=-1,
          progress_bar_refresh_rate=30,
          precision=16,                # Activate fp16 Training
#           accelerator = 'dp'         # Un-comment for Multi-GPU Training
        )

        df_train = df[df.kfold != fold].reset_index(drop=True)
        df_valid = df[df.kfold == fold].reset_index(drop=True)  

        data_module = JigsawDataModule(df_train, df_valid)

        model = JigsawModel(CONFIG['model_name'])    
        trainer.fit(model, data_module)

## [Check out the run page here $\rightarrow$](https://wandb.ai/ishandutta/jigsaw-lightning/runs/23m009ta?workspace=user-ishandutta)

---

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:maroon; border:0; color:white' role="tab" aria-controls="home"><center>If you find this notebook useful, do give me an upvote, it helps to keep up my motivation. This notebook will be updated frequently so keep checking for furthur developments.</center></h3>

--- 

## **<span style="color:orange;">Let's have a Talk!</span>**
> ### Reach out to me on [LinkedIn](https://www.linkedin.com/in/ishandutta0098)

---