<font color='#2980B9'><center><h2>SWA-LP &  Interpreting Transformer Interactively </h2></center></font>

<br>

<font color='#3498DB'><h2>Introduction</h2></font>
SWA-LP stands for Stochastic Weight Averaging for low precision training. SWA-LP can match the performance of full-precision training, even with all numbers quantized down to 8 bits. 

The notebook is an implementation of the Stochastic Weight Averaging technique with NVIDIA Apex on transformers using PyTorch. The notebook also implements how to interactively interpret Transformers using LIT (Language Interpretability Tool) a platform for NLP model understanding.


<font color='#3498DB'><h2>Idea</h2></font>
Inspired by ideas of my discussions and kernels and due to minimal resources, this kernel will explain how to Fintune Transformers with SWA and Apex AMP. It also shows how to connect various strategies like Weighted Layer Pooling, MADGRAD Optimizer, Grouped LLRD, etc.

We will also see how we can implement visualizations for salience maps, attention, as well as aggregate analysis including metrics, embedding spaces, and flexible slicing for interpreting transformer models.

*Note: We will be using competition data.*

<font color='#3498DB'><h2>Overview</h2></font>
Before we jump into the code let's have an understanding of some techniques and how they work. Even having atleast a high level idea behind each technique will help to better understand the code.

<font color='#3498DB'><h3>Stochastic Weight Averaging</h3></font>
**Paper**: [Averaging Weights Leads to Wider Optima and Better Generalization](https://arxiv.org/pdf/1803.05407.pdf)  
**Blog**: [PyTorch 1.6 now includes Stochastic Weight Averaging](https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/)

SWA produces an ensemble by combining weights of the same network at different stages of training and then uses this model with combined weights to make predictions. 
![swa](https://miro.medium.com/max/1766/1*_USiR_z8PKaDuIcAs9xomw.png)
There are two important ingredients that make SWA work. 

- First, SWA uses a modified learning rate schedule so that SGD (or other optimizers such as Adam) continues to bounce around the optimum and explore diverse models instead of simply converging to a single solution. For example, we can use the standard decaying learning rate strategy for the first 75% of training time and then set the learning rate to a reasonably high constant value for the remaining 25% of the time (see Figure below). 
![swa_training](https://pytorch.org/assets/images/nswapytorch2.jpg)
- The second ingredient is to take an average of the weights (typically an equal average) of the networks traversed by SGD. For example, we can maintain a running average of the weights obtained at the end of every epoch within the last 25% of training time. After training is complete, we then set the weights of the network to the computed SWA averages.

- Another important detail is the batch normalization. Batch normalization layers compute running statistics of activations during training. Note that the SWA averages of the weights are never used to make predictions during training. So the batch normalization layers do not have the activation statistics computed at the end of training. We can compute these statistics by doing a single forward pass on the train data with the SWA model.  
![swa2](https://miro.medium.com/max/502/1*Afu2bqxzC6p1BpIRTDWJtg.png)    

Thus we only need to train one model, and store two models in memory during training. For prediction, we only need the running average model.

<font color='#3498DB'><h3>MADGRAD Optimizer</h3></font>
**Paper**: [Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization](https://arxiv.org/abs/2101.11075)

MADGRAD is a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing.
![madgrad](https://github.com/facebookresearch/madgrad/raw/master/figures/nlp.png?raw=true)
For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.

<font color='#3498DB'><h4>Things to Note</h4></font>
 - You may need to use a lower weight decay than you are accustomed to. Often 0.
 - You should do a full learning rate sweep as the optimal learning rate will be different from SGD or Adam. On NLP models gradient clipping also helped.

<font color='#3498DB'><h3>Language Interpretability Tool (LIT)</h3></font>
**Paper**: [The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models](https://www.aclweb.org/anthology/2020.emnlp-demos.15.pdf)    
**Blog**: [The Language Interpretability Tool (LIT): Interactive Exploration and Analysis of NLP Models](https://ai.googleblog.com/2020/11/the-language-interpretability-tool-lit.html)    
**Official Page**: [Language Interpretability Tool](https://pair-code.github.io/lit/)  
**Examples**: [GitHub](https://github.com/PAIR-code/lit/tree/main/lit_nlp/examples)

LIT is an open-source platform for visualization and understanding of NLP models. LIT contains many built-in capabilities but is also customizable, with the ability to add custom interpretability techniques, metrics calculations, counterfactual generators, visualizations, and more. 

| Built-in capabilities | Supported task types | Framework agnostic |
| :-: | :-: | :-: |
| Salience maps | Classification | TensorFlow 1.x |
| Attention visualization | Regression | TensorFlow 2.x |
| Metrics calculations | Text generation / seq2seq | PyTorch |
| Counterfactual generation | Masked language models | Notebook compatibility |
| Model and datapoint comparison | Span labeling | Custom inference code |
| Embedding visualization | Multi-headed models | Remote Procedure Calls |
| And more... | And more... | And more... |

LIT can be run as a standalone server, or inside of python notebook environments such as Colab and Jupyter.

![lit](https://pair-code.github.io/lit/assets/images/lit-tweet.gif)

<font color='#3498DB'><h3>NVIDIA Apex - AMP</h3></font>
 - In [Speeding up Transformer w/ Optimization Strategies](https://www.kaggle.com/rhtsingh/speeding-up-transformer-w-optimization-strategies) notebook.

<font color='#3498DB'><h3>Weighted Layers Pooling</h3></font>
 - In [Utilizing Transformer Representations Efficiently](https://www.kaggle.com/rhtsingh/utilizing-transformer-representations-efficiently) notebook.

<font color='#3498DB'><h3>Grouped Layerwise Learning Rate Decay</h3></font>
 - In [Guide to HuggingFace Schedulers & Differential LRs](https://www.kaggle.com/rhtsingh/guide-to-huggingface-schedulers-differential-lrs) notebook.
 
 
**That's pretty much everything that we need to know. We will learn about the implementation and other nitty gritty details while coding. Rest if there's still any doubt remaining, we can discuss in the comments.**

<font color='#3498DB'><h2>Code</h2></font>

<font color='#3498DB'><h3>Install Dependencies</h3></font>
First, we will be doing the necessary setup. We will be installing NVIDIA Apex API, MADGRAD Optimizer, and Language Interpretability Tool. Below one can find the preliminary setup command for both Kaggle and Google Colab. 

*Note: If you're using Google Colab make sure to download kaggle.json by creating a new API token from your account.*

<font color='#3498DB'><h4>Colab Setup</h4></font>

In [None]:
# from google.colab import files
# files.upload() # Upload your Kaggle API Token 
# !mkdir ~/.kaggle
# !mv kaggle.json ~/.kaggle
# !chmod 600 ~/.kaggle/kaggle.json
# !kaggle competitions download -c commonlitreadabilityprize
# !unzip train.csv.zip

In [None]:
# %%writefile setup.sh
# export CUDA_HOME=/usr/local/cuda-10.1
# git clone https://github.com/NVIDIA/apex
# cd apex
# pip install -v --disable-pip-version-check --no-cache-dir ./

In [None]:
# %%capture
# !sh setup.sh
# !pip -q install madgrad
# !pip -q install lit_nlp
# !pip -q install transformers

<font color='#3498DB'><h4>Kaggle Setup</h4></font>

*Note: Installing NVIDIA-Apex will take some time.*

In [None]:
%%writefile setup.sh
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
rm -rf ./apex

In [None]:
%%capture
!sh setup.sh
!pip -q install madgrad
!pip -q install lit_nlp

<font color='#3498DB'><h3>Import Dependencies</h3></font>

Here, we will import the required dependencies and few utility functions. The `optimal_num_of_loader_workers` will find the optimal number of workers for our dataloader and `fix_all_seeds` will be doing the job of reproducibility. 

In [None]:
import os
import gc
gc.enable()
import math
import json
import time
import random
import multiprocessing
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

import numpy as np
import pandas as pd
from tqdm import tqdm, trange
from sklearn import model_selection

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Parameter
import torch.optim as optim
from torch.utils.data import (
    Dataset, DataLoader,
    SequentialSampler, RandomSampler
)

try:
    from apex import amp
    APEX_INSTALLED = True
except ImportError:
    APEX_INSTALLED = False

from madgrad import MADGRAD

try:
    from torch.optim.swa_utils import (
        AveragedModel, update_bn, SWALR
    )
    SWA_AVAILABLE = True
except ImportError:
    SWA_AVAILABLE = False

import transformers
from transformers import (
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModel,
    AutoTokenizer,
    get_cosine_schedule_with_warmup,
    logging,
    MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
)
logging.set_verbosity_warning()
logging.set_verbosity_error()

def fix_all_seeds(seed):
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

def optimal_num_of_loader_workers():
    num_cpus = multiprocessing.cpu_count()
    num_gpus = torch.cuda.device_count()
    optimal_value = min(num_cpus, num_gpus*4) if num_gpus else num_cpus - 1
    return optimal_value

print(f"Apex AMP Installed :: {APEX_INSTALLED}")
print(f"SWA Available :: {SWA_AVAILABLE}")
MODEL_CONFIG_CLASSES = list(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

<font color='#3498DB'><h3>Data & Train-Validation Split</h3></font>

Here we load our data and will be using the same old 5-KFold split with `random_state=2021`.

In [None]:
train = pd.read_csv('../input/commonlitreadabilityprize/train.csv', low_memory=False)
def create_folds(data, num_splits):
    data["kfold"] = -1
    kf = model_selection.KFold(n_splits=num_splits, shuffle=True, random_state=2021)
    for f, (t_, v_) in enumerate(kf.split(X=data)):
        data.loc[v_, 'kfold'] = f
    return data
train = create_folds(train, num_splits=5)

<font color='#3498DB'><h3>Training Config</h3></font>

The Config class is where we define our training hyperparameters, output path, etc. We define model, tokenizer, optimizer, scheduler, swa, and training configurations.

In [None]:
class Config:
    # model
    num_labels = 1
    model_type = 'roberta'
    model_name_or_path = 'roberta-base'
    config_name = 'roberta-base'
    fp16 = True if APEX_INSTALLED else False
    fp16_opt_level = "O1"

    # tokenizer
    tokenizer_name = 'roberta-base'
    max_seq_length = 250

    # train
    epochs = 10
    train_batch_size = 24
    eval_batch_size = 16

    # optimizer
    optimizer_type = 'MADGRAD'
    learning_rate = 2e-5
    weight_decay = 1e-5
    epsilon = 1e-6
    max_grad_norm = 1.0

    # stochastic weight averaging
    swa = True
    swa_start = 7
    swa_learning_rate = 1e-4
    anneal_epochs=3 
    anneal_strategy='cos'

    # scheduler
    decay_name = 'cosine-warmup'
    warmup_ratio = 0.03

    # logging
    logging_steps = 10

    # evaluate
    output_dir = 'output'
    seed = 2021

<font color='#3498DB'><h3>Average Meter</h3></font>

Will help us in logging metrics.

In [None]:
class AverageMeter(object):
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0
        self.max = 0
        self.min = 1e5

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count
        if val > self.max:
            self.max = val
        if val < self.min:
            self.min = val

<font color='#3498DB'><h3>Dataset Retriever</h3></font>

The dataset retriever will store the samples and their corresponding labels. When called it will process the samples into features i.e. input_ids, attention_mask, and covert that to tensors for our model inputs.

In [None]:
class DatasetRetriever(Dataset):
    def __init__(self, data, tokenizer, max_len, is_test=False):
        super(DatasetRetriever, self).__init__()
        self.data = data
        self.is_test = is_test
        self.excerpts = self.data.excerpt.values.tolist()
        if not self.is_test:
            self.targets = self.data.target.values.tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, item):
        excerpt = self.excerpts[item]
        features = self.convert_examples_to_features(
            excerpt, self.tokenizer, 
            self.max_len
        )
        features = {key : torch.tensor(value, dtype=torch.long) for key, value in features.items()}
        if not self.is_test:
            label = self.targets[item]
            features['labels'] = torch.tensor(label, dtype=torch.double)
        return features
    
    def convert_examples_to_features(self, example, tokenizer, max_len):
        features = tokenizer.encode_plus(
            example.replace('\n', ''), 
            max_length=max_len, 
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
        )
        return features

<font color='#3498DB'><h3>Model</h3></font>

Here we define our model. The model uses strategies of weighted layer pooling on cls embeddings from each layer, multi-sample dropout, layer initialization which I have explained in other kernels. 

In [None]:
class Model(nn.Module):
    def __init__(
        self, model_name, 
        config
    ):
        super(Model, self).__init__()
        self.config = config
        self.roberta = AutoModel.from_pretrained(
            model_name, 
            config=config
        )
        self.dropout = nn.Dropout(p=0.2)
        self.high_dropout = nn.Dropout(p=0.5)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-5)
        self._init_weights(self.layer_norm)
        self.regressor = nn.Linear(config.hidden_size, config.num_labels)
        self._init_weights(self.regressor)
        
        weights_init = torch.zeros(config.num_hidden_layers + 1).float()
        weights_init.data[:-1] = -3
        self.layer_weights = torch.nn.Parameter(weights_init)
 
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
 
    def forward(
        self, input_ids=None,
        attention_mask=None, labels=None
    ):
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
        )
        all_hidden_states = outputs[2]
        
        # weighted layer pooling
        cls_embeddings = torch.stack(
            [self.dropout(layer[:, 0]) for layer in all_hidden_states], 
            dim=2
        )
        cls_output = (
            torch.softmax(self.layer_weights, dim=0) * cls_embeddings
        ).sum(-1)
        cls_output = self.layer_norm(cls_output)
        
        # multi-sample dropout
        logits = torch.mean(
            torch.stack(
                [self.regressor(self.high_dropout(cls_output)) for _ in range(5)],
                dim=0,
            ),
            dim=0,
        )
 
        # calculate loss
        loss = None
        if labels is not None:
            # regression task
            loss_fn = torch.nn.MSELoss()
            logits = logits.view(-1).to(labels.dtype)
            loss = torch.sqrt(loss_fn(logits, labels.view(-1)))
        output = (logits,) + outputs[2:]
        
        del all_hidden_states, cls_embeddings
        del cls_output, logits
        gc.collect();
        
        return ((loss,) + output) if loss is not None else output

<font color='#3498DB'><h3>Grouped Optimizer Parameters & LLRD</h3></font>

We will be using Grouped-LLRD (Layer Wise Learning Rate Decay) since from my experiments this shows better peformance and generalization than simple LLRD.

In [None]:
def get_optimizer_grouped_parameters(args, model):
    no_decay = ["bias", "LayerNorm.weight"]
    group1=['layer.0.','layer.1.','layer.2.','layer.3.']
    group2=['layer.4.','layer.5.','layer.6.','layer.7.']    
    group3=['layer.8.','layer.9.','layer.10.','layer.11.']
    group_all=['layer.0.','layer.1.','layer.2.','layer.3.','layer.4.','layer.5.','layer.6.','layer.7.','layer.8.','layer.9.','layer.10.','layer.11.']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in model.roberta.named_parameters() if not any(nd in n for nd in no_decay) and not any(nd in n for nd in group_all)],'weight_decay': args.weight_decay},
        {'params': [p for n, p in model.roberta.named_parameters() if not any(nd in n for nd in no_decay) and any(nd in n for nd in group1)],'weight_decay': args.weight_decay, 'lr': args.learning_rate/2.6},
        {'params': [p for n, p in model.roberta.named_parameters() if not any(nd in n for nd in no_decay) and any(nd in n for nd in group2)],'weight_decay': args.weight_decay, 'lr': args.learning_rate},
        {'params': [p for n, p in model.roberta.named_parameters() if not any(nd in n for nd in no_decay) and any(nd in n for nd in group3)],'weight_decay': args.weight_decay, 'lr': args.learning_rate*2.6},
        {'params': [p for n, p in model.roberta.named_parameters() if any(nd in n for nd in no_decay) and not any(nd in n for nd in group_all)],'weight_decay': 0.0},
        {'params': [p for n, p in model.roberta.named_parameters() if any(nd in n for nd in no_decay) and any(nd in n for nd in group1)],'weight_decay': 0.0, 'lr': args.learning_rate/2.6},
        {'params': [p for n, p in model.roberta.named_parameters() if any(nd in n for nd in no_decay) and any(nd in n for nd in group2)],'weight_decay': 0.0, 'lr': args.learning_rate},
        {'params': [p for n, p in model.roberta.named_parameters() if any(nd in n for nd in no_decay) and any(nd in n for nd in group3)],'weight_decay': 0.0, 'lr': args.learning_rate*2.6},
        {'params': [p for n, p in model.named_parameters() if args.model_type not in n], 'lr':args.learning_rate*20, "weight_decay": 0.0},
    ]
    return optimizer_grouped_parameters

<font color='#3498DB'><h3>Utilities</h3></font>

Below we define our utility functions which will initialize our different components.

In [None]:
def make_model(args, output_attentions=False):
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name)
    config = AutoConfig.from_pretrained(args.config_name)
    config.update({'num_labels':args.num_labels})
    config.update({"output_hidden_states":True})
    if output_attentions:
        config.update({"output_attentions":True})
    model = Model(args.model_name_or_path, config=config)
    return model, config, tokenizer

def make_optimizer(args, model):
    optimizer_grouped_parameters = get_optimizer_grouped_parameters(args, model)
    if args.optimizer_type == "AdamW":
        optimizer = AdamW(
            optimizer_grouped_parameters,
            lr=args.learning_rate,
            eps=args.epsilon,
            correct_bias=not args.use_bertadam
        )
    else:
        optimizer = MADGRAD(
            optimizer_grouped_parameters,
            lr=args.learning_rate,
            eps=args.epsilon,
            weight_decay=args.weight_decay
        )
    return optimizer

def make_scheduler(
    args, optimizer, 
    num_warmup_steps, 
    num_training_steps
):
    if args.decay_name == "cosine-warmup":
        scheduler = get_cosine_schedule_with_warmup(
            optimizer,
            num_warmup_steps=num_warmup_steps,
            num_training_steps=num_training_steps
        )
    else:
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=num_warmup_steps,
            num_training_steps=num_training_steps
        )
    return scheduler    

def make_loader(
    args, data, 
    tokenizer, fold
):
    train_set, valid_set = data[data['kfold']!=fold], data[data['kfold']==fold]

    train_dataset = DatasetRetriever(train_set, tokenizer, args.max_seq_length)
    valid_dataset = DatasetRetriever(valid_set, tokenizer, args.max_seq_length)
    print(f"Num examples Train= {len(train_dataset)}, Num examples Valid={len(valid_dataset)}")
    
    train_sampler = RandomSampler(train_dataset)
    valid_sampler = SequentialSampler(valid_dataset)

    train_dataloader = DataLoader(
        train_dataset,
        batch_size=args.train_batch_size,
        sampler=train_sampler,
        num_workers=optimal_num_of_loader_workers(),
        pin_memory=True,
        drop_last=False 
    )

    valid_dataloader = DataLoader(
        valid_dataset,
        batch_size=args.eval_batch_size, 
        sampler=valid_sampler,
        num_workers=optimal_num_of_loader_workers(),
        pin_memory=True, 
        drop_last=False
    )

    return train_dataloader, valid_dataloader

<font color='#3498DB'><h3>Trainer</h3></font>

Here we define our trainer class which will be the main engine. Below we do the necessary changes required for training support with swa and apex-amp.

In [None]:
class Trainer:
    def __init__(
        self, model, tokenizer, 
        optimizer, scheduler, 
        swa_model=None, swa_scheduler=None
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.swa_model = swa_model
        self.swa_scheduler = swa_scheduler

    def train(
        self, args, 
        train_dataloader, 
        epoch, result_dict
    ):
        count = 0
        losses = AverageMeter()
        
        self.model.zero_grad()
        self.model.train()
        
        fix_all_seeds(args.seed)
        for batch_idx, batch_data in enumerate(train_dataloader):
            input_ids, attention_mask, labels = \
                batch_data['input_ids'], batch_data['attention_mask'], batch_data['labels']
            input_ids, attention_mask, labels = \
                input_ids.cuda(), attention_mask.cuda(), labels.cuda()

            outputs = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss, logits = outputs[:2]
            
            if args.fp16:
                with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            count += labels.size(0)
            losses.update(loss.item(), input_ids.size(0))

            if args.fp16:
                torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), args.max_grad_norm)
            else:
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), args.max_grad_norm)

            self.optimizer.step()
            if not args.swa:
                self.scheduler.step()
            else:
                if (epoch+1) < args.swa_start:
                    self.scheduler.step()
            self.optimizer.zero_grad()

            if (batch_idx % args.logging_steps == 0) or (batch_idx+1)==len(train_dataloader):
                _s = str(len(str(len(train_dataloader.sampler))))
                ret = [
                    ('Epoch: {:0>2} [{: >' + _s + '}/{} ({: >3.0f}%)]').format(epoch, count, len(train_dataloader.sampler), 100 * count / len(train_dataloader.sampler)),
                    'Train Loss: {: >4.5f}'.format(losses.avg),
                ]
                print(', '.join(ret))
            
        if args.swa and (epoch+1) >= args.swa_start:
            self.swa_model.update_parameters(self.model)
            self.swa_scheduler.step()

        result_dict['train_loss'].append(losses.avg)
        return result_dict

<font color='#3498DB'><h3>Evaluator</h3></font>

Here we define our evaluator class which we will use to evaluate our model performance and save results.

*Note: We have two evaluate functions. The first is for evaluating with the original model and second is for evaluating with the swa_model after the training is completed.*

In [None]:
class Evaluator:
    def __init__(self, model, swa_model):
        self.model = model
        self.swa_model = swa_model
    
    def save(self, result, output_dir):
        with open(f'{output_dir}/result_dict.json', 'w') as f:
            f.write(json.dumps(result, sort_keys=True, indent=4, ensure_ascii=False))

    def evaluate(self, valid_dataloader, epoch, result_dict):
        losses = AverageMeter()
        for batch_idx, batch_data in enumerate(valid_dataloader):
            self.model = self.model.eval()
            input_ids, attention_mask, labels = \
                batch_data['input_ids'], batch_data['attention_mask'], batch_data['labels']
            input_ids, attention_mask, labels = \
                input_ids.cuda(), attention_mask.cuda(), labels.cuda()
            with torch.no_grad():            
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )
                loss, logits = outputs[:2]
                losses.update(loss.item(), input_ids.size(0))
        print('----Validation Results Summary----')
        print('Epoch: [{}] Valid Loss: {: >4.5f}'.format(epoch, losses.avg))
        result_dict['val_loss'].append(losses.avg)        
        return result_dict
    
    def swa_evaluate(self, valid_dataloader, epoch, result_dict):
        losses = AverageMeter()
        for batch_idx, batch_data in enumerate(valid_dataloader):
            self.swa_model = self.swa_model.eval()
            input_ids, attention_mask, labels = \
                batch_data['input_ids'], batch_data['attention_mask'], batch_data['labels']
            input_ids, attention_mask, labels = \
                input_ids.cuda(), attention_mask.cuda(), labels.cuda()
            with torch.no_grad():            
                outputs = self.swa_model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )
                loss, logits = outputs[:2]
                losses.update(loss.item(), input_ids.size(0))
        print('----SWA Validation Results Summary----')
        print('Epoch: [{}] Valid Loss: {: >4.5f}'.format(epoch, losses.avg))
        result_dict['swa_loss'].append(losses.avg)        
        return result_dict

<font color='#3498DB'><h3>Initialize Training</h3></font>

We initialize all our training components using the below method. It will help us in initializing model, scheduler, optimizer, train and eval loader, mixed precision training, swa and our results dict.

In [None]:
def init_training(args, data, fold):
    fix_all_seeds(args.seed)
    
    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)
    
    # model
    model, model_config, tokenizer = make_model(args)
    if torch.cuda.device_count() >= 1:
        print('Model pushed to {} GPU(s), type {}.'.format(
            torch.cuda.device_count(), 
            torch.cuda.get_device_name(0))
        )
        model = model.cuda() 
    else:
        raise ValueError('CPU training is not supported')
    
    # data loaders for training and evaluation
    train_dataloader, valid_dataloader = make_loader(args, data, tokenizer, fold)

    # optimizer
    optimizer = make_optimizer(args, model)

    # scheduler
    num_training_steps = len(train_dataloader) * args.epochs
    if args.warmup_ratio > 0:
        num_warmup_steps = int(args.warmup_ratio * num_training_steps)
    else:
        num_warmup_steps = 0
    print(f"Total Training Steps: {num_training_steps}, Total Warmup Steps: {num_warmup_steps}")
    scheduler = make_scheduler(args, optimizer, num_warmup_steps, num_training_steps)

    # stochastic weight averaging
    swa_model = AveragedModel(model)
    swa_scheduler = SWALR(
        optimizer, swa_lr=args.swa_learning_rate, 
        anneal_epochs=args.anneal_epochs, 
        anneal_strategy=args.anneal_strategy
    )

    print(f"Total Training Steps: {num_training_steps}, Total Warmup Steps: {num_warmup_steps}, SWA Start Step: {args.swa_start}")

    # mixed precision training with NVIDIA Apex
    if args.fp16:
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
    
    result_dict = {
        'epoch':[], 
        'train_loss': [], 
        'val_loss' : [], 
        'swa_loss': [],
        'best_val_loss': np.inf
    }

    return (
        model, model_config, tokenizer, optimizer, scheduler, 
        train_dataloader, valid_dataloader, result_dict,
        swa_model, swa_scheduler
    )

<font color='#3498DB'><h3>Run</h3></font>

This is our main function that will stitch everything together i.e. training, evaluating, saving the best model, updating bn for our swa model, etc.

*Note: Training takes ~1 hr to complete on Tesla P100-PCIE-16GB provided by Kaggle and Colab.*

In [None]:
def run(data, fold):
    args = Config()
    model, model_config, tokenizer, optimizer, scheduler, train_dataloader, \
        valid_dataloader, result_dict, swa_model, swa_scheduler = init_training(args, data, fold)
    
    trainer = Trainer(model, tokenizer, optimizer, scheduler, swa_model, swa_scheduler)
    evaluator = Evaluator(model, swa_model)

    train_time_list = []
    valid_time_list = []

    for epoch in range(args.epochs):
        result_dict['epoch'].append(epoch)

        # Train
        torch.cuda.synchronize()
        tic1 = time.time()
        result_dict = trainer.train(
            args, train_dataloader, 
            epoch, result_dict
        )
        torch.cuda.synchronize()
        tic2 = time.time() 
        train_time_list.append(tic2 - tic1)
        
        # Evaluate
        torch.cuda.synchronize()
        tic3 = time.time()
        result_dict = evaluator.evaluate(
            valid_dataloader, epoch, result_dict
        )
        torch.cuda.synchronize()
        tic4 = time.time() 
        valid_time_list.append(tic4 - tic3)
            
        output_dir = os.path.join(args.output_dir, f"checkpoint-fold-{fold}")
        if result_dict['val_loss'][-1] < result_dict['best_val_loss']:
            print("{} Epoch, Best epoch was updated! Valid Loss: {: >4.5f}".format(epoch, result_dict['val_loss'][-1]))
            result_dict["best_val_loss"] = result_dict['val_loss'][-1]        
            
            os.makedirs(output_dir, exist_ok=True)
            torch.save(model.state_dict(), f"{output_dir}/pytorch_model.bin")
            model_config.save_pretrained(output_dir)
            tokenizer.save_pretrained(output_dir)
            print(f"Saving model checkpoint to {output_dir}.")
    
            #torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
            #torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
            #print(f"Saving optimizer and scheduler states to {output_dir}.")
        print()
        
    if args.swa:
        update_bn(train_dataloader, swa_model, device=torch.device('cuda'))
    result_dict = evaluator.swa_evaluate(valid_dataloader, epoch, result_dict)
    
    evaluator.save(result_dict, output_dir)
    torch.save(swa_model.state_dict(), f"{output_dir}/swa_pytorch_model.bin")
    
    print()
    print(f"Total Training Time: {np.sum(train_time_list)}secs, Average Training Time per Epoch: {np.mean(train_time_list)}secs.")
    print(f"Total Validation Time: {np.sum(valid_time_list)}secs, Average Validation Time per Epoch: {np.mean(valid_time_list)}secs.")
    
    torch.cuda.empty_cache()
    del trainer, evaluator
    del model, model_config, tokenizer
    del optimizer, scheduler
    del train_dataloader, valid_dataloader, result_dict
    del swa_model, swa_scheduler
    gc.collect()

In [None]:
for fold in range(5):
    print();print()
    print('-'*50)
    print(f'FOLD: {fold}')
    print('-'*50)
    run(train, fold)

<font color='#3498DB'><h2>Interpreting Transformers with LIT</h2></font>

Here we will be implementing model interpretibility using LIT alongwith understanding of how each component works.

<font color='#3498DB'><h3>How it Works?</h3></font>
The LitWidget object constructor takes a dict mapping model names to model objects, and a dict mapping dataset names to dataset objects. Those will be the datasets and models displayed in LIT. 

It also optionally takes in a height parameter for how tall to render the LIT UI in pixels (it defaults to 1000 pixels). Running the constructor will cause the LIT server to be started in the background, loading the models and datasets and enabling the UI to be served.

Render the LIT UI in an output cell by calling the render method on the LitWidget object. The LIT UI can be rendered multiple times in separate cells if desired. The widget also contains a stop method to shut down the LIT server.

<font color='#E74C3C'><h5>Thing to Note</h5></font>
LIT won't work on Kaggle [due to this issue](https://www.kaggle.com/product-feedback/89671). Running LIT on Kaggle notebook throws an error **kkb-production.jupyter-proxy.kaggle.net is taking too long to respond.** Kaggle has disabled this feature because it caused a large slowdown to notebook startup time.

But below code is platform-independent and will run on Google Colab, local or else. I used Colab and will be sharing snapshots here.

<font color='#3498DB'><h3>Import Dependencies</h3></font>
Import LIT specific dependencies.

In [None]:
import re
from lit_nlp.api import dataset as lit_dataset
from lit_nlp.api import types as lit_types
from lit_nlp.api import model as lit_model
from lit_nlp.lib import utils

In [None]:
# download snapshots
! conda install -y gdown
!gdown --id 1-RO8zoPGuX4HI1KsvH_Urjg6XxDf-Lq0
!gdown --id 1-Xcg0lBn6yehLkQnzadrWoCdLg9IGH3-
!gdown --id 1-UbtAiZsgCgo0SvnNa9uLzO6coPsODv7
!gdown --id 1-Qm2BYi-STYXfJ6DAK1yb88T7RBMerCh

<font color='#3498DB'><h3>Implement Dataset</h3></font>
We will inherit `lit_dataset.Dataset` class for this. This class will store samples so that our model can fetch to do further preprocessing, prediction etc.

In [None]:
class CommonLitData(lit_dataset.Dataset):
    def __init__(self, df, fold, split='val'):
        self._examples = self.load_datapoints(df, fold, split)
    
    def load_datapoints(self, df, fold, split):
        if split == 'val':
            df = df[df['kfold']==fold].reset_index()
        else:
            df = df[df['kfold']!=fold].reset_index()
        return [{
            "excerpt": row["excerpt"],
            "label": row["target"],
        } for _, row in df.iterrows()]

    def spec(self):
        return {
            'excerpt': lit_types.TextSegment(),
            'label': lit_types.RegressionScore(),
        }

<font color='#3498DB'><h3>Implement Model</h3></font>
This is our core engine and here we inherit lit_model.Model class. We override the predict_minibatch method and as a mainstream transformer, pipeline defines the model, load model weights, tokenize examples to features, and pass that to our model. Then we will add our outputs of cls_embeddings, attentions, and gradients.

*Note 1: This might seem a bit complicated at first sight and took quite some time for me as well to understand completely. So if it is difficult, break down everything into pieces and run sequentially. That will help in understanding better.*

*Note 2: The implementation for Saliency Maps and Integrated Gradients isn't complete so I've set compute grads to True. Will be completing it very soon.*

In [None]:
class CommonLitModel(lit_model.Model):
    compute_grads = False
    def __init__(self, args):
        self.model, self.config, self.tokenizer = make_model(args, output_attentions=True)
        self.model.eval()

    def max_minibatch_size(self):
        return 8

    def predict_minibatch(self, inputs):
        encoded_input = self.tokenizer.batch_encode_plus(
            [ex["excerpt"].replace("\n", "") for ex in inputs],
            add_special_tokens=True,
            max_length=256,
            padding="max_length",
            truncation=True,
            return_attention_mask=True
        )
        encoded_input = {
            key : torch.tensor(value, dtype=torch.long) for key, value in encoded_input.items()
        }
        
        if torch.cuda.is_available():
            self.model.cuda()
            for tensor in encoded_input:
                encoded_input[tensor] = encoded_input[tensor].cuda()
    
        with torch.set_grad_enabled(self.compute_grads):
            outputs = self.model(encoded_input['input_ids'], encoded_input['attention_mask'])
            if self.model.config.output_attentions:
                logits, hidden_states, output_attentions = outputs[0], outputs[1], outputs[2]
            else:
                logits, hidden_states = outputs[0], outputs[1]

        batched_outputs = {
            "input_ids": encoded_input["input_ids"],
            "ntok": torch.sum(encoded_input["attention_mask"], dim=1),
            "cls_emb": hidden_states[-1][:, 0],
            "score": torch.squeeze(logits, dim=-1)
        }
        
        if self.model.config.output_attentions:
            assert len(output_attentions) == self.model.config.num_hidden_layers
            for i, layer_attention in enumerate(output_attentions[-2:]):
                batched_outputs[f"layer_{i}/attention"] = layer_attention

        if self.compute_grads:
            scalar_pred_for_gradients = batched_outputs["score"]
            batched_outputs["input_emb_grad"] = torch.autograd.grad(
                scalar_pred_for_gradients,
                hidden_states[0],
                grad_outputs=torch.ones_like(scalar_pred_for_gradients)
            )[0]

        detached_outputs = {k: v.cpu().numpy() for k, v in batched_outputs.items()}
        for output in utils.unbatch_preds(detached_outputs):
            ntok = output.pop("ntok")
            output["tokens"] = self.tokenizer.convert_ids_to_tokens(
                output.pop("input_ids")[1:ntok - 1]
            )
            if self.compute_grads:
                output["token_grad_sentence"] = output["input_emb_grad"][:ntok]
            if self.model.config.output_attentions:
                for key in output:
                    if not re.match(r"layer_(\d+)/attention", key):
                        continue
                    output[key] = output[key][:, :ntok, :ntok].transpose((0, 2, 1))
                    output[key] = output[key].copy()
            yield output

    def input_spec(self) -> lit_types.Spec:
        return {
            "excerpt": lit_types.TextSegment(),
            "label": lit_types.RegressionScore()
        }

    def output_spec(self) -> lit_types.Spec:
        ret = {
            "tokens": lit_types.Tokens(),
            "score": lit_types.RegressionScore(parent="label"),
            "cls_emb": lit_types.Embeddings()
        }
        if self.compute_grads:
            ret["token_grad_sentence"] = lit_types.TokenGradients(
                align="tokens"
            )
        if self.model.config.output_attentions:
            for i in range(2): # self.model.config.num_hidden_layers
                ret[f"layer_{i}/attention"] = lit_types.AttentionHeads(
                    align_in="tokens", align_out="tokens")
        return ret

<font color='#3498DB'><h3>Run</h3></font>

Now we load our 5-Fold Validation Data and Models, pass that to `notebook.LitWidget` and call widget.render(). 

This in return will open an interface like below,

In [None]:
def create_model(path):
    args = Config()
    args.config_name = path
    args.model_name_or_path = path
    args.tokenizer_name = path
    return CommonLitModel(args)

datasets = {
    'validation_0':CommonLitData(train, fold=0, split='val'),
    'validation_1':CommonLitData(train, fold=1, split='val'),
    'validation_2':CommonLitData(train, fold=2, split='val'),
    'validation_3':CommonLitData(train, fold=3, split='val'),
    'validation_4':CommonLitData(train, fold=4, split='val'),
}

models = {
    'model_0':create_model('output/checkpoint-fold-0/'),
    'model_1':create_model('output/checkpoint-fold-1/'),
    'model_2':create_model('output/checkpoint-fold-2/'),
    'model_3':create_model('output/checkpoint-fold-3/'),
    'model_4':create_model('output/checkpoint-fold-4/'),
}


from lit_nlp import notebook
widget = notebook.LitWidget(models, datasets, height=800)
# widget.render() -->> uncomment this line to render

<font color='#3498DB'><h3>Main</h3></font>

**Modules, groups, and workspaces** form the building blocks of LIT. Modules are discrete windows in which you can perform a specific set of tasks or analyses. Workspaces display combinations of modules known as groups, so you can view different visualizations and interpretability methods side-by-side.

![img1](main_screen.PNG)

LIT is divided into two workspaces - a Main workspace in the upper half of the interface, and a Group-based workspace in the lower half.

The Main workspace contains core modules that play a role in many analyses. By default, these include:

 - **Embeddings** - explore UMAP and TSNE embeddings from your model.
 - **Data Table** - explore, navigate, and make selections from your dataset.
 - **Datapoint Editor** - deep-dive into individual examples from your dataset.
 - **Slice Editor** - create and manage slices of interest from your dataset through your LIT session.
 
<font color='#3498DB'><h3>Models and Datasets</h3></font>
![img2](models.PNG)
At the very top, you’ll see the LIT toolbar. Here, you can quickly check which models have been loaded, configure LIT, or share a URL to your session. Below that is a toolbar which makes it easier to perform actions applied across all of LIT. Here you can:

 - Select data points by relationship, or by slice.
 - Choose a feature to color data points, across all modules.
 - Track the datapoint you’re looking at, navigate to the next, mark a datapoint as a favorite, or clear your selection.
 - Select the active models and dataset, including multiple models to compare.

![img3](validation_data.PNG)

The footer at the very bottom of the interface will display any error messages. 

<font color='#3498DB'><h3>Explanations</h3></font>
![img4](lime_attention.PNG)
In the Group-based workspace, modules that offer related insights are organized together under tabs. By default, LIT offers a few default groups based on common analysis workflows: performance, predictions, explanations, and counterfactuals.

 - Use the Performance group to compare the performance of models across the entire dataset, or on individual slices.
 - Explore model results on individual data points in the Predictions group.
 - Investigate salience maps and attention for different data points in the Explanations group.
 - Generate data points using automated generators in the Counterfactuals group, and evaluate your model on them instantly.

<font color='#3498DB'><h2>Extras</h2></font>

This section is dedicated for interpreting LR's during Stochastic Weight Averaging. Run below code to make better sense of how Learning Rate changes when using SWA,

``` python
model, model_config, 
optimizer = make_optimizer(args, model)
scheduler = make_scheduler(args, optimizer, 1, 10)
swa_scheduler = SWALR(optimizer, swa_lr=1e-6, anneal_epochs=3, anneal_strategy='cos')
swa_start = 7
for epoch in range(10):
    optimizer.step()
    if (epoch+1) >= swa_start:
        print("starting swa", i)
        swa_scheduler.step()
    
    if (epoch+1) < swa_start:
        print('using simple scheduler')
        scheduler.step()
    print(optimizer.param_groups[0]['lr'])
```

<font color='#3498DB'><h2>Thanks and Plase Do Upvote!</h2></font>