## PyTorch/TPU Bert Fine-Tune Demo
(Run each cell separately - don't just run all)

- This notebook is part of a series of tutorials on using PyTorch on Cloud TPUs. PyTorch can use Cloud TPU cores as devices with the PyTorch/XLA package. For more on PyTorch/XLA see its Github or its documentation. We also have a "Getting Started" Colab notebook. Additional Colab notebooks, like this one, are available on the PyTorch/XLA Github linked above.

<h3>  &nbsp;&nbsp;Use Colab Cloud TPU&nbsp;&nbsp; <a href="https://cloud.google.com/tpu/"><img valign="middle" src="https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png" width="50"></a></h3>

* On the main menu, click Runtime and select **Change runtime type**. Set "TPU" as the hardware accelerator.
* The cell below makes sure you have access to a TPU on Colab.


In [1]:
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'
os.environ['COLAB_TPU_ADDR']

'10.100.241.34:8470'

### [RUNME] Install Colab TPU compatible PyTorch/TPU wheels and dependencies
This may take up to ~2 minutes

In [2]:
# Installs PyTorch, PyTorch/XLA, and Torchvision
# Copy this cell into your own notebooks to use PyTorch on Cloud TPUs 
# Warning: this may take a couple minutes to run

import collections
from datetime import datetime, timedelta
import os
import requests
import threading

_VersionConfig = collections.namedtuple('_VersionConfig', 'wheels,server')
VERSION = "torch_xla==nightly"
CONFIG = {
    'xrt==1.15.0': _VersionConfig('1.15', '1.15.0'),
    'torch_xla==nightly': _VersionConfig('nightly', 'XRT-dev{}'.format(
        (datetime.today() - timedelta(1)).strftime('%Y%m%d'))),
}[VERSION]
DIST_BUCKET = 'gs://tpu-pytorch/wheels'
TORCH_WHEEL = 'torch-{}-cp36-cp36m-linux_x86_64.whl'.format(CONFIG.wheels)
TORCH_XLA_WHEEL = 'torch_xla-{}-cp36-cp36m-linux_x86_64.whl'.format(CONFIG.wheels)
TORCHVISION_WHEEL = 'torchvision-{}-cp36-cp36m-linux_x86_64.whl'.format(CONFIG.wheels)

# Update TPU XRT version
def update_server_xrt():
  print('Updating server-side XRT to {} ...'.format(CONFIG.server))
  url = 'http://{TPU_ADDRESS}:8475/requestversion/{XRT_VERSION}'.format(
      TPU_ADDRESS=os.environ['COLAB_TPU_ADDR'].split(':')[0],
      XRT_VERSION=CONFIG.server,
  )
  print('Done updating server-side XRT: {}'.format(requests.post(url)))

update = threading.Thread(target=update_server_xrt)
update.start()

# Install Colab TPU compat PyTorch/TPU wheels and dependencies
!pip uninstall -y torch torchvision
!gsutil cp "$DIST_BUCKET/$TORCH_WHEEL" .
!gsutil cp "$DIST_BUCKET/$TORCH_XLA_WHEEL" .
!gsutil cp "$DIST_BUCKET/$TORCHVISION_WHEEL" .
!pip install "$TORCH_WHEEL"
!pip install "$TORCH_XLA_WHEEL"
!pip install "$TORCHVISION_WHEEL"
!pip install transformers
!sudo apt-get install libomp5
update.join()

Updating server-side XRT to XRT-dev20200214 ...
Uninstalling torch-1.5.0a0+ecd3c25:
  Successfully uninstalled torch-1.5.0a0+ecd3c25
Uninstalling torchvision-0.6.0a0+3e94dff:
  Successfully uninstalled torchvision-0.6.0a0+3e94dff
Copying gs://tpu-pytorch/wheels/torch-nightly-cp36-cp36m-linux_x86_64.whl...
-
Operation completed over 1 objects/80.0 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly-cp36-cp36m-linux_x86_64.whl...
- [1 files][111.4 MiB/111.4 MiB]                                                
Operation completed over 1 objects/111.4 MiB.                                    
Done updating server-side XRT: <Response [200]>
Copying gs://tpu-pytorch/wheels/torchvision-nightly-cp36-cp36m-linux_x86_64.whl...
/ [1 files][  2.5 MiB/  2.5 MiB]                                                
Operation completed over 1 objects/2.5 MiB.                                      
Processing ./torch-nightly-cp36-cp36m-linux_x86_64.whl
[31mERROR: fast

# [IMP] Using Kaggle Google Quest Comp Dataset here For Demonstration purposes


In [3]:
# This Will Download the DataSet which is hosted on my Google Drive
!pip install gdown
!gdown --id "1OHOc7ltJYDRrCc2zZGKne5L9gSo-UThs" --output "quest.zip"
!unzip -q "quest.zip"
#### Please Download the data from the kaggle Competiton Google - Quest if the above fails 
#### Had uploaded the datast to Google Drive here https://drive.google.com/file/d/1OHOc7ltJYDRrCc2zZGKne5L9gSo-UThs

Downloading...
From: https://drive.google.com/uc?id=1OHOc7ltJYDRrCc2zZGKne5L9gSo-UThs
To: /content/quest.zip
5.09MB [00:00, 160MB/s]
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: An


In [4]:
# some imports we need
import torch
import transformers
import torch.nn as nn
import pandas as pd
import numpy as np
import time

from contextlib import contextmanager
from sklearn.model_selection import train_test_split
from transformers import AdamW, get_linear_schedule_with_warmup
from scipy import stats

# xla imports
import torch_xla.core.xla_model as xm
import torch_xla.distributed.data_parallel as dp # http://pytorch.org/xla/index.html#running-on-multiple-xla-devices-with-multithreading
import torch_xla.distributed.xla_multiprocessing as xmp # http://pytorch.org/xla/index.html#running-on-multiple-xla-devices-with-multiprocessing
import torch_xla.distributed.parallel_loader as pl

import transformers, sys, os, gc
import numpy as np, pandas as pd, math
import torch, random, os, multiprocessing, glob
import torch.nn.functional as F
import torch, time

from sklearn.model_selection import GroupKFold
from scipy.stats import spearmanr
from torch import nn
from torch.utils import data
from torch.utils.data import DataLoader, Dataset
from transformers import (
    BertTokenizer, BertModel, BertPreTrainedModel, BertForSequenceClassification, BertConfig,
    WEIGHTS_NAME, CONFIG_NAME, AdamW, get_linear_schedule_with_warmup, 
    get_cosine_schedule_with_warmup,
)
from transformers import BertModel, BertConfig, BertPreTrainedModel
from tqdm import tqdm

# Using Multiple Cloud TPU Cores

Working with multiple Cloud TPU cores is different than training on a single Cloud TPU core. With a single Cloud TPU core we simply acquired the device and ran the operations using it directly. To use multiple Cloud TPU cores we must use other processes, one per Cloud TPU core. This indirection and multiplicity makes multicore training a little more complex than training on a single core, but it's necessary to maximize performance.

In [0]:
from sklearn.model_selection import GroupKFold
from torch.optim.lr_scheduler import _LRScheduler, Optimizer
from tqdm import tqdm_notebook as tqdm

def run(index):

    def seed_everything(seed):
        # Sets a common random seed - both for initialization and ensuring graph is the same
        random.seed(seed)
        os.environ['PYTHONHASHSEED'] = str(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)

    def _convert_to_transformer_inputs(title, question, answer, tokenizer, max_sequence_length):
        # Converts tokenized input to ids, masks and segments for transformer (including bert)

        def return_id(str1, str2, truncation_strategy, length):

            inputs = tokenizer.encode_plus(str1, str2, add_special_tokens=True, 
                                          max_length=length, truncation_strategy=truncation_strategy
                                          )
            
            input_ids =  inputs["input_ids"]
            input_masks = [1] * len(input_ids)
            input_segments = inputs["token_type_ids"]
            
            padding_length = length - len(input_ids)
            padding_id = tokenizer.pad_token_id
            
            input_ids = input_ids + ([padding_id] * padding_length)
            input_masks = input_masks + ([0] * padding_length)
            input_segments = input_segments + ([0] * padding_length)
            
            return [input_ids, input_masks, input_segments]
        
        input_ids_q, input_masks_q, input_segments_q = return_id(
            title + ' ' + question, None, 'longest_first', max_sequence_length
        )
        
        input_ids_a, input_masks_a, input_segments_a = return_id(
            answer, None, 'longest_first', max_sequence_length
        )
        
        return [input_ids_q, input_masks_q, input_segments_q,
                input_ids_a, input_masks_a, input_segments_a]

    def compute_input_arrays(df, columns, tokenizer, max_sequence_length):
        
        input_ids_q, input_masks_q, input_segments_q = [], [], []
        input_ids_a, input_masks_a, input_segments_a = [], [], []
        
        for _, instance in df[columns].iterrows():
            
            t, q, a = instance.question_title, instance.question_body, instance.answer

            ids_q, masks_q, segments_q, ids_a, masks_a, segments_a = _convert_to_transformer_inputs(t, q, a, tokenizer, max_sequence_length)
            
            input_ids_q.append(ids_q)
            input_masks_q.append(masks_q)
            input_segments_q.append(segments_q)

            input_ids_a.append(ids_a)
            input_masks_a.append(masks_a)
            input_segments_a.append(segments_a)
            
        return [
            np.asarray(input_ids_q, dtype=np.int32), 
            np.asarray(input_masks_q, dtype=np.int32),
            np.asarray(input_segments_q, dtype=np.int32),
            np.asarray(input_ids_a, dtype=np.int32), 
            np.asarray(input_masks_a, dtype=np.int32), 
            np.asarray(input_segments_a, dtype=np.int32),
        ]

    def compute_output_arrays(df, columns):
        return np.asarray(df[columns])

    def compute_spearmanr_ignore_nan(trues, preds):
        
        rhos = []
        for tcol, pcol in zip(np.transpose(trues), np.transpose(preds)):
            rhos.append(spearmanr(tcol, pcol).correlation)
        return np.nanmean(rhos)

    def train_model(train_loader, length, model, optimizer, criterion, scheduler=None):
        
        max_grad_norm = 1.0
        tracker = xm.RateTracker()
        avg_loss = 0.
        model.train();
        len_loader = length
        tk0 = enumerate(train_loader)

        for idx, batch in tk0:
            
            input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a, labels = batch
            input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a, labels = input_ids_q.to(device, dtype=torch.long), input_masks_q.to(device, dtype = torch.long), input_segments_q.to(device, dtype = torch.long), input_ids_a.to(device, dtype = torch.long), input_masks_a.to(device, dtype = torch.long), input_segments_a.to(device, dtype = torch.long), labels.to(device, dtype = torch.float)
            
            logits = model(
                q_id = input_ids_q, q_mask = input_masks_q, q_atn = input_segments_q, 
                a_id = input_ids_a, a_mask = input_masks_a, a_atn = input_segments_a
            )
            
            # Computes loss
            loss = criterion(logits, labels) 
            # BackWard pass
            loss.backward()

            # Note: optimizer_step uses the implicit Cloud TPU context to
            #  coordinate and synchronize gradient updates across processes.
            #  This means that each process's network has the same weights after
            #  this is called.
            # Warning: this coordination requires the actions performed in each 
            #  process are the same. In more technical terms, the graph that
            #  PyTorch/XLA generates must be the same across processes.
            xm.optimizer_step(optimizer) # Note: barrier=True not needed when using ParallelLoader 
            
            optimizer.zero_grad()
            avg_loss += loss.item() / len_loader

            if idx % 100 == 0:
              print('[xla:{}] ({}) Loss={:.5f} Rate={:.2f} GlobalRate={:.2f} Time={}'.format(
                xm.get_ordinal(), idx, loss.item(), tracker.rate(),
                tracker.global_rate(), time.asctime()), flush=True
              )

            del input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a, labels
        gc.collect()
        return avg_loss

    def val_model(val_loader, model, length, criterion, val_shape, batch_size=4):

        avg_val_loss = 0
        model.eval();
        len_loader = length

        valid_preds = []
        original    = []
        
        for idx, batch in enumerate(val_loader):
            
            input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a, labels = batch
            input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a, labels = input_ids_q.to(device, dtype=torch.long), input_masks_q.to(device, dtype = torch.long), input_segments_q.to(device, dtype = torch.long), input_ids_a.to(device, dtype = torch.long), input_masks_a.to(device, dtype = torch.long), input_segments_a.to(device, dtype = torch.long), labels.to(device, dtype = torch.float)

            logits = model(
                q_id = input_ids_q, q_mask = input_masks_q, q_atn = input_segments_q, 
                a_id = input_ids_a, a_mask = input_masks_a, a_atn = input_segments_a
            )

            avg_val_loss += criterion(logits, labels).item() / len_loader
            valid_preds.append(logits.detach().cpu().squeeze().numpy())
            original.append(labels.detach().cpu().squeeze().numpy())

        return avg_val_loss, np.vstack(valid_preds), np.vstack(original)

    class QuestDataset(torch.utils.data.Dataset):
        def __init__(self, inputs, labels=None):
            
            self.inputs = inputs
            self.labels = labels

        def __getitem__(self, idx):
            
            input_ids_q       = self.inputs[0][idx]
            input_masks_q     = self.inputs[1][idx]
            input_segments_q  = self.inputs[2][idx]
            
            input_ids_a       = self.inputs[3][idx]
            input_masks_a     = self.inputs[4][idx]
            input_segments_a  = self.inputs[5][idx]
            
            if self.labels is not None:
                labels = self.labels[idx]
                return input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a, labels
            return input_ids_q, input_masks_q, input_segments_q, input_ids_a, input_masks_a, input_segments_a

        def __len__(self):
            return len(self.inputs[0])

    class BertForSequenceClassification_TF_Port(BertPreTrainedModel):
        
        def __init__(self, config):

            super(BertForSequenceClassification_TF_Port, self).__init__(config)
            
            self.config     = config
            self.activation = nn.Tanh() # ----added Tanh
            self.num_labels = config.num_labels
            self.bert       = BertModel(config)
            self.dropout    = nn.Dropout(config.hidden_dropout_prob)
            self.classifier = nn.Linear(config.hidden_size*2, 30)
            self.init_weights()

        def forward(self, q_id, a_id, q_mask, a_mask, q_atn, a_atn):
            
            q_embedding = self.bert(q_id, attention_mask = q_mask, token_type_ids = q_atn)
            a_embedding = self.bert(a_id, attention_mask = a_mask, token_type_ids = a_atn)
            q = torch.mean(q_embedding[0], 1)
            a = torch.mean(a_embedding[0], 1)
            logits = self.classifier(self.dropout(self.activation(torch.cat([q, a], 1))))
            return logits

    SEED = 42
    MAX_SEQUENCE_LENGTH = 384
    seed_everything(SEED)
    
    PATH = './'
    BERT_PATH = './'

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    dfx     = pd.read_csv("train.csv", nrows = 1024).fillna('none')
    sample  = pd.read_csv("sample_submission.csv")
    df_test = pd.read_csv("test.csv").fillna("")
    
    target_cols = list(sample.drop("qa_id", axis = 1).columns)
    df_train, df_valid = train_test_split(dfx, random_state=42, test_size=0.2)
    
    df_train.reset_index(drop=True, inplace=True)
    df_valid.reset_index(drop=True, inplace=True)

    output_categories = list(df_train.columns[11:])
    input_categories  = list(df_train.columns[[1,2,5]])
    
    train_targets = df_train[target_cols].values
    valid_targets = df_valid[target_cols].values

    bert_model_config      = 'bert_config.json'
    bert_config            = BertConfig.from_pretrained('bert-base-uncased')
    bert_config.num_labels = 30
    bert_config.output_hidden_states = True

    BATCH_SIZE = 4 # OOM
    epochs = 7
    ACCUM_STEPS = 1

    scores      = []
    valid_preds = []
    test_preds  = []

    xm.master_print("Preparing train datasets....")
    train_outputs     = compute_output_arrays(df_train, output_categories)
    train_outputs     = torch.tensor(train_outputs, dtype=torch.float32)
    train_inputs      = compute_input_arrays(df_train, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)
    
    xm.master_print("Preparing Valid datasets....")
    valid_outputs     = compute_output_arrays(df_valid, output_categories)
    valid_outputs     = torch.tensor(valid_outputs, dtype=torch.float32)
    valid_inputs      = compute_input_arrays(df_valid, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)

    test_inputs = compute_input_arrays(df_test, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)

    xm.master_print("Preparing Dataloaders....")
    
    # Note: each process has its own (identical) copies of each dataset.
    train_set     = QuestDataset(inputs=train_inputs, labels=train_outputs)

    # Creates the (distributed) train sampler, which let this process only access 
    # its portion of the training dataset.
    train_sampler = torch.utils.data.distributed.DistributedSampler(
            train_set,
            num_replicas=xm.xrt_world_size(),
            rank=xm.get_ordinal(),
            shuffle=True,
            )
    train_loader = torch.utils.data.DataLoader(train_set, 
                                               batch_size=BATCH_SIZE, 
                                               sampler = train_sampler,
                                               drop_last=True,
                                               )
    train_dl_len = len(train_loader)

    valid_set = QuestDataset(inputs=valid_inputs, labels=valid_outputs)
    valid_sampler = torch.utils.data.distributed.DistributedSampler(
            valid_set,
            num_replicas = xm.xrt_world_size(),
            rank=xm.get_ordinal(),
            shuffle=False,
            )
    valid_loader = torch.utils.data.DataLoader(valid_set, 
                                               batch_size = BATCH_SIZE, 
                                               drop_last = False,
                                               sampler = valid_sampler,
                                               )
        
    val_dl_len = len(valid_loader)

    best_score       = -1.
    best_param_score = None
    best_param_epoch = None
        
    learning_rate = 4e-5 * xm.xrt_world_size() # Scale learning rate to num cores
    # Get loss function, device, optimizer and model
    # Acquires the (unique) Cloud TPU core corresponding to this process's index
    
    device = xm.xla_device()
        
    model = BertForSequenceClassification_TF_Port.from_pretrained('bert-base-uncased', config=bert_config).to(device)

    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'LayerNorm.weight']
    
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.00}
    ]

    optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=4e-5)
    criterion = nn.BCEWithLogitsLoss()
    xm.master_print("Training....")

    for epoch in range(epochs):

        xm.master_print(f"Epoch.... {epoch}")

        start_time          = time.time()
        train_para_loader   = pl.ParallelLoader(train_loader, [device])
        avg_loss            = train_model(train_para_loader.per_device_loader(device), 
                                          train_dl_len, model, optimizer, criterion, 
                                          scheduler=None
                                          )

        valid_para_loader = pl.ParallelLoader(valid_loader, [device])
        avg_val_loss, preds, original = val_model(valid_para_loader.per_device_loader(device), 
                                                  model, val_dl_len, criterion, 
                                                  val_shape=valid_outputs.shape[0],  
                                                  batch_size=BATCH_SIZE
                                                  )
        spear = []

        for jj in range(preds.shape[1]):
          p1, p2 = list(original[:, jj]), list(torch.sigmoid(torch.from_numpy(np.asarray(list(preds[:, jj])))).detach().cpu().numpy())
          coef, _ = np.nan_to_num(stats.spearmanr(p1, p2))
          spear.append(coef)
        score = np.mean(spear)
        elapsed_time = time.time() - start_time

        if xm.is_master_ordinal():
          xm.master_print('Epoch {}/{} \t loss={:.4f} \t val_loss={:.4f} \t score={:.6f} \t time={:.2f}s'.format(
            epoch + 1, epochs, avg_loss, avg_val_loss, score, elapsed_time)
        )

        if best_score < score:
            best_score = score
            best_param_score = model.state_dict()
            best_epoch_score = epoch
        
        xm.master_print("Finished training epoch {}".format(epoch))

    xm.master_print(f'Best model came from Epoch {best_epoch_score+1} with score of {best_score}',)
    model.load_state_dict(best_param_score)
    scores.append(best_score)
    
    xm.master_print('Individual Fold Scores:')
    xm.master_print(scores)
    xm.master_print(f'QUEST BERT-base CV score: {np.mean(scores)}')

```spawn()``` takes a function (the "map function"), a tuple of arguments (the placeholder flags dict), the number of processes to create, and whether to create these new processes by "forking" or "spawning." While spawning new processes is generally recommended, Colab only supports forking.

```spawn()``` will create eight processes, one for each Cloud TPU core, and call ```run()``` -- the map function -- on each process. The inputs to ```run()``` are an ```index``` (zero through seven) and the ```placeholder flags```. When the proccesses acquire their device they actually acquire their corresponding Cloud TPU core automatically..

In [6]:
import warnings
warnings.filterwarnings("ignore") # to disable those warnings from TPU's and The Metric Calculations 

xmp.spawn(run, nprocs=8, start_method='fork') # 1 also works (with slightl better score) # NB I haven't finetuned the parameters as such

Preparing train datasets....
Preparing Valid datasets....
Preparing Dataloaders....
Training....
Epoch.... 0
[xla:0] (0) Loss=0.70650 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:5] (0) Loss=0.70750 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:4] (0) Loss=0.71442 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:2] (0) Loss=0.71106 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:3] (0) Loss=0.70956 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:6] (0) Loss=0.71739 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:1] (0) Loss=0.70790 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
[xla:7] (0) Loss=0.71384 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:44:57 2020
Epoch 1/7 	 loss=0.4847 	 val_loss=0.4016 	 score=0.180821 	 time=123.40s
Finished training epoch 0
Epoch.... 1
[xla:0] (0) Loss=0.45904 Rate=0.00 GlobalRate=0.00 Time=Sat Feb 15 05:46:44 2020
[xla:3] (0) Loss=0.44036 Rate=0.00 GlobalRate=0.00

Additional notebooks demonstrating how to run PyTorch on Cloud TPUs can be found in PyTorch XLA contrib folder. While Colab provides a free Cloud TPU, training is even faster on Google Cloud Platform, especially when using multiple Cloud TPUs in a Cloud TPU pod. Scaling from a single Cloud TPU, like in this notebook, to many Cloud TPUs in a pod is easy, too. You use the same code as this notebook and just spawn more processes.