# Divide Hugging Face Transformers training time by 2 or more

This notebook is based on a work introduced in the article [Divide Hugging Face Transformers training time by 2 or more with dynamic padding and uniform length batching](https://towardsdatascience.com/divide-hugging-face-transformers-training-time-by-2-or-more-21bf7129db9q-21bf7129db9e?source=friends_link&sk=10a45a0ace94b3255643d81b6475f409).  

This notebook focuses on a small part of the article, if you want to go deeper, need more explanations, or want to read about other training time optimizations, check the article.

The basic idea behind the optimization is to **avoid computations when we know we are going to throw its result**.

Here we will focus on one of the trick presentented in the article called `dynamic padding`. This is a very simple trick to implement.

Training neural networks on a batch of sequences requires them to have the exact same length to build the batch matrix representation. Because real life NLP datasets are always made of texts of variable lengths, we often need to make some sequences shorter by truncating them, and some others longer by adding at the end a repeated fake token called `pad` token. 

Because the `pad` token doesn’t represent a real word/subword/signal, when most computations are done, before computing the loss, we erase the pad token signal by multiplying it by 0 through the attention mask matrix.  

> In `dynamic padding`, we limit the number of added `pad` tokens to reach the length of the longest sequence of each mini batch instead of a fixed value set for the whole train set. Because the number of added tokens changes across mini batches, we call it `dynamic`.

The effect of such optimization targeting sequence length can be dramatic because **the complexity of BERT is quadratic with the sequence length**, as written in the original BERT repo README: _“…attention is quadratic to the sequence length. In other words, a batch of 64 sequences of length 512 is much more expensive than a batch of 256 sequences of length 128.”_.

Let's try by ourselves.

## Setup

Experiments are run on a Nvidia `P100`, a 16Gb GPU.

In [1]:
!nvidia-smi

Sun Jun  7 18:06:52 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### Installation

In [2]:
# Installing the transformers (algo) and nlp (data)
!pip install nlp transformers



### Prepare data

Download MNLI dataset (classification task) with `nlp` package, then convert the format to something easier to work with.

In [3]:
import nlp
dataset = nlp.load_dataset('glue', 'mnli')

# Let's get an idea of the data format
print(dataset)
print(dataset["train"][0])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28998.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=30329.0, style=ProgressStyle(descriptio…


Downloading and preparing dataset glue/mnli (download: 298.29 MiB, generated: 78.65 MiB, total: 376.95 MiB) to /root/.cache/huggingface/datasets/glue/mnli/1.0.0...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312783507.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mnli/1.0.0. Subsequent calls will reuse this data.
{'train': Dataset(schema: {'premise': 'string', 'hypothesis': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 392702), 'validation_matched': Dataset(schema: {'premise': 'string', 'hypothesis': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 9815), 'validation_mismatched': Dataset(schema: {'premise': 'string', 'hypothesis': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 9832), 'test_matched': Dataset(schema: {'premise': 'string', 'hypothesis': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 9796), 'test_mismatched': Dataset(schema: {'premise': 'string', 'hypothesis': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 9847)}
{'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.', 'hypothesis': 'Product and geography are what make cream skimming work. ', 'label': 1, 'idx': 0}


Data are provided as a list of dict. We will convert them to a dataclass, so manipulation are easier.

In [0]:
from dataclasses import dataclass, field

@dataclass
class Example:
    text_a: str
    text_b: str
    label: int

# to simplify code below, we convert list of dict provided by nlp package to list of Example
train = [Example(text_a=item["premise"], text_b=item["hypothesis"], label=item["label"]) for item in dataset["train"]]
valid = [Example(text_a=item["premise"], text_b=item["hypothesis"], label=item["label"]) for item in dataset["validation_matched"]]

## Dynamic padding


On `MNLI`, shortest sequences are < 20 tokens long, if you set the max length to 512 tokens, you will add 492 pad tokens to those 20 tokens sequences, and then perform computations over those 492 noisy tokens.

Because the learning / gradient descent is performed at the mini batch level, we have the opportunity to limit the padding effect, more precisely we can first search for the longest sequence length in the mini batch, and then `pad` the other sequences accordingly.  

Those operations can be performed in the `collate_fn` function.

Below, we define a custom `Dataset` class which doesn't perform any padding (if asked so) and a custom `collate_fn` (in `DataCollator` class) which will perform the `dynamic padding` when possible.


In [0]:
import logging
import os
import random
import time

from typing import Dict, Optional
from typing import List

import numpy as np
import torch
from torch.utils.data.dataset import Dataset, IterableDataset
from torch.utils.tensorboard import SummaryWriter
from transformers import AutoTokenizer, EvalPrediction, Trainer, HfArgumentParser, TrainingArguments, \
    AutoModelForSequenceClassification, set_seed, AutoConfig
from transformers import PreTrainedTokenizer, DataCollator, PreTrainedModel

set_seed(123)

@dataclass
class Features:
    input_ids: List[int]
    attention_mask: List[int]
    label: int


class TextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, 
                 pad_to_max_length: bool, 
                 max_len: int,
                 examples: List[Example]) -> None:
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.examples: List[Example] = examples
        self.current = 0
        self.pad_to_max_length = pad_to_max_length

    def encode(self, ex: Example) -> Features:
        encode_dict = self.tokenizer.encode_plus(text=ex.text_a,
                                                 text_pair=ex.text_b,
                                                 add_special_tokens=True,
                                                 max_length=self.max_len,
                                                 pad_to_max_length=self.pad_to_max_length,
                                                 return_token_type_ids=False,
                                                 return_attention_mask=True,
                                                 return_overflowing_tokens=False,
                                                 return_special_tokens_mask=False,
                                                 )
        return Features(input_ids=encode_dict["input_ids"],
                        attention_mask=encode_dict["attention_mask"],
                        label=ex.label)

    def __getitem__(self, idx) -> Features:
        return self.encode(ex=self.examples[idx])

    def __len__(self):
        return len(self.examples)


def pad_seq(seq: List[int], max_batch_len: int, pad_value: int) -> List[int]:
    return seq + (max_batch_len - len(seq)) * [pad_value]


@dataclass
class SmartCollator(DataCollator):
    pad_token_id: int

    def collate_batch(self, batch: List[Features]) -> Dict[str, torch.Tensor]:
        batch_inputs = list()
        batch_attention_masks = list()
        labels = list()
        max_size = max([len(ex.input_ids) for ex in batch])
        for item in batch:
            batch_inputs += [pad_seq(item.input_ids, max_size, self.pad_token_id)]
            batch_attention_masks += [pad_seq(item.attention_mask, max_size, 0)]
            labels.append(item.label)

        return {"input_ids": torch.tensor(batch_inputs, dtype=torch.long),
                "attention_mask": torch.tensor(batch_attention_masks, dtype=torch.long),
                "labels": torch.tensor(labels, dtype=torch.long)
                }

def load_transformers_model(pretrained_model_name_or_path: str,
                            use_cuda: bool,
                            ) -> PreTrainedModel:


    return model


### Load models, tokenizer, datasets

In [6]:
max_sequence_len = 512 # longest sequences are >> 256 tokens, we choose to not apply any truncation.

# We will use the old classical BERT in its base cased flavor
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-cased")

config = AutoConfig.from_pretrained(pretrained_model_name_or_path="bert-base-cased",
                                    num_labels=3)
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path="bert-base-cased",
    config=config)

# a very simple accuracy function, nothing fancy
def compute_metrics(p: EvalPrediction) -> Dict:
    preds = np.argmax(p.predictions, axis=1)
    return {"acc": (preds == p.label_ids).mean()}

# default parameters for the training, in particular we limit ourselves to 1 epoch
args = TrainingArguments(output_dir="/tmp/test_dynamic_padding",
                         seed=123,
                         num_train_epochs=1,
                         per_device_train_batch_size=8,  # max batch size without OOM exception, because of the large max token length
 			                   per_device_eval_batch_size=8,
                         evaluate_during_training=True,
                         logging_steps=5000,
                         save_steps=0,
                        )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




## Training

We launch the training without and with the optimization, so we can compare times and accuracy.  
We limit ourselves to 1 epoch and don't try to tweak the hyper paramaters as it is not the focus of this work.  
Timing is printed at the end of the training.  

### Launch training **without** dynamic padding

In [7]:
# to disable dynamic padding, we just pad to max at the dataset level.
train_set = TextDataset(tokenizer=tokenizer,
                        max_len=max_sequence_len,
                        examples=train,
                        pad_to_max_length=True)  # here we pad to max

valid_set = TextDataset(tokenizer=tokenizer,
                        max_len=max_sequence_len,
                        examples=valid,
                        pad_to_max_length=True)  # here we pad to max

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_set,
    data_collator=SmartCollator(pad_token_id=tokenizer.pad_token_id),
    eval_dataset=valid_set,
    compute_metrics=compute_metrics,
)
start_time = time.time()
trainer.train()
print(f"training took {(time.time() - start_time) / 60:.2f}mn")
result = trainer.evaluate()
print(result)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=49088.0, style=ProgressStyle(description_…

{"loss": 0.8291655193090439, "learning_rate": 4.490710560625815e-05, "epoch": 0.10185788787483703, "step": 5000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.7747655408343068, "eval_acc": 0.6946510443199185, "epoch": 0.10185788787483703, "step": 5000}
{"loss": 0.7280739153057337, "learning_rate": 3.98142112125163e-05, "epoch": 0.20371577574967406, "step": 10000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6899689566936061, "eval_acc": 0.7277636271013754, "epoch": 0.20371577574967406, "step": 10000}
{"loss": 0.7019654942885041, "learning_rate": 3.4721316818774444e-05, "epoch": 0.3055736636245111, "step": 15000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6908178972695159, "eval_acc": 0.7316352521650535, "epoch": 0.3055736636245111, "step": 15000}
{"loss": 0.6748402434438467, "learning_rate": 2.9628422425032598e-05, "epoch": 0.4074315514993481, "step": 20000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6458180906295097, "eval_acc": 0.7500764136525726, "epoch": 0.4074315514993481, "step": 20000}
{"loss": 0.6491672154977918, "learning_rate": 2.4535528031290745e-05, "epoch": 0.5092894393741851, "step": 25000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6239076226269606, "eval_acc": 0.7470198675496689, "epoch": 0.5092894393741851, "step": 25000}
{"loss": 0.611613671438396, "learning_rate": 1.944263363754889e-05, "epoch": 0.6111473272490222, "step": 30000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6136362143134139, "eval_acc": 0.7624044829342843, "epoch": 0.6111473272490222, "step": 30000}
{"loss": 0.5900442306846381, "learning_rate": 1.4349739243807042e-05, "epoch": 0.7130052151238592, "step": 35000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5895616474220592, "eval_acc": 0.7814569536423841, "epoch": 0.7130052151238592, "step": 35000}
{"loss": 0.5694037208631635, "learning_rate": 9.25684485006519e-06, "epoch": 0.8148631029986962, "step": 40000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5539280138393681, "eval_acc": 0.7827814569536424, "epoch": 0.8148631029986962, "step": 40000}
{"loss": 0.5479105526492, "learning_rate": 4.163950456323338e-06, "epoch": 0.9167209908735332, "step": 45000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5272483711647813, "eval_acc": 0.7940906775343861, "epoch": 0.9167209908735332, "step": 45000}


training took 393.49mn


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5366932701656463, "eval_acc": 0.7972491085073866, "epoch": 1.0, "step": 49088}
{'eval_loss': 0.5366932701656463, 'eval_acc': 0.7972491085073866, 'epoch': 1.0}


### Launch training **with** dynamic padding

In [8]:
# to use padding, we just don't pad to max, collate_fn will automatically take care of the size.
train_set = TextDataset(tokenizer=tokenizer,
                        max_len=max_sequence_len,
                        examples=train,
                        pad_to_max_length=False)  # here we don't pad to max

valid_set = TextDataset(tokenizer=tokenizer,
                        max_len=max_sequence_len,
                        examples=valid,
                        pad_to_max_length=False)  # here we don't pad to max

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_set,
    data_collator=SmartCollator(pad_token_id=tokenizer.pad_token_id),
    eval_dataset=valid_set,
    compute_metrics=compute_metrics,
)
start_time = time.time()
trainer.train()
print(f"training took {(time.time() - start_time) / 60:.2f}mn")
result = trainer.evaluate()
print(result)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=49088.0, style=ProgressStyle(description_…

{"loss": 0.6106544892862439, "learning_rate": 4.490710560625815e-05, "epoch": 0.10185788787483703, "step": 5000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.782951954852398, "eval_acc": 0.7198166072338258, "epoch": 0.10185788787483703, "step": 5000}
{"loss": 0.6309424709320068, "learning_rate": 3.98142112125163e-05, "epoch": 0.20371577574967406, "step": 10000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.7008725743081964, "eval_acc": 0.7402954661232807, "epoch": 0.20371577574967406, "step": 10000}
{"loss": 0.6247799652352929, "learning_rate": 3.4721316818774444e-05, "epoch": 0.3055736636245111, "step": 15000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6811887560914762, "eval_acc": 0.7407030056036679, "epoch": 0.3055736636245111, "step": 15000}
{"loss": 0.6021737526774407, "learning_rate": 2.9628422425032598e-05, "epoch": 0.4074315514993481, "step": 20000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6461482198746494, "eval_acc": 0.7589403973509934, "epoch": 0.4074315514993481, "step": 20000}
{"loss": 0.5894900860860944, "learning_rate": 2.4535528031290745e-05, "epoch": 0.5092894393741851, "step": 25000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6154360700316709, "eval_acc": 0.7624044829342843, "epoch": 0.5092894393741851, "step": 25000}
{"loss": 0.562518992201984, "learning_rate": 1.944263363754889e-05, "epoch": 0.6111473272490222, "step": 30000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6169206781438822, "eval_acc": 0.7760570555272542, "epoch": 0.6111473272490222, "step": 30000}
{"loss": 0.5486260669246316, "learning_rate": 1.4349739243807042e-05, "epoch": 0.7130052151238592, "step": 35000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.6215495994738087, "eval_acc": 0.7815588385124809, "epoch": 0.7130052151238592, "step": 35000}
{"loss": 0.5369351287633181, "learning_rate": 9.25684485006519e-06, "epoch": 0.8148631029986962, "step": 40000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5896087555668659, "eval_acc": 0.7807437595517066, "epoch": 0.8148631029986962, "step": 40000}
{"loss": 0.5313110311329364, "learning_rate": 4.163950456323338e-06, "epoch": 0.9167209908735332, "step": 45000}


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5516463185993753, "eval_acc": 0.7928680590932247, "epoch": 0.9167209908735332, "step": 45000}


training took 84.16mn


HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1227.0, style=ProgressStyle(description_…


{"eval_loss": 0.5473661853612295, "eval_acc": 0.7944982170147733, "epoch": 1.0, "step": 49088}
{'eval_loss': 0.5473661853612295, 'eval_acc': 0.7944982170147733, 'epoch': 1.0}


## Conclusion

On a P100, the training took **393mn** (6h30) on a single epoch when the optimization is not used and **84mn** (1h30) when it is used, which represents a division by `4.7` of training time. Note that accuracy in both cases is almost the same (0.794 vs 0.797).

Over 20 experiments have been run for the article, showing that those training time reduction tricks are safe to use.

The article is available [here](https://towardsdatascience.com/divide-hugging-face-transformers-training-time-by-2-or-more-21bf7129db9q-21bf7129db9e?source=friends_link&sk=10a45a0ace94b3255643d81b6475f409).