In [6]:
#| include: false
import random
from typing import Dict, List, Tuple

import torch
from torch.utils.data import DataLoader, Dataset

from transformers import AutoTokenizer

Suppose we have the following hypothetical dataset.

In [8]:
class Dataset:
    def __init__(self):
        super().__init__()
        
    def __len__(self):
        return 32
    
    def __getitem__(self, idx):
        return f"hello {idx}", random.randint(0, 3)
    
rand_ds = Dataset()
rand_dl = DataLoader(rand_ds, batch_size=4)

Printing out the first batch, notice how the first element is just a tuple of strings and the second item has automagically been converted into a tensor.

In [10]:
next(iter(rand_dl))

[('hello 0', 'hello 1', 'hello 2', 'hello 3'), tensor([0, 1, 1, 0])]

## The Collate Function

With the collate function we can convert these strings to a tensor as well. This leads to cleaner code in that data preprocessing is kept away from model code. In my case it actually led to a slightly faster run time per epoch, but I'm not entirely sure why.

The following code takes in a list of size batch size, where **each element** is a string and it's corresponding label. Then it parses the strings through the tokenizer, which converts into numerical values thanks to the huggingface tokenizer. But more importantly, note how now you have to convert the y to `torch.LongTensor`, as otherwise it would remain a tuple. This is certainly an extra step that pytorch was internally taking care of for you.

In [11]:
class CollateFn:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def __call__(
        self, batch: List[Tuple[str, int]]
    ) -> Tuple[Dict[str, torch.LongTensor], torch.LongTensor]:
        x, y = zip(*batch)
        return self.tokenizer(list(x)), torch.LongTensor(y)

We can add an instance of the above class to our dataloader, which leads us to the following results:

In [12]:
collate_fn = CollateFn()
rand_dl = DataLoader(rand_ds, batch_size=4, collate_fn=collate_fn)
next(iter(rand_dl))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




({'input_ids': [[101, 19082, 121, 102], [101, 19082, 122, 102], [101, 19082, 123, 102], [101, 19082, 124, 102]], 'token_type_ids': [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]},
 tensor([2, 1, 3, 1]))

## Summary
- Collate functions are used to transform data.
- You need to transform **all outputs**, not just simply the one you possibly want.

## Shameless self promotion
If you enjoyed the tutorial buy me a coffee, or better yet [buy my course](https://www.udemy.com/course/machine-learning-and-data-science-2021/?referralCode=E79228C7436D74315787) (usually 90% off).