# Creating an NLP Data Loader

In PyTorch, the data loader plays an indispensable role in managing this vast amount of data. For natural language processing (NLP) tasks like yours, data often comes in variable lengths due to differing sentence structures and lengths across languages. The data loader efficiently batches these variable-length sequences, ensuring that your models are trained on diverse examples in every iteration. This batching is crucial for harnessing the power of parallel computation on GPUs, thus expediting the training process.

Furthermore, the data loader aids in shuffling the data set, which is vital for preventing models from memorizing the sequence of training data and promoting better generalization. Especially for NLP tasks, where data might be ordered or clustered by topics, shuffling ensures that the model remains robust and doesn't develop biases based on the order of input.

Lastly, in the world of NLP, preprocessing steps such as tokenization, padding, and numericalization are paramount. The data loader in PyTorch provides hooks that allow us to seamlessly integrate these preprocessing steps, ensuring that the raw textual data is transformed into a format that's amenable for deep learning models.
In PyTorch, the data loader plays an indispensable role in managing this vast amount of data.

**Data set**

A data set in **PyTorch** is an object that represents a collection of data samples. Each data sample typically consists of one or more input features and their corresponding target labels. You can also use your data set to transform your data as needed.

**Data loader**

A data loader in **PyTorch** is responsible for efficiently loading and batching data from a data set. It abstracts away the process of iterating over a data set, shuffling, and dividing it into batches for training. In NLP applications, the data loader is used to process and transform your text data, rather than just the data set.

Data loaders have several key parameters, including the data set to load from, batch size (determining how many samples per batch), shuffle (whether to shuffle the data for each epoch), and more. Data loaders also provide an iterator interface, making it easy to iterate over batches of data during training.

In [15]:
import torchtext
print(f"Torchtext v{torchtext.__version__}")

import pandas as pd
from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List
from torch.nn.utils.rnn import pad_sequence

import torch
import torch.nn as nn
import torch.optim as optim
print(f"Torch v{torch.__version__}")

import spacy

import numpy as np
import random

Torchtext v0.18.0
Torch v2.3.1


## Custom data set and data loader in PyTorch

In [2]:
sentences = [
    "If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.",
    "Fame's a fickle friend, Harry.",
    "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "Soon we must all face the choice between what is right and what is easy.",
    "Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.",
    "You are awesome!"
]

class CustomDataset(Dataset):
  def __init__(self, sentences):
    self.sentences = sentences

  def __len__(self):
    return len(self.sentences)

  def __getitem__(self, idx):
    return self.sentences[idx]

batch_size = 2
custom_dataset = CustomDataset(sentences)
dataloader = DataLoader(custom_dataset, batch_size=batch_size, shuffle=True)
for batch in dataloader:
  print(batch)

["If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.", 'Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.']
["Fame's a fickle friend, Harry.", 'Soon we must all face the choice between what is right and what is easy.']
['You are awesome!', 'It is our choices, Harry, that show what we truly are, far more than our abilities.']


Deep learning models can only comprehend numerical data, and words are meaningless to them. Therefore, the next step is to convert these sentences into tensors. Let's see how to do this.

## Creating tensors for custom data set

In [3]:
sentences = [
    "If you want to know what a man's like, take a good look at how he treats his inferiors, not his equals.",
    "Fame's a fickle friend, Harry.",
    "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "Soon we must all face the choice between what is right and what is easy.",
    "Youth can not know how age thinks and feels. But old men are guilty if they forget what it was to be young.",
    "You are awesome!"
]

class CustomDataset(Dataset):
  def __init__(self, sentences, tokenizer, vocab):
    self.sentences = sentences
    self.tokenizer = tokenizer
    self.vocab = vocab

  def __len__(self):
    return len(self.sentences)

  def __getitem__(self, idx):
    tokens = self.tokenizer(self.sentences[idx])
    tensor_indices = [self.vocab[token] for token in tokens]
    return torch.tensor(tensor_indices)

tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, sentences))
custom_dataset = CustomDataset(sentences, tokenizer, vocab)

print('Custom Dataset Length:', len(custom_dataset))
print('Sample Items:')
for i in range(6):
  sample_item = custom_dataset[i]
  print(f"Item {i + 1}: {sample_item}")



Custom Dataset Length: 6
Sample Items:
Item 1: tensor([11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
        43, 61,  9, 44,  0, 14,  9, 33,  1])
Item 2: tensor([35,  6, 16,  3, 38, 40,  0,  8,  1])
Item 3: tensor([12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
        21,  1])
Item 4: tensor([54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1])
Item 5: tensor([66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
         2, 12, 64, 17, 26, 65,  1])
Item 6: tensor([19,  4, 25, 20])


## Custom collate function

A collate function is employed in the context of data loading and batching in machine learning, particularly when dealing with variable-length data, such as sequences (e.g., text, time series, sequences of events). Its primary purpose is to prepare and format individual data samples (examples) into batches that can be efficiently processed by machine learning models.

When batch_first=True, output will be in [batch_size x seq_len] shape, otherwise it will be in [seq_len x batch_size] shape. Some models accept input with [batch_size x seq_len] shape while some other models need the input to be of [seq_len x batch_size] shape. Keep in mind that this parameter takes care of putting the input in the desired shape.

In [4]:
def collate_fn(batch):
  padded_batch = pad_sequence(batch, batch_first=True, padding_value=0)
  return padded_batch

In [5]:
dataloader = DataLoader(custom_dataset, batch_size=batch_size, collate_fn=collate_fn)

for batch in dataloader:
  for row in batch:
    for idx in row:
      words = [vocab.get_itos()[idx] for idx in row]
    print(words)

['if', 'you', 'want', 'to', 'know', 'what', 'a', 'man', "'", 's', 'like', ',', 'take', 'a', 'good', 'look', 'at', 'how', 'he', 'treats', 'his', 'inferiors', ',', 'not', 'his', 'equals', '.']
['fame', "'", 's', 'a', 'fickle', 'friend', ',', 'harry', '.', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']
['it', 'is', 'our', 'choices', ',', 'harry', ',', 'that', 'show', 'what', 'we', 'truly', 'are', ',', 'far', 'more', 'than', 'our', 'abilities', '.']
['soon', 'we', 'must', 'all', 'face', 'the', 'choice', 'between', 'what', 'is', 'right', 'and', 'what', 'is', 'easy', '.', ',', ',', ',', ',']
['youth', 'can', 'not', 'know', 'how', 'age', 'thinks', 'and', 'feels', '.', 'but', 'old', 'men', 'are', 'guilty', 'if', 'they', 'forget', 'what', 'it', 'was', 'to', 'be', 'young', '.']
['you', 'are', 'awesome', '!', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',']


Looking into the result, you can see that the first dimension is the batch. For example, first batch is the first sentence: "['if', 'you', 'want', 'to', 'know', 'what', 'a', 'man', "'", 's', 'like', ',', 'take', 'a', 'good', 'look', 'at', 'how', 'he', 'treats', 'his', 'inferiors', ',', 'not', 'his', 'equals', '.']". 


In [6]:
def collate_fn_bfFALSE(batch):
  padded_batch = pad_sequence(batch, padding_value=0)
  return padded_batch

In [7]:
dataloader_bfFalse = DataLoader(custom_dataset, batch_size=batch_size, collate_fn=collate_fn_bfFALSE)

for seq in dataloader_bfFalse:
  for row in seq:
    words = [vocab.get_itos()[idx] for idx in row]
    print(words)

['if', 'fame']
['you', "'"]
['want', 's']
['to', 'a']
['know', 'fickle']
['what', 'friend']
['a', ',']
['man', 'harry']
["'", '.']
['s', ',']
['like', ',']
[',', ',']
['take', ',']
['a', ',']
['good', ',']
['look', ',']
['at', ',']
['how', ',']
['he', ',']
['treats', ',']
['his', ',']
['inferiors', ',']
[',', ',']
['not', ',']
['his', ',']
['equals', ',']
['.', ',']
['it', 'soon']
['is', 'we']
['our', 'must']
['choices', 'all']
[',', 'face']
['harry', 'the']
[',', 'choice']
['that', 'between']
['show', 'what']
['what', 'is']
['we', 'right']
['truly', 'and']
['are', 'what']
[',', 'is']
['far', 'easy']
['more', '.']
['than', ',']
['our', ',']
['abilities', ',']
['.', ',']
['youth', 'you']
['can', 'are']
['not', 'awesome']
['know', '!']
['how', ',']
['age', ',']
['thinks', ',']
['and', ',']
['feels', ',']
['.', ',']
['but', ',']
['old', ',']
['men', ',']
['are', ',']
['guilty', ',']
['if', ',']
['they', ',']
['forget', ',']
['what', ',']
['it', ',']
['was', ',']
['to', ',']
['be', ',']
['

It can be seen that the first dimension is now the sequence instead of batch, which means sentences will break so that each row includes a token from each sequence. For example the first row, (['if', 'fame']), includes the first tokens of all the sequences in that batch. You need to be aware of this standard to avoid any confusion when working with recurrent neural networks (RNNs) and transformers.

In [8]:
for batch in dataloader:
  print(batch)
  print("Length of sequences in the batch:", batch.shape[1])

tensor([[11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
         43, 61,  9, 44,  0, 14,  9, 33,  1],
        [35,  6, 16,  3, 38, 40,  0,  8,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0]])
Length of sequences in the batch: 27
tensor([[12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
         21,  1],
        [54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1,  0,  0,
          0,  0]])
Length of sequences in the batch: 20
tensor([[66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
          2, 12, 64, 17, 26, 65,  1],
        [19,  4, 25, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0]])
Length of sequences in the batch: 25


You will see that each batch has a fixed size for all the sequences within the batch.

You also have the option to utilize the collate function for tasks such as tokenization, converting tokenized indices, and transforming the result into a tensor. It's important to note that the original data set remains untouched by these transformations.


In [9]:
class CustomDataset(Dataset):
  def __init__(self, sentences):
    self.sentences = sentences

  def __len__(self):
    return len(self.sentences)

  def __getitem__(self, idx):
    return self.sentences[idx]

custom_dataset = CustomDataset(sentences)

In [10]:
def collate_fn(batch):  
  tensor_batch = []
  for sample in batch:
    tokens = tokenizer(sample)
    tensor_batch.append(torch.tensor([vocab[token] for token in tokens]))
  
  # Pad sequences within the batch to have equal lengths using pad_sequence
  # batch_first=True ensures that the tensors have shape (batch_size, max_sequence_length)
  padded_batch = pad_sequence(tensor_batch, batch_first=True)
  return padded_batch

In [11]:
dataloader = DataLoader(
  dataset=custom_dataset,
  batch_size=batch_size,
  shuffle=True,
  collate_fn=collate_fn
)

for batch in dataloader:
  print(batch)
  print("Shape of sample", len(batch))

tensor([[66, 29, 14, 13, 10, 22, 60,  7, 37,  1, 28, 51, 48,  4, 42, 11, 59, 39,
          2, 12, 64, 17, 26, 65,  1],
        [19,  4, 25, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0]])
Shape of sample 2
tensor([[35,  6, 16,  3, 38, 40,  0,  8,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0],
        [11, 19, 63, 17, 13,  2,  3, 47,  6, 16, 45,  0, 55,  3, 41, 46, 24, 10,
         43, 61,  9, 44,  0, 14,  9, 33,  1]])
Shape of sample 2
tensor([[54, 18, 50, 23, 34, 58, 30, 27,  2,  5, 52,  7,  2,  5, 32,  1,  0,  0,
          0,  0],
        [12,  5, 15, 31,  0,  8,  0, 57, 53,  2, 18, 62,  4,  0, 36, 49, 56, 15,
         21,  1]])
Shape of sample 2


## Exercise

Create a data loader with a collate function that processes batches of French text (provided below). Sort the data set on sequences length. Then tokenize, numericalize and pad the sequences. Sorting the sequences will minimize the number of `<PAD>`tokens added to the sequences, which enhances the model's performance. Prepare the data in batches of size 4 and print them.

In [12]:
import sys
import subprocess
# Install spaCy model if not already installed
try:
  nlp = spacy.load('fr_core_news_sm')
  print("spaCy model 'fr_core_news_sm' is already installed")
except OSError:
  print("Installing spaCy model 'fr_core_news_sm'...")
  subprocess.check_call([sys.executable, "-m", "spacy", "download", "fr_core_news_sm"])
  print("Model installed successfully!")

spaCy model 'fr_core_news_sm' is already installed


In [13]:
corpus = [
  "Ceci est une phrase.",
  "C'est un autre exemple de phrase.",
  "Voici une troisième phrase.",
  "Il fait beau aujourd'hui.",
  "J'aime beaucoup la cuisine française.",
  "Quel est ton plat préféré ?",
  "Je t'adore.",
  "Bon appétit !",
  "Je suis en train d'apprendre le français.",
  "Nous devons partir tôt demain matin.",
  "Je suis heureux.",
  "Le film était vraiment captivant !",
  "Je suis là.",
  "Je ne sais pas.",
  "Je suis fatigué après une longue journée de travail.",
  "Est-ce que tu as des projets pour le week-end ?",
  "Je vais chez le médecin cet après-midi.",
  "La musique adoucit les mœurs.",
  "Je dois acheter du pain et du lait.",
  "Il y a beaucoup de monde dans cette ville.",
  "Merci beaucoup !",
  "Au revoir !",
  "Je suis ravi de vous rencontrer enfin !",
  "Les vacances sont toujours trop courtes.",
  "Je suis en retard.",
  "Félicitations pour ton nouveau travail !",
  "Je suis désolé, je ne peux pas venir à la réunion.",
  "À quelle heure est le prochain train ?",
  "Bonjour !",
  "C'est génial !"
]

def collate_fn_fr(batch):
  tensor_batch = []
  for sample in batch:
    tokens = tokenizer(sample)
    tensor_batch.append(torch.tensor([vocab[token] for token in tokens]))

  padded_batch = pad_sequence(tensor_batch, batch_first=True)
  return padded_batch

tokenizer = get_tokenizer('spacy', language='fr_core_news_sm')
vocab = build_vocab_from_iterator(map(tokenizer, corpus))

sorted_data = sorted(corpus, key=lambda x: len(tokenizer(x)))
dataloader = DataLoader(
  sorted_data, 
  batch_size=4, 
  shuffle=False, 
  collate_fn=collate_fn_fr
)

for batch in dataloader:
  print(batch) 

tensor([[ 27,   2,   0],
        [ 26,  45,   2],
        [ 35,   8,   2],
        [ 25, 101,   2]])
tensor([[  1, 105,  41,   0],
        [  1,   3,  76,   0],
        [  1,   3,  82,   0],
        [ 11,   4,  74,   2]])
tensor([[ 28,   4,  10,   9,   0],
        [ 38,  10, 107,   9,   0],
        [ 12,  69,  51,  49,   0],
        [  1,  16, 103,  17,   0]])
tensor([[  1,   3,  14, 100,   0,   0],
        [ 37,   4,  19,  92,  95,   7],
        [ 33,  71, 122, 117,  52,   2],
        [ 32,  85,  42,  80,  87,   0]])
tensor([[ 30,  18,  19,  88,  21,   2,   0],
        [ 31,  43,   8,  15,  57,  73,   0],
        [ 36,  62,  90, 110,  60,  83,   0],
        [ 34, 112, 104, 106, 108,  56,   0]])
tensor([[ 11,   4, 111,  50,  68,   5,   9,   0],
        [  1, 113,  55,   6,  86,  53,  47,   0],
        [  1,   3,  98,   5, 116,  99,  66,   2],
        [120,  97,  75,   4,   6,  93,  20,   7]])
tensor([[  1,   3,  14,  20,  58,  44,   6,  72,   0,   0],
        [  1,  63,  40,  13,  89, 

## Data loader for German-English translation task

This section sets the stage for German-English machine translation using the torchtext and spaCy libraries. It adjusts data set URLs for the Multi30k data set, configures tokenizers for both languages, and establishes vocabularies with special tokens. This foundation is crucial for building and training effective translation models.

### Translation data set

Fetch a language translation data set called Multi30k. You will modify its default training and validation URLs, and then retrieve and print the first pair of German-English sentences from the training set

In [14]:
multi30k.URL["train"] = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/training.tar.gz"
multi30k.URL["valid"] = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
dataset = iter(train_iter)

for n in range(5):
  src, tgt = next(dataset)
  print(f"sample {str(n+1)}")
  print(f"Source ({SRC_LANGUAGE}: {src}\nTarget ({TGT_LANGUAGE}): {tgt}")

sample 1
Source (de: Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Target (en): Two young, White males are outside near many bushes.
sample 2
Source (de: Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.
Target (en): Several men in hard hats are operating a giant pulley system.
sample 3
Source (de: Ein kleines Mädchen klettert in ein Spielhaus aus Holz.
Target (en): A little girl climbing into a wooden playhouse.
sample 4
Source (de: Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.
Target (en): A man in a blue shirt is standing on a ladder cleaning a window.
sample 5
Source (de: Zwei Männer stehen am Herd und bereiten Essen zu.
Target (en): Two men are at the stove preparing food.


### Tokenizer setup

The tokenizer, set up using spaCy, breaks down text into smaller units or tokens, facilitating precise language processing and ensuring that words and punctuations are appropriately segmented for the translation task.

In [21]:
import sys
import subprocess
# Install spaCy model if not already installed
try:
  nlp = spacy.load('de_core_news_sm')
  print("spaCy model 'de_core_news_sm' is already installed")
except OSError:
  print("Installing spaCy model 'de_core_news_sm'...")
  subprocess.check_call([sys.executable, "-m", "spacy", "download", "de_core_news_sm"])
  print("Model installed successfully!")

spaCy model 'de_core_news_sm' is already installed


In [25]:
try:
  nlp = spacy.load('en_core_web_sm')
  print("spaCy model 'en_core_web_sm' is already installed")
except OSError:
  print("Installing spaCy model 'en_core_web_sm'...")
  subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
  print("Model installed successfully!")

Installing spaCy model 'en_core_web_sm'...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Model installed successfully!


In [28]:
german, english = next(dataset)
print(f"Source German ({SRC_LANGUAGE}): {german}\nTarget English  ({TGT_LANGUAGE}): { english }")

token_transform = {}
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')

print(token_transform['de'](german))
print(token_transform['en'](english))

Source German (de): Ein schwarzer Hund und ein gefleckter Hund kämpfen.
Target English  (en): A black dog and a spotted dog are fighting
['Ein', 'schwarzer', 'Hund', 'und', 'ein', 'gefleckter', 'Hund', 'kämpfen', '.']
['A', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'are', 'fighting']


### Special symbols

In [None]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3

# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

### Tokens to indices transformation (Vocab)

The code initializes a dictionary vocab_transform and then builds vocabularies for both German (de) and English (en) languages from the ```train_iter dataset``` using the helper ```function yield_tokens```. These vocabularies are then stored in the vocab_transform dictionary. The vocabularies are built with certain constraints like a minimum frequency for tokens and the inclusion of special symbols at the beginning.


In [31]:
vocab_transform = {}

def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
  language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
  for data_sample in data_iter:
    yield token_transform[language](data_sample[language_index[language]])

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  train_iterator = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
  sorted_dataset = sorted(train_iterator, key=lambda x: len(x[0].split()))
  vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(sorted_dataset, ln),
                                                               min_freq=1,
                                                               specials=special_symbols,
                                                               special_first=True)
  vocab_transform[ln].set_default_index(UNK_IDX)

seq_en = vocab_transform['en'](token_transform['en'](english))
print(f"English text string: {english}\nEnglish sequence: {seq_en}")

seq_de = vocab_transform['de'](token_transform['de'](german))
print(f"German text string: {german}\nGerman sequence: {seq_de}")

English text string: A black dog and a spotted dog are fighting
English sequence: [6, 26, 34, 11, 4, 1763, 34, 17, 679]
German text string: Ein schwarzer Hund und ein gefleckter Hund kämpfen.
German sequence: [5, 117, 33, 9, 15, 2715, 33, 384, 4]


In [32]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [34]:
# function to add BOS/EOS, flip source sentence and create tensor for input sequence indices
def tensor_transform_s(token_ids: List[int]):
  return torch.cat((torch.tensor([BOS_IDX]),
                    torch.flip(torch.tensor(token_ids), dims=(0,)),
                    torch.tensor([EOS_IDX])))

def tensor_transform_t(token_ids: List[int]):
  return torch.cat((torch.tensor([BOS_IDX]),
                    torch.tensor(token_ids),
                    torch.tensor([EOS_IDX])))


seq_en = tensor_transform_s(seq_en)
print(seq_en)

seq_de = tensor_transform_t(seq_de)
print(seq_de)

tensor([   2,  679,   17,   34, 1763,    4,   11,   34,   26,    6,    3])
tensor([   2,    5,  117,   33,    9,   15, 2715,   33,  384,    4,    3])


In [35]:
def sequential_transforms(*transforms):
  def func(txt_input):
    for transform in transforms:
      txt_input = transform(txt_input)
    return txt_input
  return func

text_transform = {}
text_transform[SRC_LANGUAGE] = sequential_transforms(
                                        token_transform[SRC_LANGUAGE], # Tokenization
                                        vocab_transform[SRC_LANGUAGE], # Numericalization
                                        tensor_transform_s)            # Add BOS/EOS, create tensor

text_transform[TGT_LANGUAGE] = sequential_transforms(
                                        token_transform[TGT_LANGUAGE], # Tokenization
                                        vocab_transform[TGT_LANGUAGE], # Numericalization
                                        tensor_transform_s)            # Add BOS/EOS, create tensor
                                        

### Processing data in batches

The collate_fn function builds upon the utilities you established earlier. It performs the text_transform to a batch of raw data. Furthermore, it ensures consistent sequence lengths within the batch through padding. This transformation readies the data for input to a transformer model designed for language translation tasks.


In [43]:
def collate_fn(batch):
  src_batch, tgt_batch = [], []
  for src_sample, tgt_sample in batch:
    src_sequences = text_transform[SRC_LANGUAGE](src_sample.rstrip("\n"))
    src_sequences = torch.tensor(src_sequences, dtype=torch.int64)
    tgt_sequences = text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n"))
    tgt_sequences = torch.tensor(tgt_sequences, dtype=torch.int64)
    src_batch.append(src_sequences)
    tgt_batch.append(tgt_sequences)
  
  src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=True)
  tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=True)

  return src_batch.to(device), tgt_batch.to(device)

BATCH_SIZE = 4

train_iterator = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
sorted_train_iterator = sorted(train_iterator, key=lambda x: len(x[0].split()))
train_dataloader = DataLoader(sorted_train_iterator, 
                              batch_size=BATCH_SIZE, 
                              collate_fn=collate_fn, 
                              drop_last=True)

valid_iterator = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
sorted_valid_iterator = sorted(valid_iterator, key=lambda x: len(x[0].split()))
valid_dataloader = DataLoader(sorted_valid_iterator, 
                              batch_size=BATCH_SIZE, 
                              collate_fn=collate_fn, 
                              drop_last=True)

src, trg = next(iter(train_dataloader))
src, trg


  src_sequences = torch.tensor(src_sequences, dtype=torch.int64)
  tgt_sequences = torch.tensor(tgt_sequences, dtype=torch.int64)


(tensor([[    2,     3,     1,     1,     1],
         [    2,  5510,     3,     1,     1],
         [    2,  5510,     3,     1,     1],
         [    2,  1701,     8, 12642,     3]]),
 tensor([[   2,    3,    1,    1,    1,    1,    1,    1,    1,    1,    1],
         [   2,    5, 3428,  692,  115, 9953,  172,  259, 4623, 6650,    3],
         [   2,    5, 2187, 2808,   71, 3823, 1650, 3913,  110,  216,    3],
         [   2,   37,  109,  202, 3398,    6,    3,    1,    1,    1,    1]]))