In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from datasets import load_dataset

raw_dataset = load_dataset("glue", "mrpc")

train_dataset = raw_dataset["train"]
validation_dataset = raw_dataset["validation"]
test_dataset = raw_dataset["test"]

In [4]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sentence_1 = train_dataset[0]["sentence1"]
sentence_2 = train_dataset[0]["sentence2"]

# For checking if two senteces are similar or not
# we have to pass the two sentences to the tokenizer as a pair

inputs = tokenizer(sentence_1, sentence_2, return_tensors="pt")

for key, value in inputs.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [5]:
ids = inputs["input_ids"]
print(tokenizer.convert_ids_to_tokens(ids.numpy().tolist()[0]))

# as we observe the pattern here is in the form of `[CLS] sentence1 [SEP] sentence2 [SEP]`

['[CLS]', 'am', '##ro', '##zi', 'accused', 'his', 'brother', ',', 'whom', 'he', 'called', '"', 'the', 'witness', '"', ',', 'of', 'deliberately', 'di', '##stor', '##ting', 'his', 'evidence', '.', '[SEP]', 'referring', 'to', 'him', 'as', 'only', '"', 'the', 'witness', '"', ',', 'am', '##ro', '##zi', 'accused', 'his', 'brother', 'of', 'deliberately', 'di', '##stor', '##ting', 'his', 'evidence', '.', '[SEP]']


In [6]:
tokenized_dataset = tokenizer(
    raw_dataset["train"]["sentence1"],
    raw_dataset["train"]["sentence2"],
    padding=True,
    truncation=True,
)



This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists).
It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the Hugginface Datasets library are Apache Arrow files
stored on the disk, so you only keep the samples you ask for loaded in memory).

### 1. **Tokenization Returns a Dictionary:**
When you use the `tokenizer` to process the dataset, it returns a dictionary. For example, when you tokenize sentences with a Hugging Face tokenizer:

```python
tokenized_dataset = tokenizer(
    raw_dataset["train"]["sentence1"],
    raw_dataset["train"]["sentence2"],
    padding=True,
    truncation=True,
)
```

The result (`tokenized_dataset`) is a dictionary containing three keys:
- `input_ids`: The tokenized representations (IDs) of the words.
- `attention_mask`: A mask indicating which tokens should be attended to (1 for tokens, 0 for padding).
- `token_type_ids`: Used to distinguish between the two sentences in a pair (for models like BERT).

Each key contains a **list of lists**, where:
- Each **list** corresponds to one tokenized sentence or sentence pair.
- So, if you're tokenizing many sentences, you’re essentially storing all this information as a large dictionary, which can take up significant memory.

### 2. **Memory Concerns:**
When you tokenize the dataset this way, you end up storing all tokenized data in **RAM (random access memory)**. For small datasets, this is manageable. However, for large datasets, this approach can quickly fill up your system's RAM because:
- Each tokenized sentence is stored as a list of integers (`input_ids`, `attention_mask`, etc.).
- For a large dataset with thousands or millions of samples, this can become inefficient, as you are storing all the tokenized sentences in memory at once.

### 3. **🤗 Datasets Library and Apache Arrow:**
The **🤗 Datasets library** solves this memory problem by using **Apache Arrow**, a columnar storage format that is optimized for efficient reading/writing of large datasets, while keeping memory usage low.

Here’s how the 🤗 Datasets library works with Arrow files:
- **Disk-backed storage**: Instead of keeping the whole dataset in memory, the data is stored on disk in the Arrow format. Only the samples you are working on at any given moment are loaded into memory.
- **Efficient access**: You can retrieve samples as you need them without having to load the entire dataset into memory. This is particularly useful when working with large datasets like those for NLP tasks (e.g., millions of text samples).

### 4. **The Disadvantage of Tokenizing Everything at Once:**
When you tokenize the entire dataset at once (as in the code you provided), you lose the benefits of disk-backed datasets:
- **RAM limitations**: The whole tokenized dataset must fit in memory. If your dataset is large, you may run out of memory, causing performance issues or crashes.
- **No incremental loading**: All samples are tokenized and stored at once, even though you might not need all of them in memory simultaneously.

### 5. **Alternatives:**
Instead of tokenizing everything at once, you can use the **`datasets.map()`** function, which tokenizes the dataset in a more efficient, disk-backed manner:

```python
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], padding=True, truncation=True)

# Apply tokenization to the dataset using map, without loading everything into memory
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)
```

### Advantages of Using `datasets.map()`:
- The `map()` function applies the tokenization to the dataset **on-the-fly**, meaning it processes batches of samples and updates the dataset without storing everything in memory.
- It leverages **Apache Arrow** to keep most of the dataset on disk, only loading and tokenizing batches as needed.
- This approach is far more scalable for large datasets, as you don’t need to worry about memory limitations.

In [7]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [16]:
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True) # optionally you can use num_proc=4 to use multiple cores

print("Actual Dataset: ", train_dataset)
print("Actual Dataset: ", validation_dataset)
print("Actual Dataset: ", test_dataset)

print()
print("*" * 100)
print()

print("Tokenized Dataset: ", tokenized_dataset)

Map: 100%|██████████| 3668/3668 [00:00<00:00, 10015.57 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 9556.20 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 11845.29 examples/s]


Actual Dataset:  Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
Actual Dataset:  Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 408
})
Actual Dataset:  Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1725
})

****************************************************************************************************

Tokenized Dataset:  DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})


Tokenization Process Breakdown:
Tokenization = Splitting into Tokens + Converting to IDs:

Splitting into tokens: The tokenizer breaks each sentence into subword tokens, based on the tokenizer’s vocabulary. For example, the sentence "I love machine learning" could be split into the tokens ['I', 'love', 'machine', 'learning'].
Converting tokens to input_ids: Each token is then mapped to a unique integer ID in the tokenizer's vocabulary. For example, ['I', 'love', 'machine', 'learning'] might be mapped to [101, 2307, 3347, 6754] (actual IDs may vary depending on the tokenizer's vocabulary).
input_ids: This is the main output of tokenization. It represents the tokenized form of the sentences, where each word or subword is replaced by its corresponding integer ID from the tokenizer’s vocabulary.

token_type_ids (specific to BERT):

Since you are using BERT, which supports input pairs (sentence pairs for tasks like classification), token_type_ids are added to distinguish between the two input sentences.
For BERT:
Sentence 1 is marked with 0 for all its tokens.
Sentence 2 is marked with 1 for all its tokens.
This helps the model understand which tokens belong to which sentence.
attention_mask (applies to all tokenizers):

The attention mask is used to specify which tokens should be attended to by the model and which are just padding tokens (to ensure all inputs are the same length).
It is a sequence of 1s and 0s, where:
1 indicates that the token should be attended to (it’s part of the original sentence).
0 indicates that the token is just padding and should be ignored.
Regardless of the tokenizer you use, the attention mask is applied.

__DYNAMIC PADDING__:

It is often required to have all the tensors be padded to same length, but depending on the database the sentences can actually be of varied length 


suppose length of the shortest sentence is around 20 tokens
and length of the largest sentence is around 200 tokens
then each of the sentence has to padded to 200 tokens leading to wastage of memory and computation
hence we have to use dynamic padding, that pads the sentences based on the sentences in the batch itself
not considering all the sentences in the dataset

In [17]:
sentences_pair_1 = [len(sentence['sentence1']) for sentence in train_dataset]
sentences_pair_2 = [len(sentence['sentence2']) for sentence in train_dataset]

print(min(sentences_pair_1), max(sentences_pair_1))
print(min(sentences_pair_2), max(sentences_pair_2))

# here the min length is 38 and max is 226
# here the min length is 42 and max is 215

# if all sentences are processed together then the padding will be done to the max length of the sentence
# which will be a waste of memory, unessary padding will be added to the sentences

38 226
42 215


In [31]:
samples = tokenized_dataset["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}

for i in samples['input_ids']:
    print(len(i))

50
59
47
67
59
50
62
32


In [35]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch = data_collator(samples)
print({k: v.shape for k, v in batch.items()})   # all the sentences are padded to the max length of the sentence

{'input_ids': torch.Size([8, 67]), 'token_type_ids': torch.Size([8, 67]), 'attention_mask': torch.Size([8, 67]), 'labels': torch.Size([8])}
