### Exercise 2.1: Byte Pair Encoding of Unknown Words  

Try the BPE tokenizer from the `tiktoken` library on the unknown words **"Akwirw ier"** and print the individual token IDs. Then, call the `decode` function on each resulting integer to reproduce the mapping shown in **Figure 2.11**. Lastly, call the `decode` method on the token IDs to check if it can reconstruct the original input, **"Akwirw ier"**.  

A detailed discussion and implementation of BPE is beyond the scope of this book. However, in short, BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.  

For example, BPE starts by adding all individual characters to its vocabulary (`"a"`, `"b"`, ...). Then, it merges frequently occurring character combinations into subwords. For instance, `"d"` and `"e"` may be merged into the subword `"de"`, which is common in words like **"define"**, **"depend"**, **"made"**, and **"hidden"**. These merges are determined by a frequency cutoff. 

In [1]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.4.1
tiktoken version: 0.7.0


In [4]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

In [5]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

[33901, 86, 343, 86, 220, 959]


In [6]:
for i in integers:
    print(f"{i} -> {tokenizer.decode([i])}")

33901 -> Ak
86 -> w
343 -> ir
86 -> w
220 ->  
959 -> ier


In [7]:
tokenizer.encode("Ak")

[33901]

In [8]:
tokenizer.encode("w")

[86]

In [9]:
tokenizer.encode(" ")

[220]

In [10]:
tokenizer.encode("ier")

[959]

In [11]:
tokenizer.decode([33901, 86, 343, 86, 220, 959])

'Akwirw ier'

### Exercise 2.2: Data Loaders with Different Strides and Context Sizes  

To better understand how the data loader works, try running it with different settings, such as:  
- `max_length=2` and `stride=2`  
- `max_length=8` and `stride=2`  

Batch sizes of `1`—as we have used so far—are useful for illustration. If you have experience with deep learning, you may know that smaller batch sizes require less memory but lead to noisier model updates. As in regular deep learning, **batch size is a trade-off and a hyperparameter** to experiment with when training LLMs.  

Before moving on to the final sections of this chapter (which focus on creating embedding vectors from token IDs), let's briefly explore sampling with a batch size greater than `1`:  

```python
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
```

This produces the following output:

#### Inputs:
```
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
```

#### Targets:
```
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
```

Note that we **increase the stride to 4**. This ensures full dataset utilization (without skipping words) while avoiding excessive overlap between batches, which could otherwise lead to overfitting.  

In the final two sections of this chapter, we will implement **embedding layers** that convert token IDs into continuous vector representations—an essential input format for LLMs.


In [15]:
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader(txt, batch_size=4, max_length=256, stride=128):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(dataset, batch_size=batch_size)

    return dataloader


with open("01_main-chapter-code/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

tokenizer = tiktoken.get_encoding("gpt2")
encoded_text = tokenizer.encode(raw_text)

vocab_size = 50257
output_dim = 256
max_len = 4
context_length = max_len

token_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [16]:
dataloader = create_dataloader(raw_text, batch_size=4, max_length=2, stride=2)

for batch in dataloader:
    x, y = batch
    break

x

tensor([[  40,  367],
        [2885, 1464],
        [1807, 3619],
        [ 402,  271]])

In [17]:
dataloader = create_dataloader(raw_text, batch_size=4, max_length=8, stride=2)

for batch in dataloader:
    x, y = batch
    break

x

tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138],
        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [  402,   271, 10899,  2138,   257,  7026, 15632,   438]])