### Exercise 2.1: Byte Pair Encoding of Unknown Words  

Try the BPE tokenizer from the `tiktoken` library on the unknown words **"Akwirw ier"** and print the individual token IDs. Then, call the `decode` function on each resulting integer to reproduce the mapping shown in **Figure 2.11**. Lastly, call the `decode` method on the token IDs to check if it can reconstruct the original input, **"Akwirw ier"**.  

A detailed discussion and implementation of BPE is beyond the scope of this book. However, in short, BPE builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words.  

For example, BPE starts by adding all individual characters to its vocabulary (`"a"`, `"b"`, ...). Then, it merges frequently occurring character combinations into subwords. For instance, `"d"` and `"e"` may be merged into the subword `"de"`, which is common in words like **"define"**, **"depend"**, **"made"**, and **"hidden"**. These merges are determined by a frequency cutoff. 

### Exercise 2.2: Data Loaders with Different Strides and Context Sizes  

To better understand how the data loader works, try running it with different settings, such as:  
- `max_length=2` and `stride=2`  
- `max_length=8` and `stride=2`  

Batch sizes of `1`—as we have used so far—are useful for illustration. If you have experience with deep learning, you may know that smaller batch sizes require less memory but lead to noisier model updates. As in regular deep learning, **batch size is a trade-off and a hyperparameter** to experiment with when training LLMs.  

Before moving on to the final sections of this chapter (which focus on creating embedding vectors from token IDs), let's briefly explore sampling with a batch size greater than `1`:  

```python
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
```

This produces the following output:

#### Inputs:
```
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
```

#### Targets:
```
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
```

Note that we **increase the stride to 4**. This ensures full dataset utilization (without skipping words) while avoiding excessive overlap between batches, which could otherwise lead to overfitting.  

In the final two sections of this chapter, we will implement **embedding layers** that convert token IDs into continuous vector representations—an essential input format for LLMs.
