## Here I'll be trying to create a Dataset and Dataloader class : )

In [3]:
import torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np

Lets load the token ids :)

In [4]:
## Load the bin we created before ( the token_ids list)
token_id_bin_path = "../../data/processed/Initial/initial_token_ids.bin"

In [None]:
token_ids = np.fromfile(token_id_bin_path, dtype=np.uint16)

token_ids[:5]

array([ 104, 3077, 4615, 2115,  209], dtype=uint16)

In [6]:
## Define the context-length
block_size = 64    # The context length of 64


Lets create a Dataset class, which returns us the token of size *context_length*.

### Dataset Explanation

In the custom dataset class, the `__len__(self)` function should return the number of **valid training samples**.

#### Example:

Suppose:

```python
tokens = [10, 11, 12, 13, 14, 15, 16]
block_size = 4
```

Then the valid **input sequences (`X`)** of length `block_size` are:

- `[10, 11, 12, 13]`
- `[11, 12, 13, 14]`
- `[12, 13, 14, 15]`

You **cannot go beyond index 2**, because at `idx = 3`:

```python
X = [13, 14, 15, 16]
```

To predict the next token (`Y`), you'd need one more token after `16` — but it doesn't exist.

#### Therefore:
The number of valid training samples is:

```
len = len(tokens) - block_size = 7 - 4 = 3
```


### Explanation of `__getitem__(self, idx)`

This method is used to retrieve a **single training sample** from the dataset.
#### What it does:

- It slices `self.data` to generate a training pair:
  - `x` is a sequence of `block_size` tokens starting at index `idx`.
  - `y` is the next sequence of `block_size` tokens, shifted by 1 — representing the expected next-token predictions for each element in `x`.

#### Example:

Suppose:

```python
self.data = [10, 11, 12, 13, 14, 15, 16]
block_size = 4
idx = 0
```

Then:

- `x = [10, 11, 12, 13]`
- `y = [11, 12, 13, 14]`

So the model learns to predict `y[i]` given `x[i]` — that is, to predict the **next token** at every step in the input sequence.

This is how autoregressive training works in language models.

```

In [21]:
class TokenDataset(Dataset):
    def __init__(self, token_ids, block_size ):
        self.block_size = block_size   
        self.data = np.array(token_ids, dtype=np.uint16)  # our data is going to be an np array ( for easy slicing )

    def __len__(self):
        return (self.data).shape[0] - self.block_size   

    def __getitem__(self, idx):
        X = torch.tensor(self.data[idx:idx+self.block_size], dtype=torch.long)
        y = torch.tensor(self.data[idx+1 : idx+self.block_size + 1], dtype = torch.long)

        return X, y
        

In [22]:
## Lets just quickly split the dataset  -- first 80% be train data
split_idx = int(0.8 * len(token_ids))
train_token_ids = token_ids[:split_idx]
val_token_ids = token_ids[split_idx:]

len(train_token_ids), len(test_token_ids)


(47108, 11778)

47108 training tokens, 11778 Validation token Ids

In [36]:
token_dataset = TokenDataset(train_token_ids, block_size)
trainloader = DataLoader(token_dataset, batch_size = 32, shuffle=True, drop_last = True)


In [37]:
## Get the first dataset
token_dataset[0]


(tensor([ 104, 3077, 4615, 2115,  209,   12, 6642,   27, 7661,   58, 4661, 7883,
           73, 1148,   72,    5, 1433, 7880, 2732,  248,  819, 2649,   72,    5,
         2726, 7870, 7496,   27,    5, 3310,   58, 1996, 7883, 4751,   24,   34,
          947, 3290, 4079, 3219, 4751,   16, 7873, 4137, 1148,  148, 4234, 4582,
         7870,  185, 2058, 2115,   73, 7317,  270,  911, 4834,  148, 7621,   90,
          134, 1387, 6951,   27]),
 tensor([3077, 4615, 2115,  209,   12, 6642,   27, 7661,   58, 4661, 7883,   73,
         1148,   72,    5, 1433, 7880, 2732,  248,  819, 2649,   72,    5, 2726,
         7870, 7496,   27,    5, 3310,   58, 1996, 7883, 4751,   24,   34,  947,
         3290, 4079, 3219, 4751,   16, 7873, 4137, 1148,  148, 4234, 4582, 7870,
          185, 2058, 2115,   73, 7317,  270,  911, 4834,  148, 7621,   90,  134,
         1387, 6951,   27, 2147]))

Looking Good!!

In [38]:
for batch_features, batch_labels in trainloader:
    print("Input (x):", batch_features.shape)   # (batch_size, block_size)
    print("Target (y):", batch_labels.shape)  # (batch_size, block_size)
    print("First sample:")
    print("x:", batch_features[0][:10])
    print("y:", batch_labels[0][:10])
    break

Input (x): torch.Size([32, 64])
Target (y): torch.Size([32, 64])
First sample:
x: tensor([1084, 1502,  843, 2888,   34,  742, 7140, 7870,  223,  167])
y: tensor([1502,  843, 2888,   34,  742, 7140, 7870,  223,  167,   23])


the batches are currently ( batch_size, block_size ) shaped. 
Meaning 32 rows ( samples) of length 64 are being processed at once