In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Overview

In this notebook, we will explore **two approaches to tokenization** for Large Language Models:

1. **Manual Tokenization**  
   - We will implement a custom tokenization process using **regular expressions (regex)**.  
   - This approach will split text into tokens such that:  
     - Each **word** and **special character** (`, . : ; ? _ ! " ( ) ' --`) is treated as an individual token.  
     - **Whitespace** will be used for splitting but will **not appear as tokens**.  
   - This helps us understand the basic mechanics of tokenization.

2. **Byte Pair Encoding (BPE) Tokenization**  
   - We will use a library-based implementation of **BPE tokenization**, which is widely used in LLMs like GPT.  
   - BPE combines frequent character pairs into tokens to reduce the overall vocabulary size while preserving common patterns.  
   - This approach is **more efficient** and **closer to what real-world models use**.


### **1. Manual Tokenization**

In [2]:
with open('/content/drive/My Drive/LLM/Nebula by ChatGPT.txt', 'r', encoding='utf-8') as f:
  raw_text = f.read()

print(len(raw_text))
print(raw_text[:99])

8716
Nebula: The Cosmic Cradles of Creation
Preface
When you look at the night sky, you see thousands of


#### 1.1 Rule-based Tokenization
Split the given text into tokens based on the following rules:

- **Split words and the following special characters as individual tokens:**  
  `, . : ; ? _ ! " ( ) ' --`  
- **Split on whitespace, but do not include whitespace as tokens**  
- **Remove any empty strings after splitting**  

In [3]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['Nebula', ':', 'The', 'Cosmic', 'Cradles', 'of', 'Creation', 'Preface', 'When', 'you', 'look', 'at', 'the', 'night', 'sky', ',', 'you', 'see', 'thousands', 'of', 'stars', 'scattered', 'like', 'glitter', 'across', 'a', 'black', 'canvas', '.', 'It’s']


In [4]:
print(len(preprocessed))

1662


#### 1.2 Building the Vocabulary

- After tokenizing the text, we need to create a **vocabulary** – a mapping of each unique token to a unique integer ID.
- Steps:
  1. **Get all unique tokens** from the tokenized text using `set(preprocessed)`.
  2. **Sort the tokens** to maintain a consistent order.
  3. **Add special tokens**:
     - `<|endoftext|>` → Marks the end of the text sequence.
     - `<|unk|>` → Represents unknown tokens (tokens not in the vocabulary).
  4. **Create a dictionary** where:
     - **Key = token**
     - **Value = integer ID** (starting from 0)

This vocabulary will be used for encoding text into numerical representations.


In [5]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

663


In [6]:
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 10:
        break

('!', 0)
('&', 1)
('(', 2)
(')', 3)
(',', 4)
('.', 5)
('//hubblesite', 6)
('//www', 7)
('000', 8)
('1', 9)
('10', 10)


In [7]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

#### 1.3 Implementing a Tokenizer

We define a custom `Tokenizer` class to **encode text into token IDs** and **decode token IDs back to text**, based on our vocabulary.

#### **Key Components**

1. **Initialization (`__init__`)**
   - `self.str_to_int`: A dictionary mapping tokens → integer IDs.
   - `self.int_to_str`: A reverse dictionary mapping integer IDs → tokens.

2. **`encode(text)`**
   - Splits the input text using the same regex rules from earlier.
   - Removes whitespace tokens and keeps special characters as tokens.
   - Replaces any unknown token with `<|unk|>`.
   - Converts tokens into their corresponding integer IDs using the vocabulary.
   - **Returns**: A list of token IDs.

3. **`decode(ids)`**
   - Converts token IDs back into tokens using the reverse dictionary.
   - Joins tokens with spaces.
   - Uses a regex substitution to remove unwanted spaces before punctuation.
   - **Returns**: The reconstructed text.

#### **Purpose**
This tokenizer demonstrates a **basic rule-based encoding and decoding process**, similar to how real tokenizers work, but on a smaller scale.

In [8]:
class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

#### 1.4 Encoding and Decoding

We use the tokenizer to convert text into token IDs and then decode those IDs back into text.  
- **Encoding**: Splits text into tokens and maps each token to its corresponding ID from the vocabulary. Unknown tokens are replaced with `<|unk|>`.  
- **Decoding**: Converts token IDs back to tokens, joins them into a string, and removes extra spaces before punctuation.  

This verifies that our tokenizer supports a complete **encode → decode cycle**, a crucial step for text processing in LLMs.


In [9]:
tokenizer = Tokenizer(vocab)
text = 'Nebulae are vast clouds of gas and dust in space, acting as the birthplaces and graveyards of stars, shaping the life cycle of the cosmos. They come in stunning forms—glowing, dark, or shattered remnants—each telling a story of creation, destruction, and renewal in the universe.'

In [10]:
text

'Nebulae are vast clouds of gas and dust in space, acting as the birthplaces and graveyards of stars, shaping the life cycle of the cosmos. They come in stunning forms—glowing, dark, or shattered remnants—each telling a story of creation, destruction, and renewal in the universe.'

In [11]:
tokenizer.encode(text)

[126,
 207,
 632,
 273,
 486,
 365,
 203,
 314,
 400,
 570,
 4,
 191,
 211,
 602,
 664,
 203,
 664,
 486,
 580,
 4,
 664,
 602,
 428,
 664,
 486,
 602,
 664,
 5,
 165,
 283,
 400,
 588,
 664,
 4,
 299,
 4,
 493,
 664,
 664,
 664,
 187,
 584,
 486,
 296,
 4,
 307,
 4,
 203,
 664,
 400,
 602,
 628,
 5]

In [12]:
tokenizer.decode(tokenizer.encode(text))

'Nebulae are vast clouds of gas and dust in space, acting as the <|unk|> and <|unk|> of stars, <|unk|> the life <|unk|> of the <|unk|>. They come in stunning <|unk|>, dark, or <|unk|> <|unk|> <|unk|> a story of creation, destruction, and <|unk|> in the universe.'

### **2. Byte Pair Tokenizer**

In [13]:
# !pip3 install tiktoken

In [14]:
import importlib
import tiktoken

#### BPE Tokenization using `tiktoken` (GPT-2 Encoding)

In this step, we use the **`tiktoken`** library to apply **Byte Pair Encoding (BPE)** tokenization, the same method used in GPT models.

##### **What happens**
- **Tokenizer Initialization**:  
  `tiktoken.get_encoding("gpt2")` loads the GPT-2 BPE tokenizer.
  
- **Encoding**:  
  `tokenizer.encode(text, allowed_special={"<|endoftext|>"})` converts the text into a list of integer token IDs.  
  - Uses GPT-2’s BPE vocabulary.
  - Allows the special token `<|endoftext|>` to remain intact.

- **Decoding**:  
  `tokenizer.decode(integers)` converts the token IDs back into the original text.

##### **Purpose**
This demonstrates **library-based BPE tokenization**, which is more efficient and commonly used in LLMs compared to simple rule-based tokenization.


In [15]:
tokenizer = tiktoken.get_encoding("gpt2")

In [16]:
text = (
    'Nebulae are enormous clouds of gas and dust scattered across the universe, serving as the cradles where stars are born. These majestic formations can span hundreds of light-years and display breathtaking colors when illuminated by young, hot stars. <|endoftext|> Some nebulae glow brightly as emission regions, while others appear dark, blocking the starlight behind them. They play a crucial role in recycling elements, giving rise to new stars and planetary systems over billions of years. Studying nebulae helps scientists understand the origin of stars, planets, and even life itself.'
)

In [17]:
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[45, 1765, 377, 3609, 389, 9812, 15114, 286, 3623, 290, 8977, 16830, 1973, 262, 6881, 11, 7351, 355, 262, 1067, 324, 829, 810, 5788, 389, 4642, 13, 2312, 45308, 30648, 460, 11506, 5179, 286, 1657, 12, 19002, 290, 3359, 35589, 7577, 618, 35162, 416, 1862, 11, 3024, 5788, 13, 220, 50256, 2773, 497, 15065, 3609, 19634, 35254, 355, 25592, 7652, 11, 981, 1854, 1656, 3223, 11, 12013, 262, 3491, 2971, 2157, 606, 13, 1119, 711, 257, 8780, 2597, 287, 25914, 4847, 11, 3501, 4485, 284, 649, 5788, 290, 27047, 3341, 625, 13188, 286, 812, 13, 3604, 1112, 497, 15065, 3609, 5419, 5519, 1833, 262, 8159, 286, 5788, 11, 14705, 11, 290, 772, 1204, 2346, 13]


In [18]:
tokenizer.decode(integers)

'Nebulae are enormous clouds of gas and dust scattered across the universe, serving as the cradles where stars are born. These majestic formations can span hundreds of light-years and display breathtaking colors when illuminated by young, hot stars. <|endoftext|> Some nebulae glow brightly as emission regions, while others appear dark, blocking the starlight behind them. They play a crucial role in recycling elements, giving rise to new stars and planetary systems over billions of years. Studying nebulae helps scientists understand the origin of stars, planets, and even life itself.'

### **3. Input-Target Pairs**

In [19]:
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

2066


*We've got 2066 tokens after applying BPE tokenizer*

#### 3.1 Creating Input and Target Sequences

We define a fixed context size of 8 tokens to prepare sequences for training. The input sequence (`x`) consists of the first 8 tokens from the encoded text, while the target sequence (`y`) is a shifted version of the text, starting 4 tokens later.  

This method creates overlapping input-target pairs, which helps the model learn next-token prediction by using part of the context and predicting upcoming tokens.


In [20]:
context_size = 8

x = enc_text[:context_size]
y = enc_text[4:context_size+4]

print(f"x: {x}")
print(f"y:                     {y}")

x: [45, 1765, 4712, 25, 383, 32011, 3864, 324]
y:                     [383, 32011, 3864, 324, 829, 286, 21582, 198]


#### 3.2 Generating Context-Target Pairs

We iterate through the encoded text to generate training samples for next-token prediction. For each step:

- **Context**: A sequence of tokens from the start up to the current position.
- **Desired Token**: The next token that should be predicted given the context.

This approach demonstrates how language models learn by predicting the next token based on the preceding tokens.


In [21]:
for i in range(4, context_size+4):
    context = enc_text[:i]
    desired = enc_text[i]

    print(context, "---->", desired)

[45, 1765, 4712, 25] ----> 383
[45, 1765, 4712, 25, 383] ----> 32011
[45, 1765, 4712, 25, 383, 32011] ----> 3864
[45, 1765, 4712, 25, 383, 32011, 3864] ----> 324
[45, 1765, 4712, 25, 383, 32011, 3864, 324] ----> 829
[45, 1765, 4712, 25, 383, 32011, 3864, 324, 829] ----> 286
[45, 1765, 4712, 25, 383, 32011, 3864, 324, 829, 286] ----> 21582
[45, 1765, 4712, 25, 383, 32011, 3864, 324, 829, 286, 21582] ----> 198


#### 3.3 Displaying Context-Target Pairs in Text Form

We decode each context and target token back into readable text. This makes it easier to understand how the model sees input-output pairs:

This visualization helps us clearly see how language models are trained to predict the next word based on the preceding context.


In [22]:
for i in range(4, context_size+4):
    context = enc_text[:i]
    desired = enc_text[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

Nebula: ---->  The
Nebula: The ---->  Cosmic
Nebula: The Cosmic ---->  Cr
Nebula: The Cosmic Cr ----> ad
Nebula: The Cosmic Crad ----> les
Nebula: The Cosmic Cradles ---->  of
Nebula: The Cosmic Cradles of ---->  Creation
Nebula: The Cosmic Cradles of Creation ----> 



#### 3.4 Creating a Dataset and DataLoader for GPT Training

In this step, we prepare the text data for training by converting it into overlapping input-target sequences using **PyTorch's Dataset and DataLoader classes**.

##### **Key Components**

1. **GPTDataset Class**
   - **Initialization (`__init__`)**:
     - Encodes the raw text into token IDs using the GPT-2 BPE tokenizer.
     - Splits token IDs into overlapping **input chunks** and **target chunks**:
       - `input_chunk`: A sequence of `max_length` tokens.
       - `target_chunk`: The next `max_length` tokens shifted by one position (for next-token prediction).
     - Stores all input-target pairs as tensors.
   - **`__len__`**: Returns the total number of input-target pairs.
   - **`__getitem__`**: Retrieves a single input-target pair by index.

2. **`create_dataloader` Function**
   - Wraps the dataset into a **DataLoader** for efficient batching.
   - Parameters:
     - `batch_size`: Number of sequences per batch.
     - `max_length`: Context size for each input sequence.
     - `stride`: Step size for creating overlapping chunks.

3. **Example**
   - Creates a DataLoader for the raw text with:
     - `batch_size = 8`
     - `max_length = 4`
     - `stride = 4`
   - Fetches the first batch and prints **inputs** and **targets**.

##### **Purpose**
This process converts a raw text into a structured dataset of overlapping input-target pairs, enabling **efficient batch training for next-token prediction models like GPT**.


In [27]:
import torch
from torch.utils.data import Dataset, DataLoader


class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [28]:
def create_dataloader(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    tokenizer = tiktoken.get_encoding("gpt2")

    dataset = GPTDataset(txt, tokenizer, max_length, stride)

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [29]:
dataloader = create_dataloader(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   45,  1765,  4712,    25],
        [  383, 32011,  3864,   324],
        [  829,   286, 21582,   198],
        [ 6719,  2550,   198,  2215],
        [  345,   804,   379,   262],
        [ 1755,  6766,    11,   345],
        [  766,  4138,   286,  5788],
        [16830,   588, 31133,  1973]])

Targets:
 tensor([[ 1765,  4712,    25,   383],
        [32011,  3864,   324,   829],
        [  286, 21582,   198,  6719],
        [ 2550,   198,  2215,   345],
        [  804,   379,   262,  1755],
        [ 6766,    11,   345,   766],
        [ 4138,   286,  5788, 16830],
        [  588, 31133,  1973,   257]])


### **4. Input Embeddings**

#### 4.1 Word Embeddings

This example demonstrates how to use **`torch.nn.Embedding`** to map discrete token indices into dense vector representations.




In [30]:
input_ids = torch.tensor([2, 3, 5, 1])

In [31]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [32]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [33]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


#### 4.2 Positional Embeddings

In [34]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [36]:
max_length = 4
dataloader = create_dataloader(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [37]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   45,  1765,  4712,    25],
        [  383, 32011,  3864,   324],
        [  829,   286, 21582,   198],
        [ 6719,  2550,   198,  2215],
        [  345,   804,   379,   262],
        [ 1755,  6766,    11,   345],
        [  766,  4138,   286,  5788],
        [16830,   588, 31133,  1973]])

Inputs shape:
 torch.Size([8, 4])


In [38]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [39]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [40]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [41]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
