# $$ Text-Data $$


**Our focus is on step 1 of stage 1: implementing the data sampling pipeline.**

### Pretraining Insight:
✅ Process text `one word` at a time <br/>
✅ Are trained on next-word prediction<br/>
- Models with:<br/>

✅ Millions to billions of parameters<br/>
→ Achieve impressive capabilities

### Why Data Preparation Is Needed
Before training an LLM:

✅ The training dataset must be prepared <br/>
Data must be:

- Tokenized
- Vectorized
- Sampled correctly

### What Is an Embedding?
Word embeddings convert words into numerical vectors that preserve semantic meaning.

### Why Embeddings Are Needed
- Neural networks cannot process raw text directly.
- Text is categorical, not numerical.
- Neural networks require:
✅ Continuous-valued numerical vectors.

### Embedding Models
Embeddings can be created using:
- A neural network layer.
- A pretrained embedding model

Different data types require: <br/>
✅ Different embedding models. <br/>
Text ≠ Audio ≠ Video

### Types of Text Embeddings
- Sentences
- Paragraphs
- Entire documents

Sentence and paragraph embeddings are commonly used in Retrieval-Augmented Generation (RAG) systems, which combine:
- Generation (text creation)
- Retrieval (searching external knowledge)

## Word2Vec
**One of the earliest and most popular word embedding methods is `Word2Vec`.**

How it works: <br/>
It trains a neural network to:
- Predict the context from a word.
- Or predict the word from its context.

**Main Idea:**

Words that appear in similar contexts tend to have similar meanings.

As a result, when visualized in 2D space:

> Similar words appear close together <br/>
> Related terms form clusters

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/2D Visualization of Word Embedding Space.png"
    style="width: 750px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
    This figure visualizes word embeddings projected into a two-dimensional space, where semantically similar words appear closer together. <br/>
    It demonstrates how relationships between concepts like animals, locations, and adjectives emerge in embedding space
  </p>
</div>


<br/>

## Embeddings in LLMs vs Word2Vec
In classical ML, embeddings can be generated using pretrained models such as Word2Vec.

In LLMs, embeddings are part of the model itself and are:
- Learned from scratch
- Updated during training

**Advantage:** LLM embeddings are optimized for:
- The specific task
- The specific dataset

**Real LLM embeddings are high-dimensional (hundreds to thousands)**

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/Text Tokenization to GPT Input Pipeline.png"
    style="width: 750px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
    This diagram illustrates the full pipeline from raw input text to token IDs and embeddings fed into a GPT-like decoder-only transformer. <br/>
    It shows how tokenization, embedding lookup, and postprocessing connect input text to generated output
  </p>
</div>


## Dataset

In [None]:
# Import the module for making HTTP requests and downloading files
import urllib.request
url = ("https://raw.githubusercontent.com/salavii/llm-from-scratch/refs/heads/main/data/the-verdict.txt")
file_path = "the-verdict.txt"

urllib.request.urlretrieve(url, file_path) # Download the file from the given URL and store it as "the-verdict.txt"


In [1]:
# Open the text file in read mode using UTF-8 encoding
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### Goal:
Tokenize a 20,479-character short story into:
- Individual words
- Special characters

So that these tokens can later be converted into `embeddings` for LLM training.

### How Do We Split the Text into Tokens?
To illustrate the basic idea of splitting text into tokens, we:
- Use Python’s **re (regular expression)** library.

Apply re.split to:
- Split text based on whitespace

This is only for demonstration.
Later, we will switch to a prebuilt tokenizer, and no regex knowledge will be required

In [2]:
import re
text = "Hello, world. This, is a test."

result = re.split(r'(\s)', text)        # Split the text based on whitespace characters
                                        # Parentheses ensure that the spaces are also included in the output
print(result)


['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


### Limitations of Simple Whitespace Tokenization
A simple whitespace-based tokenization:

>Separates most words correctly ✅ <br/>
>But still leaves punctuation attached to words ❌

### We do not convert all text to lowercase because:
Capitalization helps:
- Distinguish $proper$ $nouns$ from $common$ $nouns$.
- Understand sentence structure.
- Learn correct text generation with proper capitalization.

### Improved Tokenization Strategy

To improve tokenization, we split the text based on:
- Whitespace → \s
- Commas and periods → [,.]

In [3]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


A small remaining problem is that the list still includes whitespace characters. <br/>
Optionally, we can remove these redundant characters safely as follows:

In [4]:
result = [item for item in result if item.strip()]  # Remove whitespace-only tokens from the token list
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


### Handling Whitespace in Tokenization
When building a simple tokenizer, Whether to keep or remove whitespaces depends on the **application**.

#### Removing whitespaces:
- Reduces memory usage.
- Improves computational efficiency.

#### Keeping whitespaces:
- Preserves exact text structure.
- Is important for:
  > Programming languages <br/>
  > Indentation-sensitive data

In this project:

✅ Whitespaces are removed for simplicity. <br/>
✅ A later tokenizer will include whitespaces again.

### Extending the Tokenization Scheme
The current tokenizer works well for simple examples. <br/>
However, real-world text also contains:
- Question marks ?
- Quotation marks "
- Double dashes --

Other special characters

Therefore:

**The tokenization rules must be extended to correctly separate these symbols as well**

In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


Now that we have a basic tokenizer working, let’s apply it to Edith Wharton’s entire
short story:

In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


# Converting tokens into token IDs
After tokenization, tokens are still represented as:
> Python strings (text)

However, LLMs: <br/>
Operate only on `numerical values.`

Therefore:<br/>
- Each token must be converted into a unique integer.
- This integer is called a **Token ID.**

This step is an $intermediate$ $stage$ before converting token IDs into:

**Embedding vectors**

### Building the Vocabulary
To map tokens to token IDs, we must first build a **vocabulary.**

A vocabulary:
- Contains all unique tokens
- Assigns: <br/>
> One unique integer To each word and special character.

<br/>

let’s create a list of all unique tokens and sort them
alphabetically to determine the vocabulary size:

In [7]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(f'vocabulary size is: {vocab_size}')

vocabulary size is: 1130


Now, we create the vocabulary and print its first 51 entries for illustration purposes.

In [8]:
# Build the vocabulary by assigning a unique integer ID to each token
vocab = {token:integer for integer,token in enumerate(all_words)}

for i, item in enumerate(vocab.items()):      # Loop through the vocabulary items with an index counter
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


## Converting Tokens to Token IDs using Vocabulary
After tokenizing a new text, each token is mapped to a unique integer using an existing vocabulary. <br/>
This vocabulary is built once from the full training dataset and is reused for all future text. <br/>
As a result, any new sentence can be converted into a sequence of token IDs that the model can process. <br/>

## Encoding and Decoding with a Tokenizer
When an LLM processes text, it operates on token IDs, not `raw strings`. <br/>
We therefore need two directions of mapping:
- **Encoding**
- Text → Tokens → Token IDs
- Implemented by an `encode` method
- Uses a vocabulary that maps each token (string) to a unique integer ID

- **Decoding**
- Token IDs → Tokens → Text
- Implemented by a `decode` method
- Uses an inverse vocabulary that maps each integer ID back to its corresponding token

By combining `encode` and `decode`, the tokenizer:
- Converts human-readable text into numerical input for the model
- Converts the model’s numerical output back into human-readable text

In [9]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode (self, text):
        preprocessed = re.split (r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip() 
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode (self, ids):
        text = " ".join ([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,?.!\"()\'])', r'\1', text)
        return text
    

In [10]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," 
       Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


Next, let’s see whether we can turn these token IDs back into text using the decode
method:

In [11]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


So far, so good.<br/>
Let’s now apply it to a new text sample not contained in the training set

In [12]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

The problem is that the word `Hello` was not used in the “The Verdict” short story. <br/>
Hence, it is not contained in the vocabulary. This highlights the need to consider
$large$ and $diverse$ training sets to extend the vocabulary when working on LLMs.

## Adding special context tokens
In real-world scenarios, a tokenizer must be able to handle:
- `Unknown words` that are not present in the vocabulary.
- `Special context` tokens that provide additional information to the model.

To address this, we extend the vocabulary and tokenizer to include two special tokens:
- **<|unk|>** → Represents unknown tokens (out-of-vocabulary words)
- **<|endoftext|>** → Marks the end of a document or text sequence

We implement these changes in a new tokenizer version called SimpleTokenizerV2, which:
> Maps any unseen word to <|unk|> <br/>
> Optionally appends <|endoftext|> to indicate the end of the text

These special tokens help the LLM:
- Deal robustly with words it has not encountered during training
- Better understand document boundaries and context

## Using <|endoftext|> to Separate Independent Text Sources

When training GPT-like large language models on multiple independent documents or texts, all sources are usually concatenated into a single long training sequence. To prevent the model from confusing unrelated texts, a special token called `<|endoftext|>` is inserted between each text.

This token acts as a clear boundary marker that signals the end of one document and the start of another

<br/>

Let’s now modify the vocabulary to include these two special tokens.

In [13]:
# Remove duplicate tokens, convert to list, and sort alphabetically
all_tokens = sorted(list(set(preprocessed)))

# Add special tokens for unknown words and document separation
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

# Create the vocabulary: map each token to a unique integer ID
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1132


 let’s print the last five entries of the updated vocabulary:

In [14]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [15]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode (self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode (self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)   
        return text
    

In [16]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


Next, let’s tokenize the sample text using the SimpleTokenizerV2.

In [17]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [18]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


# Byte pair encoding
To move beyond simple word-level tokenization, we now explore a more advanced tokenization scheme called **Byte Pair Encoding (BPE)**. BPE-based tokenizers are used to train large language models such as`GPT-2`, `GPT-3`, and the original model behind ChatGPT.

Implementing BPE from scratch can be relatively complex, so instead of reimplementing the algorithm, we use the open source Python library `tiktoken`. This library provides a highly efficient BPE tokenizer with a Rust-based backend

In [19]:
from importlib.metadata import version
import tiktoken

print("tiktoken version:", version("tiktoken"))


tiktoken version: 0.12.0


we can instantiate the BPE tokenizer from tiktoken as follows:

In [20]:
# Load GPT-2 compatible BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
print("Tokenizer Loaded ✅")

Tokenizer Loaded ✅


The usage of this tokenizer is similar to the SimpleTokenizerV2 we implemented via an encode method:

In [21]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

Integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(Integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


We can then convert the token IDs back into text using the decode method, similar to
our SimpleTokenizerV2:

In [22]:
Strings = tokenizer.decode(Integers)
print (Strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


### Key Observations about BPE Tokenization
1. **Vocabulary Size and `<|endoftext|>` Token**
- The GPT-2 BPE tokenizer has a total vocabulary of **50,257 tokens**.
- The token `<|endoftext|>` is assigned the **largest token ID: 50256**.
- This token is mainly used to mark **the end of a document or text sequence.**

2. **No Need for `<|unk|>` in BPE**
- Unlike simple tokenizers, **BPE does not require an `<|unk|>` (unknown) token.**
- When the tokenizer encounters an unknown word, it:
    - Splits it into **subword units**
    - Or even into **individual characters**
    - This guarantees that **any word can always be represented**

3. **Why This is Important for LLMs** <br/>
The model can handle:
- New words
- Names
- Technical terms
- Misspellings
- Without ever failing due to “unknown vocabulary

### Byte pair encoding of unknown words 
Try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and
print the individual token IDs. Then, call the decode function on each of the resulting
integers in this list. Lastly, call the
decode method on the token IDs to check whether it can reconstruct the original
input, “Akwirw ier.”

In [23]:
TEXT = "Akwirw ier"
int = tokenizer.encode(TEXT, allowed_special={"<|endoftext|>"})
print (int)

[33901, 86, 343, 86, 220, 959]


In [24]:
for i in int:
    print (f"{i} : {tokenizer.decode([i])}")
    

33901 : Ak
86 : w
343 : ir
86 : w
220 :  
959 : ier


In [25]:
# str = tokenizer.decode([86,220,343])

str = tokenizer.decode(int)
print (str)

Akwirw ier


# Data Sampling with Sliding Window
During LLM training, the text is converted into `input–target` pairs, where the model sees a sequence of tokens as input and learns to **predict the next token as the target.** <br/>
To generate these training pairs, a `**sliding window**` approach is used. This moving window scans across the text step by step and continuously creates overlapping training examples.

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/sliding window.png"
    style="width: 750px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
   This figure shows how Large Language Models (LLMs) are trained to predict one word at a time using a next-token prediction  task.
  </p>
</div>


In [26]:
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


Next, we remove the first 50 tokens from the dataset for demonstration purposes,
as it results in a slightly more interesting text passage in the next steps:

In [27]:
enc_sample = enc_text[50:]

### Creating Input–Target Pairs (x and y)
A simple and intuitive way to build training data for next-token prediction is to create two variables:

- x → the input token sequence
- y → the target token sequence (x shifted by one position)

In [28]:
context_size = 4
x = enc_text[:context_size]
y = enc_text[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [40, 367, 2885, 1464]
y:      [367, 2885, 1464, 1807]


In [29]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "----->", desired)
    

[290] -----> 4920
[290, 4920] -----> 2241
[290, 4920, 2241] -----> 287
[290, 4920, 2241, 287] -----> 257


Let’s repeat the previous code but convert the token IDs
into text

In [30]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desire = enc_sample[i]
    print(tokenizer.decode(context), "----->", tokenizer.decode([desire]))

 and ----->  established
 and established ----->  himself
 and established himself ----->  in
 and established himself in ----->  a


## Preparing Tensors with a Data Loader
At this stage, we have already constructed the **input–target** pairs required for LLM training. <br/>
Before converting tokens into embeddings, one final step remains: implementing an **efficient data loader**.

The `data loader` iterates over the tokenized dataset and returns:
- an **input tensor** containing the token sequences seen by the LLM,
- a **target tensor** containing the next-token labels that the model must predict.

These tensors are returned in the form of **PyTorch tensors**, which are multidimensional numerical arrays used for deep learning computations.

Although tokens are shown as text strings in illustrations, the actual implementation operates **directly on token IDs**, since the BPE tokenizer’s `encode` method performs both tokenization and ID conversion in a single step

In [31]:
import torch
from torch.utils.data import Dataset, DataLoader

In [32]:
class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenizes the entire text
        token_ids = tokenizer.encode(text)

        # Uses a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]

            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    # Returns the total number of rows in the dataset
    def __len__(self):
        return len(self.input_ids)

    # Returns a single row from the dataset
    def __getitem__(self, idx):        
        return self.input_ids[idx], self.target_ids[idx]

    
    

#### This dataset is later combined with a PyTorch DataLoader, which:
- Groups samples into batches
- Enables efficient iteration during training
- Improves performance and memory usage

In [33]:
def CreateDataLoader_V1(text, batch_size=4, max_length=256, stride=128, 
                        shuffle=True, drop_last=True, num_workers=0):

    token_ids = tokenizer.encode(text)

    dataset = GPTDatasetV1(text, tokenizer, max_length, stride)
    dataloader = DataLoader(dataset, 
                            batch_size= batch_size,
                            shuffle= shuffle,
                            drop_last= drop_last,
                            num_workers= num_workers
                           )
    return dataloader
    

In [34]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = CreateDataLoader_V1( raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)     
first_batch = next(data_iter)
print(first_batch)


[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [35]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


**Exercise 2.2 Data loaders with different strides and context sizes.<br/>**
To develop more intuition for how the data loader works, try to run it with different
settings such as `max_length=2 and stride=2`, and `max_length=8 and stride=2`.

In [36]:
dataloader_E = CreateDataLoader_V1(raw_text, batch_size=1, max_length= 2, stride=2, shuffle= False)
data_iter_E = iter(dataloader_E)
first_batch_E = next(data_iter_E)
print(first_batch_E)

[tensor([[ 40, 367]]), tensor([[ 367, 2885]])]


<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/stride.png"
    style="width: 650px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >


Let’s look briefly at how we can use the data loader to sample with a batch size
greater than 1:

In [37]:
dataloader = CreateDataLoader_V1(raw_text, batch_size=8, max_length=4, stride=4, shuffle= False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f"input is:\n {inputs} \n\n target is:\n {targets}")

input is:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]) 

 target is:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


Increasing the stride to `4` ensures that we fully utilize the dataset without skipping any tokens.

Because the stride equals the context length, the extracted sequences do not overlap.
This eliminates redundant training samples and reduces the risk of overfitting, since the model does not repeatedly see nearly identical input windows

# Creating token embeddings
To train an LLM, token IDs must be converted into continuous numerical vectors called **embeddings.** <be/>
Neural networks cannot learn semantic meaning from integers alone, so an embedding layer maps each token ID to a dense vector.

### Why embeddings are needed:
- Token IDs (e.g., `2`, `15`, `501`) carry no meaning by themselves.
- Embeddings transform these IDs into vectors that capture relationships between words.
- During training, **backpropagation** updates these vectors so the model learns semantics (e.g., cat and dog become closer in vector space)

Let’s see how the token ID to embedding vector conversion works with a hands-on
example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1:

In [38]:
input_ids = torch.tensor([2, 3, 5, 1])

 suppose we have a small vocabulary of only 6 words (instead
of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embed
dings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):

In [39]:
vocab_size = 6
output_dim = 3

Using the `vocab_size` and `output_dim`, **we can instantiate an embedding layer in
PyTorch, setting the random seed to 123 for reproducibility purposes:**

In [40]:
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


When we create an embedding layer in PyTorch, it generates a weight matrix that stores the embedding vectors for all tokens.

Structure of the embedding matrix

Shape: [vocab_size, embedding_dim]

One row per token ID

One column per embedding dimension

In [41]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [42]:
print(embedding_layer(input_ids))


tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


## Next Step: Adding Positional Information

While embeddings capture the meaning of tokens,
**they do not encode the order of tokens in the sequence.**

To allow the model to understand sentence structure,
we must add **positional encodings** to the embeddings.

This enables the LLM to distinguish between differently ordered sequences

## Encoding word positions
Why Embeddings Alone Are Not Enough

While token embeddings provide a continuous vector representation of each token,
they do **not** encode the position of tokens in the sequence.

Self-attention does not inherently understand the order of tokens.
A token ID always maps to the same embedding vector, regardless of whether it
appears at the beginning, middle, or end of a sentence.

**Result:**
The model cannot distinguish between sequences with the same words but different order.

To fix this, we need **positional** encodings, which provide information about
the position of each token in the input sequence

## Relative vs. Absolute Positional Embeddings
LLMs need positional information because self-attention does not inherently understand token order.

#### Relative Positional Embeddings
These embeddings focus on **distances between tokens** (“how far apart”) rather than absolute positions. <br/>
They help the model generalize better to sequences of different lengths, even ones unseen during training.

#### Absolute Positional Embeddings
Each position (0, 1, 2, …) has its own learnable embedding vector.<br/>
`GPT models` use this type and optimize these embeddings during training.

#### Why this matters
Both methods enrich the LLM with knowledge of ordering and token relationships, enabling more coherent and context-aware predictions.

### Increasing the Embedding Size
For realistic LLM inputs, token IDs are mapped into high-dimensional embedding vectors.
Here, we use 256 dimensions—much smaller than GPT-3’s 12,288, but sufficient for experimentation.

The BPE tokenizer used earlier provides a vocabulary size of 50,257, which defines the embedding matrix shape

In [44]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [49]:
max_length = 4
dataloader = CreateDataLoader_V1(
                                raw_text, batch_size= 8,
                                max_length= max_length, 
                                stride= max_length, shuffle= False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print (f"Token IDs: \n {inputs} \n\n Inputs shape: \n {inputs.shape}")

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]) 

 Inputs shape: 
 torch.Size([8, 4])


 Let’s now use the embedding layer to embed these token IDs into 256-dimensional
vectors:

In [51]:
token_embeddings = token_embedding_layer(inputs)
print (token_embeddings.shape)

torch.Size([8, 4, 256])


The 8 × 4 × 256–dimensional tensor output shows that each token ID is now embed
ded as a 256-dimensional vector.
 For a GPT model’s absolute embedding approach, we just need to create another
embedding layer that has the same embedding dimension as the token_embedding_
layer:

In [57]:
context_lenght = max_length
pos_embedding_layer = torch.nn.Embedding(context_lenght, output_dim)
pos_embeddings = poss_embedding_layer(torch.arange(context_lenght))
print(poss_embeddings.shape)

torch.Size([4, 256])


In [58]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
