# Chapter 2: Working with text data

<div style="max-width:800px">
    
![](images/2.0.png)

</div>

## 2.1 Understand word embeddings

<div style="max-width:800px">
    
![](images/2.1.png)

</div>

## 2.2 Tokenizing Text
* Splitting the input text into individual tokens, a required step for creating embeddings for an LLM
* These tokens can be individual words or special characters, including punctuation characters

In [1]:
import urllib.request

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "the-verdict.txt"

urllib.request.urlretrieve(url, file_path)

('the-verdict.txt', <http.client.HTTPMessage at 0x1077967e0>)

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(len(raw_text))
print(raw_text[:99])

20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [3]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [4]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
print(preprocessed[:30])

4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into token IDs
* #### Convert hese tokens from strings into an integer representation to produce token IDs - intermediate step before converting token IDs into embedding vectors

#### To get token IDs:
#### 1. Build a vocabulary from these previously generated tokens - Each unique token is added to the vocabulary in alphabetical order

  
#### 2. Each unique token is mapped to a unique integer called token ID


#### For Example:

#### Apple -> 0

#### Axle -> 1

#### .

#### .

#### .

#### Zebra -> 10,000


<div style="max-width:1000px">
    
![](images/2.3.png)

</div>

In [7]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


In [8]:
vocab =  {token:index for index, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [9]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [10]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [11]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [12]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

## 2.4 Adding special context tokens
* #### add "< unk >" token to vocab when tokenizer comes across a word that is not in the vocabulary
* #### add "< endoftext >" token to vocab to signify a document boundary. When training LLM on independent documents, it is common to insert this token before each document to signfy to the LLM that these text sources are unrelated (because they will be concatenated when training)

![](images/2.4.png)

In [13]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}
print(len(vocab.items()))

1132


In [14]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [15]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [16]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [17]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [18]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


## 2.5 Byte pair encoding


#### BPE tokenizer was used to train GPT-2, GPT-3 and the original model used in ChatGPT

#### Doesn't use <|unk|> token - then how does it deal with unknown words?

#### The algorithm underlying BPE breaks down words that aren't in its predefined vocab into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words. So, if BPE encounters an unknown word, it can represent it as a sequence of subword tokens or characters

<div style="max-width:1000px">
    
![](images/2.5_1.png)

</div>


#### The ability to break down unknown words into individual characters ensures that the tokenizer and, consequently, the LLM that is trained with it can process any text, even if it contains words that were not present in its training data.


In [19]:
import tiktoken

In [20]:
tokenizer = tiktoken.get_encoding("gpt2")
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someUnknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 20035, 27271, 13]


In [21]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someUnknownPlace.


## 2.6 Data sampling with a sliding window

#### Introduce Dataloaders and Dataset class from PyTorch
* #### Dataset -> instantiates an object that decides how a record in the dataset will be structured
* #### Dataloader -> decides how the data is shuffled and assembled into batches

#### batch_size -> how many records in a batch

#### max_length -> how many words in a record

#### stride -> how many indices to slide forward (imagine a sliding window)

#### drop_last -> drops the last batch if it is shorter than the specified batch size

#### num_workers -> number of CPU processes used for preprocessing

<h4>
The next step is to generate input-target pairs for training the LLM.
</h4>

![](images/2.6_1.png)


In [22]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [23]:
enc_sample = enc_text[50:]

#### Generating the input-target pairs -> LLMS are pretrained by trying to predict the next word in a text. 

#### Create two variables, x and y, where x contains the input tokens, and y cointains the targets (which are the inputs shifted by +1)

In [24]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1: context_size + 1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [25]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)
    

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [26]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))
    

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


#### We've now created the input-target pairs we can use for LLM training.

#### The last thing we must do before we turn these tokens into embeddings is implement an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors. We want to tensors in particular: the input tensor, containing the text that the LLM sees and the target tensor, containing the targets for the LLM to predict.

<div style="max-width:800px">
    
![](images/2.6_2.png)

</div>

## ---------------------------------------------

#### Explaining Dataset and DataLoader classes from PyTorch

In [28]:
import torch
X_train = torch.randn(5, 2)
y_train = torch.tensor([0, 0, 0, 1, 1])
X_test = torch.randn(2, 2)
y_test = torch.tensor([0, 1])

print(X_train)
print(y_train)
print(X_test)
print(y_test)

tensor([[-1.5605, -0.4598],
        [-0.8939, -2.0963],
        [ 0.1645,  0.2967],
        [ 0.7719, -0.1005],
        [ 2.6913,  0.8113]])
tensor([0, 0, 0, 1, 1])
tensor([[ 0.3835,  1.2684],
        [-1.9615,  1.1403]])
tensor([0, 1])


In [29]:
# Dataset class instantiates objects that define how each data record is loaded

from torch.utils.data import Dataset

class ToyDataset(Dataset):            
    def __init__(self, X, y):        # set up attributes that we can access later in the __getitem__ and __len__ methods (filepaths, file objects, database connectors, etc)
        self.features = X
        self.labels = y

    def __getitem__(self, index):    # define instructions for returning exactly one item from the datset via an index 
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y
        
    def __len__(self):               # contains instructions for retrieving the length of the dataset 
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

In [30]:
# DataLoader class handles how the data is shuffled and assembled into batches

from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True
)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

In [31]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx + 1}:", x, y)

Batch 1: tensor([[-0.8939, -2.0963],
        [ 0.1645,  0.2967]]) tensor([0, 0])
Batch 2: tensor([[ 2.6913,  0.8113],
        [ 0.7719, -0.1005]]) tensor([1, 1])


## ---------------------------------------------

In [32]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)
        for i in range(0, len(token_ids) - max_length, stride): # terminate forloop at [ len(token_ids) - max_length ] in order to guarantee input_chunk and target_chunk are of the same size
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [33]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader

In [34]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [35]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


#### The first batch contains two tensors: the firt tensor stores the input token IDs, and the second tensor stores the target token IDs. Since the max_length is set to 4, each of the two tensors contain four token IDs. 

#### To understand the meaning of stride=1, take a look at the second batch. If we compare the first and second batches we notice that the second batch's token IDs are shifted by one position (for example the second ID in the first batch's input is 367, which is the first ID of the second batch's input). The stride setting dictates the number of positions the inputs shift across batches.

<div style="max-width:800px">
    
![](images/2.6_3.png)

</div>

## 2.7 Creating token embeddings

#### The last step in preparing the input text for LLM training is to convert the token IDs into embedding vectors, as shown below.

<div style="max-width:800px">
    
![](images/2.7_1.png)

</div>

#### We initialize these embedding weights with random values. This serves as the starting point for the LLM's learning process. 

#### Lets demonstrate how the token ID to embedding vector conversion works

#### Suppose we have the following four input tokens with IDs 2, 3, 5, and 1

In [41]:
input_ids = torch.tensor([2, 3, 5, 1])

#### Suppose we have a small vocabulary of only 6 words (instead of 50,257 words like in BPE), and we want to create embeddings of size 3:

In [42]:
vocab_size = 6
output_dim = 3

#### Using vocab_size and output_dim, we can instantiate an embedding layer in PyTorch

In [43]:
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[-0.9416, -0.6812, -0.4983],
        [-0.2112,  0.3875,  0.3281],
        [-0.5844, -0.4397, -0.5584],
        [ 0.0242, -0.4170, -0.3838],
        [ 1.6351, -1.0261,  0.1873],
        [ 0.5751,  0.5681,  1.7211]], requires_grad=True)


#### The weight matrix of the embedding layer contains small, random values. These values are optimized during LLM training.

#### Moreover, we can see the weight matrix has six rows and three columns. These is one row for each of the six possible tokens in the vocabulary, and there is one column for each of the three embedding dimensions.

#### Now, lets apply it to a token ID to obtain the embedding vector:

In [44]:
print(embedding_layer(torch.tensor([3])))

tensor([[ 0.0242, -0.4170, -0.3838]], grad_fn=<EmbeddingBackward0>)


#### You can see that for token ID 3, the third row in the embedding layer is returned as this tokens embedding. In other words, the embedding layer is essentially a lookup operation that retrieves rows from the embedding layer's weight matrix via token ID.

#### Lets convert all four input IDs to embeddings:

In [45]:
print(embedding_layer(input_ids))

tensor([[-0.5844, -0.4397, -0.5584],
        [ 0.0242, -0.4170, -0.3838],
        [ 0.5751,  0.5681,  1.7211],
        [-0.2112,  0.3875,  0.3281]], grad_fn=<EmbeddingBackward0>)


#### Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix, as shown below

<div style="max-width:800px">
    
![](images/2.7_2.png)

</div>

#### Next we add a small modification to these embedding vectors to encode positional information about a token within a text.

## 2.8 Encoding word positions

#### The self-attention mechanism doesn't have a notion of position or order for the tokens within a sequence, so it is helpful to inject additional position information into the LLM.

#### To achieve this we can use either of two broad categories of position-aware embeddings: relative positional embeddings and absolute positional embeddings. 

#### Absolute positional embeddings are directly associated with specific positions in a sequence. For each poisition in the input sequence, a unique embedding will be added to the tokens embedding to convey location, as shown below.

<div style="max-width:800px">
    
![](images/2.8_1.png)

</div>

#### Relative positional embeddings focus on the relative position or distance between tokens. This means the model learns the relationship in terms of "how far apart", rather than "at which exact poisition." The advantage is that the model can generalize better to sequences of varying lengths. OpenAI's GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original transformer model. 

#### Lets create initial positional embeddings to create LLM inputs:

In [36]:
# Encode the input tokens into 256-dimensional vector represenations

vocab_size = 50267 # vocab size of BPE tokenizer
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [37]:
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token ids:\n", inputs)
print("\nInput shape: \n", inputs.shape) # data branch consists of 8 samples with 4 tokens each

Token ids:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Input shape: 
 torch.Size([8, 4])


In [38]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape) # each token is now embedding as a 256-dimensional vector

torch.Size([8, 4, 256])


In [39]:
# For GPT's absoulute embedding approach, we need to create another embedding layer that has the same embedding dimension as the "token_embedding_layer"

context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length)) # input to pos_embeddings is usually a placeholder vector that contains sequence of numbers 0, 1, 2 ... input_length - 1
print(pos_embeddings.shape)

torch.Size([4, 256])


#### The input to pos_embeddings is usually a placeholder vector torch.arange(context_length), which contains sequence of numbers 0, 1, ..., up to the maximum input length-1. A

#### As you can see, the pos_embeddintg tensor contains four 256-dimensional vecotrs. We can now add these directly to the token embeddings.

In [40]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


#### These input_embeddings can now be processed by