# Chapter 2

__Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining.__

![LLM coding mental model](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image001.png)

## 2.2 tokenizing text

__Figure 2.4 A view of the text processing steps covered in this section in the context of an LLM. Here, we split an input text into individual tokens, which are either words or special characters, such as punctuation characters. In upcoming sections, we will convert the text into token IDs and create token embeddings.__

![LLM text processing steps](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image007.png)

In [2]:
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict.txt', <http.client.HTTPMessage at 0x112290770>)

In [11]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of chars:", len(text))
print(raw_text[:99])

Total number of chars: 31
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### text sample sizes

In [5]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


### when to remove whitespace

In [37]:
text = "Hello, world. Is this-- a test?"
regex_string = r'([,.:;?_!()\']|--|\s)'
REGEX_STRING = regex_string
result = re.split(regex_string, text)
reesult = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']


__Figure 2.5 The tokenization scheme we implemented so far splits text into individual words and punctuation characters. In the specific example shown in this figure, the sample text gets split into 10 individual tokens.__

![tokenizing scheme](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image009.png)

In [13]:
preprocessed = re.split(regex_string, raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4606


In [14]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 converting tokens into token IDs

__Figure 2.6 We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity.__

![building a covabulary of unique token IDs](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image011.png)

__Figure 2.7 Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs. The vocabulary is built from the entire training set and can be applied to the training set itself and any new text samples. The depicted vocabulary contains no punctuation or special characters for simplicity.__

In [15]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1158


In [17]:
vocab = {token:integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
('"Ah', 2)
('"Be', 3)
('"Begin', 4)
('"By', 5)
('"Come', 6)
('"Destroyed', 7)
('"Don', 8)
('"Gisburns"', 9)
('"Grindles', 10)
('"Hang', 11)
('"Has', 12)
('"How', 13)
('"I', 14)
('"If', 15)
('"It', 16)
('"Jack', 17)
('"Money', 18)
('"Moon-dancers"', 19)
('"Mr', 20)
('"Mrs', 21)
('"My', 22)
('"Never', 23)
('"Of', 24)
('"Oh', 25)
('"Once', 26)
('"Only', 27)
('"Or', 28)
('"That', 29)
('"The', 30)
('"Then', 31)
('"There', 32)
('"This', 33)
('"We', 34)
('"Well', 35)
('"What', 36)
('"When', 37)
('"Why', 38)
('"Yes', 39)
('"You', 40)
('"but', 41)
('"deadening', 42)
('"dragged', 43)
('"effects"', 44)
('"interesting"', 45)
('"lift', 46)
('"obituary"', 47)
('"strongest', 48)
('"strongly"', 49)
('"sweetly"', 50)


In [18]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()} #B

    def encode(self, text): #C
        preprocessed = re.split(regex_string, text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids): #D
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(regex_string, r'\1', text) #E
        return text

### a mispelling 'sad', resulting in 'token' not recognised in vocab

In [23]:
tokenizer = SimpleTokenizerV1(vocab)
text = """It's the last he painted, you know.
       Mrs, Gisburn sad with pardonale pride."""
ids = tokenizer.encode(text)
print(ids)

KeyError: 'sad'

### corrected 'sad' to 'said'

__Figure 2.8 Tokenizer implementations share two common methods: an encode method and a decode method. The encode method takes in the sample text, splits it into individual tokens, and converts the tokens into token IDs via the vocabulary. The decode method takes in token IDs, converts them back into text tokens, and concatenates the text tokens into natural text.__

![all tokenizers share 2 methods encode and decode](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image015.png)

In [25]:
tokenizer = SimpleTokenizerV1(vocab)
text = """It's the last he painted, you know.
       Mrs, Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[95, 51, 880, 1015, 633, 564, 776, 54, 1154, 627, 56, 104, 54, 82, 881, 1136, 784, 823, 56]


In [26]:
print(tokenizer.decode(ids))

It ' s the last he painted , you know . Mrs , Gisburn said with pardonable pride .


### now formally introducing a token not recognised int he vocab

In [28]:
text = """Hello, do you like tea?"""
print(tokenizer.decode(text))

KeyError: 'H'

## 2.4 adding special context tokens

__Figure 2.9 We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an `<|unk|>` token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an `<|endoftext|>` token that we can use to separate two unrelated text sources.__

![special context tokens](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image017.png)

__Figure 2.10 When working with multiple independent text source, we add <|endoftext|> tokens between these texts. These `<|endoftext|>` tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.__

![delimiting separate text sources](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image019.png)

In [38]:
EOT_TOKEN = "<|endoftext|>"
UNKNOWN_TOKEN = "<|unk|>"

In [39]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend([EOT_TOKEN, UNKNOWN_TOKEN])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

print(len(vocab.items()))

1160


In [40]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1155)
('your', 1156)
('yourself', 1157)
('<|endoftext|>', 1158)
('<|unk|>', 1159)


In [59]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s, i in vocab.items()}

    def encode(self, text):
        print('here')
        preprocessed = re.split(regex_string, text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int #A
                        else UNKNOWN_TOKEN for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(regex_string, r'\1', text) #B
        return text


In [60]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = f" {EOT_TOKEN} ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [61]:
tokenizer = SimpleTokenizerV2(vocab)
ids = tokenizer.encode(text)
print(ids)

here
[1159, 54, 386, 1154, 659, 1002, 59, 1158, 94, 1015, 984, 1011, 752, 1015, 1159, 56]


In [62]:
print(tokenizer.decode(ids))

<|unk|> , do you like tea ? <|endoftext|> In the sunlit terraces of the <|unk|> .


## 2.5 byte pair encoding

In [63]:
pip freeze | grep tiktoken

tiktoken==0.7.0
Note: you may need to restart the kernel to use updated packages.


In [6]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.7.0


In [65]:
tokenizer = tiktoken.get_encoding("gpt2")

In [66]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
    "of someunkonPlace."
)
integers = tokenizer.encode(text, allowed_special={EOT_TOKEN})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 2954, 261, 27271, 13]


In [67]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunkonPlace.


__Figure 2.11 BPE tokenizers break down unknown words into subwords and individual characters. This way, a BPE tokenizer can parse any word and doesn’t need to replace unknown words with special tokens, such as `<|unk|>`.__

![byte paor encoding](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image021.png)

### BPE

Try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and print the individual token IDs. 

In [68]:
string = "Akwirw ier"
tokens = tokenizer.encode(string)
print(tokens)

[33901, 86, 343, 86, 220, 959]


Then, call the decode function on each of the resulting integers in this list to reproduce the mapping shown in figure 2.11.

In [70]:
for token in tokens:
    print(token, tokenizer.decode([token]))

33901 Ak
86 w
343 ir
86 w
220  
959 ier


Lastly, call the decode method on the token IDs to check whether it can reconstruct the original input, “Akwirw ier.”

In [71]:
print(tokenizer.decode(tokens))

Akwirw ier


## 2.6 data smapling using the sliding window approach

__Figure 2.12 Given a text sample, extract input blocks as subsamples that serve as input to the LLM, and the LLM’s prediction task during training is to predict the next word that follows the input block. During training, we mask out all words that are past the target. Note that the text shown in this figure would undergo tokenization before the LLM can process it; however, this figure omits the tokenization step for clarity.__

![sampling with a sliding window approach](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image023.png)

In [72]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [73]:
enc_sample = enc_text[50:]

In [74]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [75]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [77]:
for i in range(1, context_size+1):
    context = tokenizer.decode(enc_sample[:i])
    desired = tokenizer.decode([enc_sample[i]])
    print(context, "---->", desired)

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


__Figure 2.13 To implement efficient data loaders, we collect the inputs in a tensor, x, where each row represents one input context. A second tensor, y, contains the corresponding prediction targets (next words), which are created by shifting the input by one position.__

![efficient data loader](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image025.png)

In [78]:
pip freeze | grep torch

torch==2.3.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch

In [3]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt) #A

        for i in range(0, len(token_ids) - max_length, stride): #B
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self): #C
        return len(self.input_ids)
    
    def __getitem__(self, idx): #D
        return self.input_ids[idx], self.target_ids[idx]

In [4]:
def create_dataloader_v1(txt, 
                         batch_size=4, 
                         max_length=256,
                         stride=128,
                         shuffle=True,
                         drop_last=True,
                         num_workers=0):
  tokenizer = tiktoken.get_encoding("gpt2")
  dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
  dataloader = DataLoader(
      dataset,
      batch_size=batch_size,
      shuffle=shuffle,
      drop_last=drop_last,
      num_workers=num_workers
  )

  return dataloader

In [37]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [39]:
type(first_batch[0])

torch.Tensor

In [40]:
tokenizer = tiktoken.get_encoding("gpt2")

# Decode and print the decoded representation of the first batch
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in first_batch]
print(decoded_texts)

[['I HAD always'], [' HAD always thought']]


In [41]:
second_bath = next(data_iter)
print(second_bath)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in second_bath]
print(decoded_texts)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]
[[' HAD always thought'], ['AD always thought Jack']]


__Figure 2.14 When creating multiple batches from the input dataset, we slide an input window across the text. If the stride is set to 1, we shift the input window by one position when creating the next batch. If we set the stride equal to the input window size, we can prevent overlaps between the batches.__

![stride](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image027.png)

### `max_length=` and `stride=2`

In [46]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=2, stride=2, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in first_batch]
print(decoded_texts)

[tensor([[ 40, 367]]), tensor([[ 367, 2885]])]
[['I H'], [' HAD']]


In [47]:
second_bath = next(data_iter)
print(second_bath)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in second_bath]
print(decoded_texts)

[tensor([[2885, 1464]]), tensor([[1464, 1807]])]
[['AD always'], [' always thought']]


In [48]:
third_bath = next(data_iter)
print(third_bath)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in third_bath]
print(decoded_texts)

[tensor([[1807, 3619]]), tensor([[3619,  402]])]
[[' thought Jack'], [' Jack G']]


In [49]:
fourth_bath = next(data_iter)
print(fourth_bath)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in fourth_bath]
print(decoded_texts)

[tensor([[402, 271]]), tensor([[  271, 10899]])]
[[' Gis'], ['isburn']]


### `max_length=8` and `stride=2`

In [50]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=8, stride=2, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in first_batch]
print(decoded_texts)

[tensor([[  40,  367, 2885, 1464, 1807, 3619,  402,  271]]), tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899]])]
[['I HAD always thought Jack Gis'], [' HAD always thought Jack Gisburn']]


In [51]:
second_bath = next(data_iter)
print(second_bath)
decoded_texts = [tokenizer.decode_batch(batch.tolist()) for batch in second_bath]
print(decoded_texts)

[tensor([[ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138]]), tensor([[ 1464,  1807,  3619,   402,   271, 10899,  2138,   257]])]
[['AD always thought Jack Gisburn rather'], [' always thought Jack Gisburn rather a']]


In [59]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=8, stride=2, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(type(inputs))
print("Iputs:\n", inputs)
decoded_inputs = tokenizer.decode_batch(inputs.tolist())
print(type(decoded_inputs))
print("Inputs:\n", decoded_inputs)

<class 'torch.Tensor'>
Iputs:
 tensor([[  40,  367, 2885, 1464, 1807, 3619,  402,  271]])
<class 'list'>
Inputs:
 ['I HAD always thought Jack Gis']


In [58]:
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=8,
    max_length=4,
    stride=4,
    shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Iputs:\n", inputs)
decoded_inputs = tokenizer.decode_batch(inputs.tolist())
print("Inputs:\n", decoded_inputs)
print("Targets:\n", targets)
decoded_targets = tokenizer.decode_batch(targets.tolist())
print("Targets:\n", decoded_targets)

Iputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Inputs:
 ['I HAD always', ' thought Jack Gis', 'burn rather a cheap', ' genius--though a', ' good fellow enough--', 'so it was no', ' great surprise to me', ' to hear that,']
Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
Targets:
 [' HAD always thought', ' Jack Gisburn', ' rather a cheap genius', '--though a good', ' fellow enough--so', ' it was no great', ' surprise to me to', ' hear that, in']


## 2.7 create token embeddings

__Figure 2.15 Preparing the input text for an LLM involves tokenizing text, converting text tokens to token IDs, and converting token IDs into vector embedding vectors. In this section, we consider the token IDs created in previous sections to create the token embedding vectors.__

![token embeddings step](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image029.png)

In [60]:
input_ids = torch.tensor([2, 3, 5, 1])

In [61]:
vocab_size = 6
output_dim = 3

### instantiate an embedding layer

In [62]:
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


### apply token ID to get an embedding vector

In [63]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [64]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


__Figure 2.16 Embedding layers perform a lookup operation, retrieving the embedding vector corresponding to the token ID from the embedding layer’s weight matrix. For instance, the embedding vector of the token ID 5 is the sixth row of the embedding layer weight matrix (it is the sixth instead of the fifth row because Python starts counting at 0). For illustration purposes, we assume that the token IDs were produced by the small vocabulary we used in section 2.3.__

![lloking up embedding layers](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image031.png)

## 2.8 encoding word positions

__Figure 2.17 The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 5, whether it’s in the first or third position in the token ID input vector, will result in the same embedding vector.__

![encoding word positions](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image033.png)

__Figure 2.18 Positional embeddings are added to the token embedding vector to create the input embeddings for an LLM. The positional vectors have the same dimension as the original token embeddings. The token embeddings are shown with value 1 for simplicity.__

![positional embeddings](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image035.png)

In [65]:
output_dim = 256
vocab_size = 50257
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [67]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=8,
    max_length=max_length,
    stride=max_length,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Tokens:\n", inputs)
print("\nInputs shape:\n", inputs.shape)


Tokens:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [69]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [71]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [72]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


__Figure 2.19 As part of the input processing pipeline, input text is first broken up into individual tokens. These tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used as input for the main LLM layers.__

![input text processing pipeline](https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch02__image037.png)

## 2.9 Summary
* LLMs require textual data to be converted into numerical vectors, known as __embeddings__ since they can’t process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.
* As the first step, raw text is broken into __tokens__, which can be words or characters. Then, the tokens are converted into integer representations, termed __token IDs__.
* Special tokens, such as `<|unk|>` and `<|endoftext|>`, can be added to enhance the model’s understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.
* The __byte pair encoding (BPE)__ tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.
* We use a __sliding window approach__ on tokenized data to generate input-target pairs for LLM training.
* Embedding layers in PyTorch function as a __lookup operation__, _retrieving vectors corresponding to token IDs_. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.
* While token embeddings provide consistent vector representations for each token, they lack a sense of the token’s position in a sequence. To rectify this, two main types of __positional embeddings__ exist: absolute and relative. OpenAI’s GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.