## 2.2 Tokenizing text

In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [2]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Let's modify the regular expression splits on whitespaces (\s) and commas, and periods ([,.]):

In [3]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


As we can see, this creates empty strings, let's remove them

In [4]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


 Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes

In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


This is pretty good, and we are now ready to apply this tokenization to the raw text

In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


The above print statement outputs 4690, which is the number of tokens in this text **(without whitespaces)**.
 
 
 Let's print the first 30 tokens for a quick visual check:

In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into tokens ID

In [8]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)


1130


In [9]:
all_words[:12]

['!', '"', "'", '(', ')', ',', '--', '.', ':', ';', '?', 'A']

In [10]:
vocab = {token:integer for integer,token in enumerate(all_words)}


In [11]:
len(vocab)

1130

In [12]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)


Putting it now all together into a tokenizer class

In [13]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text



- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [14]:
tokenizer = SimpleTokenizerV1(vocab)

# all symbols in the """   """ are part of text.
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


- We can decode the integers back into text


In [15]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [16]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing
text based on a snippet from the training set. Let's now apply it to a new text sample that is not contained in the training set:

In [17]:
#text = "Hello, do you like tea?"
#print(tokenizer.encode(text))  # print gives error -> KeyError "Hello"

The problem is that the word "Hello" was not used in the **The Verdict** short story. Hence, it is not contained in the vocabulary. This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.


we will also discuss **additional special tokens** that can be used to provide further context for an LLM during training.

## 2.4 Adding special context tokens

- The above produces an error because the word "Hello" is not contained in the vocabulary
- To deal with such cases, we can add special tokens like "<|unk|>" to the vocabulary to represent unknown words
- Since we are already extending the vocabulary, let's add another token called "<|endoftext|>" which is used in GPT-2 training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)

In [18]:
len(preprocessed)

4690

In [19]:
# adding special tokens: unk and endoftext
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [20]:
len(vocab.items())

1132

In [21]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)


('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


- Based on the code output above, we can confirm that the two new special tokens were indeed successfully incorporated into the vocabulary. Next, we adjust the tokenizer from code SimpleTokenizerV1 to new one:

In [22]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int     # A
                        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)     #B
        return text
    
#A replace unknown words by <|unk|> tokens
#B Replace spaces before the specified punctuations

In [23]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [24]:
tokenizer = SimpleTokenizerV2(vocab)

In [25]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

- how we see, token IDds contains 1130 for the `<|endoftext|>` seperatorr token as well as two 1131 token, which are used for `unknown` words.

In [26]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

> Note:
So far, we have discussed tokenization as an essential step in processing text as input to
LLMs. Depending on the LLM, some researchers also consider additional special tokens such
as the following:
- [BOS] (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

- [EOS] (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

- [PAD] (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
[PAD] token, up to the length of the longest text in the batch.


> Note:
Note that the tokenizer used for GPT models does not need any of these tokens mentioned
above but only uses an `<|endoftext|>` token for simplicity. The <|endoftext|> is
analogous to the `[EOS]` token mentioned above. Also, `<|endoftext|>` is used for padding
as well. However, as we'll explore in subsequent chapters when training on batched inputs,
we typically use a `mask`, meaning we don't attend to padded tokens. Thus, the specific
token chosen for padding becomes inconsequential.

Moreover, the tokenizer used for GPT models also **doesn't use an `<|unk|>`** token for outof-vocabulary words. Instead, GPT models use a **`byte pair encoding tokenizer`**, which breaks down words into subword units, which we will discuss in the next section.

## 2.5 Byte pair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer

- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words

- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges

- The original BPE tokenizer can be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py

- In this chapter, we are using the BPE tokenizer from OpenAI's open-source tiktoken library, which implements its core algorithms in Rust to improve computational performance

- Author of book created a notebook in the ./bytepair_encoder that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)

In [27]:
# pip install tiktoken

In [28]:
import importlib
import importlib.metadata
import tiktoken

print("tiktoken version: ", importlib.metadata.version("tiktoken"))

tiktoken version:  0.7.0


In [29]:
tokenizer = tiktoken.get_encoding("gpt2")

In [30]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
)

In [31]:
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [32]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


We can make two noteworthy observations based on the token IDs and decoded text
above. First, the <|endoftext|> token is assigned a relatively large token ID, namely,
`50256`. In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3,
and the original model used in ChatGPT, has a total vocabulary size of `50,257`, with
`<|endoftext|>` being assigned the largest token ID.

- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="400px">

> EXERCISE 2.1 BYTE PAIR ENCODING OF UNKNOWN WORDS
- Try the BPE tokenizer from the tiktoken library on the unknown words "Akwirw ier"
and print the individual token IDs. Then, call the decode function on each of the
resulting integers in this list to reproduce the mapping shown in Figure 2.11. Lastly,
call the decode method on the token IDs to check whether it can reconstruct the
original input, "Akwirw ier".


In [33]:
text_sample = "Akwirw ier"
example_integers = tokenizer.encode(text_sample)
print(example_integers)

[33901, 86, 343, 86, 220, 959]


In [34]:
example_strings = tokenizer.decode(example_integers)
print(example_strings)

Akwirw ier


The exercise 2.1 implemented in the `exercise_solutions.ipynb` file

A detailed discussion and implementation of BPE is out of the scope of this book, but in
short, it builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words. For example, BPE starts with adding all individual single characters to its vocabulary ("a", "b", ...). In the next stage, it merges character combinations that frequently occur together into subwords. For example, "d" and "e" may be merged into the subword "de," which is common in many English words like "define", "depend", "made", and "hidden". The merges are determined by a frequency cutoff.


## 2.6 Data sampling with a sliding window

- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp" width="500px">

To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section

In [35]:
with open("the-verdict.txt", "r", encoding='utf-8') as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


- Next, we remove the first 50 tokens from the dataset for demonstration purposes as it
results in a slightly more interesting text passage in the next steps:


In [36]:
enc_sample = enc_text[50:]

- One of the easiest and most intuitive ways to create the input-target pairs for the next-word prediction task is to create two variables, `x` and `y`, where `x` contains the `input tokens` and `y` contains the `targets`, which are the inputs shifted by 1:


In [37]:
context_size = 4 #A
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f'x: {x}')
print(f"y:      {y}")

#A the context size determines how many tokens are included in the input

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [38]:
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


Everything left of the arrow (---->) refers to the input an LLM would receive, and the token
ID on the right side of the arrow represents the target token ID that the LLM is supposed to
predict.


In [39]:
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


- We will take care of the next-word prediction in a later chapter after we covered the attention mechanism

- For now, we implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one


- We use a sliding window approach, changing the position by +1:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp" width="500px">

- For the efficient data loader implementation, we will use PyTorch's built-in Dataset and
DataLoader classes. 

In [40]:
import torch 
print(f'PyTorch version: {torch.__version__}')

PyTorch version: 2.0.1+cu118


In [41]:
# A dataset for batched inputs and targets

from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)                           #A

        for i in range(0, len(token_ids) - max_length, stride):     #B
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + 1 + max_length]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):                                              #C
        return len(self.input_ids)
    
    def __getitem__(self, idx):                                     #D
        return self.input_ids[idx], self.target_ids[idx]
    
#A Tokenize the entire text
#B Use a sliding window to chunk the book into overlapping sequences of max_length
#C Return the total number of rows in the dataset
#D Return a single row from the dataset

- The following code will use the `GPTDatasetV1` to load the inputs in batches via a `PyTorch DataLoader`:

In [42]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
        stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")                       #A
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)      #B
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,                                        #C
        num_workers=0                                               #D
    )
    
    return dataloader

#A Initialize the tokenizer
#B Create dataset
#C drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss 
    # spikes during training
#D The number of CPU processes to use for preprocessing

- Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4:


In [43]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f'Original text length: {len(raw_text)}')
print(f"Tokenized text length: {len(tokenizer.encode(raw_text))}")

Original text length: 20479
Tokenized text length: 5145


In [44]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

In [45]:
len(dataloader)  # (tokenized_length - max_length) / stride => (5145 - 4) / 1 = 5141

5141

In [46]:
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)  # the first tensor stores input_ids, while second target_ids

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


> Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least `256`.


- To illustrate the meaning of stride=1, let's fetch another batch from this dataset:


In [47]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


> if we compare the first with the second batch, we can see that the second batch's token
IDs are shifted by one position compared to the first batch

- An example using stride equal to the context length (here: 4) as shown below:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp" width="500px">

- We can also create batched outputs

- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting

In [48]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

In [49]:
len(dataloader)  # 160 batches, where each contains 8 tensors

160

In [50]:
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f'Inputs shape: {inputs.shape}')     # 8 means batch_size, 4 means -> max_length
print(f'Targets shape: {targets.shape}\n')


print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs shape: torch.Size([8, 4])
Targets shape: torch.Size([8, 4])

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


## 2.7 Creating token embeddings

- The data is already almost ready for an LLM
- But lastly let us embed the tokens in a continuous vector representation using an embedding layer
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp" width="500px">

- Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):


In [51]:
input_ids = torch.tensor([2, 3, 5, 1])

- For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:


In [52]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

 - This would result in a 6x3 weight matrix:


In [54]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


- For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer.

- Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation

- To convert a token with id 3 into a 3-dimensional vector, we do the following:


In [55]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row.

- In other words, the embedding layer is essentially a look-up operation that retrieves rows from the embedding layer's weight matrix via a token ID.


In [56]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


- An embedding layer is essentially a look-up operation:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp" width="550px">

This section covered how we create embedding vectors from token IDs. The next and final section of this chapter will add a small modification to these embedding vectors to encode
positional information about a token within a text.

## 2.8 Encoding word positions

The way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence, as illustrated in next figure:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp" width="500px">


- In principle, the deterministic, position-independent embedding of the token ID is good for
reproducibility purposes. However, since the self-attention mechanism of LLMs itself is also
position-agnostic, it is helpful to inject additional position information into the LLM.

- Absolute positional embeddings are directly associated with specific positions in a
sequence. For each position in the input sequence, a unique embedding is added to the
token's embedding to convey its exact location. For instance, the first token will have a
specific positional embedding, the second token another distinct embedding, and so on, as
illustrated in next figure:


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp" width="500px">

- `Positional embeddings are added to the token embedding vector to create the input embeddings`
for an LLM. The positional vectors have the same dimension as the original token embeddings. The token
embeddings are shown with `value 1 for simplicity`.


Previously, we focused on very small embedding sizes in this chapter for illustration
purposes. We now consider more realistic and useful embedding sizes and encode the input
tokens into a `256-dimensional vector representation`. This is smaller than what the original
GPT-3 model used (in GPT-3, the embedding size is `12,288` dimensions) but still reasonable
for experimentation. Furthermore, we assume that the token IDs were created by the `BPE tokenizer` that we implemented earlier, which has a vocabulary size of `50,257`:

- The BytePair encoder has a vocabulary size of 50,257:
- Suppose we want to encode the input tokens into a 256-dimensional vector representation:


In [57]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- if we sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. 
- If we have a batch size of 8 with four tokens each, the result will be an `8 x 4 x 256` tensor.

In [58]:
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [61]:
print(f'Token IDs: \n {inputs}')
print(f'Target IDs: \n {targets}')

print(f'\n Inputs shape: {inputs.shape} and Targets shape {targets.shape}')

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Target IDs: 
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

 Inputs shape: torch.Size([8, 4]) and Targets shape torch.Size([8, 4])


- As we can see, the token ID tensor is 8x4-dimensional, meaning that the data batch
consists of 8 text samples with 4 tokens each.

- Let's now use the embedding layer to embed these token IDs into 256-dimensional
vectors:


In [62]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


- As we can tell based on the 8x4x256-dimensional tensor output, each token ID is now
embedded as a 256-dimensional vector.


- For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the `token_embedding_layer`

In [64]:
context_length = max_length 
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


> `Note:`
The context_length is a variable that represents the supported input size of the LLM. Here, we choose it similar to the
maximum length of the input text. In practice, input text can be longer than the supported context length, in which case we have to truncate the text.

- We can now add these directly to the token embeddings, where PyTorch will add the `4x256-dimensional pos_embeddings tensor` to each 4x256-dimensional token embedding tensor `in each of the 8 batches`:

In [66]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


- The input_embeddings we created, as summarized in Figure 19, are the embedded input
examples that can now be processed by the main LLM modules, which we will begin
implementing in chapter 3.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp" width="500px">

- Figure 2.19 As part of the input processing pipeline, input text is first broken up into individual tokens. These
tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding
vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used
as input for the main LLM layers.


## 2.9 Summary

- LLMs require textual data to be converted into numerical vectors, known
as embeddings since they can't process raw text. Embeddings transform
discrete data (like words or images) into continuous vector spaces,
making them compatible with neural network operations

- As the first step, raw text is broken into tokens, which can be words or
characters. Then, the tokens are converted into integer representations,
termed token IDs.

- Special tokens, such as `<|unk|>` and `<|endoftext|>`, can be added to
enhance the model's understanding and handle various contexts, such as
unknown words or marking the boundary between unrelated texts.

- The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT3 can efficiently handle unknown words by breaking them down into
subword units or individual characters.

- We use a sliding window approach on tokenized data to generate inputtarget pairs for LLM training.

- Embedding layers in PyTorch function as a lookup operation, retrieving
vectors corresponding to token IDs. The resulting embedding vectors
provide continuous representations of tokens, which is crucial for training
deep learning models like LLMs.

- While token embeddings provide consistent vector representations for
each token, they lack a sense of the token's position in a sequence. To
rectify this, two main types of positional embeddings exist: absolute and
relative. OpenAI's GPT models utilize absolute positional embeddings that
are added to the token embedding vectors and are optimized during the
model training.

## 2.10 Roadmap by Sebastian Rashka for an effective learning:

A suggestion for an effective 11-step LLM summer study plan:
1) Read* Chapters 1 and 2 on implementing the data loading pipeline (https://www.manning.com/books/build-a-large-language-model-from-scratch & https://github.com/rasbt/LLMs-from-scratch.

2) Watch Karpathy's video on training a BPE tokenizer from scratch (https://www.youtube.com/watch?v=zduSFxRajkE).
3) Read Chapters 3 and 4 on implementing the model architecture.
4) Watch Karpathy's video on pretraining the LLM.
5) Read Chapter 5 on pretraining the LLM and then loading pretrained weights.
6) Read Appendix E on adding additional bells and whistles to the training loop.
7) Read Chapters 6 and 7 on finetuning the LLM.
8) Read Appendix E on parameter-efficient finetuning with LoRA.
9) Check out Karpathy's repo on coding the LLM in C code (https://github.com/karpathy/llm.c).
10) Check out LitGPT to see how multi-GPU training is implemented and how different LLM architectures compare (https://github.com/Lightning-AI/litgpt).
11) Build something cool and share it with the world.

**`First step is done`**, next step is Karpathy's video on training a BPE tokenizer from scratch.