## Reading in a short story as text sample into python

## Step1: Creating Tokens

In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


#### Our goal is tokenize this 20480 - character short story into individual words and special characters that we can then turn into embeddings for LLM training

#### Note that it's common to process millions of articles and hundreds of thousands of books -- many gigabytes of text -- when working with LLMs. However, for educational purposes, it's sufficient to work with smaller text samples like a single book to illustrate the main ideas behind the text processing steps and to make it possible to run it in reasonable time on consumer hardware.

#### How can we best split this text to obtain a list of tokens? For this, we go on a small excursion and use Python's regular expression library re for illustration purposes. (Note that you don't have to learn or memorize any regular expression syntax since we will transition to a pre-built tokenizer later in this chapter.)

#### Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters:

In [8]:
# Importing the python library Regular Expression , for the illustration purpose 
import re 

text = "Hello, world. This is a Test."
result =  re.split(r'(\s)', text)  # we use r'(\s)'  for the white spaces( defining that you have to split the text according to white spaces)

print(result)


['Hello,', ' ', 'world.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'Test.']


#### The result is a list of individual words, whitespaces, and punctuation characters:

#### Let's modify the regular expression splits on whitespaces (\s) and commas, and periods ([,.])"

In [11]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'Test', '.', '']


#### We can see that words and punctuation characters are now seperate list entries just as we wanted

#### A small remaining issue is that the list still includes whitespaces characters. optionally, we can remove these redundant characters safely as follow:

In [14]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', 'is', 'a', 'Test', '.']


### REMOVING WHITESPACES OR NOT

#### When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

#### The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [17]:
text = "Hello, world, Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', ',', 'Is', 'this', '--', 'a', 'test', '?']


### Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short story:

In [19]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [20]:
print(len(preprocessed))


4690


##  Step2: Creating Token IDs

#### In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable called preprocessed. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size:

In [23]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


#### After determining that the vocabulary size is 1,130 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes:

In [25]:
vocab = {token:integer for integer, token in enumerate(all_words)}

## Enumerate = basically this command take all thw words/tokens and arrange it in the alphabatically integer form 

In [26]:
for i, item in enumerate (vocab.items()):
    print(item)
    if i>=50:
        break 

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


#### As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels.

### Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text.

#### For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.

### Let's implement a complete tokenizer class in Python.

#### The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.
#### In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

### Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

### Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

### Step 3: Process input text into token IDs

### Step 4: Convert token IDs back into text

### Step 5: Replace spaces before the specified punctuation

In [31]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text 
        

### Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from Edith Wharton's short story to try it out in practice:

In [33]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"it's the last he painted , you know, " Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 585, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


### The code above prints the following token IDs: Next, let's see if we can turn these token IDs back into text using the decode method:

In [35]:
tokenizer.decode(ids)

'" it\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

### Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text.

#### So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set.
#### Let's now apply it to a new text sample that is not contained in the training set:

In [38]:
# text = "Hello, do you like tea?"
# print(tokenizer.encode(text))

#### The problem is that the word "Hello" was not used in the The Verdict short story. Hence, it is not contained in the vocabulary. This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.



## ADDING SPECIAL CONTEXT TOKENS
#### In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set.
#### In this section, we will modify this tokenizer to handle unknown words.
#### In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and <|endoftext|>

### We can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.Furthermore, we add a token between unrelated texts.
#### For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source

#### Let's now modify the vocabulary to include these two special tokens,<|unk|>  and <|endoftext|>, by adding these to the list of all unique words that we created in the previous section:

In [43]:
# Now we will two special context tokens using extend command from python , we use this command for adding the more additional entries to the list
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [44]:
len(vocab.items())

1132

### Based on the output of the print statement above, the new vocabulary size is 1132 (the vocabulary size in the previous section was 1130).

### As an additional quick check, let's print the last 5 entries of the updated vocabulary:

In [46]:
for i, item in enumerate(list(vocab.items()) [-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


### A simple text tokenizer that handles unkown words
#### Step 1: Replace unknown words by <|unk|> tokens 
#### step 2: Replace spaces before the specified punctuations

In [48]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [49]:
# using of <|endoftext|> 

tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = "<|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea?<|endoftext|> In the sunlit terraces of the palace.


In [50]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [51]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

#### Based on comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words "Hello" and "palace."

### So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following:

#### [BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins.

#### [EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins.

#### [PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.

#### Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity

#### the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units

### BYTE PAIR ENCODING

### Since implementing BPE can be relatively complicated, we will use an existing Python Open-Source Liberary called tiktoken.
#### This Library implements the BPE algorithm very effectively based on source code in Rust.

In [56]:
#! pip3 install tiktoken



In [57]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


#### Once installed, we can instantiate the BPE tokenizer from tiktoken as follows 

In [59]:
tokenizer = tiktoken.get_encoding("gpt2")

#### The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method:

In [61]:
text = (
    "Hello, do you like tea? <|endoftext|> in the sunlit terraces"
    "of someunknownplace."
)
integers = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 287, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 5372, 13]


#### The code above prints the following token IDs:

#### We can then convert the token IDs back into the text using the decode method, similar to our SimpleTokenizerV2 earlier:

In [64]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> in the sunlit terracesof someunknownplace.


#### The BPE tokenizer above enocdes and decodes unknown words, such as "someunknownplace" correctly.   The BPE Tokenizer can handle any unknown word. how does it achive without using <|unk|> tokens?

#### The algorithm underlying BPE breaks down words that aren't in its predifined vocabulary into smaller units or even individual characters. The enables it to handle out-of-vocabulary words.
#### So, thanks to the BPE algoithm , if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword token or characters


## CREATING INPUT-TARGET PAIRS

#### In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.
#### To get started , we will first tokenize the whole The Verdict Short story we worked with earlier using the BPE tokenizer

In [69]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
          raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5146


####  Executing the code above will return 5146, the total number of tokens in the training set, after applying the BPE Tokenizer.
#### Next, we remove the first 50 tokens from the dataset for demonstration purpose as it result in a slightly more intresting text passage oin the next steps:

In [71]:
enc_sample = enc_text[50:]

#### one of the easiset and most intutive ways to create the input-target pairs for the nextword prediction task is to create two variables , x and y where x contains the input tokens and y contains the targets, which are the inputs shifted by 1:

#### The context size determines how many tokens are included in the input

In [74]:
context_size = 4  # length of the input
# the context size of 4 means that the model is trained to look at a sequence of 4 words ( or tokens)
# to predict the next word in the sequence.
# The input x is the first 4 tokens [1,2,3,4], and the target y is the next 4 tokens [2,3,4,5]

x = enc_sample[:context_size]
y = enc_sample[1:context_size + 1]

print(f"x: {x}")
print(f"y:    {y}")

x: [290, 4920, 2241, 287]
y:    [4920, 2241, 287, 257]


#### Processing the inputs alongs with the target, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows: 

In [76]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


#### Everything left of the arrow(---->) refers to the input an LLM would receive , and the token ID on the right side of the arrow represnts the target token ID that the LLM is supposed to predict:

#### For illustation purpose, let's repeat the previous code but convert the token IDs into text(Decoder):

In [79]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "--->", tokenizer.decode([desired]))

 and --->  established
 and established --->  himself
 and established himself --->  in
 and established himself in --->  a


### We've now created the input-target pairs that we can turn into use for the LLM Training in upcoming chapters.
#### There's only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that iterates over the input dataset and return the inputs and targets as Pytorch Tensors, which can be thought of as multidimensional arrays.

#### In particular , we are intrested in returning two Tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,

## IMPLEMENTING A DATA LOADER

### For the efficient data loader implemetation, we will use PyTorch's built-in Dataset and DataLoader clasees.

<div class="alert alert-block alert-warning">

Step1: Tokenize the entire text

Step2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step3: Return the total number of rows in the dataset

Step4: Return a single row from the dataset

</div>

In [85]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

 ### Now the following code will use the GPTDatasetV1 to load the inputs in batches via a Pytorch DataLoader:

#### step1 : Initialize the tokenizer
#### step2: Create dataset
#### Step3: drop_last = True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training
#### Step4: The number of CPU process to use for preprocessing

In [88]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

#### let's test the dataloader with a batch size of 1 for an LLM with a context_size of 4,
#### This will develop an intution of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:

In [90]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

### convert dataloader into a python iterator to fetch the next entry via python's built-in next() function

In [92]:
import torch
print("PyTorch version:", torch.__version__)
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version: 2.8.0+cpu
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


#### The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs
#### Since the max_length is set to $, each of the two tensors contains 4 token IDs
#### Note that an input size of 4 is relatively small and only chosen for illustration purposes, it is common to train LLMs with input size of atleast 256 

#### to illustrate the meaning of stride = 1, let's fetch another batch from this dataset:

In [95]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In [96]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


#### Note that we increases the stride to 4. this is to utilize the dataset fully (we don't skip a single word) but also avoid any overla[ between the batches , since more overlap could lead to increased overfitting.

# CREATING TOKEN EMBEDDING

### Let's Illustrate how the token I to embedding vector conversation workd with a hands-on example. 

In [100]:
input_ids = torch.tensor([2,3,5,1])

### Using the vocab_size and output_dim, we can instantiate an embedding layer in pytorch, setting the random seed to 123 for reproducibility purposes:

In [102]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

#### The print statement in the code prints the embedding layer's underlying weight matrix:

In [104]:
print(embedding_layer.weight)


Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


#### After we instantiated the embedding layer, let's now apply it to a token ID to obtain the embedding vector:

In [106]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


### In the other words: The embedding layer is essentially a look-up operation that retrives rows from the embedding layer's weight matrix via a token ID.

#### Let's now apply it on (torch.tensor({2,3,5,1})):

In [109]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


#### Each row in this output matrix is obtained via a lookup-operation from the embedding weight matrix

# POSITIONAL EMBEDDING

### Previously, we focused on very small embedding sizes in this chapter for illustration purposes.
### We now consider more realistic and useful embedding sized and the input tokens into a 256-dimensional vector representation.


In [157]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

### Using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into 256-dimensional vector. if we have batch size of 8 with four tokens each, the resutlt will be an 8 x 4 x 256 tensor.

In [160]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
    

In [164]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


#### As we can see, the token ID tensor 8 x 4- dimensional, meaning that the data bacth consists of 8 text samples with 4 tokens each.

#### let's now use the embedding layer to embed these token IDs into 256-dimensional vectors:

In [174]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


as we can tell based on the 8 x 4 x 256- dimensional tensor output, each token ID is now embedded as a 256-dimensional vector.

In [177]:
context_length = max_length 
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [181]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


as we can see, the positional embedding tensor cosists of four 256-dimensional vectors, we can now add these directly to the token embeddings, where PyTorch will add the 4 x 256- dimensional pos_embedding tensor to each 4 x 256- dimensional token embedding 