## Implementing GPT-2

![GPT-2](/home/znyd/hacking/meow-former/images/gpt_2_impl_plan.png)




## Data Preparation and Sampling

### Input text -> token(word or sub-word) -> token ID(with a vocab) -> embeddings -> decoder -> post-processing -> output

<div style="display: flex; justify-content: space-around;">
  <img src="/home/znyd/hacking/meow-former/images/text_to_embed.png" alt="Image 1" width="600">
  <img src="/home/znyd/hacking/meow-former/images/vocab.png" alt="Image 2" width="600">
</div>


#### Here we are implementing a simple tokenization scheme with just separate words and punctuations.

In [2]:
with open('the-verdict.txt', 'r', encoding='utf-8') as f:
    loaded_txt = f.read()

print(len(loaded_txt))
print(loaded_txt[:50])

20479
I HAD always thought Jack Gisburn rather a cheap g


In [3]:
import re

result = re.split(r'([,.:;?_!"()\']|--|\s)', loaded_txt)
tokens = [item.strip() for item in result if item.strip()]
print(len(tokens))
print(tokens[:30])

4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [4]:
unique_word = sorted(list(set(tokens)))
vocab_size = len(unique_word)
print(vocab_size)


1130


In [5]:
vocab = {token:idx for idx, token in enumerate(unique_word)}
for idx, token in enumerate(vocab.items()):
    print((token))
    if idx > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)


In [6]:
class TokenizerV1:
    def __init__(self, vocab):
        self.stoi = vocab
        self.itos = {i:s for s, i in vocab.items()} 

    def encode(self, txt_inp):
        processed = re.split(r'([,.:;?_!"()\']|--|\s)', txt_inp)
        tokens = [item.strip() for item in processed if item.strip()]
        toekn_ids = [ self.stoi[s] for s in tokens]
        return toekn_ids
    
    def decode(self, token_ids):
        tokens = " ".join([self.itos[ids] for ids in token_ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', tokens)
        return text

    


In [7]:
tokenizer_v1 = TokenizerV1(vocab)
text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
ids = tokenizer_v1.encode(text)
print(ids)
print(tokenizer_v1.decode(ids))

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


<h3>Till now it is working great but if we use some word which is not on our vocab then it won't work</h3>

In [8]:
input = "Hello, Mrs. Gisburn"
token_ids = tokenizer_v1.encode(input)
print(token_ids)

KeyError: 'Hello'

<h4>
Now, we are gonna tackle this with <b><|unk|></b> token for a placeholder of unknown tokens and will use <b><|endoftext|></b>token for separating different paragraph or context
</h4>

In [9]:
unique_word.append('<|endoftext|>')
unique_word.append('<|unk|>')
vocab = {token:integer for integer,token in enumerate(unique_word)}
print(len(vocab.items()))

1132


In [10]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [11]:
class TokenizerV2:
    def __init__(self, vocab):
        self.stoi = vocab
        self.itos = {i:s for s, i in vocab.items()} 

    def encode(self, txt_inp):
        processed = re.split(r'([,.:;?_!"()\']|--|\s)', txt_inp)
        tokens = [item.strip() for item in processed if item.strip()]
        tokens = [tkn if tkn in self.stoi else "<|unk|>" for tkn in tokens] # Using <|unk|> for uniknown token
        toekn_ids = [ self.stoi[s] for s in tokens]
        return toekn_ids
    
    def decode(self, token_ids):
        tokens = " ".join([self.itos[ids] for ids in token_ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', tokens)
        return text

In [12]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

tokenizer_v2 = TokenizerV2(vocab)
ids = tokenizer_v2.encode(text)
print(ids)
txt = tokenizer_v2.decode(ids)
print(txt)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


<h4>
Though we don't know what is this unknown token is and this way of tokenization would create a huge vocab dictionary. Also we wont be able to tokenize weird words
<br>
<br>
That's why we use byte pair encoding(BPE) type of tokenizer
Now we will use it with tiktoken
</h4>

<div>
<img src="/home/znyd/hacking/meow-former/images/sub_word_bpe.png", width="600">
</div>

In [13]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')

In [14]:
text = (
"Hello, do you like tea? <|endoftext|> In the sunlit terraces"
"of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
strings = tokenizer.decode(integers)
print(strings)
print()
print("see we can even tokenize this weird word 'someunknownPlace' with BPE")

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.

see we can even tokenize this weird word 'someunknownPlace' with BPE


#### Create (Input-Target) token pair for training to predict the next token

<div style="display: flex; justify-content: space-around;">
<img src="/home/znyd/hacking/meow-former/images/inp-target_pair.png">
<img src="/home/znyd/hacking/meow-former/images/sliding_window_txt-target.png">
</div>

### Dataloader

In [15]:
import torch
from torch.utils.data import DataLoader, Dataset

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids)-max_length, stride):
            inp_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(inp_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]



In [16]:
def dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workders=0):
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workders
    )

    return dataloader

In [17]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = dataloader_v1(
raw_text, batch_size=8, max_length=4, stride=4,
shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


<h4>
Token Embedding and Positional Embedding
</h4>

In [18]:
# Basic working of embedding layer

input_ids = torch.tensor([2, 3, 5, 1])

vocab_size = 6
output_dim = 3

embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)
print()
print("Here every row represents a token and every token has 3 dimension")

Parameter containing:
tensor([[ 1.4069,  0.1522, -0.3106],
        [-1.2768,  0.7557, -1.1393],
        [-1.0281,  0.0889,  1.0486],
        [ 0.1386, -0.0095,  0.5327],
        [ 0.6682, -0.4070, -0.2591],
        [ 0.7890,  0.3928, -1.2502]], requires_grad=True)

Here every row represents a token and every token has 3 dimension


In [19]:
print("Retrieving Embedding Vector for a Token")
print()
print(embedding_layer(torch.tensor([3])))
print()
print(embedding_layer(input_ids))

Retrieving Embedding Vector for a Token

tensor([[ 0.1386, -0.0095,  0.5327]], grad_fn=<EmbeddingBackward0>)

tensor([[-1.0281,  0.0889,  1.0486],
        [ 0.1386, -0.0095,  0.5327],
        [ 0.7890,  0.3928, -1.2502],
        [-1.2768,  0.7557, -1.1393]], grad_fn=<EmbeddingBackward0>)


<h5>

The embedding
layer approach described here is essentially just a more efficient way of imple-
menting one-hot encoding followed by matrix multiplication in a fully con-
nected layer, which is illustrated in the supplementary code on GitHub at
https://mng.bz/ZEB5.<br> Because the embedding layer is just a more efficient
implementation equivalent to the one-hot encoding and matrix-multiplica-
tion approach, it can be seen as a neural network layer that can be optimized
via backpropagation.
</h5>

<h4>In principle, the deterministic, position-independent embedding of the token ID is
good for reproducibility purposes. However, since the self-attention mechanism of
LLMs itself is also position-agnostic, it is helpful to inject additional position informa-
tion into the LLM.
</h4>
<div>
<img src="/home/znyd/hacking/meow-former/images/positional_emb.png">
</div>

In [20]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [21]:
dataloader = dataloader_v1(
    raw_text,
    batch_size=8,
    max_length=4,
    stride=4,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [None]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


#### Basic sequential/static positional encoding

In [30]:
print(torch.arange(4))

tensor([0, 1, 2, 3])


In [29]:
context_length = 4 
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)
print(pos_embeddings)

torch.Size([4, 256])
tensor([[ 0.3184, -0.9020,  0.1182,  ...,  0.7687,  0.8159,  0.4627],
        [-1.6514,  1.0246, -1.1728,  ..., -2.3351, -0.7295,  0.3487],
        [-0.8057, -0.3397, -0.2613,  ..., -1.1176,  1.2932,  0.7838],
        [ 0.4690, -1.3245, -1.5091,  ...,  0.8208, -0.0506,  1.1770]],
       grad_fn=<EmbeddingBackward0>)


In [31]:
# Finally the input embedding 

input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


## Now it's time for Attentation Mechanism