<a href="https://colab.research.google.com/github/srvmishra/Language-Models/blob/main/GPT2_Model_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tiktoken



### Imports

In [2]:
import torch
import torch.nn as nn
import tiktoken

`cfg` is a dictionary whose fields indicate properties of the LLM
- `vocab_size` - vocabulary size
- `context_length` - context length
- `emb_dim` - embedding dimension - input as well as output of transformer blocks
- `n_heads` - number of attention heads inside a transformer block
- `n_layers` - number of transformer blocks
- `drop_rate` - dropout rate
- `qkv_bias` - whether to include query-key-value bias

GPT2 LLM architecture:

token embedding + position embedding -> dropout -> transformer blocks -> layer norm -> output head -> logits

1. why `qkv_bias` is disabled in LLMs?
2. why a dropout is used just before the transformer blocks?

### Layer Normalization

Why Layer Normalization?
1. Batch norm computes statistics across batch dimension so token representations from different sequences might mix resulting in undesirable behaviour.
2. Usually there are different number of tokens in each sequence and we do not want to consider the representations of padding tokens into calculation.
3. Since LLMs are parameter heavy, batch size for training and testing might be different due to hardware requirements. Layer norm avoids this inconsistency by normalizing each input feature dimension independently.

In [3]:
class LayerNorm(nn.Module):
  def __init__(self, emb_dim):
    super(LayerNorm, self).__init__()
    self.eps = 1e-5

    # learnable scale and shift parameters
    self.scale = nn.Parameter(torch.ones(emb_dim))
    self.shift = nn.Parameter(torch.zeros(emb_dim))

  def forward(self, x):
    # calculate mean and variance along emb_dim (feature dimension)
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)

    # normalize input --> scale and shift it to return output
    normalized_x = (x - mean)/torch.sqrt(var + self.eps)
    return self.scale * normalized_x + self.shift

### GELU activation

$GELU(x) \approx 0.5\cdot x\cdot(1 + \tanh[\sqrt{\dfrac{2}{\pi}}\cdot(x + 0.044715\cdot x^3)])$

In [4]:
class GELU(nn.Module):
  def __init__(self):
    super(GELU, self).__init__()

  def forward(self, x):
    return 0.5 * x * (1.0 + torch.tanh(
        torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715*x**3)
        ))

### Feed Forward Network

Learn and generalize based on the specific patterns in the dataset.

In [5]:
class FeedForward(nn.Module):
  def __init__(self, cfg):
    super(FeedForward, self).__init__()

    self.layers = nn.Sequential(nn.Linear(cfg['emb_dim'], 4 * cfg['emb_dim']),
                                GELU(),
                                nn.Linear(4 * cfg['emb_dim'], cfg['emb_dim']))

  def forward(self, x):
    return self.layers(x)

### Masked Multi Head Self Attention

This was implemented in `Self_Attention_Mechanism.ipynb`, so we simply take it from there.

In [6]:
class MultiHeadSelfAttention(nn.Module):
  def __init__(self, d_in, d_out, context_length, num_heads, drop_rate, qkv_bias=False):
    super(MultiHeadSelfAttention, self).__init__()

    self.in_dim = d_in
    self.out_dim = d_out
    self.num_heads = num_heads
    self.head_dim = self.out_dim//self.num_heads
    self.context_length = context_length

    self.W_query = nn.Linear(self.in_dim, self.out_dim, bias=qkv_bias)
    self.W_key = nn.Linear(self.in_dim, self.out_dim, bias=qkv_bias)
    self.W_value = nn.Linear(self.in_dim, self.out_dim, bias=qkv_bias)
    # combines the outputs from all heads
    self.out_proj = nn.Linear(self.out_dim, self.out_dim)

    self.dropout = nn.Dropout(drop_rate)
    self.register_buffer('mask',
                         torch.triu(torch.ones(self.context_length, self.context_length), diagonal=1))

  def forward(self, x):
    num_seq, num_tokens, _ = x.shape

    query = self.W_query(x)
    key = self.W_key(x)
    value = self.W_value(x)

    query = query.view(num_seq, num_tokens, self.num_heads, self.head_dim)
    key = query.view(num_seq, num_tokens, self.num_heads, self.head_dim)
    value = query.view(num_seq, num_tokens, self.num_heads, self.head_dim)

    attention_scores = query.transpose(1, 2) @ key.transpose(1, 2).transpose(2, 3)   # --> num_seq, num_heads, num_tokens, num_tokens
    attention_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
    attention_weights = torch.softmax(attention_scores/self.head_dim ** 0.5, dim=-1)
    drop_attention_weights = self.dropout(attention_weights)

    context_vector = drop_attention_weights @ value.transpose(1, 2) # --> num_seq, num_heads, num_tokens, head_dim
    context_vector = context_vector.transpose(1, 2) # --> num_seq, num_tokens, num_heads, head_dim --> transpose is not the same as view/reshape
    # --> create same memory mapping as if created from scratch
    context_vector = context_vector.contiguous().view(num_seq, num_tokens, self.out_dim) # --> num_seq, num_tokens, out_dim
    context_vector = self.out_proj(context_vector) # --> num_seq, num_tokens, out_dim
    return context_vector

### Transformer Block Architecture

Connects all the above components into one block.

(input -> Layer Norm 1 -> Masked Multi Head Self Attention -> Dropout) -> (Layer Norm 2 -> Feed Forward Network -> Dropout) -> output

input and output dimensions are the same so that multiple blocks can be stacked on top of each other. (block) indicates a residual connection between input and output of the block in the above architecture.

each operation is applied position wise on every sequence. that is, each each token in each sequence goes through each of the above operations. tokens from one sequence interact with each other only through the self attention mechanism. other operations are applied at all token positions independently of others.

In [7]:
class TransformerBlock(nn.Module):
  def __init__(self, cfg):
    super(TransformerBlock, self).__init__()
    self.att = MultiHeadSelfAttention(cfg['emb_dim'], cfg['emb_dim'],
                                      cfg['context_length'], cfg['n_heads'],
                                      cfg['drop_rate'], cfg['qkv_bias'])

    self.ff = FeedForward(cfg)
    self.norm1 = LayerNorm(cfg['emb_dim'])
    self.norm2 = LayerNorm(cfg['emb_dim'])
    self.drop_shortcut = nn.Dropout(cfg['drop_rate'])

  def forward(self, x):
    x = x + self.drop_shortcut(self.att(self.norm1(x)))
    x = x + self.drop_shortcut(self.ff(self.norm2(x)))
    return x

### GPT2 architecture

token embedding + position embedding -> dropout -> transformer blocks -> layer norm -> output head -> logits

Combining all the above modules into the GPT2 architecture.

1. Tokenizer takes care of generating token ids from text.
2. Token ids are converted into token embeddings and position embeddings are added to it in the first part of GPT2.
3. A dropout is then applied.
4. The resuling tensor is passed through the transformer blocks.
5. Next, layer norm, and output head follow sequentially.

for the stability of learning, there are a lot of skip connections, and layer normalizations inside the GPT2 architecture.

why there is no bias in the final output layer?

In [8]:
class GPT2Model(nn.Module):
  def __init__(self, cfg):
    super(GPT2Model, self).__init__()

    self.tok_emb = nn.Embedding(cfg['vocab_size'], cfg['emb_dim'])
    self.pos_emb = nn.Embedding(cfg['context_length'], cfg['emb_dim'])
    self.drop_emb = nn.Dropout(cfg['drop_rate'])
    self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg['n_layers'])])
    # * unpacks the list as nn.Sequential expects, and use range while iterating with for loop
    self.final_norm = LayerNorm(cfg['emb_dim'])
    self.out_head = nn.Linear(cfg['emb_dim'], cfg['vocab_size'], bias=False)

  def forward(self, x): # --> batch_size, num_tokens
    tok_emb = self.tok_emb(x)
    # keep 0 inside torch.arange, add device
    pos_emb = self.pos_emb(torch.arange(0, x.shape[1]))#, device=x.device)
    # device throws error: `embedding forward got an unexpected keyword argument device`
    x = tok_emb + pos_emb

    x = self.drop_emb(x)

    x = self.trf_blocks(x)

    x = self.final_norm(x)
    x = self.out_head(x)
    return x

### Number of parameters in GPT2 model.

GPT2 uses *weight_tying* - the weights of the token embedding layer and the output head are the same.

Modern LLMs use different weights for these layers for better training and model performance like we have used here.

In [9]:
GPT2_CONFIG_124M = {'vocab_size': 50257,
                    'context_length': 1024,
                    'emb_dim': 768,
                    'n_heads': 12,
                    'n_layers': 12,
                    'drop_rate': 0.1,
                    'qkv_bias': False}

GPT2_MEDIUM = {'vocab_size': 50257,
               'context_length': 1024,
               'emb_dim': 1024,
               'n_heads': 16,
               'n_layers': 24,
               'drop_rate': 0.1,
               'qkv_bias': False}

GPT2_LARGE = {'vocab_size': 50257,
              'context_length': 1024,
              'emb_dim': 1280,
              'n_heads': 20,
              'n_layers': 36,
              'drop_rate': 0.1,
              'qkv_bias': False}

GPT2_XL = {'vocab_size': 50257,
           'context_length': 1024,
           'emb_dim': 1600,
           'n_heads': 25,
           'n_layers': 48,
           'drop_rate': 0.1,
           'qkv_bias': False}

In [10]:
def calculate_num_params(gpt_model, exclude_output=False):
  size_mb = lambda x: (4 * x)/(1024 ** 2)
  params = lambda x: sum(p.numel() for p in x)

  # tok_emb = gpt_model.tok_emb.parameters().numel()
  # pos_emb = gpt_model.pos_emb.parameters().numel()
  # trf_blocks = gpt_model.trf_blocks().parameters().numel()
  # out_head = gpt_model.out_head.parameters().numel()

  tok_emb = params(gpt_model.tok_emb.parameters())
  pos_emb = params(gpt_model.pos_emb.parameters())
  trf_blocks = params(gpt_model.trf_blocks.parameters())
  out_head = params(gpt_model.out_head.parameters())

  if exclude_output:
    total = tok_emb + pos_emb + trf_blocks
  else:
    total = tok_emb + pos_emb + trf_blocks + out_head

  print('Trainable Parameters ... ')
  print(f'Token Embedding layer: {tok_emb}, size: {size_mb(tok_emb)} MB')
  print(f'Position Embedding layer: {pos_emb}, size: {size_mb(pos_emb)} MB')
  print(f'Transformer Blocks: {trf_blocks}, size: {size_mb(trf_blocks)} MB')
  print(f'Output Head: {out_head}, size: {size_mb(out_head)} MB')
  print(f'Total params: {total}, size: {size_mb(total)} MB')
  print('###################################################')

  print('Trainable parameters in attention and feed forward blocks ...')
  trf_blk = gpt_model.trf_blocks[0]
  num_trf_blks = len(gpt_model.trf_blocks)
  # attn = trf_blk.att.parameters().numel() * num_trf_blks
  # ffns = trf_blk.ff.parameters().numel() * num_trf_blks
  attn = params(trf_blk.att.parameters()) * num_trf_blks
  ffns = params(trf_blk.ff.parameters()) * num_trf_blks
  print(f'Attention Layer Parameters: {attn}, size: {size_mb(attn)} MB')
  print(f'Feed Forward Network Parameters: {ffns}, size: {size_mb(ffns)} MB')
  print('###################################################')
  return

In [11]:
gpt2_small = GPT2Model(GPT2_CONFIG_124M)
calculate_num_params(gpt2_small)

Trainable Parameters ... 
Token Embedding layer: 38597376, size: 147.2373046875 MB
Position Embedding layer: 786432, size: 3.0 MB
Transformer Blocks: 85026816, size: 324.3515625 MB
Output Head: 38597376, size: 147.2373046875 MB
Total params: 163008000, size: 621.826171875 MB
###################################################
Trainable parameters in attention and feed forward blocks ...
Attention Layer Parameters: 28320768, size: 108.03515625 MB
Feed Forward Network Parameters: 56669184, size: 216.17578125 MB
###################################################


In [12]:
gpt2_medium = GPT2Model(GPT2_MEDIUM)
calculate_num_params(gpt2_medium)

Trainable Parameters ... 
Token Embedding layer: 51463168, size: 196.31640625 MB
Position Embedding layer: 1048576, size: 4.0 MB
Transformer Blocks: 302235648, size: 1152.9375 MB
Output Head: 51463168, size: 196.31640625 MB
Total params: 406210560, size: 1549.5703125 MB
###################################################
Trainable parameters in attention and feed forward blocks ...
Attention Layer Parameters: 100687872, size: 384.09375 MB
Feed Forward Network Parameters: 201449472, size: 768.46875 MB
###################################################


In [13]:
gpt2_large = GPT2Model(GPT2_LARGE)
calculate_num_params(gpt2_large)

Trainable Parameters ... 
Token Embedding layer: 64328960, size: 245.3955078125 MB
Position Embedding layer: 1310720, size: 5.0 MB
Transformer Blocks: 708249600, size: 2701.7578125 MB
Output Head: 64328960, size: 245.3955078125 MB
Total params: 838218240, size: 3197.548828125 MB
###################################################
Trainable parameters in attention and feed forward blocks ...
Attention Layer Parameters: 235975680, size: 900.17578125 MB
Feed Forward Network Parameters: 472089600, size: 1800.87890625 MB
###################################################


In [14]:
# gpt2_xl = GPT2Model(GPT2_XL) # --> system crashes due to limited RAM
# calculate_num_params(gpt2_xl)

### Text Generation with GPT2 Models

text -> (`tokens_ids` -> GPT Model -> `next_token_id`) -> end of generation

append `next_token_id` to `token_ids` and keep generating till a specified number of tokens is reached.

`next_token_id` is sampled from the last output of the GPT Model by taking the token in the vocabulary that has the maximum probability.

in the implementation below, we did not check for two conditions:
1. generating beyond `context_length` - assume that `num_tokens_to_generate` << `context_length`.
2. checking for `EOS` token to stop generation - assume that model keeps generating `EOS` till `num_tokens_to_generate` is reached.

the outputs will not make sense because the models are not trained.

also, everytime a new model is created, its parameters are initialized again. since each time we have different parameters, we will get different outputs, as we can see by running the cells below repeatedly.

In [15]:
def generate_text_simple(gpt_model, tokenizer, input_text, context_length, num_tokens_to_generate=20):
  input_ids = torch.tensor(tokenizer.encode(input_text))

  if len(input_ids.shape) == 1:
    input_ids = input_ids.unsqueeze(0)
    # print(input_ids.shape)

  for _ in range(num_tokens_to_generate):
    input_ids_trunc = input_ids[:, -context_length:] # --> we want to keep generating, so we keep the latest tokens only,
    # so we do not use [:, :context_length], as it keeps the same old tokens - to the first `context_length` tokens
    with torch.no_grad():
      model_outs = gpt_model(input_ids)[:, -1, :]
    probs = torch.softmax(model_outs, dim=-1)
    next_token_id = torch.argmax(probs, dim=-1, keepdim=True) # --> keepdim is specified so that concatenation will be easy
    input_ids = torch.cat([input_ids, next_token_id], dim=1)

  input_ids = input_ids.cpu().numpy() # can this tokenizer encode & decode a batch of text sequences?
  # no, so we use a loop over all sequences
  for tid in input_ids:
    # print(tid.shape)
    print(tokenizer.decode(tid.tolist()))
  return

In [16]:
tokenizer = tiktoken.get_encoding('gpt2')
input_text = 'Hello, I am '
num_tokens_to_generate = 20
context_length = 1024

In [17]:
generate_text_simple(gpt2_small.eval(), tokenizer, input_text, context_length, num_tokens_to_generate=num_tokens_to_generate)

Hello, I am  spawned OLEDcomedPat Spoiler orthodox displayonut substit attendedletters constitutional Baptist Launcher commonplacetion vaginal comma Sam Closed


In [18]:
generate_text_simple(gpt2_medium.eval(), tokenizer, input_text, context_length, num_tokens_to_generate=num_tokens_to_generate)

Hello, I am asmajet riddled Sty sportsKill Iro Thumbnails acceptable allegation preserved Squid Node teenager influounced abbre   Most overturn


In [19]:
generate_text_simple(gpt2_large.eval(), tokenizer, input_text, context_length, num_tokens_to_generate=num_tokens_to_generate)

Hello, I am  bere Taste unlimitedAnyoneaq Courtney Absolute NY bubbles frequently portableFormer Joberedithfoliosignantdream 1975 mand harsh
