<a href="https://colab.research.google.com/github/somnathsingh31/inside_LLM_Architecture/blob/main/self_attention_OPT_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U bitsandbytes

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

If you are using a pretrained model, it’s important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the vocab) during pretraining.

In [None]:
#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b", load_in_8bit=True)

In [3]:
#load Model:- Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers
OPT = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", load_in_8bit=True)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [4]:
inp = "The quick brown fox jumps over the lazy dog"

In [5]:
#return_tensors parameter to either pt for PyTorch, or tf for TensorFlow
inp_tokenized = tokenizer(inp, return_tensors='pt')

In [7]:
print(inp_tokenized)

{'input_ids': tensor([[    2,   133,  2119,  6219, 23602, 13855,    81,     5, 22414,  2335]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [8]:
print(inp_tokenized['input_ids'].size())

torch.Size([1, 10])


In [9]:
print(OPT.model)

OPTModel(
  (decoder): OPTDecoder(
    (embed_tokens): Embedding(50272, 2048, padding_idx=1)
    (embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
    (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (layers): ModuleList(
      (0-23): 24 x OPTDecoderLayer(
        (self_attn): OPTAttention(
          (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
        )
        (activation_fn): ReLU()
        (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear8bitLt(in_features=2048, out_features=8192, bias=True)
        (fc2): Linear8bitLt(in_features=8192, out_features=2048, bias=True)
        (final_layer_norm): LayerNorm((2048,), eps=1e-05, e

## OPTDecoder

The decoder component of the transformer model. It consists of:

- **embed_tokens**: An embedding layer that maps input tokens to vectors.
  - Vocabulary size: 50,272 tokens.
  - Embedding dimension: 2,048.
  - Padding token index: 1.

- **embed_positions**: A learned positional embedding layer.
  - Maximum sequence length: 2,050.
  - Embedding dimension: 2,048.

- **final_layer_norm**: A layer normalization layer.
  - Normalizes the output of the decoder.
  - Epsilon (smoothing factor): 1e-5.

- **layers**: A list of 24 identical decoder layers (OPTDecoderLayer).

### OPTDecoderLayer

A single decoder layer. It consists of:

- **self_attn**: A self-attention mechanism (OPTAttention).
  - Query, key, and value projections (Linear8bitLt).
  - Output projection (Linear8bitLt).

- **activation_fn**: A ReLU activation function.

- **self_attn_layer_norm**: A layer normalization layer.
  - Normalizes the output of self-attention.
  - Epsilon: 1e-5.

- **fc1 and fc2**: Two linear layers (Linear8bitLt) with:
  - Input dimension: 2,048.
  - Output dimension: 8,192 (fc1) and 2,048 (fc2).

- **final_layer_norm**: A layer normalization layer.
  - Normalizes the output of the decoder layer.
  - Epsilon: 1e-5.

### Linear8bitLt

A linear layer with 8-bit quantization. This suggests the model is optimized for efficient inference.


In above model, the trainable parameters are the weights and biases of the following layers:

### Embedding layers:
- **embed_tokens**: weights (50,272 x 2,048)
- **embed_positions**: weights (2,050 x 2,048)

### Linear layers within OPTDecoderLayer:
- **k_proj**: weights (2,048 x 2,048) and bias (2,048)
- **v_proj**: weights (2,048 x 2,048) and bias (2,048)
- **q_proj**: weights (2,048 x 2,048) and bias (2,048)
- **out_proj**: weights (2,048 x 2,048) and bias (2,048)
- **fc1**: weights (2,048 x 8,192) and bias (8,192)
- **fc2**: weights (8,192 x 2,048) and bias (2,048)

### LayerNorm layers:
- **final_layer_norm** (in OPTDecoder): weights (2,048) and bias (2,048)
- **self_attn_layer_norm** (in OPTDecoderLayer): weights (2,048) and bias (2,048)
- **final_layer_norm** (in OPTDecoderLayer): weights (2,048) and bias (2,048)

> **Note**: The ReLU activation function does not have any trainable parameters.

In total, there are 24 OPTDecoderLayer instances, each with its own set of trainable parameters. This results in a large number of trainable parameters, specifically:

- **Embedding layers**: 50,272 x 2,048 + 2,050 x 2,048
- **Linear layers**: 24 x (4 x (2,048 x 2,048 + 2,048) + 2 x (2,048 x 8,192 + 8,192))
- **LayerNorm layers**: 24 x 2 x (2,048 + 2,048) + 1 x (2,048 + 2,048)

This is approximately **1.3 billion trainable parameters**.


The embedding layer is accessed via decoder object's .embed_tokens method. The embedding layer will convert a list of token IDs of size [1,10] to [1,10,2048]

In [10]:
embeded_input = OPT.model.decoder.embed_tokens(inp_tokenized['input_ids'])
print(embeded_input)

tensor([[[-0.0407,  0.0519,  0.0574,  ..., -0.0263, -0.0355, -0.0260],
         [-0.0371,  0.0220, -0.0096,  ...,  0.0265, -0.0166, -0.0030],
         [-0.0455, -0.0236, -0.0121,  ...,  0.0043, -0.0166,  0.0193],
         ...,
         [ 0.0007,  0.0267,  0.0257,  ...,  0.0622,  0.0421,  0.0279],
         [-0.0126,  0.0347, -0.0352,  ..., -0.0393, -0.0396, -0.0102],
         [-0.0115,  0.0319,  0.0274,  ..., -0.0472, -0.0059,  0.0341]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<EmbeddingBackward0>)


In [11]:
print(embeded_input.size())

torch.Size([1, 10, 2048])


In [12]:
# positional embedding
embed_pos_input = OPT.model.decoder.embed_positions(inp_tokenized['attention_mask'])

print("Layer: \t", OPT.model.decoder.embed_positions)
print("Size: \t", embed_pos_input.size())
print("Output: \t", embed_pos_input)


Layer: 	 OPTLearnedPositionalEmbedding(2050, 2048)
Size: 	 torch.Size([1, 10, 2048])
Output: 	 tensor([[[-8.1406e-03, -2.6221e-01,  6.0768e-03,  ...,  1.7273e-02,
          -5.0621e-03, -1.6220e-02],
         [-8.0585e-05,  2.5000e-01, -1.6632e-02,  ..., -1.5419e-02,
          -1.7838e-02,  2.4948e-02],
         [-9.9411e-03, -1.4978e-01,  1.7557e-03,  ...,  3.7117e-03,
          -1.6434e-02, -9.9087e-04],
         ...,
         [ 3.6979e-04, -7.7454e-02,  1.2955e-02,  ...,  3.9330e-03,
          -1.1642e-02,  7.8506e-03],
         [-2.6779e-03, -2.2446e-02, -1.6754e-02,  ..., -1.3142e-03,
          -7.8583e-03,  2.0096e-02],
         [-8.6288e-03,  1.4233e-01, -1.9012e-02,  ..., -1.8463e-02,
          -9.8572e-03,  8.7662e-03]]], device='cuda:0', dtype=torch.float16,
       grad_fn=<EmbeddingBackward0>)


In [13]:
embed_position_inut = embeded_input + embed_pos_input


In [15]:
#There are 24 such decoder layers in OPT model. In each layer we can see, self attention, activation function, self_attn_layer_norm, two fullu connected layers.
OPT.model.decoder.layers[0]

OPTDecoderLayer(
  (self_attn): OPTAttention(
    (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
    (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
    (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
    (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  )
  (activation_fn): ReLU()
  (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  (fc1): Linear8bitLt(in_features=2048, out_features=8192, bias=True)
  (fc2): Linear8bitLt(in_features=8192, out_features=2048, bias=True)
  (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
)

**Three outputs:**

The self_attn module in the OPT model returns three tensors:

**attention_output:** The weighted sum of the value projections, shaped as (batch_size, sequence_length, embed_dim).

**attention_weights:** The attention weights, shaped as (batch_size, num_heads, sequence_length, sequence_length).

**intermediate_result:** An intermediate result used for reversible attention (optional, but present in OPT).

In [30]:
attention_output,_ , _ = OPT.model.decoder.layers[0].self_attn(embed_position_inut)

In [32]:
print("Layer: \t", OPT.model.decoder.layers[0].self_attn)
print("Size: \t", attention_output.size(), "\n")
print("Output: \t", attention_output)

Layer: 	 OPTAttention(
  (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
)
Size: 	 torch.Size([1, 10, 2048]) 

Output: 	 tensor([[[-0.0119, -0.0110,  0.0056,  ...,  0.0094,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         ...,
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093],
         [-0.0119, -0.0110,  0.0056,  ...,  0.0095,  0.0013,  0.0093]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)


In [35]:
#Activation function output
activation_fn_output = OPT.model.decoder.layers[0].activation_fn(attention_output)

In [36]:
print("Layer: \t", OPT.model.decoder.layers[0].activation_fn)
print("Size: \t", activation_fn_output.size(), "\n")
print("Output: \t", activation_fn_output)

Layer: 	 ReLU()
Size: 	 torch.Size([1, 10, 2048]) 

Output: 	 tensor([[[0.0000, 0.0000, 0.0056,  ..., 0.0094, 0.0013, 0.0093],
         [0.0000, 0.0000, 0.0056,  ..., 0.0095, 0.0013, 0.0093],
         [0.0000, 0.0000, 0.0056,  ..., 0.0095, 0.0013, 0.0093],
         ...,
         [0.0000, 0.0000, 0.0056,  ..., 0.0095, 0.0013, 0.0093],
         [0.0000, 0.0000, 0.0056,  ..., 0.0095, 0.0013, 0.0093],
         [0.0000, 0.0000, 0.0056,  ..., 0.0095, 0.0013, 0.0093]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<ReluBackward0>)


In [37]:
#self_attn_layer_norm output
self_attn_layer_norm_output = OPT.model.decoder.layers[0].self_attn_layer_norm(activation_fn_output)

In [38]:
print("Layer: \t", OPT.model.decoder.layers[0].self_attn_layer_norm)
print("Size: \t", self_attn_layer_norm_output.size(), "\n")
print("Output: \t", self_attn_layer_norm_output)

Layer: 	 LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
Size: 	 torch.Size([1, 10, 2048]) 

Output: 	 tensor([[[-0.0297, -0.0003, -0.0809,  ..., -0.0103, -0.0086, -0.0093],
         [-0.0298, -0.0004, -0.0807,  ..., -0.0100, -0.0086, -0.0092],
         [-0.0298, -0.0004, -0.0803,  ..., -0.0098, -0.0090, -0.0091],
         ...,
         [-0.0298, -0.0003, -0.0810,  ..., -0.0098, -0.0091, -0.0091],
         [-0.0297, -0.0003, -0.0809,  ..., -0.0099, -0.0089, -0.0091],
         [-0.0298, -0.0004, -0.0802,  ..., -0.0100, -0.0089, -0.0092]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<NativeLayerNormBackward0>)


In [40]:
#fc1 output
fc1_output = OPT.model.decoder.layers[0].fc1(self_attn_layer_norm_output)
print("Layer: \t", OPT.model.decoder.layers[0].fc1)
print("Size: \t", fc1_output.size(), "\n")
print("Output: \t", fc1_output)

Layer: 	 Linear8bitLt(in_features=2048, out_features=8192, bias=True)
Size: 	 torch.Size([1, 10, 8192]) 

Output: 	 tensor([[[0.1697, 0.4006, 0.1063,  ..., 0.2170, 0.1257, 0.0748],
         [0.1711, 0.4009, 0.1044,  ..., 0.2166, 0.1252, 0.0759],
         [0.1687, 0.4016, 0.1060,  ..., 0.2180, 0.1233, 0.0735],
         ...,
         [0.1694, 0.4028, 0.1036,  ..., 0.2180, 0.1251, 0.0746],
         [0.1696, 0.4023, 0.1064,  ..., 0.2188, 0.1250, 0.0756],
         [0.1685, 0.4023, 0.1066,  ..., 0.2168, 0.1241, 0.0732]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)


In [41]:
#fc2 output
fc2_output = OPT.model.decoder.layers[0].fc2(fc1_output)
print("Layer: \t", OPT.model.decoder.layers[0].fc2)
print("Size: \t", fc2_output.size(), "\n")
print("Output: \t", fc2_output)

Layer: 	 Linear8bitLt(in_features=8192, out_features=2048, bias=True)
Size: 	 torch.Size([1, 10, 2048]) 

Output: 	 tensor([[[ 2.3164,  1.1621,  3.1191,  ..., -0.6309, -0.8525,  0.4644],
         [ 2.3164,  1.1729,  3.1309,  ..., -0.6323, -0.8521,  0.4561],
         [ 2.3027,  1.1777,  3.1289,  ..., -0.6587, -0.8359,  0.4490],
         ...,
         [ 2.3047,  1.1504,  3.1152,  ..., -0.6270, -0.8496,  0.4744],
         [ 2.3027,  1.1699,  3.1348,  ..., -0.6494, -0.8403,  0.4480],
         [ 2.2949,  1.1787,  3.1348,  ..., -0.6533, -0.8354,  0.4607]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<MatMul8bitLtBackward>)


In [42]:
# final_layer_norm output
final_layer_norm_output = OPT.model.decoder.layers[0].final_layer_norm(fc2_output)
print("Layer: \t", OPT.model.decoder.layers[0].final_layer_norm)
print("Size: \t", final_layer_norm_output.size(), "\n")
print("Output: \t", final_layer_norm_output)

Layer: 	 LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
Size: 	 torch.Size([1, 10, 2048]) 

Output: 	 tensor([[[ 0.9146,  0.3237,  1.5127,  ..., -0.1781, -0.5068,  0.3682],
         [ 0.9175,  0.3301,  1.5225,  ..., -0.1794, -0.5078,  0.3650],
         [ 0.9111,  0.3323,  1.5205,  ..., -0.1915, -0.5000,  0.3618],
         ...,
         [ 0.9092,  0.3184,  1.5117,  ..., -0.1763, -0.5054,  0.3728],
         [ 0.9102,  0.3281,  1.5225,  ..., -0.1870, -0.5020,  0.3611],
         [ 0.9077,  0.3328,  1.5244,  ..., -0.1892, -0.5000,  0.3672]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<NativeLayerNormBackward0>)


The embedded input size is also torch.Size([1, 10, 2048]), which is the same as the final output size mentioned above.