<a href="https://colab.research.google.com/github/suhel-24/FML/blob/main/Decoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This shall demonstrate the functioning of the Decoder transformer Architecture from a pre-trained large language model, providing an insight into their inner workings using the transformers Hugging Face library.

To begin, we load the model and tokenizer using `AutoModelForCausalLM` and `AutoTokenizer`, respectively. Then, we proceed to tokenize a sample phrase, which will serve as our input in the following steps.

**Ensure that your runtime type is T4-GPU**

## Install required packages
The colab environment provides pre installed transformes. But we are trying to quantize the model. Quantization means to represent the parameters of the model in less number of bits so that it fits in less memory. For that we need to install following two packages.

In [1]:
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m266.2/270.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1


In [2]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.42.0


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# This line loads the Causal Language Model that was trained on the
# "facebook/opt-1.3b" dataset. load_in_8bit=True means the model weight
# tensors will be loaded in 8-bit precision which saves memory space
# and speeds up model loading.
OPT = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b", load_in_8bit=True)

# This line loads the tokenizer that was used to preprocess the
# text data for the "facebook/opt-1.3b" model.
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

# Assigns this English sentence to a variable called inp.
inp = "The quick brown fox jumps over the lazy dog"

# Tokenize the inp sentence.
# It converts the sentence into tokens that the transformer models
# can understand. return_tensors="pt" means the tokenized data is
# returned as PyTorch tensors.
inp_tokenized = tokenizer(inp, return_tensors="pt")

# Prints the size of the input_ids tensor.
# This represents the number of tokens in the input sentence.
print(inp_tokenized['input_ids'].size())

# Prints the details of inp_tokenized which includes input_ids and
# attention_mask. It is the tokenized form of the input sentence.
print(inp_tokenized)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

torch.Size([1, 10])
{'input_ids': tensor([[    2,   133,  2119,  6219, 23602, 13855,    81,     5, 22414,  2335]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


Let's unpack the results:

```
input_ids: tensor([[ 2, 133, 2119, 6219, 23602, 13855, 81, 5, 22414, 2335]])
```

This represents your tokenized input sentence:

> "The quick brown fox jumps over the lazy dog"

Each number corresponds to a token in the transformer model's vocabulary, which is used to represent words or subwords.

```
attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
```
The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.

For the Transformer models, this mask shows which tokens should be attended to (1) and which should not be (0).

In this case, there is no padding, so all values are 1.

If the sentence had been padded (to match the length of the longest sentence in a batch or meet a set sequence length for the model), the padding tokens would be represented as 0 in this mask. **Padding is useful when your input is batched and a batch has varying length sentences**.

As a note, the specific numeric values in the `input_ids` tensor will depend on the specific tokenizer used and the vocabulary it was trained with. Each unique word or subword in the vocabulary is assigned a unique numerical ID.

For example, in the output:

```
  <SOS>The   quick brown fox    jumps  over   the lazy   dog"
[ 2,   133,  2119, 6219, 23602, 13855, 81,    5,  22414, 2335]
```
Here `SOS` means **Start of Sentence** token

Let us now see. how padding would work


In [None]:
inp = ["I like Machine Learning", "The sun rises in the east"]
inp_tokenized = tokenizer(inp, return_tensors="pt", padding=True)
print(inp_tokenized['input_ids'].size())
print(inp_tokenized)

torch.Size([2, 7])
{'input_ids': tensor([[    2,   100,   101, 14969, 13807,     1,     1],
        [    2,   133,  3778, 10185,    11,     5,  3017]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1]])}


Let us understand the output again

```
Tokenized tensor:
    <SOS>   I     like  Machine Learning <PAD>  <PAD>
  [    2,   100,   101, 14969,  13807,       1,     1],
    <SOS>   The    sun  rises     in    the   east
  [    2,   133,  3778, 10185,    11,     5,  3017]
  
attention_mask
  [1, 1, 1, 1, 1, 0, 0], <-- 0 is padding as first sentence is short
  [1, 1, 1, 1, 1, 1, 1]


```

## The Model
We load Facebook's Open Pre-trained Transformer model with 1.3B parameters (facebook/opt-1.3b) in the 8-bit format, a memory-saving approach to efficiently utilize GPU resources. The tokenizer object loads the required vocabulary to interact with the model and will be used to convert the sample input (inp variable) to the token IDs and attention mask.

Let’s look at the model’s architecture by accessing its `.model` method.

In [None]:
print(OPT.model)

OPTModel(
  (decoder): OPTDecoder(
    (embed_tokens): Embedding(50272, 2048, padding_idx=1)
    (embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
    (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (layers): ModuleList(
      (0-23): 24 x OPTDecoderLayer(
        (self_attn): OPTAttention(
          (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
          (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
        )
        (activation_fn): ReLU()
        (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear8bitLt(in_features=2048, out_features=8192, bias=True)
        (fc2): Linear8bitLt(in_features=8192, out_features=2048, bias=True)
        (final_layer_norm): LayerNorm((2048,), eps=1e-05, e

The model is decoder-only, a common characteristic among transformer-based language models. It is similar in architecture to GPT (Generative Pretrained Transformer). Here's a high-level explanation of main components:

1. `Embedding(50272, 2048, padding_idx=1)`: This creates an embedding lookup table of size `(50272, 2048)`. Here, 50272 is the size of your vocabulary and 2048 is the dimension of each embedded word vector. The `padding_idx` parameter specifies which index in your lookup table will be used for padding.

2. `OPTLearnedPositionalEmbedding(2050, 2048)`: This layer is used for incorporating the position of words in a sentence. It is a matrix of size `(2050, 2048)`. Here, 2050 is the maximum length of the input sequences (sentences) that your model can take in, and 2048 is the dimension of the output embeddings. It's called 'learned' because the model learns these positional embeddings during training.

3. `LayerNorm((2048,), eps=1e-05, elementwise_affine=True)`: This performs layer normalization, a type of normalization technique that standardizes feature vectors within each layer of the neural network.

4. `ModuleList(0-23: 24 x OPTDecoderLayer...`: Your model has 24 identical decoder layers stacked on each other. Each of these layers consists of:
   
   - `OPTAttention`: An attention mechanism that allows the model to focus on different parts of the input while generating each word in the output. It also includes a series of linear layers (`Linear8bitLt`) for projecting the inputs into the appropriate dimensions for calculating the attention scores.
   
   - `activation_fn`: ReLU (Rectified Linear Unit) is the activation function used in this model. It likely follows the matrix multiplication in each linear layer.
   
   - `LayerNorm((2048,), eps=1e-05, elementwise_affine=True)`: After the attention scores are computed and the weightings are applied, there is a layer normalization step in the decoder layer.
   
   - `fc1` and `fc2` (Fully Connected layers): Each `OPTDecoderLayer` in the model contains two linear layers, which transform the data to a specified number of features (from 2048 to 8192, then back to 2048).

`Linear8bitLt` is indicative that 8-bit quantization is used on the weights of these layers, which can reduce the model size and inference time, often without any significant decrease in performance.

The above is a commonly used variant of the transformer architectures, implemented with quantization for optimized performance. Such architectures have been quite successful in various natural language processing tasks, including language translation, summarization, and more.

Now that we understand the model, we must utilize the decoder key to access its inner components. Furthermore, the examination of the layers key reveals that the decoder component is composed of 24 stacked layers with the same architecture. To begin, we look at the embedding layer.

The embedding layer is accessible through the .embed_tokens method under the decoder component and passes our tokenized inputs to the layer. As you can see, the embedding layer will transform a list of IDs with [2, 7] size to [2, 7, 2048]. This representation will then be used and passed through the decoder layers.

In [None]:
embedded_input = OPT.model.decoder.embed_tokens(inp_tokenized['input_ids'])
print("Layer:\t", OPT.model.decoder.embed_tokens)
print("Size:\t", embedded_input.size())
print("Output:\t", embedded_input)

Layer:	 Embedding(50272, 2048, padding_idx=1)
Size:	 torch.Size([2, 7, 2048])
Output:	 tensor([[[-0.0407,  0.0519,  0.0574,  ..., -0.0263, -0.0355, -0.0260],
         [-0.0207,  0.0689, -0.0345,  ...,  0.0536, -0.0303, -0.0147],
         [ 0.0460,  0.0240,  0.0114,  ...,  0.0053, -0.0050, -0.0035],
         ...,
         [ 0.0121, -0.0246, -0.0299,  ..., -0.0241, -0.0199,  0.0188],
         [ 0.0168, -0.0312, -0.0161,  ..., -0.0373, -0.0273,  0.0114],
         [ 0.0168, -0.0312, -0.0161,  ..., -0.0373, -0.0273,  0.0114]],

        [[-0.0407,  0.0519,  0.0574,  ..., -0.0263, -0.0355, -0.0260],
         [-0.0371,  0.0220, -0.0096,  ...,  0.0265, -0.0166, -0.0030],
         [-0.0135,  0.0299,  0.0194,  ..., -0.0290,  0.0076,  0.0114],
         ...,
         [ 0.0359,  0.0118,  0.0179,  ...,  0.0498,  0.0327, -0.0034],
         [ 0.0007,  0.0267,  0.0257,  ...,  0.0622,  0.0421,  0.0279],
         [ 0.0310, -0.0077,  0.0362,  ..., -0.0366, -0.0258, -0.0346]]],
       device='cuda:0', dtype



Subsequently, the positional encoding component utilizes the attention masks to generate a vector that imparts a sense of positioning within the model. The following code uses the `.embed_positions` method from the decoder to generate the positional embeddings. As seen, the layer generates a distinct vector for each position, which is added to the output of the embedding layer. This process introduces supplementary positional information to the model.

In [None]:
embed_pos_input = OPT.model.decoder.embed_positions(inp_tokenized['attention_mask'])
print("Layer:\t", OPT.model.decoder.embed_positions)
print("Size:\t", embed_pos_input.size())
print("Output:\t", embed_pos_input)

Layer:	 OPTLearnedPositionalEmbedding(2050, 2048)
Size:	 torch.Size([2, 7, 2048])
Output:	 tensor([[[-8.1406e-03, -2.6221e-01,  6.0768e-03,  ...,  1.7273e-02,
          -5.0621e-03, -1.6220e-02],
         [-8.0585e-05,  2.5000e-01, -1.6632e-02,  ..., -1.5419e-02,
          -1.7838e-02,  2.4948e-02],
         [-9.9411e-03, -1.4978e-01,  1.7557e-03,  ...,  3.7117e-03,
          -1.6434e-02, -9.9087e-04],
         ...,
         [ 9.8038e-03,  2.4780e-01,  1.4900e-02,  ..., -2.9709e-02,
          -4.4937e-03, -1.2398e-03],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[-8.1406e-03, -2.6221e-01,  6.0768e-03,  ...,  1.7273e-02,
          -5.0621e-03, -1.6220e-02],
         [-8.0585e-05,  2.5000e-01, -1.6632e-02,  ..., -1.5419e-02,
          -1.7838e-02,  2.4948e-02],
         [-9.9411e-03, -1.4978e-01,  1.7557e-03,  ...

Lastly, the self-attention component! We use the first layer’s self-attention component by indexing through the layers and accessing the .self_attn method.

In [None]:
embed_position_input = embedded_input + embed_pos_input
hidden_states, _, _ = OPT.model.decoder.layers[0].self_attn(embed_position_input)
print("Layer:\t", OPT.model.decoder.layers[0].self_attn)
print("Size:\t", hidden_states.size())
print("Output:\t", hidden_states)


Layer:	 OPTAttention(
  (k_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (v_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
  (out_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=True)
)
Size:	 torch.Size([2, 7, 2048])
Output:	 tensor([[[-0.0118, -0.0071,  0.0049,  ...,  0.0088,  0.0003,  0.0098],
         [-0.0118, -0.0071,  0.0049,  ...,  0.0088,  0.0003,  0.0098],
         [-0.0118, -0.0071,  0.0049,  ...,  0.0088,  0.0003,  0.0098],
         ...,
         [-0.0118, -0.0071,  0.0049,  ...,  0.0088,  0.0003,  0.0098],
         [-0.0118, -0.0071,  0.0049,  ...,  0.0088,  0.0003,  0.0098],
         [-0.0118, -0.0071,  0.0049,  ...,  0.0088,  0.0003,  0.0098]],

        [[-0.0131, -0.0083,  0.0046,  ...,  0.0082,  0.0013,  0.0118],
         [-0.0130, -0.0083,  0.0045,  ...,  0.0082,  0.0013,  0.0118],
         [-0.0131, -0.0083,  0.0045,  ...,  0.0082,  0.00

The self-attention component comprises the mentioned query, key, and value layers, culminating in a final projection for the output. It takes the sum of the embedded input and the positional encoding vector as input. In a real-world example, the model also provides the attention mask to the component, enabling it to identify which portions of the input should be disregarded or ignored. (removed from the sample code for simplicity)

The rest of the architecture applies non-linearity (e.g., RELU), feedforward, and batch normalization layers.

> 💡If you are interested in learning the Transformer architecture in more detail and implement a GPT-like network from scratch, we recommend watching the following video from Andrej Karpathy


> https://www.youtube.com/watch?v=kCc8FmEb1nY

Let us now try to use this model to generate some text. First let us look at the genrate method doc.

In [None]:
help(OPT.generate)

Help on method generate in module transformers.generation.utils:

generate(inputs: Optional[torch.Tensor] = None, generation_config: Optional[transformers.generation.configuration_utils.GenerationConfig] = None, logits_processor: Optional[transformers.generation.logits_process.LogitsProcessorList] = None, stopping_criteria: Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None, prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None, synced_gpus: Optional[bool] = None, assistant_model: Optional[ForwardRef('PreTrainedModel')] = None, streamer: Optional[ForwardRef('BaseStreamer')] = None, negative_prompt_ids: Optional[torch.Tensor] = None, negative_prompt_attention_mask: Optional[torch.Tensor] = None, **kwargs) -> Union[transformers.generation.utils.GreedySearchEncoderDecoderOutput, transformers.generation.utils.GreedySearchDecoderOnlyOutput, transformers.generation.utils.SampleEncoderDecoderOutput, transformers.generation.utils.Sampl

Let us now try to generate some text. We need to map the input to CUDA (GPU) before feeding it to the model as the model is using GPU.

**Remember**: GPU memory is used for the model, input and output.

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer.encode("I love", return_tensors='pt').to(device) # map to cuda if you are using a GPU
outputs = OPT.generate(inputs, max_length=50, do_sample=True, temperature=0.7, top_p=0.8)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded_output)

I love this. I’m in a very similar situation to you. I’m still trying to get my head around it.
I’m so sorry to hear that. It’s a very strange feeling to


In [None]:
inputs = tokenizer.encode("Looking at the horizion across the ocean with the sun rising from the", return_tensors='pt').to(device) # map to cuda if you are using a GPU
outputs = OPT.generate(inputs, max_length=50, do_sample=True, temperature=0.7, top_p=0.8)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded_output)

Looking at the horizion across the ocean with the sun rising from the horizon.
This. I was on a cruise with my parents and we were on the Caribbean. We were in the middle of the ocean and I was looking up at
