# Chapter 6 - Parameter Effiecient Fine Tuning

LORA, QLORA

Parameter effiecient fine tuning is a family of techniques to adapt a pretrained model for new task in an effiecient manner; these techniques are tailored to use less compute power, far less memory and storage.

LoRA stands for low rank adaptations. A technique to adapt a pretrained model for a specific downstream NLP task. In the previous chapter we looked at full fine tuning technique of a pretrained model. 

* reduced number of weight updates
* separate the fine-tuned weight from pretrained weights. This allows to have smaller model size.

In [1]:
import os
from pathlib import Path
import sys
import warnings

warnings.simplefilter("ignore")
current_path = Path(os.getcwd())
parent_path  = str(current_path.parent.absolute())
sys.path.append(parent_path)



In [61]:
from chapter2.gptlikemodel import SLLM, SLLMConfig


def print_no_parameters(model, custom_msg=""):
    total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,} {custom_msg}")
    

def get_model():
    # Initialize the model class
    config = SLLMConfig()
    print(f"Model configuration {config}\n")
    model = SLLM(config)
    print_no_parameters(model)

    return model, config

In [62]:
model,config = get_model()
model

Model configuration SLLMConfig(d_model=128, d_head=128, bias=False, dropout=0.0, context_window=50, n_heads=2, vocab_size=52000, n_layers=2)

Total parameters: 13,698,976 


SLLM(
  (embedding_block): EmbeddingsBlock(
    (token_embdgs): Embedding(52000, 128)
    (pos_embdgs): Embedding(50, 128)
    (droput): Dropout(p=0.0, inplace=False)
  )
  (transformer_blocks): ModuleList(
    (0-1): 2 x TransformerBlock(
      (ln1): LayerNorm()
      (mha): MultiHeadAttentionV1(
        (projection_out): Linear(in_features=256, out_features=128, bias=True)
        (Wq): Linear(in_features=128, out_features=256, bias=False)
        (Wk): Linear(in_features=128, out_features=256, bias=False)
        (Wv): Linear(in_features=128, out_features=256, bias=False)
      )
      (ln2): LayerNorm()
      (mlp): MLP(
        (ln_1): Linear(in_features=128, out_features=128, bias=False)
        (gelu): GELU(approximate='none')
        (c_proj): Linear(in_features=128, out_features=128, bias=False)
        (dropout): Dropout(p=0.0, inplace=False)
      )
    )
  )
  (final_norm): LayerNorm()
  (out_head): Linear(in_features=128, out_features=52000, bias=True)
)

## LoRA

Low Rank Adaptation. Replace an existing weight matrices in a model with low rank counter part. For example, consider the following matrix.

gradient descent weight update

w = w + learning_rate * delta_w
where delta_w is the gradients of weights w.r.t loss.

once the weights are calculate, x is projected onto w.

x*w

we can rewrite the above equation as

x * (w + learning_rate * delta_w)

x*w + learning_rate * (x * delta_w)

In [69]:
import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self,in_dim, out_dim, rank, alpha):
        super().__init__()
        self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        
        self.alpha = alpha
    
    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

class LinearLoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora  = LoRALayer(linear.in_features, linear.out_features, rank, alpha)
    
    def forward(self, x):
        return self.linear(x) + self.lora(x)

In [70]:
model,config = get_model()

print_no_parameters(model, "before freezing")

for param in model.parameters():
    param.requires_grad = False

print_no_parameters(model, "after freezing")

replace_modules = ["ln_1", "c_proj"]
rank = 16
alpha = 30

for module in model.modules():
    for name, submodule in module.named_children():
        if isinstance(submodule, torch.nn.Linear):
            if name in replace_modules:
                print(f"replace {name}")
                setattr(submodule, name, LinearLoRA(submodule, rank, alpha))
                
print_no_parameters(model, "to train with LoRA")


Model configuration SLLMConfig(d_model=128, d_head=128, bias=False, dropout=0.0, context_window=50, n_heads=2, vocab_size=52000, n_layers=2)

Total parameters: 13,698,976 
Total parameters: 13,698,976 before freezing
Total parameters: 0 after freezing
replace ln_1
replace c_proj
replace ln_1
replace c_proj
Total parameters: 16,384 to train with LoRA


In [71]:
model

RecursionError: maximum recursion depth exceeded while getting the str of an object

## Adapters

Save only the layers we have replaced as adapters. Later load the model weights
and replace the layers with weights we have stored.

In [67]:
model

AttributeError: 'SLLM' object has no attribute 'ln_1'

## Prompt Tuning

In the previous chapter, we did a full fine tuning of a model for a classification task. With prompt tuning we can turn any task into a generation task. Introduced in the paper {cite}`lester2021powerscaleparameterefficientprompt`. Quoting from the paper "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples"

Given a series of n tokens, {x1, x2, . . . , xn}, the
first thing T5 does is embed the tokens, forming
a matrix Xe ∈ Rn×e where e is the dimension of
the embedding space. Our soft-prompts are repre-
sented as a parameter Pe ∈ Rp×e, where p is the
length of the prompt. Our prompt is then concate-
nated to the embedded input forming a single ma-
trix [Pe; Xe] ∈ R(p+n)×e which then flows though
the encoder-decoder as normal. Our models are
trained to maximize the probability of Y , but only
the prompt parameters Pe are updated.

In [14]:
from chapter1.simplebooks import get_tokenizer

tokenizer = get_tokenizer()
prompt_tuning_init_text = "Classify if the feedback is a complaint or no complaint.\n"
pt_init_text_tokens = torch.tensor(tokenizer(prompt_tuning_init_text)["input_ids"])
pt_init_text_tokens

Loading tokenizer from /home/gopi/Documents/small_llm/llmbook/data/simplebooks-tokenizer


tensor([20978,  7634,   475,   260,  2432,  3235,   357,   259, 19962,   493,
          467, 19962,    14,   199])

In [44]:
import torch
import torch.nn as nn

class PrefixEmbedding(nn.Module):
    """"""
    def __init__(self, pt_tokens, config):
        
        super().__init__()
        self.pt_tokens       = pt_tokens
        no_tokens            = len(pt_tokens)
        self.pt_embedding    = nn.Embedding(no_tokens, config.d_model)
    
    def forward(self, x):
        
        pt = self.pt_embedding(self.pt_tokens)
        return torch.cat((pt, x),0)



In [55]:

model, config = get_model()
prefix_block = PrefixEmbedding(pt_init_text_tokens, config)

# freeze all the weights
for param in model.parameters():
    param.requires_grad = False

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters to train after freezing: {total_params:,}")



Model configuration SLLMConfig(d_model=128, d_head=128, bias=False, dropout=0.0, context_window=50, n_heads=2, vocab_size=52000, n_layers=2)

Total parameters: 13,698,976
Total parameters to train after freezing: 0


In [56]:
from collections import OrderedDict

blocks = OrderedDict({"token_embedding": model.embedding_block, "pt_embedding": prefix_block})
model.embedding_block = nn.Sequential(blocks)
#nn.Sequential(model.embedding_block, prefix_block)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters after adding prompt tuning: {total_params:,}")

Total parameters after adding prompt tuning: 1,792


In [57]:
len(pt_init_text_tokens) * config.d_model

1792

In [58]:
model

SLLM(
  (embedding_block): Sequential(
    (token_embedding): EmbeddingsBlock(
      (token_embdgs): Embedding(52000, 128)
      (pos_embdgs): Embedding(50, 128)
      (droput): Dropout(p=0.0, inplace=False)
    )
    (pt_embedding): PrefixEmbedding(
      (pt_embedding): Embedding(14, 128)
    )
  )
  (transformer_blocks): ModuleList(
    (0-1): 2 x TransformerBlock(
      (ln1): LayerNorm()
      (mha): MultiHeadAttentionV1(
        (projection_out): Linear(in_features=256, out_features=128, bias=True)
        (Wq): Linear(in_features=128, out_features=256, bias=False)
        (Wk): Linear(in_features=128, out_features=256, bias=False)
        (Wv): Linear(in_features=128, out_features=256, bias=False)
      )
      (ln2): LayerNorm()
      (mlp): MLP(
        (ln_1): Linear(in_features=128, out_features=128, bias=False)
        (gelu): GELU(approximate='none')
        (c_proj): Linear(in_features=128, out_features=128, bias=False)
        (dropout): Dropout(p=0.0, inplace=False)
   

In [49]:
for name,module in model.embedding_block.named_children():
    for params in module.parameters():
        print(name)
        print(params.shape)
        print(params)


0
torch.Size([52000, 128])
Parameter containing:
tensor([[ 0.3087,  0.2148,  0.6442,  ..., -0.7668,  0.4021,  0.6716],
        [-0.3794, -0.6720, -0.8701,  ..., -0.1167,  0.3699, -0.1928],
        [-0.1089,  0.7103,  1.2057,  ..., -0.1386,  1.9621,  1.0058],
        ...,
        [ 1.4903,  1.5516,  0.5210,  ...,  1.2449,  1.5994,  1.8209],
        [-0.0917,  0.5116, -1.5981,  ..., -0.7242, -1.8690,  0.0202],
        [-0.9190,  0.5129,  1.3088,  ..., -1.1461, -0.9235, -1.7763]])
0
torch.Size([50, 128])
Parameter containing:
tensor([[-1.2415,  0.2271,  1.0032,  ..., -0.6076,  1.3255, -1.4367],
        [ 1.5665,  1.4603,  1.2462,  ..., -1.2420,  1.5477, -0.7403],
        [ 0.4539, -0.9131, -0.5536,  ..., -1.1211,  0.8547,  0.7545],
        ...,
        [ 0.1233,  0.7568,  0.8532,  ..., -0.4519,  1.8104, -1.1059],
        [-1.0984, -0.3848,  0.4627,  ..., -1.9657, -0.5742, -1.7771],
        [ 1.3583, -1.6562,  0.6895,  ..., -1.3460,  0.3275,  0.8947]])
1
torch.Size([14, 128])
Parameter con

## Prefix Tuning

## P-Tuning