<a href="https://colab.research.google.com/github/shahabday/DSR-LLM-finetuning/blob/main/02_PEFT_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PEFT LoRA

Let's start by loading a model:

In [None]:
!pip install peft

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.13.0->peft)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "EleutherAI/pythia-160m"

model = AutoModelForCausalLM.from_pretrained(base_model_id)

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

The `GPTNeoXSdpaAttention` class is deprecated in favor of simply modifying the `config._attn_implementation`attribute of the `GPTNeoXAttention` class! It will be removed in v4.48


In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
      

In [None]:
print_trainable_parameters(model)

trainable params: 162322944 || all params: 162322944 || trainable%: 100.0


It has 160M parameters - as expected - and they're all trainable.

We can use LoRA to get low-rank matrices for all the big linear layers in the model (our `target_modules`):

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
        "embed_out",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 1,588,224 || all params: 163,911,168 || trainable%: 0.9690


Thanks to LoRA, now there's only 1.5M trainable parameters - a bit less than 1% of the original number!

Notice that A and B matrices were created for each targeted linear layer:

In [None]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPTNeoXForCausalLM(
      (gpt_neox): GPTNeoXModel(
        (embed_in): Embedding(50304, 768)
        (emb_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-11): 12 x GPTNeoXLayer(
            (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (post_attention_dropout): Dropout(p=0.0, inplace=False)
            (post_mlp_dropout): Dropout(p=0.0, inplace=False)
            (attention): GPTNeoXSdpaAttention(
              (rotary_emb): GPTNeoXRotaryEmbedding()
              (query_key_value): lora.Linear(
                (base_layer): Linear(in_features=768, out_features=2304, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (defaul

Let's take a closer look:

In [None]:
lin = peft_model.base_model.model.gpt_neox.layers[0].attention.query_key_value
lin

lora.Linear(
  (base_layer): Linear(in_features=768, out_features=2304, bias=True)
  (lora_dropout): ModuleDict(
    (default): Dropout(p=0.05, inplace=False)
  )
  (lora_A): ModuleDict(
    (default): Linear(in_features=768, out_features=8, bias=False)
  )
  (lora_B): ModuleDict(
    (default): Linear(in_features=8, out_features=2304, bias=False)
  )
  (lora_embedding_A): ParameterDict()
  (lora_embedding_B): ParameterDict()
  (lora_magnitude_vector): ModuleDict()
)

In [None]:
lin.lora_A, lin.lora_B

(ModuleDict(
   (default): Linear(in_features=768, out_features=8, bias=False)
 ),
 ModuleDict(
   (default): Linear(in_features=8, out_features=2304, bias=False)
 ))

In [None]:
print_trainable_parameters(lin.base_layer)
print_trainable_parameters(lin.lora_A)
print_trainable_parameters(lin.lora_B)

trainable params: 0 || all params: 1771776 || trainable%: 0.0
trainable params: 6144 || all params: 6144 || trainable%: 100.0
trainable params: 18432 || all params: 18432 || trainable%: 100.0


Now, let's see how the output is produced under-the-hood:

In [None]:
torch.manual_seed(42)
x = torch.randn(1, 5, 768).float()

In [None]:
previous_dtype = x.dtype

# Uses the base model to produce outputs
result = lin.base_layer(x)

for active_adapter in lin.active_adapters:
    if active_adapter not in lin.lora_A.keys():
        continue

    lora_A = lin.lora_A[active_adapter]
    lora_B = lin.lora_B[active_adapter]
    dropout = lin.lora_dropout[active_adapter]
    scaling = lin.scaling[active_adapter]
    x = x.to(lora_A.weight.dtype)

    result += lora_B(lora_A(dropout(x))) * scaling

result = result.to(previous_dtype)
result

tensor([[[-0.7590,  0.8904, -1.9062,  ..., -0.2870, -0.3660,  0.3444],
         [ 0.6140,  0.5264, -0.9148,  ...,  0.5473,  0.2673,  0.2375],
         [ 0.4470,  0.2263,  0.6217,  ..., -0.0528, -0.2111, -0.5163],
         [-2.4433,  1.6739, -0.1461,  ..., -0.4212,  0.1469,  0.1895],
         [ 0.8305, -1.2330,  0.0519,  ...,  0.3947, -0.0971,  0.4093]]],
       grad_fn=<AsStridedBackward0>)

We can also try out a real sentence as input:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-160m",
)

inputs = tokenizer("The capital of Argentina is", return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

First, it tokenizes the sentence:

In [None]:
inputs['input_ids']

tensor([[  510,  5347,   273, 23881,   310]])

Then, it retrieves its input embeddings:

In [None]:
embed = peft_model.base_model.model.gpt_neox.embed_in(inputs['input_ids'])
embed

tensor([[[ 0.0002,  0.0048, -0.0329,  ...,  0.0067,  0.0170,  0.0054],
         [ 0.0592, -0.0108, -0.0004,  ..., -0.0036, -0.0077, -0.0434],
         [ 0.0005, -0.0028,  0.0054,  ...,  0.0027, -0.0020, -0.0037],
         [-0.0086, -0.0027,  0.0196,  ..., -0.0363,  0.0144,  0.0205],
         [-0.0053, -0.0024,  0.0104,  ...,  0.0185,  0.0039, -0.0238]]])

The inputs are layer-normalized next:

In [None]:
lnorm = peft_model.base_model.model.gpt_neox.layers[0].input_layernorm
lnorm

LayerNorm((768,), eps=1e-05, elementwise_affine=True)

In [None]:
norm_embed = lnorm(embed)
norm_embed

tensor([[[-0.0725,  0.2857, -1.0191,  ...,  0.1744,  0.5764,  0.1426],
         [ 1.6593, -0.1973,  0.0192,  ..., -0.1529, -0.1523, -1.3258],
         [-0.0525,  0.0059,  0.3059,  ...,  0.0892, -0.0103, -0.2053],
         [-0.3531,  0.0372,  0.6126,  ..., -1.0641,  0.4470,  0.5608],
         [-0.3176,  0.0268,  0.5030,  ...,  0.7324,  0.2268, -1.0857]]])

What if we pass these values as arguments to the linear layer we're experimenting with?

In [None]:
result = lin.base_layer(norm_embed)
result.shape

torch.Size([1, 5, 2304])

The variable `result` contains the output we're trying to replicate.

Now, let's use matrices A and B to manually compute this output and compare to the one above:

In [None]:
active_adapter = 'default'
lora_A = lin.lora_A[active_adapter]
lora_B = lin.lora_B[active_adapter]

In [None]:
lora_A.weight.shape, lora_B.weight.shape

(torch.Size([8, 768]), torch.Size([2304, 8]))

In [None]:
lora_A.state_dict(), lora_B.state_dict()

(OrderedDict([('weight',
               tensor([[ 0.0150,  0.0034, -0.0276,  ..., -0.0118, -0.0242, -0.0050],
                       [-0.0207, -0.0076, -0.0316,  ...,  0.0236,  0.0075, -0.0293],
                       [ 0.0247,  0.0306,  0.0109,  ...,  0.0133, -0.0126, -0.0100],
                       ...,
                       [ 0.0020,  0.0129,  0.0020,  ..., -0.0160, -0.0157, -0.0047],
                       [ 0.0258, -0.0268, -0.0022,  ...,  0.0063,  0.0017,  0.0089],
                       [-0.0122, -0.0147, -0.0073,  ..., -0.0279,  0.0154, -0.0129]]))]),
 OrderedDict([('weight',
               tensor([[0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       ...,
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.]]))]))

Did you notice anything?

Matrix B is initialized with **zeros**, so the model's original behavior is preserved before the "add-on" - the adapter - is trained.

In [None]:
low_ranked = lora_A(norm_embed)
low_ranked

tensor([[[ 0.0803, -0.2491, -0.2248,  0.1994,  0.3437,  0.6364, -0.8091,
           0.5596],
         [-0.0426, -0.6635, -0.8990, -0.4588,  1.0424, -0.0181, -0.4272,
           1.1195],
         [ 0.3390, -0.1111,  0.2337,  0.3571,  0.1301,  0.1260,  0.1835,
           0.3713],
         [ 0.5135,  0.8373, -0.4839,  0.5041,  0.0950,  0.4152,  0.1603,
           0.9721],
         [ 0.8137, -0.5629, -0.0342, -0.0405,  0.4985, -0.4471,  0.0960,
           0.2506]]], grad_fn=<UnsafeViewBackward0>)

In [None]:
delta = lora_B(low_ranked)
delta

tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]], grad_fn=<UnsafeViewBackward0>)