## Chapter 3: Low-Rank Adaptation (LoRA)

### Spoilers

In this chapter, we will:

- Understand what a low-rank adapter is and why it’s useful
- Prepare the quantized model for training
- Use `peft` to create and attach adapters to a base model
- Discuss configuration options for targeting layers for training

### Setup

In [None]:
# If you're running on Colab
!pip install datasets bitsandbytes trl

In [None]:
# If you're running on runpod.io's Jupyter Template
#!pip install datasets bitsandbytes trl transformers peft huggingface-hub accelerate safetensors pandas matplotlib

### Imports

In [None]:
import numpy as np
import torch
import torch.nn as nn
from copy import deepcopy
from numpy.linalg import matrix_rank
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

### The Goal

We attach adapters to the huge linear layers in an LLM to drastically reduce the number of trainable parameters. We can easily shrink the number of trainable parameters down to less than 1% of their original number. By reducing both computation (fewer gradients to compute) and memory footprint (fewer parameters tracked by the optimizer), we achieve significant efficiency gains. Keep in mind, however, that low-rank adapters are unlikely to match the performance of full-model tuning, and their effectiveness may vary depending on the base model and the task.

### Pre-Reqs

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch3/matmul.png?raw=True)
<center>Figure 3.1 - Matrix multiplication</center>

### Low-Rank Adaptation in a Nutshell

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch3/two_matrices.png?raw=True)
<center>Figure 3.2 - Multiplying two low-rank matrices</center>

In [None]:
base_layer = nn.Linear(1024, 1024, bias=False)
base_layer.weight.shape, base_layer.weight.numel()

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch3/lowrank_matrices.png?raw=True)
<center>Figure 3.3 - Frozen weights and low-rank matrices</center>

In [None]:
torch.manual_seed(11)
r = 8
layer_A = nn.Linear(base_layer.in_features, r, bias=False)
layer_B = nn.Linear(r, base_layer.out_features, bias=False)
layer_A, layer_B

In [None]:
layer_A.weight.numel(), layer_B.weight.numel()

In [None]:
composite = layer_B.weight @ layer_A.weight
composite.shape, composite.numel()

In [None]:
matrix_rank(composite.detach().numpy())

$$
\Large
\text{output} = X @ (W + B @ A)^T
$$
<center>Equation 3.1 - Adding the resulting product to the weights</center>

In [None]:
torch.manual_seed(19)
batch = torch.randn(1, 1024)

batch @ (base_layer.weight.data + layer_B.weight @ layer_A.weight).T

$$
\Large
\text{output} = \underbrace{X @ W^T}_{O_W} + \underbrace{X @ (B @ A)^T}_{O_{AB}}
$$
<center>Equation 3.2 - Using two forward passes</center>

![](https://github.com/dvgodoy/FineTuningLLMs/blob/main/images/ch3/forward.png?raw=True)
<center>Figure 3.4 - Using two forward passes</center>

In [None]:
regular_output = batch @ base_layer.weight.data.T
additional_output = batch @ (layer_B.weight @ layer_A.weight).T
regular_output, additional_output

$$
\Large
\text{additional} = X @ (B @ A)^T = \underbrace{\underbrace{(X @ A^T)}_{O_A} @ B^T}_{O_{AB}}
$$
<center>Equation 3.3 - Chaining the adapter’s forward passes</center>

In [None]:
out_A = (batch @ layer_A.weight.T)
additional_output = out_A @ layer_B.weight.T
additional_output

In [None]:
regular_output = base_layer(batch)
out_A = layer_A(batch)
additional_output = layer_B(out_A)
output = regular_output + additional_output
regular_output, additional_output, output

$$
\Large
\text{output} = X @ W^T + \frac{\alpha}{r}\left[X @ (B @ A)^T\right]
$$
<center>Equation 3.4 - LoRA’s alpha</center>

In [None]:
alpha = 2*r
output = regular_output + (alpha / r) * additional_output
output

### The Road So Far

In [None]:
supported = torch.cuda.is_bf16_supported(including_emulation=False)
compute_dtype = (torch.bfloat16 if supported else torch.float32)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=compute_dtype
)

model_q4 = AutoModelForCausalLM.from_pretrained("facebook/opt-350m",
                                                device_map='cuda:0',
                                                torch_dtype=compute_dtype,
                                                quantization_config=nf4_config)

### Parameter Types and Gradients

****
**Summary of "Parameter Types and Gradients"**
- quantization only freezes the linear layers that have been quantized
- after quantization, a model can be prepared using the `prepare_model_for_kbit_training()` function
  - it freezes **all** layers
  - it casts every non-quantized 16-bit layer to FP32 to improve training
  - it enables gradient checkpointing
- you'll be able to unfreeze layers of your choice later on using the LoRA configuration
****

In [None]:
def trainable_parms(model):
    parms = [(name, param.dtype) for name, param in model.named_parameters() if param.requires_grad]
    return parms

trainable_parms(model_q4.model)

#### `prepare_model_for_kbit_training()`

In [None]:
prepared_model = prepare_model_for_kbit_training(model_q4,
                                        use_gradient_checkpointing=True,
                                        gradient_checkpointing_kwargs={'use_reentrant': False})
prepared_model

In [None]:
trainable_parms(prepared_model)

In [None]:
def parms_of_dtype(model, dtype=torch.float32):
    parms = [name for name, param in model.named_parameters() if param.dtype == dtype]
    return parms

In [None]:
parms_of_dtype(prepared_model)

In [None]:
prepared_model.get_memory_footprint()/1e6

### PEFT

"_🤗 PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware._

_PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference._"

****
**Summary of "PEFT"**
- the basic configuration below should work well in many cases
```python
config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, config)
```
- ranks of 8, 16, or 32 are typical, but using higher values shouldn’t significantly impact the model’s memory footprint.
- the scaling factor, `lora_alpha` is typically twice the rank.
- if your model has `Conv1D` layers, add `fan_in_fan_out=True` to your configuration
- if your model was recently released, you may need to specify the `target_modules` manually
  - typically, use the names of the massive linear layers in the attention module.
- by default, only the adapters are trainable
  - if you'd like to train other layers, such as layer norms, add them to the `modules_to_save` argument
  - if you're adding your own tokens to the tokenizer, you'll need to also train vocabulary-related layers such as embeddings and the model's head
****

In [None]:
lora_config = LoraConfig()
lora_config

In [None]:
config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

### `target_modules`

Since there are new models and architectures being released on a weekly basis, chances are that there is no preconfigured list of target layers in your currently installed version of the PEFT library. In this case, you’ll be greeted with the following error:

***
`ValueError: Please specify `target_modules` in `peft_config``
***

Once you have the names, you can use yet another configuration argument: target_modules, which is either
the name or a list of the names of the modules to which you want to apply the adapters.

**Supported Models**

If you'd like to check if a given model's architecture is already supported by the installed version of the `peft` package, you can do the following:

In [None]:
from peft.utils.constants import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.keys()

In [None]:
TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['phi']

#### The PEFT Model

In [None]:
peft_model = get_peft_model(prepared_model, config, adapter_name='default')
peft_model

In [None]:
TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['opt']

In [None]:
lin = peft_model.base_model.model.model.decoder.layers[0].self_attn.q_proj
lin

In [None]:
peft_model.print_trainable_parameters()

In [None]:
trainable_parms(peft_model.base_model.model)

#### `modules_to_save`

In [None]:
config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    modules_to_save=['layer_norm']
)

In [None]:
# Since the model is modified in-place, we need to unload adapters
# from previous configuration to avoid mixing them.
# In a regular workflow, you'd load configuration only once and
# this wouldn't be needed.
_ = peft_model.unload()

In [None]:
peft_model = get_peft_model(prepared_model, config)
peft_model.print_trainable_parameters()

#### Embeddings

In [None]:
config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    modules_to_save=['layer_norm', 'embed_tokens']
)

In [None]:
# Since the model is modified in-place, we need to unload adapters
# from previous configuration to avoid mixing them.
# In a regular workflow, you'd load configuration only once and
# this wouldn't be needed.
_ = peft_model.unload()

In [None]:
peft_model = get_peft_model(prepared_model, config)
peft_model.print_trainable_parameters()

In [None]:
config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['embed_tokens', 'q_proj', 'v_proj']
)

In [None]:
# Since the model is modified in-place, we need to unload adapters
# from previous configuration to avoid mixing them.
# In a regular workflow, you'd load configuration only once and
# this wouldn't be needed.
_ = peft_model.unload()

In [None]:
peft_model = get_peft_model(prepared_model, config)
peft_model.print_trainable_parameters()

In [None]:
lin = peft_model.base_model.model.model.decoder.embed_tokens
lin

#### Managing Adapters

In [None]:
peft_model.load_adapter('dvgodoy/opt-350m-lora-yoda', adapter_name='yoda')
lora_A = peft_model.base_model.model.model.decoder.layers[0].self_attn.q_proj.lora_A
lora_A

In [None]:
peft_model.add_adapter(adapter_name='third', peft_config=config)
lora_A

In [None]:
peft_model.delete_adapter(adapter_name='third')
lora_A

In [None]:
peft_model.peft_config.keys()

In [None]:
peft_model.active_adapter

In [None]:
peft_model.set_adapter('yoda')
peft_model.active_adapter

```python
with peft_model.disable_adapter():
    original_outputs = peft_model(inputs)

original_outputs = peft_model.base_model(inputs)
```

In [None]:
peft_model.merge_adapter(adapter_names=['yoda'])
lora_A

In [None]:
peft_model.unload()
peft_model.base_model.model.model.decoder.layers[0].self_attn

### Coming Up in "Fine-Tuning LLMs"

Low-rank adapters saved the day by swooping in and enabling fast and cheap fine-tuning for LLMs. These humongous models, although powerful, are masters of a single trade—predicting the next token—thus remaining limited by the structure of their inputs. A new kind of input must be developed to enable these creatures to chat. Learn more about the incredible tale of chat templates in the next chapter of "Fine-Tuning LLMs."