## Set the stage - Data, Model, Library, Pre-training

Huggingface has developed the `peft` library to facilitate the efficient adaptation of pretrained language models for various downstream applications without fine-tuning all of the model's parameters. The `peft` library supports multiple fine-tuning methods one of which is LoRA (Low Rank Adaptators) and it can be applied to various model types, not limited to transformers. Currently, `peft` allows fine-tuning of Linear, Embedding, and Conv2D layers in conjunction with LoRA.

There is an abundance of tutorials and blogs discussing how to implement LoRA fine-tuning to Large Language Models (LLMs) such as LLaMa and alike. Grasping the methodology and rationale behind LoRA while applying to large models is challenging because of the inherent complexity of the models. To enhance our understanding, lets implement LoRA in a multilayer perceptron (MLP) and use it to train a model for a binary classification task and thereby also assess parameter efficiency during the fine-tuning process.

In [None]:
import torch
from torch import nn
import torch.nn.functional as F

In [None]:
torch.manual_seed(0)

<torch._C.Generator at 0x7e03e06b9610>

Let's create a toy dataset consisting of random data for a classification task. There is a little bit of signal in the data, so we should expect that the loss of the model can improve during training.

In [None]:
X = torch.rand((1000, 20)) # Returns a tensor filled with random numbers from a uniform distribution on the interval [0,1).
y = (torch.sin(X.sum(1)) > 0).long() # y is the label with shape (1000, 1) which results in 1 if the sin of sum of elements in each row is > 0.5 and 0 otherwise. y is then cast to (torch.int64).

In [None]:
# Distribution of data between both classes
unique, counts = torch.unique(y, return_counts=True)
distribution = dict(zip(unique.tolist(), counts.tolist()))
distribution

{0: 663, 1: 337}

In [None]:
n_train = 800
batch_size = 64

In [None]:
train_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[:n_train], y[:n_train]),
    batch_size=batch_size,
    shuffle=True,
)
eval_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[n_train:], y[n_train:]),
    batch_size=batch_size,
)

As a model, we use a simple multilayer perceptron (MLP). For demonstration purposes, we use a very large number of hidden units. This is totally an overkill for this task but it helps to demonstrate the advantages of `peft`. In more realistic settings, models will also be quite large on average, so this is not far-fetched.

In [None]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Linear(20, 2000),
            nn.ReLU(),
            nn.Linear(2000, 200),
            nn.ReLU(),
            nn.Linear(200, 2),
            nn.LogSoftmax(dim=-1), #log loss / binary cross entropy loss
        )

    def forward(self, X):
        return self.seq(X)

Here are just a few training hyper-parameters and a simple function that performs the training and evaluation loop.

In [None]:
lr = 0.002
batch_size = 64
max_epochs = 35
device = 'cpu' if not torch.cuda.is_available() else 'cuda'
print(device)

cpu


In [None]:
# Data and Model have to put on the device.
# So, xb, yb and model are on device.
def train(model, optimizer, criterion, train_dataloader, eval_dataloader, epochs):
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for xb, yb in train_dataloader:
            xb = xb.to(device)
            yb = yb.to(device)
            outputs = model(xb)
            loss = criterion(outputs, yb)
            train_loss += loss.detach().float()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        model.eval()
        eval_loss = 0
        for xb, yb in eval_dataloader:
            xb = xb.to(device)
            yb = yb.to(device)
            with torch.no_grad():
                outputs = model(xb)
            loss = criterion(outputs, yb)
            eval_loss += loss.detach().float()

        eval_loss_total = (eval_loss / len(eval_dataloader)).item()
        train_loss_total = (train_loss / len(train_dataloader)).item()
        print(f"{epoch=:<2}  {train_loss_total=:.4f}  {eval_loss_total=:.4f}")

In [None]:
base_model = MLP().to(device)
optimizer = torch.optim.Adam(base_model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [None]:
base_model

MLP(
  (seq): Sequential(
    (0): Linear(in_features=20, out_features=2000, bias=True)
    (1): ReLU()
    (2): Linear(in_features=2000, out_features=200, bias=True)
    (3): ReLU()
    (4): Linear(in_features=200, out_features=2, bias=True)
    (5): LogSoftmax(dim=-1)
  )
)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
print_trainable_parameters(base_model)

trainable params: 442602 || all params: 442602 || trainable%: 100.0


In [None]:
# Lets train the base model
%time train(base_model, optimizer, criterion, train_dataloader, eval_dataloader, epochs=20)

epoch=0   train_loss_total=0.6282  eval_loss_total=0.5629
epoch=1   train_loss_total=0.5087  eval_loss_total=0.4953
epoch=2   train_loss_total=0.3993  eval_loss_total=0.4251
epoch=3   train_loss_total=0.3303  eval_loss_total=0.4426
epoch=4   train_loss_total=0.3449  eval_loss_total=0.3773
epoch=5   train_loss_total=0.3435  eval_loss_total=0.3848
epoch=6   train_loss_total=0.2730  eval_loss_total=0.3268
epoch=7   train_loss_total=0.2616  eval_loss_total=0.3372
epoch=8   train_loss_total=0.2213  eval_loss_total=0.3446
epoch=9   train_loss_total=0.2236  eval_loss_total=0.3269
epoch=10  train_loss_total=0.2067  eval_loss_total=0.3505
epoch=11  train_loss_total=0.1907  eval_loss_total=0.3439
epoch=12  train_loss_total=0.1981  eval_loss_total=0.3375
epoch=13  train_loss_total=0.1900  eval_loss_total=0.3734
epoch=14  train_loss_total=0.1905  eval_loss_total=0.3414
epoch=15  train_loss_total=0.1686  eval_loss_total=0.4842
epoch=16  train_loss_total=0.1666  eval_loss_total=0.3911
epoch=17  trai

 We achieved an evaluation loss that is better than a random outcome. In fine-tuning exercises, the primary focus is typically on the model's performance for a specific downstream task. It's important to note that showcasing performance improvements with LoRA fine-tuning on our current MLP model pretrained on toy dataset may not be ideal, mainly because its advantages are more pronounced in larger models with clearly defined downstream tasks. Nonetheless, our objective here is to assess how LoRA enhances parameter efficiency and to deepen our understanding of its algorithm. This exploration is particularly relevant because comprehending the implementation of LoRA at code level on large models can be quite challenging.

## Finetuning with LoRA
Using our model, pre-trained for 20 epochs as a base_model, we'll apply LoRA using Huggingface's `peft` library. Make sure that you have the latest version of peft installed. We already established that we will be injecting few extra set of parameters called adapters in between the layers of the pre-trained based model, focusing on training only these adapters while keeping the base model's parameters frozen. Where and How to inject the adapters in the base model is still an open question.

In [None]:
!python -m pip install --upgrade peft

Collecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from peft)
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate, peft
Successfully installed accelerate-0.26.1 peft-0.7.1


In [None]:
import copy
import os

# Set this variable to 1 to not get default messages. Ignore bnb warnings
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

In [None]:
import peft

We already estrablished that we will be injecting few extra set of parameters called adaptors in between the layers of the pretrained based model, focusing on training only these adaptors while keeping the base model's parameters frozen. Where and how to inject the adaptors in the base model is still an open question. Lets address the 'where' part of the question first. In the current scenario with a 5-layer MLP, the exact layer for adaptor insertion is not particularly critical. However, in larger models, each layer serves a distinct function and contributes differently to learning. For instance, linear layers convey crucial information, unlike layer normalization in a transformer network. To prevent catastrophic forgetting of the original model, adapter modules must be strategically inserted between these impactful layers. Section 7.1 of the LoRA paper (https://arxiv.org/pdf/2106.09685.pdf) conducts experiments to determine which layers can be effectively used for fine-tuning and which ones should remain undisturbed.


In [None]:
# Let's identify the names of the modules, ensuring that we fine-tune the appropriate ones with adaptors.
[(n, type(m)) for n, m in base_model.named_modules()]

[('', __main__.MLP),
 ('seq', torch.nn.modules.container.Sequential),
 ('seq.0', torch.nn.modules.linear.Linear),
 ('seq.1', torch.nn.modules.activation.ReLU),
 ('seq.2', torch.nn.modules.linear.Linear),
 ('seq.3', torch.nn.modules.activation.ReLU),
 ('seq.4', torch.nn.modules.linear.Linear),
 ('seq.5', torch.nn.modules.activation.LogSoftmax)]

### Where to inject the adapters?

In the current scenario with a 5-layer MLP, the exact layer for adapter insertion is not particularly critical. However, in larger models, each layer serves a distinct function and contributes differently to learning. For instance, linear layers convey crucial information, unlike layer normalization in a transformer network. To prevent catastrophic forgetting of the original model, adapter modules must be strategically inserted between these impactful layers. Section 7.1 of the LoRA paper¹ conducts experiments to determine which layers can be effectively used for fine-tuning and which ones should remain undisturbed.

Note: Not all layers types can be fine-tuned with LoRA. At the moment, `Linear`, `Embeddings`, `Conv2D` and `Conv1D` are supported.

### How to inject the adapters?
Now we are going to address the 'how' part of the question. Lets say we choose to place adapters for linear layers seq.0 and seq.2 of the base model. Lets call them Adoptees. Adapters can be placed either in sequence or in parallel to the adoptee layers. Since adapters are small in size compared to adoptee layers, running it in sequence will be inefficient to work with GPU on two counts - GPU memory wont be fully utilized and GPUs are designed for parallel execution so layers in-sequence will cause time inefficiency.

Authors of LoRA proposed placing adapters in parallel to the adoptee layers. This design keeps the adoptee's weight matrix and the adapter's matrix separate throughout the fine-tuning process. Both adoptee and adapter must have the same input and output layer dimension so that parallel connection can be accommodated.

LoRA proposed decomposing the adapter matrix into two low rank matrices (lora_A and lora_B) which will have very small rank. The adapter with lora_A and lora_B is designed such that the output of their product and output of the adaptee layer are compatible. Only lora_A and lora_B are learned for the specific downstream task.

Let's define the LoRA configuration. We set the LoRA rank to 3 and select the layers seq.0 and seq.2 to be used for LoRA fine-tuning. lora_A and lora_B layers are created across both seq.0 and seq.2layers. Number of parameters in lora_A (20 x 3) + number of parameters in lora_B (3 x 2000) == 6060 is much fewer than the number of parameters in seq.0 (20 x 2000) == 40,000. However, the output dimension of lora_A x lora_Band seq.0 are both equal to 2000, irrespective of the value of r! Both the outputs can now be added and passed to the next module of the network.

In [None]:
config = peft.LoraConfig(
    r=3,
    target_modules=["seq.0", "seq.2"],
)

In [None]:
base_model_pretrained = copy.deepcopy(base_model)  # Let's keep a copy of the pretrained model for later use
peft_model = peft.get_peft_model(base_model, config)
optimizer = torch.optim.Adam(peft_model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
peft_model.print_trainable_parameters()

trainable params: 12,660 || all params: 455,262 || trainable%: 2.780816321151337


We see that only ~2.8% of parameters are actually trainable, which is what we like to see. Now let's see how the architecture of the model with lora weights look like:

In [None]:
peft_model

PeftModel(
  (base_model): LoraModel(
    (model): MLP(
      (seq): Sequential(
        (0): lora.Linear(
          (base_layer): Linear(in_features=20, out_features=2000, bias=True)
          (lora_dropout): ModuleDict(
            (default): Identity()
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=20, out_features=3, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in_features=3, out_features=2000, bias=False)
          )
          (lora_embedding_A): ParameterDict()
          (lora_embedding_B): ParameterDict()
        )
        (1): ReLU()
        (2): lora.Linear(
          (base_layer): Linear(in_features=2000, out_features=200, bias=True)
          (lora_dropout): ModuleDict(
            (default): Identity()
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=2000, out_features=3, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in

def forward(x):

	seq.0_out = seq.0(x)
	lora_A_out = seq.0.lora_A(x)
	lora_B_out = seq.0.lora_B(lora_A_out)
	lora_B_out = lora_B_out * alpha
	seq.0_lora_out = seq.0_out + lora_B_out
	seq.0_lora_out = ReLU(seq.0_lora_out)
	seq.2(seq.2.lora_out)
  
  In the pseudo-code above, alpha is a scaling factor that adjusts the magnitude of the combined result (original model output plus low-rank adaptation). This balances the pretrained model’s knowledge and the new task-specific adaptation — by default, alpha is usually set to 1.

### How to initialize lora_A and lora_B?
If both lora_A and lora_B were initialized to 0, the gradient of the loss with respect to each weight will be the same for all weights and all these neurons will likely undergo the same updates during training. The phenomenon of each neuron learning different aspects of the data is called symmetry breaking. Here, they all will learn the same thing. This is akin to having a single parameter, significantly limiting the model's ability to learn complex patterns. Zero initialization may never cause the symmetry to break.

Having both of them randomly initialized may destabilise the training. While this can help break symmetry (as discussed earlier), it can also lead to initial instability. At the beginning of fine-tuning, the network might produce outputs that are significantly off-target. The optimizer has to correct these wrong initializations. There are techniques to mitigate these instabilities and limit the effect of wrong paramters like lower learning rates, smaller initial values, introducing warm up periods during training for smooth transition etc,.

LoRA gets best of the both worlds and initializes lora_A with random Gaussian initialization and lora_B is set to 0. This results in the product being 0. There is no inductive bias because in the first few epochs only the base model is in play, adaptors are not contributing to the training - no instabilities during initial training stages.

Let's verify this:

In [None]:
lora_B = peft_model.state_dict()['base_model.model.seq.0.lora_B.default.weight']
lora_A = peft_model.state_dict()['base_model.model.seq.0.lora_A.default.weight']
print(lora_B.size())
print(lora_A.size())
print(torch.all(lora_B @ lora_A == 0))

torch.Size([2000, 3])
torch.Size([3, 20])
tensor(True)


Additional Remark:

In Figure 1 of the LoRA paper, the parameters initially named A and B are later referred to as B and A, respectively, starting from Section 4 (Method and Implementation). Despite this switch in nomenclature, it's important to note that both the PEFT implementation and the paper discuss the same parameters. This naming discrepancy is also highlighted in an issue I raised, which can be viewed here: https://github.com/huggingface/peft/issues/983.

When fine-tuning is performed on the same task and data as used in pretraining, it is observed that the loss remains relatively consistent with the last epoch of pretraining, indicating that the training process remains stable. This stability aligns with the effects of the lora_A and lora_B initialization as proposed in the LoRA paper.

In [None]:
%time train(peft_model, optimizer, criterion, train_dataloader, eval_dataloader, epochs=5)

epoch=0   train_loss_total=0.1001  eval_loss_total=0.3855
epoch=1   train_loss_total=0.0944  eval_loss_total=0.3875
epoch=2   train_loss_total=0.0926  eval_loss_total=0.3918
epoch=3   train_loss_total=0.0909  eval_loss_total=0.3921
epoch=4   train_loss_total=0.0889  eval_loss_total=0.3913
CPU times: user 526 ms, sys: 13.1 ms, total: 539 ms
Wall time: 641 ms


### Finetune on the downstream task with peft
Let's define a downstream task that is relevant but not identical to the pretraining task. Observing the decreasing trend in the loss indicates effective learning and adaptation to this new task.

In [None]:
X = torch.rand((500, 20)) # Returns a tensor filled with random numbers from a uniform distribution on the interval [0,1).
y = (X.sum(1) > 10).long() # y is the label with shape (1000, 1) which results in 1 if the sum of elements in each row is > 10 and 0 otherwise. y is then cast to (torch.int64).
n_train = 300
batch_size = 64
train_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[:n_train], y[:n_train]),
    batch_size=batch_size,
    shuffle=True,
)
eval_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[n_train:], y[n_train:]),
    batch_size=batch_size,
)
%time train(peft_model, optimizer, criterion, train_dataloader, eval_dataloader, epochs=10)

epoch=0   train_loss_total=2.9943  eval_loss_total=2.4758
epoch=1   train_loss_total=2.3308  eval_loss_total=1.9671
epoch=2   train_loss_total=1.7741  eval_loss_total=1.2765
epoch=3   train_loss_total=1.1337  eval_loss_total=0.8434
epoch=4   train_loss_total=0.7536  eval_loss_total=0.6921
epoch=5   train_loss_total=0.6938  eval_loss_total=0.6921
epoch=6   train_loss_total=0.6928  eval_loss_total=0.6921
epoch=7   train_loss_total=0.6931  eval_loss_total=0.6921
epoch=8   train_loss_total=0.6937  eval_loss_total=0.6921
epoch=9   train_loss_total=0.6935  eval_loss_total=0.6921
CPU times: user 560 ms, sys: 703 µs, total: 561 ms
Wall time: 613 ms


To verify the correct application of LoRA, we need to identify which parameters were updated and which remained unchanged. Lets print all the named parameters in pretrained base model and the peft model and compare their weights. Only the extra lora adaptor layers in the peft model should have gotten updated.

In [None]:
print("** Pretrained based model's parameters **")
for name, param in base_model_pretrained.named_parameters():
  print(name)
print()
print("** Peft model's parameters **")
for name, param in peft_model.named_parameters():
  print(name)

** Pretrained based model's parameters **
seq.0.weight
seq.0.bias
seq.2.weight
seq.2.bias
seq.4.weight
seq.4.bias

** Peft model's parameters **
base_model.model.seq.0.base_layer.weight
base_model.model.seq.0.base_layer.bias
base_model.model.seq.0.lora_A.default.weight
base_model.model.seq.0.lora_B.default.weight
base_model.model.seq.2.base_layer.weight
base_model.model.seq.2.base_layer.bias
base_model.model.seq.2.lora_A.default.weight
base_model.model.seq.2.lora_B.default.weight
base_model.model.seq.4.weight
base_model.model.seq.4.bias


In [None]:
print(torch.equal(base_model_pretrained.state_dict()['seq.0.weight'], peft_model.state_dict()['base_model.model.seq.0.base_layer.weight']))
print(torch.equal(base_model_pretrained.state_dict()['seq.0.bias'], peft_model.state_dict()['base_model.model.seq.0.base_layer.bias']))
print(torch.equal(base_model_pretrained.state_dict()['seq.2.weight'], peft_model.state_dict()['base_model.model.seq.2.base_layer.weight']))
print(torch.equal(base_model_pretrained.state_dict()['seq.2.bias'], peft_model.state_dict()['base_model.model.seq.2.base_layer.bias']))
print(torch.equal(base_model_pretrained.state_dict()['seq.4.weight'], peft_model.state_dict()['base_model.model.seq.4.weight']))
print(torch.equal(base_model_pretrained.state_dict()['seq.4.bias'], peft_model.state_dict()['base_model.model.seq.4.bias']))

True
True
True
True
True
True


In [None]:
print_trainable_parameters(base_model_pretrained)
print_trainable_parameters(peft_model)

trainable params: 442602 || all params: 442602 || trainable%: 100.0
trainable params: 12660 || all params: 455262 || trainable%: 2.780816321151337


12,660 extra paramters are added to the base_model to increase the total number of paramters from 442,602 to 455,262. And only these extra parameters are trainale in the peft_model.

### Finetune with full rank adaptors as well
In addition to low rank adaptors, you can also fine-tune full rank adaptors using the peft library. Full rank adaptors are essentially replicas of the layers being adapted. They have the flexibility to be saved independently and later merged. This is another useful feature of the 'peft' library and can be enabled with the "modules_to_save" option. In some cases this can increase the performance of the fine-tuning task.

In [None]:
config_1 = peft.LoraConfig(
    r=3,
    target_modules=["seq.0", "seq.2"],
    modules_to_save=["seq.4"],
)

In [None]:
copy_1 = copy.deepcopy(base_model_pretrained) # keep the orginal as is and work on a copy
peft_model_1 = peft.get_peft_model(copy_1, config_1)
optimizer = torch.optim.Adam(peft_model_1.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
peft_model_1.print_trainable_parameters()

trainable params: 13,062 || all params: 455,664 || trainable%: 2.8665859054039817


In [None]:
peft_model_1

PeftModel(
  (base_model): LoraModel(
    (model): MLP(
      (seq): Sequential(
        (0): lora.Linear(
          (base_layer): Linear(in_features=20, out_features=2000, bias=True)
          (lora_dropout): ModuleDict(
            (default): Identity()
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=20, out_features=3, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in_features=3, out_features=2000, bias=False)
          )
          (lora_embedding_A): ParameterDict()
          (lora_embedding_B): ParameterDict()
        )
        (1): ReLU()
        (2): lora.Linear(
          (base_layer): Linear(in_features=2000, out_features=200, bias=True)
          (lora_dropout): ModuleDict(
            (default): Identity()
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=2000, out_features=3, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in

A replica of `seq.4` is incorporated as-is without applying low rank approximation. This results in the addition of 402 extra parameters to the base model, causing the total number of training parameters to increase by 402. The additional 2 parameters beyond 400 are due to bias being set to true. It's noteworthy that in LoRA layers, the bias is set to False by default.

In [None]:
# %time train(peft_model_1, optimizer, criterion, train_dataloader, eval_dataloader, epochs=max_epochs)

### Merging the adaptors
While parameter efficient finetuning techniques increase inference latency due to the expanded network size with additional adaptor modules, the LoRA adaptors are strategically designed to facilitate merging with adaptee matrices when needed, thereby reducing additional inference time. As demonstrated earlier, the number of parameters in (seq.0.lora_A x seq.0.lora_B) aligns with the number of parameters in seq.0, and similarly for (seq.2.lora_A x seq.2.lora_B) and seq.2. Leveraging this alignment, element-wise addition can be employed during the merging process, optimizing the overall efficiency of the model.

Fig will help here.

In [None]:
peft_model_unmerged = copy.deepcopy(peft_model)
peft_model_merged_and_unloaded = peft_model.merge_and_unload()

In [None]:
print_trainable_parameters(peft_model_merged_and_unloaded)

trainable params: 0 || all params: 442602 || trainable%: 0.0


As we can see above, the total number of parameters is back to 442602 and none are trainable. The inference time now, will be same as it was for the base_model.

In [None]:
for name, param in peft_model.base_model.named_parameters():
    if "lora" not in name:
        print(f"New parameter {name:<35} | {param.numel():>15} parameters | not updated")
        continue

    print(f"New parameter {name:<35} | {param.numel():>15} parameters | updated")

New parameter model.seq.0.weight                  |           40000 parameters | not updated
New parameter model.seq.0.bias                    |            2000 parameters | not updated
New parameter model.seq.0.lora_A.default.weight   |             160 parameters | updated
New parameter model.seq.0.lora_B.default.weight   |           16000 parameters | updated
New parameter model.seq.2.weight                  |         4000000 parameters | not updated
New parameter model.seq.2.bias                    |            2000 parameters | not updated
New parameter model.seq.2.lora_A.default.weight   |           16000 parameters | updated
New parameter model.seq.2.lora_B.default.weight   |           16000 parameters | updated
New parameter model.seq.4.weight                  |            4000 parameters | not updated
New parameter model.seq.4.bias                    |               2 parameters | not updated


## Sharing the model through Hugging Face Hub
It is necessary to have a valid Hugging Face account and you need to have 'write access token' to push the model to the hub. You may or may not want to add the token as a git credential. Either ways you will be allowed to login.

In [None]:
# Write token: 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
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.c

### Pushing the model to HF Hub
Create a model id and push the peft_model to Hugging Face Hub.

In [None]:
user = "s3pi"  # put your user name here
model_name = "peft-lora-with-MLP-model_"
model_id = f"{user}/{model_name}"

In [None]:
peft_model_unmerged.push_to_hub(model_id)

adapter_model.safetensors:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/s3pi/peft-lora-with-MLP-model_/commit/e395ff324600c273782f66b6900698ca366248ac', commit_message='Upload model', commit_description='', oid='e395ff324600c273782f66b6900698ca366248ac', pr_url=None, pr_revision=None, pr_num=None)

As evident, the adapter size is merely 51 kB. Alternatively, this figure can be derived from the 12,660 parameters, each of 32-bit size, resulting in approximately 51KB (12,660 * 4) Bytes. In contrast, the base model comprises 442,602 parameters, amounting to 1,770KB, a size considerably larger than that of the adapter. This size escalation becomes particularly significant in the context of a large language model.

### Loading the model from HF Hub
Now, it only takes one step to load the model from HF Hub. To do this, we can use `PeftModel.from_pretrained`, passing our pretrained base model and the model ID:

In [None]:
loaded_model = peft.PeftModel.from_pretrained(base_model_pretrained, model_id)
type(loaded_model)

peft.peft_model.PeftModel

In [None]:
loaded_model.print_trainable_parameters()

trainable params: 0 || all params: 455,262 || trainable%: 0.0


In [None]:
loaded_model_merged_and_unloaded = loaded_model.merge_and_unload()
print_trainable_parameters(merged_model)

trainable params: 0 || all params: 442602 || trainable%: 0.0


Let's check that the two models produce the same output:

In [None]:
y_peft = peft_model_merged_and_unloaded(X.to(device))
y_loaded = loaded_model_merged_and_unloaded(X.to(device))
torch.allclose(y_peft, y_loaded)

True

### Clean up
Finally, as a clean up step, you may want to delete the repo.

In [None]:
from huggingface_hub import delete_repo

In [None]:
delete_repo(model_id)

## Performance

In Section 7 of the LoRA paper, it's demonstrated that low-rank adaptation matrices can enhance important features for specific downstream tasks—features that were initially learned but not strongly emphasized in the general pre-training model. They employ a metric called amplification factor to show the same. that a small r (such as 2) can yield a higher amplification factor compared to a larger r (like 64). This finding implies that only a limited number of directions (or features, in this case, 2) in the weight space are crucial for adapting the model to a specific task.This insight is valuable for efficiently adapting large pre-trained models like GPT-3, as it indicates that only a small number of directions need to be modified for specific tasks, the number of parameters that need to be trained. For different downstream tasks, a distinct set of feature directions are likely to be amplified.


## Limitations
- LoRA adaptors are only few mega bytes whereas the pretrained model is several gigabytes, during inference we need both - so not much of a saving in terms of memory requirement during inference although significantly less memory is required during fine tuning compared to fine tuning all the layers of the pretrained model. QLoRA is the solution to this problem.
- For each sub task, an adaptor is trained. If a batch has data for multiple tasks, cannot load multiple adapters at the same time.


##Extras

### Amplification Factor

In [None]:
import numpy as np
Delta_W = peft_model.state_dict()['base_model.model.seq.2.lora_B.default.weight'] @ peft_model.state_dict()['base_model.model.seq.2.lora_A.default.weight']
W = base_model_pretrained.state_dict()['seq.2.weight'].numpy()

U, _, VT = np.linalg.svd(Delta_W)
W_projected = U.T @ W @ VT
norm_Delta_W = np.linalg.norm(Delta_W, 'fro')
norm_W_projected = np.linalg.norm(W_projected, 'fro')
amplification_factor = norm_Delta_W / norm_W_projected
norm_Delta_W, norm_W_projected, amplification_factor

(0.79638445, 17.702658, 0.044986717)

### Low Rank Approximation

In [None]:

# Create a random 10x20 matrix
A = np.random.rand(5, 5)

# Perform Singular Value Decomposition (SVD)
U, Sigma, VT = np.linalg.svd(A, full_matrices=False)

# Choose the rank for approximation
rank = 3

# Construct low-rank matrices B and C
B = U[:, :rank]
C = np.dot(np.diag(Sigma[:rank]), VT[:rank, :])

# Construct an approximation of the original matrix using the selected rank
A_approx = np.dot(B, C)

# Print the original matrix A, low-rank matrices B and C, and the approximated matrix A_approx
print("Original Matrix A:")
print(np.round(A, 2))
print("\nLow-rank Matrix B:")
print(np.round(B, 2))
print("\nLow-rank Matrix C:")
print(np.round(C, 2))
print("\nApproximated Matrix A_approx:")
print(np.round(A_approx, 2))

Original Matrix A:
[[0.97 0.57 0.71 0.15 0.12]
 [0.5  0.34 0.19 0.76 0.46]
 [0.08 0.17 0.45 0.79 0.68]
 [0.08 0.28 0.13 0.07 0.44]
 [0.77 0.92 0.06 0.1  0.47]]

Low-rank Matrix B:
[[-0.55  0.46  0.67]
 [-0.46 -0.34  0.  ]
 [-0.4  -0.71  0.12]
 [-0.2  -0.1  -0.34]
 [-0.53  0.4  -0.65]]

Low-rank Matrix C:
[[-1.22 -1.08 -0.72 -0.82 -0.89]
 [ 0.51  0.36 -0.05 -0.72 -0.45]
 [ 0.13 -0.29  0.45  0.11 -0.29]]

Approximated Matrix A_approx:
[[ 0.99  0.57  0.67  0.19  0.08]
 [ 0.39  0.38  0.35  0.62  0.56]
 [ 0.14  0.14  0.37  0.85  0.64]
 [ 0.15  0.28 -0.    0.2   0.32]
 [ 0.77  0.91  0.07  0.09  0.49]]
