![NVIDIA Logo](images/nvidia.png)

# LoRA

In this notebook we explore the conceptual underpinnings behind the second major PEFT technique of this workshop: LoRA.

---

## Learning Objectives

By the time you complete this notebook you:
- Understand the structure and functionality of the LoRA PEFT technique.

---

## LoRA Presentation

In [None]:
from llm_utils.slides import load_lora_slides
load_lora_slides()

---

## LoRA Simplified

 For the remainder of this notebook we will construct a simplified LoRA mechanism to help develop your intuition about how LoRA works.

## Imports

In [None]:
import numpy as np

---

## LoRA

Low-Rank Adaptation (**LoRA**) is a PEFT technique where we modify the weights of an  LLM using low-rank matrices. In LoRA, instead of directly altering the original weights of the LLM, we introduce trainable low-rank matrices that capture the desired adaptations.

The key advantage of LoRA lies in its ability to make significant modifications to the LLM's behavior while training only a small number of additional parameters. This is achieved by decomposing the weight adjustments into lower-dimensional spaces using the introduced matrices. During the LoRA training process, the original weights of the LLM remain frozen, and only the parameters of the low-rank matrices are updated.

This approach allows for substantial customization of the LLM's responses and capabilities without the computational burden of training the entire model. It is often said that we "apply LoRA to an LLM" or "use a LoRA-modified LLM", but it's important to note that the core LLM itself does not undergo retraining; rather, it's the additional low-rank matrices that are fine-tuned to achieve the desired outcome.

## Visual LoRA

We will now build a toy implementation of LoRA. We will use the following image as a point of reference as we define the various components in code.

![LoRA](images/lora_medium.png)

---

## LLM Weight Matrix

LLMs consist of thousands to tens of thousands **weight matrices**. In this simplified simulation we will treat the LLM as a single **weight matrix** (`W`) with small input and output dimensions `d` and `h`.

In [None]:
d = 6  # Input dimension
h = 8  # Output dimension
W = np.random.randn(d, h)  # Weight matrix W with dimensions d x h

In [None]:
W.shape

For our mock LLM / small **weight matrix** the total number of tune-able parameters is:

In [None]:
W.size

![LoRA](images/lora_medium.png)

---

## LoRA (Low Rank) Matrices

During LoRA PEFT we supply the **adapter dimension** or (low) rank of the **LoRA matrices**.

In [None]:
r = 2  # Rank for LoRA matrices

For each **weight matrix** two **low-rank matrices** are created using the low-rank **adapter dimension** and the **weight matrix's** input and output dimensions.

In [None]:
A = np.random.randn(r, d)  # Low-rank matrix A with dimensions (adapter dimension) x (weight matrice input dimension)
B = np.random.randn(r, h)  # Low-rank matrix B with dimensions (adapter dimension) x (weight matrice output dimension)

In [None]:
A.shape

In [None]:
B.shape

The total number of tune-able parameters for the 2 **low-rank matrices**:

In [None]:
A.size + B.size

Percentage of **low-rank matrix** tune-able parameters compared to the **weight-matrix**:

In [None]:
f'{(A.size + B.size) / W.size*100:.2f}%'

![LoRA](images/lora_medium.png)

---

## Low-Rank Factorization for Weight Matrix Approximation

The factorization of the **low-rank matrices**, when multiplied together, forms an approximation of the original high-dimensional **weight matrix**. The matrix product of `A` and `B`, referred to as the **approximation matrix**...

In [None]:
AB = np.dot(A.T, B)

...represents the reduced representation of the original **weight matrix**, capturing its significant features with fewer parameters. The **approximation matrix** is the same size and shape as the **weight matrix**.

In [None]:
AB.shape

In [None]:
AB.size

![LoRA](images/lora_medium.png)

---

## Generating Output with Modified Weight Matrix

The input and output vectors match the input dimension and output dimension of the **weight matrix** respectively.

In [None]:
input_vector = np.random.randn(d)  # Input vector of rank d
output_vector = np.dot(input_vector, W)  # Output vector of rank h

In [None]:
input_vector.shape

In [None]:
output_vector.shape

During inference, the **approximation matrix** is added to the **weight matrix**.

In [None]:
modified_output_vector = np.dot(input_vector, W + AB)  # Output vector using modified W

In [None]:
modified_output_vector.shape

![LoRA](images/lora_medium.png)

---

## Exercise: Rerun With Larger Input and Output Dimensions

In this pass of the notebook we saw that the number of tune-able parameters of the **low-rank** matrices combined were only 58% of the those in the **weight matrix**. Go back to the top of the notebook and increase the size of the input and output dimensions and watch how the increase in size affects this percentage. You might try somewhat realistic values like
- `d` = 1024 or 2048
- `h` = 1024 or 2048
- `r` = 32