<a href="https://colab.research.google.com/github/sharma-himanshukumar/LLM_Learning/blob/main/LLM_finetuning_with_lora_%26_qlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LoRA and QLoRA

## Overview

**LoRA (Low-Rank Adaptation)** is a technique to fine-tune large pre-trained language models efficiently. Instead of updating all the parameters of the model during fine-tuning, LoRA introduces low-rank matrices to approximate the necessary updates. This reduces the computational cost and memory usage.

**QLoRA (Quantized Low-Rank Adaptation)** extends LoRA by applying quantization to further reduce memory and computational requirements. In QLoRA, the weights and the low-rank matrices are quantized, typically to lower precision formats such as int8.

## Concepts

### LoRA (Low-Rank Adaptation)

- Decomposes the weight updates into two low-rank matrices `A` and `B`.
- Instead of updating the full weight matrix `W`, LoRA represents the update as `ΔW = A * B`.
- During fine-tuning, the model updates `A` and `B` while keeping `W` fixed.

### QLoRA (Quantized Low-Rank Adaptation)

- Applies quantization to the low-rank matrices `A` and `B` to further reduce memory and computational requirements.
- The quantization process involves mapping the floating-point values to a lower bit representation (e.g., int8).
- Quantized inference uses the quantized versions of `A` and `B`.

## Code Example

The following code demonstrates the implementation of LoRA and QLoRA in PyTorch, and shows the difference in overall weights between the two methods.

```python
import torch
import torch.nn as nn
import torch.quantization as quantization

# Define a simple model with LoRA applied to one layer
class LoRAModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(LoRAModel, self).__init__()
        self.fc = nn.Linear(input_dim, hidden_dim)
        self.lora_A = nn.Parameter(torch.randn(hidden_dim, 4))  # Low-rank matrix A
        self.lora_B = nn.Parameter(torch.randn(4, input_dim))   # Low-rank matrix B
    
    def forward(self, x):
        W = self.fc.weight
        lora_update = torch.matmul(self.lora_A, self.lora_B)
        updated_W = W + lora_update
        x = torch.matmul(x, updated_W.T) + self.fc.bias
        return x

# Define quantization functions
def quantize_tensor(tensor, num_bits=8):
    scale = tensor.abs().max() / (2 ** (num_bits - 1) - 1)
    quantized = torch.round(tensor / scale).int()
    return quantized, scale

def dequantize_tensor(quantized, scale):
    return quantized.float() * scale

# Instantiate the model
input_dim = 10
hidden_dim = 6
model = LoRAModel(input_dim, hidden_dim)

# Apply LoRA
lora_A = model.lora_A
lora_B = model.lora_B
W = model.fc.weight
lora_update = torch.matmul(lora_A, lora_B)
updated_W_lora = W + lora_update

# Apply QLoRA
quantized_A, scale_A = quantize_tensor(lora_A)
quantized_B, scale_B = quantize_tensor(lora_B)
dequantized_A = dequantize_tensor(quantized_A, scale_A)
dequantized_B = dequantize_tensor(quantized_B, scale_B)
lora_update_quantized = torch.matmul(dequantized_A, dequantized_B)
updated_W_qlora = W + lora_update_quantized

# Print the differences between LoRA and QLoRA
print("Original Weight Matrix W:\n", W)
print("\nLow-Rank Update ΔW (LoRA):\n", lora_update)
print("\nUpdated Weight Matrix W' (LoRA):\n", updated_W_lora)
print("\nLow-Rank Update ΔW (QLoRA):\n", lora_update_quantized)
print("\nUpdated Weight Matrix W' (QLoRA):\n", updated_W_qlora)
```

## Key Points

- **LoRA**: Reduces the number of parameters by representing the updates as low-rank matrices `A` and `B`.
- **QLoRA**: Quantizes the low-rank matrices to reduce memory usage and computational load.
- **Quantization Methods**:
  - **Static Quantization**: Quantizes weights and activations based on calibration data.
  - **Dynamic Quantization**: Quantizes weights statically and activations dynamically during inference.
  - **Quantization Aware Training (QAT)**: Simulates quantization during training to better adapt the model to lower precision.

### Advantages of QLoRA

- **Efficiency**: Reduces memory and computational requirements by using low-rank matrices and quantization.
- **Performance**: Maintains a high level of performance by fine-tuning with low-rank adaptations and considering quantization effects during training.
- **Scalability**: Allows scaling large models to even larger contexts by reducing the overhead of full precision computations.

By integrating low-rank adaptations with quantization, QLoRA provides a powerful approach to efficiently fine-tune and deploy large language models.


In [3]:
import torch
import torch.nn as nn
import torch.quantization as quantization

# Define a simple model with LoRA applied to one layer
class LoRAModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(LoRAModel, self).__init__()
        self.fc = nn.Linear(input_dim, hidden_dim)
        self.lora_A = nn.Parameter(torch.randn(hidden_dim, 4))  # Low-rank matrix A
        self.lora_B = nn.Parameter(torch.randn(4, input_dim))   # Low-rank matrix B

    def forward(self, x):
        W = self.fc.weight
        lora_update = torch.matmul(self.lora_A, self.lora_B)
        updated_W = W + lora_update
        x = torch.matmul(x, updated_W.T) + self.fc.bias
        return x

# Define quantization functions
def quantize_tensor(tensor, num_bits=8):
    scale = tensor.abs().max() / (2 ** (num_bits - 1) - 1)
    quantized = torch.round(tensor / scale).int()
    return quantized, scale

def dequantize_tensor(quantized, scale):
    return quantized.float() * scale

# Instantiate the model
input_dim = 10
hidden_dim = 6
model = LoRAModel(input_dim, hidden_dim)

# Apply LoRA
lora_A = model.lora_A
lora_B = model.lora_B
W = model.fc.weight
lora_update = torch.matmul(lora_A, lora_B)
updated_W_lora = W + lora_update

# Apply QLoRA
quantized_A, scale_A = quantize_tensor(lora_A)
quantized_B, scale_B = quantize_tensor(lora_B)
dequantized_A = dequantize_tensor(quantized_A, scale_A)
dequantized_B = dequantize_tensor(quantized_B, scale_B)
lora_update_quantized = torch.matmul(dequantized_A, dequantized_B)
updated_W_qlora = W + lora_update_quantized

# Print the differences between LoRA and QLoRA
print("Original Weight Matrix W:\n", W)
print("\nLow-Rank Update ΔW (LoRA):\n", lora_update)
print("\nUpdated Weight Matrix W' (LoRA):\n", updated_W_lora)
print("\nLow-Rank Update ΔW (QLoRA):\n", lora_update_quantized)
print("\nUpdated Weight Matrix W' (QLoRA):\n", updated_W_qlora)

Original Weight Matrix W:
 Parameter containing:
tensor([[ 0.2674,  0.1404, -0.2016,  0.0831, -0.2406, -0.0549,  0.2625,  0.2510,
         -0.1550, -0.1925],
        [-0.2031, -0.0693, -0.1186,  0.2365,  0.0907, -0.3049,  0.2085, -0.2339,
         -0.0290,  0.0925],
        [ 0.1454,  0.0029,  0.1722, -0.0569,  0.1638,  0.1947,  0.0819, -0.0532,
         -0.1233, -0.3128],
        [ 0.1417,  0.1994,  0.2274,  0.0310,  0.2003,  0.2961,  0.0400,  0.1053,
         -0.0526, -0.0381],
        [ 0.1214,  0.1669,  0.0250, -0.2070,  0.0904, -0.0866,  0.2056, -0.1328,
          0.1986,  0.2691],
        [ 0.2832, -0.0298,  0.1418,  0.2959,  0.0232,  0.0285,  0.2896, -0.2381,
         -0.0884, -0.0879]], requires_grad=True)

Low-Rank Update ΔW (LoRA):
 tensor([[ 0.6609,  0.3984, -0.8163,  1.0553,  0.0283,  0.9763,  0.9529,  3.1155,
          1.9216,  1.0668],
        [-1.2604, -2.4124, -0.4593,  0.0493, -0.1857,  1.7256, -0.6232,  1.3840,
          0.6693,  2.9433],
        [ 1.2750,  0.8068,  0

Below is the code to demonstrate the overall size of the training parameters for LoRA and QLoRA. We'll measure the size of the model parameters in memory before and after quantization.

```python
import torch
import torch.nn as nn
import os

# Define a simple model with LoRA applied to one layer
class LoRAModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(LoRAModel, self).__init__()
        self.fc = nn.Linear(input_dim, hidden_dim)
        self.lora_A = nn.Parameter(torch.randn(hidden_dim, 4))  # Low-rank matrix A
        self.lora_B = nn.Parameter(torch.randn(4, input_dim))   # Low-rank matrix B
    
    def forward(self, x):
        W = self.fc.weight
        lora_update = torch.matmul(self.lora_A, self.lora_B)
        updated_W = W + lora_update
        x = torch.matmul(x, updated_W.T) + self.fc.bias
        return x

# Define quantization functions
def quantize_tensor(tensor, num_bits=8):
    scale = tensor.abs().max() / (2 ** (num_bits - 1) - 1)
    quantized = torch.round(tensor / scale).int()
    return quantized, scale

def dequantize_tensor(quantized, scale):
    return quantized.float() * scale

# Function to calculate the size of the model parameters in MB
def get_model_size(model):
    torch.save(model.state_dict(), "temp.p")
    size = os.path.getsize("temp.p") / 1e6
    os.remove("temp.p")
    return size

# Instantiate the model
input_dim = 10
hidden_dim = 6
model = LoRAModel(input_dim, hidden_dim)

# Size of the original model with LoRA
original_model_size = get_model_size(model)
print(f"Original model size (with LoRA): {original_model_size:.4f} MB")

# Apply QLoRA
quantized_A, scale_A = quantize_tensor(model.lora_A)
quantized_B, scale_B = quantize_tensor(model.lora_B)
dequantized_A = dequantize_tensor(quantized_A, scale_A)
dequantized_B = dequantize_tensor(quantized_B, scale_B)

# Replace the original low-rank matrices with quantized versions
model.lora_A.data = dequantized_A
model.lora_B.data = dequantized_B

# Size of the model with QLoRA
quantized_model_size = get_model_size(model)
print(f"Quantized model size (with QLoRA): {quantized_model_size:.4f} MB")

# Summary of sizes
print(f"Size reduction: {original_model_size - quantized_model_size:.4f} MB")
```

### Explanation

1. **Model Definition**:
   - `LoRAModel`: A simple model with a linear layer and low-rank adaptation matrices `A` and `B`.

2. **Quantization Functions**:
   - `quantize_tensor`: Quantizes a tensor to a specified number of bits.
   - `dequantize_tensor`: Converts a quantized tensor back to its original floating-point representation.

3. **Model Size Calculation**:
   - `get_model_size`: Saves the model's state dictionary to a temporary file and calculates its size in megabytes (MB).

4. **Applying LoRA and QLoRA**:
   - The original model size is measured after initializing the model with LoRA.
   - Low-rank matrices `A` and `B` are quantized and then dequantized to simulate the effect of quantization.
   - The quantized matrices are assigned back to the model.
   - The size of the model with QLoRA is measured after replacing the original matrices with their quantized versions.

5. **Printing the Results**:
   - The original model size and quantized model size are printed along with the size reduction achieved by quantization.

This code provides a clear demonstration of the memory size difference between LoRA and QLoRA, highlighting the efficiency gained through quantization.

In [4]:
import torch
import torch.nn as nn
import os

# Define a simple model with LoRA applied to one layer
class LoRAModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(LoRAModel, self).__init__()
        self.fc = nn.Linear(input_dim, hidden_dim)
        self.lora_A = nn.Parameter(torch.randn(hidden_dim, 4))  # Low-rank matrix A
        self.lora_B = nn.Parameter(torch.randn(4, input_dim))   # Low-rank matrix B

    def forward(self, x):
        W = self.fc.weight
        lora_update = torch.matmul(self.lora_A, self.lora_B)
        updated_W = W + lora_update
        x = torch.matmul(x, updated_W.T) + self.fc.bias
        return x

# Define quantization functions
def quantize_tensor(tensor, num_bits=8):
    scale = tensor.abs().max() / (2 ** (num_bits - 1) - 1)
    quantized = torch.round(tensor / scale).int()
    return quantized, scale

def dequantize_tensor(quantized, scale):
    return quantized.float() * scale

# Function to calculate the size of the model parameters in MB
def get_model_size(model):
    torch.save(model.state_dict(), "temp.p")
    size = os.path.getsize("temp.p") / 1e6
    os.remove("temp.p")
    return size

# Instantiate the model
input_dim = 10
hidden_dim = 6
model = LoRAModel(input_dim, hidden_dim)

# Size of the original model with LoRA
original_model_size = get_model_size(model)
print(f"Original model size (with LoRA): {original_model_size:.4f} MB")

# Apply QLoRA
quantized_A, scale_A = quantize_tensor(model.lora_A)
quantized_B, scale_B = quantize_tensor(model.lora_B)
dequantized_A = dequantize_tensor(quantized_A, scale_A)
dequantized_B = dequantize_tensor(quantized_B, scale_B)

# Replace the original low-rank matrices with quantized versions
model.lora_A.data = dequantized_A
model.lora_B.data = dequantized_B

# Size of the model with QLoRA
quantized_model_size = get_model_size(model)
print(f"Quantized model size (with QLoRA): {quantized_model_size:.4f} MB")

# Summary of sizes
print(f"Size reduction: {original_model_size - quantized_model_size:.4f} MB")

Original model size (with LoRA): 0.0024 MB
Quantized model size (with QLoRA): 0.0024 MB
Size reduction: 0.0000 MB


In [2]:
import os
import torch
import torch.nn as nn
import torch.quantization as quantization

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(10, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Instantiate the model
model = SimpleNN()

# Example input
example_input = torch.randn(1, 10)

# Static Quantization
def static_quantization(model, example_input):
    # Prepare the model for static quantization
    model.qconfig = quantization.get_default_qconfig('fbgemm')
    model_fused = quantization.fuse_modules(model, [['fc1', 'relu']])
    model_prepared = quantization.prepare(model_fused)

    # Calibrate the model with example data
    model_prepared(example_input)

    # Convert to a quantized model
    quantized_model = quantization.convert(model_prepared)
    return quantized_model

# Dynamic Quantization
def dynamic_quantization(model):
    # Convert to a dynamically quantized model
    quantized_model = quantization.quantize_dynamic(
        model, {nn.Linear}, dtype=torch.qint8
    )
    return quantized_model

# Apply static quantization
quantized_model_static = static_quantization(model, example_input)
print("Static Quantization:\n", quantized_model_static)

# Apply dynamic quantization
quantized_model_dynamic = dynamic_quantization(model)
print("Dynamic Quantization:\n", quantized_model_dynamic)

# Compare model sizes
def get_model_size(model):
    torch.save(model.state_dict(), "temp.p")
    size = os.path.getsize("temp.p") / 1e6
    os.remove("temp.p")
    return size

print(f"Original model size: {get_model_size(model)} MB")
print(f"Static quantized model size: {get_model_size(quantized_model_static)} MB")
print(f"Dynamic quantized model size: {get_model_size(quantized_model_dynamic)} MB")




Static Quantization:
 SimpleNN(
  (fc1): QuantizedLinearReLU(in_features=10, out_features=10, scale=0.009280134923756123, zero_point=0, qscheme=torch.per_channel_affine)
  (relu): Identity()
  (fc2): QuantizedLinear(in_features=10, out_features=1, scale=0.0022189663723111153, zero_point=127, qscheme=torch.per_channel_affine)
)
Dynamic Quantization:
 SimpleNN(
  (fc1): DynamicQuantizedLinear(in_features=10, out_features=10, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
  (relu): ReLU()
  (fc2): DynamicQuantizedLinear(in_features=10, out_features=1, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
Original model size: 0.002424 MB
Static quantized model size: 0.004546 MB
Dynamic quantized model size: 0.00342 MB
