# 📚 Notebook 6.3: Parameter-Efficient Fine-Tuning (PEFT) with LoRA and QLoRA - Finetuning gpt2(1.5B) on Alpaca(50k)

Welcome back to the series! 🎉 In *Notebook 6.2*, we explored foundational techniques for fine-tuning large language models, and now, it’s time to tackle an even more efficient approach. This notebook focuses on **Parameter-Efficient Fine-Tuning (PEFT)**, a method designed to fine-tune large models with minimal computational resources.

### What's the Goal? 🏆

In this notebook, we’ll dive into two PEFT techniques, **Low-Rank Adaptation (LoRA)** and **Quantized LoRA (QLoRA)**. These methods enable us to adapt large-scale models like GPT-2-XL for specific tasks with fewer trainable parameters and reduced memory usage, making fine-tuning accessible even on limited hardware.

### What's Inside? 🔍

This notebook is structured to give you a practical and theoretical understanding of PEFT, along with a hands-on implementation. Here’s what we’ll cover:

#### **Section 1: Introduction to LoRA and QLoRA** 🧠
We’ll explore the motivation and mechanics behind LoRA and QLoRA, including:
- How LoRA introduces low-rank matrices to optimize memory usage
- The role of quantization in QLoRA to further reduce resource needs

#### **Section 2: Implementing LoRA from Scratch** 🔄
In this section, we’ll go through:
- Building a custom LoRA layer in PyTorch
- Testing its integration within the GPT-2 architecture


#### **Section 3: Fine-Tuning GPT-2-XL on the Alpaca Dataset** 🦙
Finally, we’ll put our implementation to the test by fine-tuning GPT-2-XL:
- Preparing the Alpaca instruction-following dataset (50,000 entries)
- Applying LoRA and QLoRA for fine-tuning and evaluating the results

Let’s dive into this exploration of efficient fine-tuning, and see how LoRA and QLoRA can make LLM fine-tuning more accessible than ever! 🌟

# Introduction to PEFT

Parameter-Efficient Fine-Tuning (PEFT) is a technique that reduces the computational cost of fine-tuning large language models. Instead of adjusting every weight in the model, PEFT allows us to tune only a small set of parameters, leveraging the knowledge already embedded in the model's layers. LoRA and QLoRA are two powerful PEFT methods, enabling us to train with fewer resources while maintaining performance.

## Key Concepts in LoRA and QLoRA:

- **Efficiency**: LoRA introduces low-rank matrices to reduce the number of trainable parameters.
- **Quantization in QLoRA**: By using lower precision for certain weights, QLoRA further reduces memory needs without compromising much on accuracy.
- **Application in NLP**: PEFT techniques are valuable for adapting models like GPT-2 for specific tasks, making fine-tuning feasible on consumer-grade hardware.

By utilizing PEFT methods, we aim to efficiently train a GPT-2-XL model on the Alpaca dataset for instruction-following tasks, achieving significant memory savings and faster training.

<p align="center">
    <img src="images/peft.jpg" alt="PEFT Overview" />
</p>

## LET'S GET STARTED 🚀


## Section 1: LoRA - Low-Rank Adaptation for Large Language Models (qLoRA)

### What is LoRA?

LoRA, or **Low-Rank Adaptation**, is a technique created to simplify and optimize the fine-tuning of large language models. It allows for flexible, resource-efficient fine-tuning, which reduces computational requirements significantly.

### Why Use LoRA?

Fine-tuning large models is often costly in terms of both time and resources. Additionally, the same base model may need to be fine-tuned multiple times for different tasks. LoRA offers an efficient solution by allowing multiple adapters to be plugged into a base model as needed, achieving high accuracy without the typical computational costs.

The core insight behind LoRA is that many model parameters are redundant. By reducing redundancy, LoRA captures essential model information with fewer parameters, resulting in a much more efficient tuning process.

### How Does LoRA Work?

Rather than modifying the main weight matrix W  directly, LoRA introduces two smaller matrices,  W_a  and W_b , which approximate the adaptation needed. This setup significantly reduces the number of parameters without sacrificing model adaptability.

<p align="center">
    <img src="images/lora_f.png" alt="PEFT Overview" />
</p>

#### Example

Consider a weight matrix  W with dimensions  5000 x 1000 . The total number of parameters is:

num_params = M * N = 5000 * 1000 = 5000000


With LoRA, two smaller matrices are created:
- Wa with dimensions M x r
- W_b with dimensions  N x r

where  r  is a small rank, often set to a low number such as 1. The number of parameters with LoRA is then:


( M * r + N * r )


For example, if  r = 1 :

Wa = 5000 * 1 = 5000
Wb = 1000 * 1 = 1000
num_params = Wa + Wb = 6000


### Fine-Tuning with LoRA: Backpropagation

During fine-tuning, backpropagation is performed only on Wa and Wb, not on the original weight matrix W. This setup preserves the base model weights, reducing memory and computational costs since only the smaller LoRA matrices are updated during training.


## qLoRA: Quantized Low-Rank Adaptation

### What is qLoRA?

qLoRA, or **Quantized Low-Rank Adaptation**, builds on LoRA by introducing quantization. Quantization compresses model weights into lower-precision values (e.g., from 32-bit to 8-bit), reducing memory use and speeding up computation further without losing significant accuracy.

<p align="center">
    <img src="images/qlora.jpeg" alt="PEFT Overview" />
</p>

### Why Use qLoRA?

While LoRA reduces parameters with low-rank matrices, qLoRA compresses these matrices even further by quantizing them. This combination results in a highly efficient, lightweight model adaptation process, ideal for deployment in resource-constrained environments.

### How Does qLoRA Work?

qLoRA applies quantization on the low-rank matrices Wa and Wb, using fewer bits to represent each parameter. This approach decreases the memory footprint and allows the model to run faster on hardware with limited capacity. Quantization also enables efficient scaling, especially useful for deploying large models on edge devices.


## Mathematical Intuition Behind LoRA and qLoRA

### Hypothesis: Redundancy and Low-Rank Approximation

The idea behind LoRA and qLoRA is that many parameters in large models are redundant and can be approximated by a smaller set of parameters. This reduction is possible without sacrificing significant information in the model.

### Singular Value Decomposition (SVD) and Low-Rank Approximation

#### What is Singular Value Decomposition (SVD)?

SVD decomposes a matrix \( W \) as:

\[
W = U \Sigma V^T
\]

where:
- \( U \) and \( V \) are orthogonal matrices.
- \( \Sigma \) is a diagonal matrix with singular values that reflect the importance of each component in \( W \).

#### Linear Redundancy in Large Models

In large models, many singular values of W  are near zero, indicating redundancy. This redundancy allows W to be approximated by focusing only on the largest singular values, reducing matrix complexity without significant information loss.



## Section 2: Implementation from Scratch

### Why Implement from Scratch?

While there are countless out-of-the-box tools and libraries that make it easier to fine-tune large models and adapt them using techniques like LoRA and qLoRA, our approach has always been to understand everything from first principles. By building our solutions from scratch, we gain a deeper understanding of the underlying mechanics, which empowers us to optimize and customize the methods to our specific needs.

In this section, we’ll break down the process of implementing LoRA and qLoRA from scratch, including the mathematical foundations, the architecture, and the steps involved in training the model. Once we’ve understood the implementation, we will integrate **qLoRA** and plug it into the LoRA model to test the results.

## first lets start with SVD which lora build on  :

In [None]:
import torch

# Step 1: Create a random matrix W with 5 rows and 3 columns
# This matrix represents the weight matrix in our SVD process
W = torch.randn(5, 3)

# Print the original matrix W
# This matrix will be decomposed using Singular Value Decomposition
print("Original Matrix W:\n", W)


Original Matrix W:
 tensor([[-1.0328,  0.0464,  0.9360],
        [-0.8173, -0.7201, -1.4533],
        [-1.3988, -0.6507, -0.5185],
        [-1.4505, -1.2822,  0.5440],
        [-0.0512, -1.2063,  0.5130]])


In [None]:

# Step 2: Perform Singular Value Decomposition (SVD)
# torch.linalg.svd performs SVD on the matrix W
# U, S, and Vh are the three components resulting from the SVD
U, S, Vh = torch.linalg.svd(W, full_matrices=False)

# Print the resulting matrices
# U contains the left singular vectors
# S contains the singular values (a 1D tensor)
# Vh is the transpose of V (the right singular vectors)
print("\nU matrix:\n", U)
print("\nSingular values (S):\n", S)
print("\nV transpose (Vh):\n", Vh)



U matrix:
 tensor([[-0.2597,  0.4954,  0.5552],
        [-0.3932, -0.7234, -0.0544],
        [-0.5251, -0.2361,  0.2857],
        [-0.6586,  0.3152, -0.1077],
        [-0.2616,  0.2760, -0.7717]])

Singular values (S):
 tensor([2.9002, 1.9518, 1.1983])

V transpose (Vh):
 tensor([[ 0.7905,  0.6113,  0.0373],
        [-0.0315, -0.0202,  0.9993],
        [-0.6116,  0.7912, -0.0033]])


In [None]:

# Step 3: Reconstruct the original matrix using the formula W = U * S * V^T
# Since S is a 1D array, we need to convert it to a diagonal matrix
S_matrix = torch.diag(S)

# Reconstruct W by multiplying U, S, and Vh (the transpose of V)
W_reconstructed = U @ S_matrix @ Vh

# Step 4: Print the reconstructed matrix and check if it is close to the original matrix
print("\nReconstructed Matrix W (U * S * V^T):\n", W_reconstructed)

# Check if the reconstruction is close to the original matrix
print("\nReconstruction Close to Original:", torch.allclose(W, W_reconstructed))



Reconstructed Matrix W (U * S * V^T):
 tensor([[-1.0328,  0.0464,  0.9360],
        [-0.8173, -0.7201, -1.4533],
        [-1.3988, -0.6507, -0.5185],
        [-1.4505, -1.2822,  0.5440],
        [-0.0512, -1.2063,  0.5130]])

Reconstruction Close to Original: True


## Concrete Example: Training on the MNIST Dataset

In this section, we'll go through a practical example using the classic MNIST dataset:

1. **Training the Model**:  
   We’ll train a model (intentionally larger than necessary) to simulate a more complex scenario and evaluate how well it performs.

2. **Evaluating Model Performance**:  
   After training, we'll evaluate the model's performance, focusing on how well it performs across each individual class (digit), and highlight any potential shortcomings.

3. **Introducing LoRA**:  
   Finally, we’ll introduce Low-Rank Adaptation (LoRA) to see how it improves the model’s performance and efficiency compared to the original approach.

By doing this, we aim to better understand the impact of LoRA in optimizing the model's results.


In [None]:
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torch.optim as optim
from tqdm import  tqdm

# Define transformations for the dataset
# - ToTensor() converts the image into a tensor.
# - Normalize() normalizes the image with the given mean and standard deviation values for MNIST.
transform = transforms.Compose([
    transforms.ToTensor(),  # Converts the images into tensor format, which is needed by PyTorch
    transforms.Normalize((0.1307,), (0.3081,))  # Normalizes the tensor by the mean and standard deviation of MNIST
])

# Load the training dataset

train_data = datasets.MNIST('mnist', train=True, download=True, transform=transform)

# Create a DataLoader to load the training data in batches of 10 images at a time.

train_loader = DataLoader(train_data, batch_size=10, shuffle=True)

# Load the test dataset

test_data = datasets.MNIST('mnist', train=False, download=True, transform=transform)

# Create a DataLoader for the test dataset

test_loader = DataLoader(test_data, batch_size=10, shuffle=False)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to mnist/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9.91M/9.91M [00:02<00:00, 4.55MB/s]


Extracting mnist/MNIST/raw/train-images-idx3-ubyte.gz to mnist/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to mnist/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28.9k/28.9k [00:00<00:00, 134kB/s]


Extracting mnist/MNIST/raw/train-labels-idx1-ubyte.gz to mnist/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to mnist/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1.65M/1.65M [00:01<00:00, 1.27MB/s]


Extracting mnist/MNIST/raw/t10k-images-idx3-ubyte.gz to mnist/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4.54k/4.54k [00:00<00:00, 4.29MB/s]

Extracting mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to mnist/MNIST/raw






In [None]:
# we will use a larger hidden layer size here for the porpose of simulating a larger model with much num of parameters

class Mnist(nn.Module):
    def __init__(self,hidden_size1=1000, hidden_size2=2000):
        super(Mnist, self).__init__()
        self.fc1 = nn.Linear(784, hidden_size1)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.fc3 = nn.Linear(hidden_size2, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = Mnist()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model.to(device)

def train(train_loader, net, epochs=5, total_iterations_limit=None):
    cross_el = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

    total_iterations = 0

    for epoch in range(epochs):
        net.train()

        loss_sum = 0
        num_iterations = 0

        data_iterator = tqdm(train_loader, desc=f'Epoch {epoch+1}')
        if total_iterations_limit is not None:
            data_iterator.total = total_iterations_limit
        for data in data_iterator:
            num_iterations += 1
            total_iterations += 1
            x, y = data
            x = x.to(device)
            y = y.to(device)
            optimizer.zero_grad()
            output = net(x.view(-1, 28*28))
            loss = cross_el(output, y)
            loss_sum += loss.item()
            avg_loss = loss_sum / num_iterations
            data_iterator.set_postfix(loss=avg_loss)
            loss.backward()
            optimizer.step()

            if total_iterations_limit is not None and total_iterations >= total_iterations_limit:
                return

train(train_loader, model, epochs=1)

Epoch 1: 100%|██████████| 6000/6000 [00:51<00:00, 117.43it/s, loss=0.237]


Lets save the model weights to compare after LoRa implementaion

In [None]:
og_weights = {}

for name, param in model.named_parameters():
    og_weights[name] = param.clone().detach()


In [None]:
def test():
    correct = 0
    total = 0

    wrong_counts = [0 for i in range(10)]

    with torch.no_grad():
        for data in tqdm(test_loader, desc='Testing'):
            x, y = data
            x = x.to(device)
            y = y.to(device)
            output = model(x.view(-1, 784))
            for idx, i in enumerate(output):
                if torch.argmax(i) == y[idx]:
                    correct +=1
                else:
                    wrong_counts[y[idx]] +=1
                total +=1
    print(f'Accuracy: {round(correct/total, 3)}')
    for i in range(len(wrong_counts)):
        print(f'wrong counts for the digit {i}: {wrong_counts[i]}')

test()

Testing: 100%|██████████| 1000/1000 [00:03<00:00, 288.03it/s]

Accuracy: 0.963
wrong counts for the digit 0: 57
wrong counts for the digit 1: 17
wrong counts for the digit 2: 40
wrong counts for the digit 3: 59
wrong counts for the digit 4: 22
wrong counts for the digit 5: 35
wrong counts for the digit 6: 38
wrong counts for the digit 7: 27
wrong counts for the digit 8: 25
wrong counts for the digit 9: 52





# Fine-Tuning MNIST Model with LoRA on Class 9

We will fine-tune the MNIST model using **LoRA (Low-Rank Adaptation)** to improve performance specifically on **class 9** (digit 9). Instead of retraining the entire model, we will focus on adjusting the model's parameters related to class 9, while keeping the rest of the model frozen.

lets visulize the network para,etrs before the implementaoion of LoRa

In [None]:
num_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {num_params}")

for name, param in model.named_parameters():
    print(f"Parameter {name}: {param.size()}")


Number of parameters: 2807010
Parameter fc1.weight: torch.Size([1000, 784])
Parameter fc1.bias: torch.Size([1000])
Parameter fc2.weight: torch.Size([2000, 1000])
Parameter fc2.bias: torch.Size([2000])
Parameter fc3.weight: torch.Size([10, 2000])
Parameter fc3.bias: torch.Size([10])


Now lets implement the LoRa paramerization according to the original papaer [ttps://arxiv.org/abs/2106.09685 ]


In [None]:
class LoRaParams(nn.Module):
    def __init__(self, feat_in, feat_out, rank, alpha=1.0, device='cpu'):
        super(LoRaParams, self).__init__()

        # Low-rank decomposition matrices A and B
        self.A = nn.Parameter(torch.zeros(rank, feat_out).to(device))  # (rank, feat_in)
        self.B = nn.Parameter(torch.randn(feat_in,rank).to(device))  # (rank, feat_out)
        self.scale = alpha / rank

        self.lora = True

    def forward(self, og_weights):
        # Apply LoRA if enabled, otherwise return the original weights
        if self.lora:
            # Matrix multiplication for low-rank update: A * B.T
            return og_weights + torch.matmul(self.B, self.A).view(og_weights.shape) * self.scale
        else:
            # Return original weights if LoRA is not enabled
            return og_weights


### Injecting LoRA Parameterization into the Model using `torch.nn.utils.Parameterize`

To integrate LoRA (Low-Rank Adaptation) parameterization into the model, we will use PyTorch's `torch.nn.utils.Parameterize` API. This allows us to modify the weights of specific layers in the model by applying low-rank updates, optimizing the model without retraining it from scratch.

### Steps for LoRA Injection:

1. **Identify Target Layers**: Determine which layers will benefit from low-rank adaptation. In this case, we apply LoRA to the weight matrices of certain layers.
  
2. **Apply LoRA Parameterization**: For each of the selected layers, we replace their weights with LoRA-parameterized weights during the forward pass.

3. **Use `torch.nn.utils.Parameterize`**: The `torch.nn.utils.Parameterize` API ensures that LoRA matrices are treated as trainable parameters during training.


In [None]:
from torch.nn.utils import parametrize

# Define the correct LoRaParametrization for each layer
def linear_layer_parametrize(layer, rank=1, alpha=1, device=device):
    feat_in = layer.in_features
    feat_out = layer.out_features
    lora_params = LoRaParams(feat_in, feat_out, rank=rank, alpha=alpha, device=device)
    return lora_params

# Register LoRA parametrizations
parametrize.register_parametrization(model.fc1, 'weight', linear_layer_parametrize(model.fc1, device=device))
parametrize.register_parametrization(model.fc2, 'weight', linear_layer_parametrize(model.fc2, device=device))
parametrize.register_parametrization(model.fc3, 'weight', linear_layer_parametrize(model.fc3,  device=device))


def enable_disable_lora(enabled=True):
    for layer in [model.fc1, model.fc2, model.fc3]:
        layer.parametrizations["weight"][0].enabled = enabled

for name, param in model.named_parameters():
    if 'weight' in name:
        print(f"{name}: {param.shape}")


fc1.parametrizations.weight.original: torch.Size([1000, 784])
fc1.parametrizations.weight.0.A: torch.Size([1, 1000])
fc1.parametrizations.weight.0.B: torch.Size([784, 1])
fc2.parametrizations.weight.original: torch.Size([2000, 1000])
fc2.parametrizations.weight.0.A: torch.Size([1, 2000])
fc2.parametrizations.weight.0.B: torch.Size([1000, 1])
fc3.parametrizations.weight.original: torch.Size([10, 2000])
fc3.parametrizations.weight.0.A: torch.Size([1, 10])
fc3.parametrizations.weight.0.B: torch.Size([2000, 1])


Lets count the total number of lora prametr5s of the total number of parametrs

In [None]:
# Assuming the model is a simple neural network with LoRA applied
def count_params(model):
    total_params = sum(p.numel() for p in model.parameters())
    return total_params

def count_lora_params(model):
    lora_params = 0
    # Iterate through the layers where LoRA is applied (e.g., fc1, fc2, fc3)
    for name, param in model.named_parameters():
        # Check if the parameter is a LoRA parameter (e.g., A or B)
        if 'A' in name or 'B' in name:
            lora_params += param.numel()
    return lora_params


# Calculate total number of parameters
total_params = count_params(model)

# Calculate total LoRA parameters (A and B matrices)
lora_params = count_lora_params(model)

# Calculate percentage of LoRA parameters
lora_percentage = (lora_params / total_params) * 100

print(f"Total Parameters: {total_params}")
print(f"LoRA Parameters: {lora_params}")
print(f"Percentage of LoRA Parameters: {lora_percentage:.2f}%")


Total Parameters: 2813804
LoRA Parameters: 6794
Percentage of LoRA Parameters: 0.24%


As you can say the original model increase bt 6794 with the addiiton with loar paramets , now the total number of parametrs is 2813804 , but how a model hot incresed in size train much faser and with minum lost in accuracy ?

### Why Does a Model Increase in Size with LoRA Parameters but Train Faster with Minimal Accuracy Loss?

When adding **Low-Rank Adaptation (LoRA)** to a model, the overall number of parameters increases due to the introduction of additional low-rank matrices (`A` and `B`). However, the model can still train faster and with minimal loss in accuracy for the following reasons:

1. **Efficient Use of Parameters**: LoRA introduces small, low-rank matrices that are much smaller in size than the full weight matrices. This allows the model to adapt efficiently without drastically increasing the number of parameters.

2. **Frozen Pretrained Weights**: In many cases, LoRA freezes the original pretrained weights and only trains the low-rank matrices. This reduces the amount of work the optimizer needs to do, speeding up training.

3. **Sparsity in Adaptations**: LoRA's low-rank matrices introduce sparsity, meaning fewer parameters need to be updated during training. This leads to faster convergence without sacrificing accuracy.

4. **Minimal Impact on Accuracy**: LoRA's rank adaptations are carefully designed to ensure that the model maintains high accuracy. The added parameters are small and targeted, resulting in only a minimal impact on performance.

5. **Faster Fine-Tuning**: By only modifying a small portion of the model, LoRA allows for quicker fine-tuning, reducing the time required for training while still achieving excellent performance.

6. **Reduced Overfitting**: LoRA helps prevent overfitting by introducing just enough complexity to adapt to new tasks, without overburdening the model with unnecessary parameters.

In conclusion, **LoRA** increases the model size only slightly but significantly speeds up training by focusing on low-rank adaptations. This approach maintains accuracy and reduces the time spent fine-tuning, making it an efficient technique for fast and effective model adaptation.


In [None]:

# Freeze all layers except LoRA layers
def freeze_layers_except_lora(model):
    for name, param in model.named_parameters():
        # Freeze layers that are not LoRA parameters (i.e., layers that do not contain 'A' or 'B' in the name)
        if 'A' not in name and 'B' not in name:
            param.requires_grad = False
        else:
            # Unfreeze LoRA layers (those with 'A' or 'B' in the name)
            param.requires_grad = True

# Apply the freeze function to the model
freeze_layers_except_lora(model)

# Check if layers are frozen properly by printing the requires_grad status
for name, param in model.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")


fc1.bias: requires_grad=False
fc1.parametrizations.weight.original: requires_grad=False
fc1.parametrizations.weight.0.A: requires_grad=True
fc1.parametrizations.weight.0.B: requires_grad=True
fc2.bias: requires_grad=False
fc2.parametrizations.weight.original: requires_grad=False
fc2.parametrizations.weight.0.A: requires_grad=True
fc2.parametrizations.weight.0.B: requires_grad=True
fc3.bias: requires_grad=False
fc3.parametrizations.weight.original: requires_grad=False
fc3.parametrizations.weight.0.A: requires_grad=True
fc3.parametrizations.weight.0.B: requires_grad=True


In [None]:
mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
exclude_indices = mnist_trainset.targets == 9
mnist_trainset.data = mnist_trainset.data[exclude_indices]
mnist_trainset.targets = mnist_trainset.targets[exclude_indices]
# Create a dataloader for the training
train_loader = torch.utils.data.DataLoader(mnist_trainset, batch_size=10, shuffle=True)

# Train the network with LoRA only on the digit 9 and only for 100 batches (hoping that it would improve the performance on the digit 9)
train(train_loader, model, epochs=1, total_iterations_limit=100)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9.91M/9.91M [00:11<00:00, 899kB/s] 


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28.9k/28.9k [00:00<00:00, 134kB/s]


Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1.65M/1.65M [00:06<00:00, 244kB/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4.54k/4.54k [00:00<00:00, 3.24MB/s]


Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw



Epoch 1:  99%|█████████▉| 99/100 [00:00<00:00, 127.56it/s, loss=0.0616]


In [None]:
enable_disable_lora(enabled=True)

test()

Testing: 100%|██████████| 1000/1000 [00:03<00:00, 265.72it/s]

Accuracy: 0.767
wrong counts for the digit 0: 460
wrong counts for the digit 1: 150
wrong counts for the digit 2: 106
wrong counts for the digit 3: 205
wrong counts for the digit 4: 481
wrong counts for the digit 5: 318
wrong counts for the digit 6: 73
wrong counts for the digit 7: 335
wrong counts for the digit 8: 202
wrong counts for the digit 9: 3





### Observing the Speed and Accuracy with LoRA Fine-Tuning

Did you see how incredibly fast the training became? With LoRA (Low-Rank Adaptation), not only did we achieve lightning-fast fine-tuning, but we also managed to significantly improve the accuracy on the previously excluded class—digit "9." Moving from a high count of incorrect classifications to just a single misclassification, the improvement is clear.

However, it’s worth noting that accuracy on other classes did decline. This drop occurs because the model reallocated its capacity to focus on the specific class we trained on (digit "9"), a trade-off common in fine-tuning. In this case, though, this reduction in accuracy on other classes is acceptable, given the simplicity of the dataset and our specific objective of enhancing performance for a single class.

The takeaway? With just a fraction of the model's total weights being adjusted, LoRA allowed us to fine-tune efficiently, achieving remarkable results in both speed and performance for the targeted class! 🚀

## Adding Quantization top the Equation

Quantization involves reducing the precision of the model’s weights and activations, typically from 32-bit floating point numbers (float32) to lower bit-width formats such as:

- 6-bit (half precision)
- 8-bit (int8)
- 4-bit (int4)

This can significantly reduce memory usage and computational overhead without sacrificing too much performance, especially when applied during fine-tuning.

Here are the steps for implementing **QLoRA** (Quantized Low-Rank Adaptation) :

## How the Quantization Works
Here is the breakdown of the steps involved in quantizing a tensor:

1- Determine the Maximum Value:

We first calculate the maximum absolute value in the tensor. This will be used to scale the tensor values so that they fit within the range of the desired bit-width.
Scaling:

The scaling factor is determined based on the bit-width. For example, for an 8-bit quantization, the range is from -127 to 127 (since -2^(n-1) + 1 to 2^(n-1) - 1 for an n-bit signed integer). The scaling factor ensures the tensor's values fit within this range.
The scaling factor is calculated as:

scale = max absolute value of tensor(2(bit-width−1)−1)
scale=
(2
(bit-width−1)
 −1)
max absolute value of tensor
​

Quantization:

Each element of the tensor is divided by the scaling factor, rounded to the nearest integer, and then multiplied by the scaling factor again. This maps the floating-point values into integer values that fit the specified bit-width.
Casting to Integer Type:

Finally, the tensor is cast to an integer type (e.g., torch.int8 for 8-bit).



In [None]:
# Step 1: Create an example tensor (input tensor)
tensor = torch.tensor([0.5, -1.2, 2.3, 3.1, -0.4], dtype=torch.float32)

# Step 2: Find the maximum absolute value in the tensor
# This is used to calculate the scaling factor for quantization
max_val = tensor.abs().max()  # max_val = 3.1
print(f"Maximum value: {max_val}")

# Step 3: Compute the scale factor for quantization
# For 8-bit quantization, the range is [-127, 127] (signed int8).
# The scaling factor ensures that the max value in the tensor fits within the 8-bit range.
scale = max_val / (2 ** (8 - 1) - 1)  # scale = 3.1 / 127 = 0.0244
print(f"Scaling factor: {scale}")

# Step 4: Perform quantization
# 1. Divide the tensor by the scaling factor to scale it.
# 2. Round the values to the nearest integer.
# 3. Multiply by the scaling factor again to restore the range of the tensor.
quantized_tensor = torch.round(tensor / scale) * scale
print(f"Quantized tensor (scaled and rounded): {quantized_tensor}")

# Step 5: Cast the tensor to int8 type for 8-bit quantization
# This ensures the values are represented as 8-bit integers.
quantized_tensor = quantized_tensor.to(torch.int8)
print(f"Quantized tensor as int8: {quantized_tensor}")


Maximum value: 3.0999999046325684
Scaling factor: 0.024409448727965355
Quantized tensor (scaled and rounded): tensor([ 0.4882, -1.1961,  2.2945,  3.1000, -0.3906])
Quantized tensor as int8: tensor([ 0, -1,  2,  3,  0], dtype=torch.int8)


# QLoRA Components

## 1. 4-Bit Normal Flow
In QLoRA, the typical low-rank approximation flow starts with the computation of the matrix product between two low-rank matrices, \( A \) and \( B \), to approximate the original weight matrix. After this approximation is obtained, quantization is applied to the resulting low-rank matrix. For 4-bit quantization, the values are scaled and rounded to fit within the range of \( 0 \) to \( 15 \) (i.e., the range for 4-bit values). This reduces the memory and computation overhead associated with full-precision matrices, making it efficient for training large models, especially in the context of LLMs (Large Language Models).

### Process:
- **Low-rank approximation**: \( \mathbf{A} \times \mathbf{B} \)
- **Quantization**: Values are scaled and clipped to fit within the range of 4-bit precision.
  
## 2. Double Quantization
Double quantization refers to the process of applying two layers of quantization. The first layer quantizes the low-rank approximation matrix produced by \( A \times B \), and the second layer applies a separate quantization mechanism to refine the matrix further. This can be used to compress the matrix to an even lower bit-depth, effectively reducing memory usage and speeding up computations. Double quantization is particularly useful in resource-constrained environments or when trying to push performance limits by further reducing model size.

### Process:
- **First Quantization**: Quantize the low-rank approximation matrix to a lower bit-width (e.g., 4-bit).
- **Second Quantization**: Apply an additional quantization layer to the already quantized matrix to further reduce its precision.

## 3. Paged Optimizer
The paged optimizer is designed to efficiently manage and update large weight matrices during training. Instead of storing the entire matrix in memory, it breaks the matrix into smaller "pages," or chunks, which are processed separately. This reduces the memory footprint during training, allowing for large-scale models to be trained on hardware with limited memory resources. By applying the paged optimizer, QLoRA can handle the massive size of weights in LLMs and distribute them across devices or memory layers effectively.

### Process:
- **Paging**: Split the matrix or parameters into manageable chunks (pages).
- **Optimization**: Update each chunk independently to reduce memory load.
  
## 4. LoRA (Low-Rank Adaptation)
LoRA (Low-Rank Adaptation) is a technique used to reduce the number of trainable parameters in large models, such as LLMs. By using a low-rank decomposition, the model approximates the weight matrix using two smaller matrices \( A \) and \( B \), where the rank defines the approximation complexity. The rank can be adjusted based on the desired trade-off between model performance and memory usage. LoRA helps in adapting large pre-trained models to new tasks by injecting low-rank matrices into the model, minimizing the additional computational overhead.

### Process:
- **Low-rank approximation**: The weight matrix is approximated as \( \mathbf{A} \times \mathbf{B} \), where \( \mathbf{A} \) and \( \mathbf{B} \) are much smaller than the original matrix.
- **Model adaptation**: By injecting the low-rank approximation into the original model's weights, you modify the behavior of the pre-trained model with minimal parameter updates.

## lets now try to implemet this :
## Lora + Quantization = QloRa


In [None]:
class QLoRaParams(nn.Module):
    def __init__(self, feat_in, feat_out, rank, alpha=1.0, device='cpu', quantization_bit=4, lora=True):
        super(QLoRaParams, self).__init__()

        # Initialize low-rank matrices A and B
        self.A = nn.Parameter(torch.zeros(rank, feat_out).to(device))  # (rank, feat_out)
        self.B = nn.Parameter(torch.randn(feat_in, rank).to(device))  # (feat_in, rank)
        self.scale = alpha / rank

        self.lora = lora
        self.quantization_bit = quantization_bit
        self.device = device

    def quantize(self, tensor):
        """
        Quantize tensor to the specified bit-width.
        """
        min_val, max_val = tensor.min(), tensor.max()
        range_val = max_val - min_val + 1e-6  # Prevent division by zero

        # Scale tensor values to the quantization range
        scale = (2 ** self.quantization_bit - 1) / range_val
        quantized_tensor = torch.round((tensor - min_val) * scale)

        # Map back to the original range
        return (quantized_tensor / scale) + min_val

    def forward(self, og_weights):
        """
        Apply LoRA with quantization and return updated weights.
        """
        if self.lora:
            # Quantize A and B matrices
            quantized_A = self.quantize(self.A)
            quantized_B = self.quantize(self.B)

            # Compute low-rank update
            low_rank_update = torch.matmul(quantized_B, quantized_A).view(og_weights.shape) * self.scale

            # Quantize the low-rank update
            quantized_low_rank = self.quantize(low_rank_update)

            # Add quantized update to original weights
            return og_weights + quantized_low_rank
        else:
            # If LoRA is disabled, return original weights
            return og_weights


In [None]:
from torch.nn.utils import parametrize

# Define the correct LoRaParametrization for each layer
def linear_layer_parametrize(layer, rank=1, alpha=1, device=device):
    feat_in = layer.in_features
    feat_out = layer.out_features
    lora_params = QLoRaParams(feat_in, feat_out, rank=rank, alpha=alpha, device=device)
    return lora_params

# Register LoRA parametrizations
parametrize.register_parametrization(model.fc1, 'weight', linear_layer_parametrize(model.fc1, device=device))
parametrize.register_parametrization(model.fc2, 'weight', linear_layer_parametrize(model.fc2, device=device))
parametrize.register_parametrization(model.fc3, 'weight', linear_layer_parametrize(model.fc3,  device=device))


def enable_disable_lora(enabled=False):
    for layer in [model.fc1, model.fc2, model.fc3]:
        layer.parametrizations["weight"][0].enabled = enabled

for name, param in model.named_parameters():
    if 'weight' in name:
        print(f"{name}: {param.shape}")


fc1.parametrizations.weight.original: torch.Size([1000, 784])
fc1.parametrizations.weight.0.A: torch.Size([1, 1000])
fc1.parametrizations.weight.0.B: torch.Size([784, 1])
fc1.parametrizations.weight.1.A: torch.Size([1, 1000])
fc1.parametrizations.weight.1.B: torch.Size([784, 1])
fc2.parametrizations.weight.original: torch.Size([2000, 1000])
fc2.parametrizations.weight.0.A: torch.Size([1, 2000])
fc2.parametrizations.weight.0.B: torch.Size([1000, 1])
fc2.parametrizations.weight.1.A: torch.Size([1, 2000])
fc2.parametrizations.weight.1.B: torch.Size([1000, 1])
fc3.parametrizations.weight.original: torch.Size([10, 2000])
fc3.parametrizations.weight.0.A: torch.Size([1, 10])
fc3.parametrizations.weight.0.B: torch.Size([2000, 1])
fc3.parametrizations.weight.1.A: torch.Size([1, 10])
fc3.parametrizations.weight.1.B: torch.Size([2000, 1])


In [None]:

# Freeze all layers except QLoRA layers
def freeze_layers_except_lora(model):
    for name, param in model.named_parameters():
        # Freeze layers that are not LoRA parameters (i.e., layers that do not contain 'A' or 'B' in the name)
        if 'A' not in name and 'B' not in name:
            param.requires_grad = False
        else:
            # Unfreeze LoRA layers (those with 'A' or 'B' in the name)
            param.requires_grad = True

# Apply the freeze function to the model
freeze_layers_except_lora(model)

# Check if layers are frozen properly by printing the requires_grad status
for name, param in model.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")


fc1.bias: requires_grad=False
fc1.parametrizations.weight.original: requires_grad=False
fc1.parametrizations.weight.0.A: requires_grad=True
fc1.parametrizations.weight.0.B: requires_grad=True
fc1.parametrizations.weight.1.A: requires_grad=True
fc1.parametrizations.weight.1.B: requires_grad=True
fc2.bias: requires_grad=False
fc2.parametrizations.weight.original: requires_grad=False
fc2.parametrizations.weight.0.A: requires_grad=True
fc2.parametrizations.weight.0.B: requires_grad=True
fc2.parametrizations.weight.1.A: requires_grad=True
fc2.parametrizations.weight.1.B: requires_grad=True
fc3.bias: requires_grad=False
fc3.parametrizations.weight.original: requires_grad=False
fc3.parametrizations.weight.0.A: requires_grad=True
fc3.parametrizations.weight.0.B: requires_grad=True
fc3.parametrizations.weight.1.A: requires_grad=True
fc3.parametrizations.weight.1.B: requires_grad=True


In [None]:
# Assuming the model is a simple neural network with LoRA applied
def count_params(model):
    total_params = sum(p.numel() for p in model.parameters())
    return total_params

def count_lora_params(model):
    lora_params = 0
    # Iterate through the layers where LoRA is applied (e.g., fc1, fc2, fc3)
    for name, param in model.named_parameters():
        # Check if the parameter is a LoRA parameter (e.g., A or B)
        if 'A' in name or 'B' in name:
            lora_params += param.numel()
    return lora_params


# Calculate total number of parameters
total_params = count_params(model)

# Calculate total LoRA parameters (A and B matrices)
lora_params = count_lora_params(model)

# Calculate percentage of LoRA parameters
lora_percentage = (lora_params / total_params) * 100

print(f"Total Parameters: {total_params}")
print(f"LoRA Parameters: {lora_params}")
print(f"Percentage of LoRA Parameters: {lora_percentage:.2f}%")


Total Parameters: 2820598
LoRA Parameters: 13588
Percentage of LoRA Parameters: 0.48%


## Let’s Recap Before Moving Forward  

Up to this point, we’ve introduced the concept of **LoRA** (Low-Rank Adaptation) and implemented it from scratch. We tested this implementation with a simple linear model on the MNIST dataset.  

Next, we discussed **quantization** and integrated its implementation into the LoRA framework to create a quantized LoRA (qLoRA).  

We briefly mentioned that qLoRA has four key components:  

1. **4-bit Normal Float**  
   - This is achieved by loading the model using the **BitsAndBytes** library, which allows us to load a quantized version of the model with 4-bit precision.  

2. **Double Quantization**  
   - Already implemented in our earlier steps.  

3. *... (Steps for the third component go here, if applicable)*  

4. **LoRA Application**  
   - Completed in our earlier implementation.  


### What's Missing?  

At this point, the only remaining piece is the **PAGeS Optimizer**, which plays a critical role in optimizing the model during training. This will be the next step in our implementation journey!  

Let’s dive in. 🚀

#### **Section 3: Fine-Tuning GPT-2-XL on the Alpaca Dataset** 🦙  

The **Alpaca dataset** is a curated collection designed to enhance instruction-following capabilities in language models. It provides examples of user prompts and corresponding responses, covering a wide range of instructions. This dataset helps fine-tune models to better understand and generate human-like responses to instructions, making them more effective and versatile in real-world applications.  

To fine-tune GPT-2-XL on this dataset for instruction-following tasks, we will follow a systematic approach involving preprocessing, model quantization, and training. Below is the step-by-step flow of the project:  

1. **Load and Preprocess the Dataset**  
   - Format the dataset to match the input-output requirements of GPT-2-XL for instruction-following.  
   - Create efficient data loaders for streamlined training.  

2. **Load the Quantized Model**  
   - Use **BitsAndBytes** to load a 4-bit quantized version of GPT-2-XL, which reduces memory usage and speeds up training.  

3. **Inject the QLoRA Layer**  
   - Add **QLoRA layers** into the GPT-2-XL architecture to enable parameter-efficient fine-tuning while preserving the original model's weights.  

4. **Fine-Tune the Model**  
   - Train the model on the Alpaca dataset with the injected QLoRA layers, utilizing the **PAGeS optimizer** to optimize the fine-tuning process efficiently.  

By following this structured workflow, we can adapt GPT-2-XL to excel in instruction-following tasks while keeping computational overhead minimal. Let's get started! 🚀

## 1- Loading and Preparing the Data :

In [None]:
import json
import os

# Path to the local dataset file
dataset_file = "/content/alpaca_data.json"

# Load and inspect the data
if os.path.isfile(dataset_file):
    with open(dataset_file, 'r') as file:
        data = json.load(file)

print("Number of entries:", len(data))
# Print an example entry from the list (entry at index 10)
print("Entry to the list:\n", data[:3])


Number of entries: 52002
Entry to the list:
 [{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}, {'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.'}, {'instruction': 'Describe the structure of an atom.', 'input': '', 'output': 'An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.'}]


We will peform the same data prepartion as we did in the previous notebookk

In [None]:
def format_input(entry):
    # Create the instruction text using the instruction from the entry
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"  # Include the instruction in the formatted text
    )

    # Check if the input exists; if so, format it accordingly
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    # Combine the instruction text and input text (if any) and return the result
    return instruction_text + input_text


In [None]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Edit the following sentence to make it more concise.

### Input:
He ran to the bus stop in order to catch the bus that was due to arrive in five minutes.

### Response:
He ran to the bus stop, due to arrive in five minutes.


In [None]:
from sklearn.model_selection import train_test_split

train_data, temp_data = train_test_split(data, test_size=0.15, random_state=42)  # 85% train, 15% temp

# Split temp data into 5% validation and 10% test
val_data, test_data = train_test_split(temp_data, test_size=(10/15), random_state=42)  # 5% val, 10% test

print("Train data size:", len(train_data))
print("Validation data size:", len(val_data))
print("Test data size:", len(test_data))

Train data size: 44201
Validation data size: 2600
Test data size: 5201


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        # Initialize the dataset with data and a tokenizer
        self.data = data
        self.encode_data = []

        # Loop through each entry in the dataset
        for entry in data:
            # Format the input entry into a structured text
            instruction_text = format_input(entry) ## using the function defined above
            # Prepare the desired response format
            desired_response = f"\n\n### Response:\n{entry['output']}"
            # Combine instruction and response into full text
            full_text = instruction_text + desired_response
            # Encode the full text using the provided tokenizer and store it
            self.encode_data.append(
                tokenizer.encode(full_text)
            )

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        return self.encode_data[index]


In [None]:
! pip install tiktoken
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special = {"<|endoftext|>"}))

[50256]


In [None]:
def collate_fn3(batch, pad_token_id=50256, ignore_index=-100, allowed_max_len=None, device="cpu"):
    # Determine the maximum length of the sequences in the batch
    batch_max_length = max(len(item) + 1 for item in batch)
    input_ls, target_ls = [], []

    # Iterate through each item in the batch
    for item in batch:
        # Create a copy of the item and append the padding token
        new_item = item.copy()
        new_item += [pad_token_id]

        # Pad the item to the maximum length
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))

        # Separate inputs and targets; inputs exclude the last token, targets exclude the first token
        inputs = torch.tensor(padded[:-1])  # Input tensor
        target = torch.tensor(padded[1:])   # Target tensor

        # Create a mask for the padding tokens in the target tensor
        mask = target == pad_token_id
        indices = torch.nonzero(mask).squeeze()  # Get indices of padding tokens

        # If there are multiple padding tokens, ignore the subsequent ones in the loss calculation
        if indices.numel() > 1:
            target[indices[1:]] = ignore_index  # Set ignore index for padding

        # Optionally limit the length of inputs and targets
        if allowed_max_len is not None:
            inputs = inputs[:allowed_max_len]
            target = target[:allowed_max_len]

        # Append the processed inputs and targets to the respective lists
        input_ls.append(inputs)
        target_ls.append(target)

    # Stack the lists into tensors and move them to the specified device
    input_tensor = torch.stack(input_ls).to(device)
    target_tensor = torch.stack(target_ls).to(device)

    return input_tensor, target_tensor  # Return the batched input and target tensors


In [None]:
from functools import partial

device = torch.device("cuda" if torch.cuda.is_available else "cpu")

custom_collate_fn = partial(collate_fn3,
                            device=device,
                            allowed_max_len=1024)

num_workers = 0
batch_size = 1

train_dataset = InstructionDataset(train_data, tokenizer)
val_dataset = InstructionDataset(val_data, tokenizer)
test_dataset = InstructionDataset(test_data, tokenizer)

train_dataloader = DataLoader(
    train_dataset,
    collate_fn=custom_collate_fn,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataloader = DataLoader(
    val_dataset,
    collate_fn=custom_collate_fn,
    batch_size=batch_size,
    shuffle=False,
    drop_last=True,
    num_workers=num_workers
)

test_dataloader = DataLoader(
    test_dataset,
    collate_fn=custom_collate_fn,
    batch_size=batch_size,
    shuffle=False,
    drop_last=True,
    num_workers=num_workers
)

print("Train dataloader:", len(train_dataloader))
print("Validation dataloader:", len(val_dataloader))
print("Test dataloader:", len(test_dataloader))

Train dataloader: 44201
Validation dataloader: 2600
Test dataloader: 5201


In [None]:
# since we working with a limited RAM make sure to clean up the memory from time to time
# Clean up temporary variables and free memory
import gc

# Remove temp_data (no longer needed after splitting)
del temp_data

# Clear any large unused variables
del model_input, desired_response  # Output used for testing only

# Delete the data after the split to free up memory
del data
gc.collect()


# You can also delete any other intermediate variables if needed:
# del train_data, val_data, test_data
# gc.collect()

# Optionally, after creating DataLoader objects, delete unnecessary intermediate variables
del train_dataset, val_dataset, test_dataset
gc.collect()

print("Memory cleaned up.")

torch.cuda.empty_cache()  # Clear GPU memory
gc.collect()  # Run garbage collection


Memory cleaned up.


0

## 2- Loading the Quantized Model  

To fine-tune large language models efficiently, it's crucial to manage memory and computational costs. This is where **BitsAndBytes** comes into play.  

### **Introduction to BitsAndBytes**  
**BitsAndBytes** is a library designed to enable efficient training and inference of large-scale language models by using low-bit precision. Instead of relying on traditional 16-bit or 32-bit floating-point formats, BitsAndBytes supports **4-bit and 8-bit quantization**, significantly reducing memory usage without compromising model performance.  

Some key features of BitsAndBytes include:  
- **Quantized Model Loading**: Load pre-trained models in quantized formats (e.g., 4-bit) to save memory and increase speed.  
- **Compatibility**: Integrates seamlessly with popular frameworks like Hugging Face Transformers.  
- **Fine-Tuning Ready**: Ideal for adapting large language models using efficient fine-tuning methods like **QLoRA**.  

### **Why Use BitsAndBytes?**  
Quantization with BitsAndBytes allows us to:  
1. **Reduce Memory Footprint**: Handle large models like GPT-2-XL on smaller hardware setups.  
2. **Boost Training Efficiency**: Train faster due to reduced computation requirements.  
3. **Retain Performance**: Minimize the trade-off between accuracy and efficiency.  

In this section, we'll leverage BitsAndBytes to load a **4-bit quantized version** of GPT-2-XL, preparing it for the integration of QLoRA layers and fine-tuning. Let's dive into the process! 🚀

In [None]:
! pip install transformers accelerate bitsandbytes


Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1


In [None]:
from transformers import AutoModelForCausalLM , BitsAndBytesConfig , AutoTokenizer

model_id = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_id)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)


print("Model loaded successfully!")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Model loaded successfully!


### **Understanding the BitsAndBytesConfig**

The `BitsAndBytesConfig` defines how the model is quantized when it is loaded. Here's a breakdown of the key components:

---

#### **Quantization Parameters**
1. **`load_in_4bit=True`**  
   - Specifies that the model weights should be loaded in **4-bit precision**.  
   - **Benefit**: Significantly reduces memory requirements compared to the usual 16-bit or 32-bit precision.

2. **`bnb_4bit_quant_type="nf4"`**  
   - Indicates the quantization type.  
   - **NF4 (Normal Float 4)**:  
     - An advanced quantization scheme that retains more dynamic range compared to uniform quantization.  
     - Improves accuracy for low-precision models.

3. **`bnb_4bit_use_double_quant=True`**  
   - Enables **double quantization**, which involves:  
     1. Quantizing weights initially.  
     2. Applying a secondary quantization.  
   - **Benefit**: Further reduces memory usage while minimizing numerical degradation.

4. **`bnb_4bit_compute_dtype=torch.bfloat16`**  
   - Sets the compute precision to **bfloat16 (Brain Floating Point 16-bit)**.  
   - **Why bfloat16?**  
     - Balances precision and memory efficiency.  
     - Optimized for modern accelerators like GPUs and TPUs.

---

#### **Under the Hood: Benefits of This Configuration**
1. **Memory Efficiency**:  
   - Reducing weights to **4 bits** drastically decreases memory usage compared to standard **16/32-bit weights**.

2. **Speed**:  
   - Smaller weights mean faster data transfers between GPU/CPU and memory.

3. **Accuracy**:  
   - Using **NF4 quantization** retains higher accuracy compared to simpler schemes like uniform quantization.

4. **Compatibility**:  
   - Leveraging **bfloat16** for computations ensures compatibility with modern GPUs while minimizing numerical instability.

---

This configuration allows large language models like **GPT-2-XL** to be fine-tuned or used for inference efficiently, even on **consumer-grade hardware**.

## now lets test our loaded model

In [None]:
# Verify model type and configuration
print("Model type:", type(model_nf4))
print("Model configuration:", model_nf4.config)

# Check the dtype of model parameters to confirm quantization
print("\nModel parameter data types:")
for name, param in model_nf4.named_parameters():
    print(f"{name}: {param.dtype}")

# Run a simple forward pass
test_input = tokenizer("Hello, world!", return_tensors="pt").input_ids

# Move to device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_nf4 = model_nf4.to(device)
test_input = test_input.to(device)

print("\nRunning a forward pass...")
with torch.no_grad():
    output = model_nf4.generate(test_input, max_length=10)
print("Model output:", tokenizer.decode(output[0], skip_special_tokens=True))

# Check memory usage
if torch.cuda.is_available():
    print("\nMemory allocated (MB):", torch.cuda.memory_allocated(device) / 1024**2)
else:
    print("\nRunning on CPU; memory usage check skipped.")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Model type: <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>
Model configuration: GPT2Config {
  "_attn_implementation_autoset": true,
  "_name_or_path": "gpt2-xl",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_i

## Methods to Inject QLoRA Layers into the Model

We have three methods for injecting QLoRA layers into the model:

### 1. **Wrapper Class Approach**  
This involves copying the model’s source code from the Hugging Face implementation and creating a wrapper class on top of the original model. While not the most practical, this approach helps us understand what’s happening under the hood. We'll test it to gain insights, even though it's not ideal for larger-scale use.

### 2. **Compact Custom Implementation**  
This method involves designing a compact class, as we've been doing in this notebook. We inject the QLoRA layers into the model by looping through and identifying the appropriate layer for insertion. This approach is more efficient and scalable compared to the wrapper class.

### 3. **PEFT from Hugging Face** (Production-Ready)  
The most production-ready solution is using the PEFT (Parameter-Efficient Fine-Tuning) library from Hugging Face. While we need to learn this method first, it’s the most optimized for production environments and will integrate seamlessly with existing Hugging Face tools.



### 1. **Wrapper Class Approach**  

Here's the compact markdown for implementing LoRA (Low-Rank Adaptation) in a GPT model:

```markdown
# Reimplementing LoRA in GPT Model

The goal is to adapt the self-attention mechanism in a GPT model (such as `GPT2` from Huggingface) to incorporate Low-Rank Adaptation (LoRA). We'll replace the original self-attention module with a new class `LoraGPT2SelfAttention` that adds LoRA matrices.

### Step 1: Define `LoraGPT2SelfAttention`

We extend the `GPT2SelfAttention` module to include LoRA matrices. These matrices will modify the query and value components while keeping the rest of the attention mechanism unchanged.

```python
class LoraGPT2SelfAttention(GPT2SelfAttention):
    """
    Extends GPT2SelfAttention with LoRA (Low-Rank Adaptation) matrices.
    LoRA enhances efficiency by only updating the query and value matrices.
    """
    def __init__(self, r=8, *args, **kwargs):
        super().__init__(*args, **kwargs)
        d = self.attn_head_size
        
        # Initialize LoRA matrices for query and value
        self.lora_query_matrix_B = nn.Parameter(torch.zeros(d, r))
        self.lora_query_matrix_A = nn.Parameter(torch.randn(r, d))
        self.lora_value_matrix_B = nn.Parameter(torch.zeros(d, r))
        self.lora_value_matrix_A = nn.Parameter(torch.randn(r, d))
    
    def lora_query(self, x):
        """
        Applies LoRA to the query component.
        """
        lora_query_weights = torch.matmul(self.lora_query_matrix_B, self.lora_query_matrix_A)
        return self.query(x) + F.linear(x, lora_query_weights)

    def lora_value(self, x):
        """
        Applies LoRA to the value component.
        """
        lora_value_weights = torch.matmul(self.lora_value_matrix_B, self.lora_value_matrix_A)
        return self.value(x) + F.linear(x, lora_value_weights)
```

### Step 2: Modify the `forward` Method

To apply the LoRA modifications, we overwrite the `forward` method of `GPT2SelfAttention`. We replace calls to the original `query` and `value` functions with our LoRA-enhanced versions.

```python
class LoraGPT2SelfAttention(GPT2SelfAttention):
    def forward(self, hidden_states, *args, **kwargs):
        # Use LoRA-enhanced query and value
        mixed_query_layer = self.lora_query(hidden_states)
        
        # Regular key layer, no LoRA
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        
        # Use LoRA-enhanced value
        value_layer = self.transpose_for_scores(self.lora_value(hidden_states))
        
        # ... (rest of the forward code, unchanged)
```

### Step 3: Replace Attention Modules in GPT Model

We replace the `GPT2SelfAttention` module in the original GPT model with `LoraGPT2SelfAttention`. This ensures that all attention layers use the LoRA mechanism.

```python
class LoraWrapperGPT2(nn.Module):
    def __init__(self, task_type, num_classes=None, dropout_rate=0.1, model_id="gpt2",
                 lora_rank=8, train_biases=True, train_embedding=False, train_layer_norms=True):
        """
        A wrapper for GPT2 with Low-Rank Adaptation (LoRA) for various NLP tasks.
        """
        super().__init__()
        self.model_id = model_id
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_id)
        self.model = GPT2Model.from_pretrained(model_id)
        self.model_config = self.model.config

        # Add task-specific layer
        d_model = self.model_config.n_embd
        self.finetune_head_norm = nn.LayerNorm(d_model)
        self.finetune_head_dropout = nn.Dropout(dropout_rate)
        self.finetune_head_classifier = nn.Linear(d_model, num_classes)

        # Set up LoRA model for training
        self.replace_multihead_attention()
        self.freeze_parameters_except_lora_and_bias()

    def replace_multihead_attention(self):
        """
        Replaces all attention modules in GPT2 with LoRA-modified ones.
        """
        self.replace_multihead_attention_recursion(self.model)

    def replace_multihead_attention_recursion(self, model):
        """
        Recursively replaces GPT2SelfAttention with LoraGPT2SelfAttention in the model.
        """
        for name, module in model.named_children():
            if isinstance(module, GPT2SelfAttention):
                new_layer = LoraGPT2SelfAttention(r=self.lora_rank, config=self.model_config)
                new_layer.load_state_dict(module.state_dict(), strict=False)
                setattr(model, name, new_layer)
            else:
                self.replace_multihead_attention_recursion(module)
```

### Step 4: Fine-Tuning for NLP Tasks

The wrapper also allows fine-tuning for different NLP tasks (e.g., GLUE, SQuAD). The attention layers are replaced by LoRA-enabled ones, while ensuring that only LoRA parameters and select others (biases, layer norms) are trainable.

```python
class LoraWrapperGPT2(nn.Module):
    # ... (same as above)

    def freeze_parameters_except_lora_and_bias(self):
        """
        Freezes all model parameters except those related to LoRA and biases.
        """
        for param in self.model.parameters():
            param.requires_grad = False
        for param in self.model.lm_head.parameters():
            param.requires_grad = True
        for param in self.finetune_head_classifier.parameters():
            param.requires_grad = True
        for param in self.model.get_parameters():
            if "lora" in param.name:
                param.requires_grad = True
```

### Conclusion

We have successfully integrated LoRA into the GPT2 self-attention mechanism. This new model efficiently adapts query and value components during fine-tuning, while preserving the original model's parameters. The wrapper enables fine-tuning for specific tasks while using LoRA for efficient adaptation.
```


### 2. **Compact Custom Implementation**  


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

def quantize_tensor(tensor, num_bits=8):
    """
    Simple uniform quantization function.
    Scales tensor values to the [0, 2^num_bits - 1] range, rounds to integers, and rescales.
    """
    scale = torch.max(torch.abs(tensor))  # Find the max value for scaling
    q_max = 2 ** num_bits - 1  # Max value for the given bit width
    tensor_scaled = tensor / scale  # Scale tensor to the range [-1, 1]
    tensor_quantized = torch.round(tensor_scaled * q_max)  # Round to integer values
    tensor_quantized = tensor_quantized / q_max  # Rescale back to the range [-1, 1]
    return tensor_quantized * scale  # Rescale to the original range

class LoraLinear(nn.Linear):
    """
    Extends a PyTorch linear layer with Low-Rank Adaptation (LoRA) and double quantization.
    LoRA adds two matrices to the layer, allowing for efficient training of large models.
    """
    def __init__(self, in_features, out_features, r=8, num_bits=8, *args, **kwargs):
        super().__init__(in_features, out_features, *args, **kwargs)

        # Initialize LoRA matrices
        self.lora_matrix_B = nn.Parameter(torch.zeros(out_features, r))  # B matrix of shape (out_features, r)
        self.lora_matrix_A = nn.Parameter(torch.randn(r, in_features))  # A matrix of shape (r, in_features)

        # Store the quantization bit-width for later use
        self.num_bits = num_bits

        # Freeze the original weight matrix (no gradients)
        self.weight.requires_grad = False

    def forward(self, x):
        # Quantize A and B matrices
        qA = quantize_tensor(self.lora_matrix_A, num_bits=self.num_bits)
        qB = quantize_tensor(self.lora_matrix_B, num_bits=self.num_bits)

        # Compute LoRA weight adjustment by multiplying quantized B and A
        lora_weights = torch.matmul(qB, qA)

        # Quantize the result of the multiplication
        quantized_lora_weights = quantize_tensor(lora_weights, num_bits=self.num_bits)

        # Apply the original and LoRA-adjusted linear transformations
        return super().forward(x) + F.linear(x, quantized_lora_weights)  # x @ W + x @ LoRA_weights


## Key Points:
LoRA Mechanism: Only a small subset of parameters (the LoRA parameters lora_a and lora_b) are updated during training, which makes the process efficient.
Quantization: Optionally, the original weights can be quantized to reduce memory usage.
Freezing Original Weights: The original weights are frozen, ensuring that only the low-rank components are fine-tuned.
This implementation is beneficial in transfer learning scenarios, where a pretrained model can be adapted to a new task without retraining all the parameters. Instead, the low-rank matrices efficiently modify the behavior of the pretrained model.

Now lets replace the old layers with qlora as follow :

In [None]:
def replace_with_qlora(model, rank=8, num_bits=4):
    # Access the transformer part of the model (which is of type GPT2Model)
    transformer = model.transformer  # GPT2Model

    # Iterate over each GPT2Block inside the transformer model
    for block in transformer.h:  # transformer.h is a ModuleList of GPT2Block modules
        # In GPT2SdpaAttention (attn)
        # Check if 'c_attn' and 'c_proj' are instances of Linear (generic type)
        if isinstance(block.attn.c_attn, torch.nn.Linear):  # Check if it's a standard Linear layer
            block.attn.c_attn = LoraLinear(block.attn.c_attn.in_features, block.attn.c_attn.out_features, r=rank)

        if isinstance(block.attn.c_proj, torch.nn.Linear):  # Check if it's a standard Linear layer
            block.attn.c_proj = LoraLinear(block.attn.c_proj.in_features, block.attn.c_proj.out_features, r=rank)

        # In GPT2MLP (mlp)
        if isinstance(block.mlp.c_fc, torch.nn.Linear):  # Check if it's a standard Linear layer
            block.mlp.c_fc = LoraLinear(block.mlp.c_fc.in_features, block.mlp.c_fc.out_features, r=rank)

        if isinstance(block.mlp.c_proj, torch.nn.Linear):  # Check if it's a standard Linear layer
            block.mlp.c_proj = LoraLinear(block.mlp.c_proj.in_features, block.mlp.c_proj.out_features, r=rank)

    return model

Lets apply the function and countn the number of the parametrs before and after the injection

In [None]:
model_nf4_lora = replace_with_qlora(model_nf4, rank=8, num_bits=4)


def count_params(model):
    # Total parameters in the model (includes LoRA and original weights)
    total_params = sum(p.numel() for p in model.parameters())
    return total_params

def count_lora_params(model):
    lora_params = 0
    # Iterate through the layers in the model to find LoRA parameters
    for name, param in model.named_parameters():
        # Check if the parameter is part of a QLoRALinear layer (the LoRA layers)
        if 'lora_matrix_A' in name or 'lora_matrix_B' in name:
            lora_params += param.numel()
    return lora_params

# Assuming `model_nf4_lora` is the model with LoRA applied
# Calculate total number of parameters in the model
total_params = count_params(model_nf4_lora)

# Calculate total LoRA parameters (A and B matrices)
lora_params = count_lora_params(model_nf4_lora)

# Calculate percentage of LoRA parameters
lora_percentage = (lora_params / total_params) * 100

print(f"Total Parameters: {total_params}")
print(f"LoRA Parameters: {lora_params}")
print(f"Percentage of LoRA Parameters: {lora_percentage:.2f}%")

print(model_nf4_lora)

Total Parameters: 1567441600
LoRA Parameters: 9830400
Percentage of LoRA Parameters: 0.63%
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x GPT2Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): LoraLinear(in_features=1600, out_features=4800, bias=True)
          (c_proj): LoraLinear(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): LoraLinear(in_features=1600, out_features=6400, bias=True)
          (c_proj): LoraLinear(in_features=6400, out_features=1600, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(

Freezing the modle graph expect for the lorfa parameters

In [None]:
def freeze_all_except_lora(model):
    # Iterate through all the modules in the model
    for name, param in model.named_parameters():
        # Check if the parameter belongs to a LoraLinear layer
        if isinstance(param, nn.Parameter) and 'lora_matrix' in name:
            # Don't freeze LoRA parameters
            param.requires_grad = True
        else:
            # Freeze other parameters
            param.requires_grad = False

# Apply this function to the model
freeze_all_except_lora(model_nf4_lora)

def check_freezing_status(model):
    for name, param in model.named_parameters():
        print(f"{name}: requires_grad={param.requires_grad}")

check_freezing_status(model_nf4_lora)


transformer.wte.weight: requires_grad=False
transformer.wpe.weight: requires_grad=False
transformer.h.0.ln_1.weight: requires_grad=False
transformer.h.0.ln_1.bias: requires_grad=False
transformer.h.0.attn.c_attn.weight: requires_grad=False
transformer.h.0.attn.c_attn.bias: requires_grad=False
transformer.h.0.attn.c_attn.lora_matrix_B: requires_grad=True
transformer.h.0.attn.c_attn.lora_matrix_A: requires_grad=True
transformer.h.0.attn.c_proj.weight: requires_grad=False
transformer.h.0.attn.c_proj.bias: requires_grad=False
transformer.h.0.attn.c_proj.lora_matrix_B: requires_grad=True
transformer.h.0.attn.c_proj.lora_matrix_A: requires_grad=True
transformer.h.0.ln_2.weight: requires_grad=False
transformer.h.0.ln_2.bias: requires_grad=False
transformer.h.0.mlp.c_fc.weight: requires_grad=False
transformer.h.0.mlp.c_fc.bias: requires_grad=False
transformer.h.0.mlp.c_fc.lora_matrix_B: requires_grad=True
transformer.h.0.mlp.c_fc.lora_matrix_A: requires_grad=True
transformer.h.0.mlp.c_proj.wei

Again lets delete variables we dont need anymore , we dont want the kernel to crash and us loosing all the progress we made (trust me i learned the hard way)

In [None]:
from accelerate import Accelerator

# Initialize accelerator
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Don't manually move the model; let accelerator handle the device placement
model_nf4_lora.to(device)  # This will ensure the model is placed on the right device

# Free memory by deleting unnecessary objects
del model_nf4 , total_params , lora_params ,  lora_percentage , model_nf4_lora# Delete the original model if it's no longer needed

# Manually clean up unused variables
gc.collect()  # Run garbage collection to release memory

# Check memory usage on the current device (GPU if available, otherwise CPU)
if torch.cuda.is_available():
    print("\nMemory allocated (MB):", torch.cuda.memory_allocated() / 1024**2)
else:
    print("\nRunning on CPU; memory usage check skipped.")

# Now the model_nf4_lora is ready to be trained using accelerate
print("Model ready for training with accelerate!")





Memory allocated (MB): 6013.6416015625
Model ready for training with accelerate!


In [None]:
import torch
! PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
torch.cuda.empty_cache()


### 3. **PEFT from Hugging Face** (Production-Ready)  

# The Five Commandments of Low-Rank Adaptation

1. **Utilize LoRA**  
   Leverage LoRA for efficient model fine-tuning, focusing on keeping parameter sizes minimal.

2. **Employ the PEFT Library**  
   Use the PEFT library for LoRA implementation, avoiding the need for complex coding.

3. **Extend LoRA Adaptations**  
   Apply LoRA to all linear layers to enhance the overall model capabilities.

4. **Keep Biases and Layer Norms Trainable**  
   Maintain biases and layer norms as trainable parameters since they are critical for model adaptability and don’t require low-rank adaptations.

5. **Apply Quantized-LoRA (QLoRA)**  
   Use QLoRA to conserve GPU VRAM and enable the training of larger models efficiently.

Hugging Face and PEFT (Parameter-Efficient Fine-Tuning) have revolutionized fine-tuning by making it simpler and more accessible. With libraries like Transformers and PEFT, users can quickly adapt large models using techniques like LoRA without deep dives into complex code. These tools focus on efficiency, enabling fine-tuning with minimal compute resources and parameter updates.

However, in alignment with our motto *"Understand it before you use it,"* we take a hands-on approach. By building and experimenting with alternative methods from scratch, we deepen our understanding of how fine-tuning truly works, empowering us to innovate beyond pre-built solutions.

Now lets stream the whole project step by step :

In [None]:
!pip install peft datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

## 1- Load the Quantized model using BitsandBytes  

In [None]:
from transformers import AutoModelForCausalLM , BitsAndBytesConfig , AutoTokenizer

model_id = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_id)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)


print("Model loaded successfully!")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded successfully!


In [None]:
print(model_nf4)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x GPT2Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Linear4bit(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear4bit(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Linear4bit(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear4bit(in_features=6400, out_features=1600, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, ele

## 2-Preparing the Dataset

The dataset preparation involves a custom class (`InstructionDataset`) tailored for different training needs. Both classes achieve similar goals but are designed for distinct workflows. Here's an expanded explanation of the differences:

---

### **1. First Class: Suitable for Custom PyTorch Training Loop**
#### **Key Characteristics:**
- Pre-encodes all data during initialization:
  - Combines input (`instruction`, `input`) and output (`output`) into a single text format.
  - Uses `tokenizer.encode()` to preprocess and store the tokenized data upfront.

#### **Advantages:**
- **Preprocessed Efficiency:**
  - Tokenization is completed once during initialization, saving time during training epochs.
- **Simple Output Structure:**
  - The `__getitem__` method returns pre-tokenized sequences as plain tensors. This works well when you control the training loop and manage batching manually.

#### **Limitations:**
- **Lacks Flexibility for Adjustments:**
  - Hardcoded tokenization settings (e.g., no dynamic padding or truncation). Changes like modifying `max_length` require reinitializing the dataset.
- **No Attention Masks:**
  - Hugging Face models rely on `attention_mask` to distinguish padding tokens from meaningful input. This is missing, potentially causing issues during training.
- **Not Hugging Face Compatible:**
  - Hugging Face's `Trainer` expects inputs like `input_ids`, `attention_mask`, and `labels` in a dictionary format. Additional preprocessing would be required to use this dataset with the `Trainer`.

---

### **2. Second Class: Flexible and Hugging Face Trainer Compatible**
#### **Key Characteristics:**
- Dynamically tokenizes data during `__getitem__`:
  - Separately tokenizes input (`instruction + input`) and target (`output`).
  - Adds padding and truncation dynamically based on `max_length`.
  - Returns `input_ids`, `attention_mask`, and `labels` as a dictionary.

#### **Advantages:**
- **Dynamic Tokenization:**
  - Allows adjustments to tokenization parameters like `max_length` without reinitializing the dataset.
  - Suitable for a variety of tasks, including those requiring sequence-to-sequence modeling or causal language modeling.
- **Attention Mask Support:**
  - Generates `attention_mask`, ensuring that padding tokens are ignored during training, which is crucial for models like GPT-2.
- **Hugging Face `Trainer` Ready:**
  - Directly provides inputs (`input_ids`, `attention_mask`) and outputs (`labels`) in the format expected by Hugging Face models and `Trainer`.

#### **Limitations:**
- **Tokenization Overhead:**
  - Repeated tokenization during each `__getitem__` call adds some computational overhead compared to pre-encoded data.

---

### **When to Use Each Class**
#### **First Class:**
- Use it when:
  - You're implementing a custom PyTorch training loop.
  - You want to pre-tokenize data for efficiency and control batching manually.
  - The task doesn't require attention masks or specific input-output separation.

#### **Second Class:**
- Use it when:
  - You're working with Hugging Face's ecosystem (e.g., `Trainer`, `DataCollator`).
  - You need flexibility in handling different tokenization parameters (e.g., padding, truncation).
  - Your task involves sequence-to-sequence learning or other NLP tasks requiring explicit `input_ids`, `attention_mask`, and `labels`.

---

### **Technical Comparison**
| Feature                          | First Class                       | Second Class                       |
|-----------------------------------|------------------------------------|-------------------------------------|
| **Tokenization**                  | Pre-encoded during initialization | Dynamic during `__getitem__`       |
| **Attention Mask**                | Not included                      | Included                           |
| **Output Format**                 | Single tensor                     | Dictionary with `input_ids`, `attention_mask`, and `labels` |
| **Adjustable `max_length`**       | Requires reinitialization          | Dynamic                            |
| **Hugging Face Compatibility**    | No                                | Yes                                |
| **Efficiency**                    | Faster during training             | More flexible, but slightly slower |

---

In summary, the **first class** is ideal for a **custom PyTorch training loop**, while the **second class** is better suited for **Hugging Face's Trainer** and tasks requiring greater flexibility and model compatibility.

In [None]:
from torch.utils.data import Dataset
import torch

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=1024):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]

        # Combine 'instruction' and 'input' for the input text
        input_text = item['instruction']
        if item['input']:
            input_text += f"\n{item['input']}"

        # Output is the 'output' key
        target_text = item['output']

        # Tokenize input and target text
        inputs = self.tokenizer(input_text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors="pt")
        targets = self.tokenizer(target_text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors="pt")

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': targets['input_ids'].squeeze()
        }


## Load the GPT-2 Tokenizer with EOS Token

To prepare our tokenizer for the Hugging Face Trainer, we'll load the GPT-2 tokenizer and ensure that the EOS (End of Sequence) token is added. The EOS token is **crucial** for our process as it signifies the end of a sequence, enabling the model to learn proper sequence boundaries.

This tokenizer will now be ready to be passed to the Hugging Face Trainer, ensuring compatibility and proper handling of sequences.


In [None]:
from transformers import AutoTokenizer

# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set padding token explicitly
tokenizer.pad_token = tokenizer.eos_token

# Verify the padding token
print(f"Padding token set to: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

# Prepare the dataset
train_dataset = InstructionDataset(train_data, tokenizer)
val_dataset = InstructionDataset(val_data, tokenizer)

# Test a sample
for i in range(3):
    print(train_dataset[i])  # Inspect the structure of the tokenized data



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Padding token set to: <|endoftext|> (ID: 50256)
{'input_ids': tensor([ 2061,   318,   262,  ..., 50256, 50256, 50256]), 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]), 'labels': tensor([  464, 14069,  6264,  ..., 50256, 50256, 50256])}
{'input_ids': tensor([ 4550,   530,  1627,  ..., 50256, 50256, 50256]), 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]), 'labels': tensor([ 4299,  1720, 33529,  ..., 50256, 50256, 50256])}
{'input_ids': tensor([30003,  6525,   262,  ..., 50256, 50256, 50256]), 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]), 'labels': tensor([15363,    70,  9215,  ..., 50256, 50256, 50256])}


## 3-Configuring LoRA (Low-Rank Adaptation)

In this step, we configure the LoRA settings using the `LoraConfig` from the PEFT library. LoRA is a technique that adds low-rank updates to specific parts of a neural network, reducing the number of parameters to be fine-tuned while still achieving effective adaptation. Here's a breakdown of the configuration options:

- **r**: The low-rank dimension, controlling the rank of the approximation.
- **lora_alpha**: A scaling factor that adjusts the influence of the low-rank matrices.
- **target_modules**: Specifies which parts of the model to apply LoRA to. We focus on the attention layers (`c_attn`), projection layers (`c_proj`), and fully connected layers (`c_fc`).
- **lora_dropout**: A dropout rate applied to the LoRA layers to prevent overfitting.
- **bias**: We specify `none` to leave bias terms untouched by LoRA updates.
- **task_type**: Defines the task we're targeting. In this case, we set it to `CAUSAL_LM` for causal language modeling.
- **modules_to_save**: Lists the modules (like the final LayerNorm and the language modeling head) that will be saved without LoRA adaptation.

After configuring the settings, the LoRA configuration is printed to confirm the setup.



In [None]:
from peft import LoraConfig, get_peft_model

# Define the LoRA configuration
lora_config = LoraConfig(
    r=8,                              # Low-rank dimension
    lora_alpha=16,                    # Scaling factor
    target_modules=["c_attn", "c_proj", "c_fc"],  # Targeted modules
    lora_dropout=0.1,                 # Dropout rate for LoRA layers
    bias="none",                      # Leave biases untouched
    task_type="CAUSAL_LM",            # Task type
    modules_to_save=["ln_f", "lm_head"]  # Keep LayerNorm and classifier untouched
)

# Confirmation message
print("LoRA config created!")

LoRA config created!


Apply the config to the model

In [None]:
model_nf4_lora = get_peft_model(model_nf4, lora_config)

print("LoRA applied to the quantized model!")


LoRA applied to the quantized model!


## 4-Configuring Training Arguments

In this step, we configure the training arguments using the `TrainingArguments` from the Hugging Face `transformers` library. These settings control various aspects of the training process, such as where to save the model, how often to evaluate, and the optimization parameters. Here's an explanation of each parameter:

- **output_dir**: Directory where model checkpoints will be saved.
- **evaluation_strategy**: Specifies when to perform evaluation. Here, we evaluate the model at the end of each epoch.
- **save_strategy**: Defines when to save the model. In this case, we save the model after each epoch.
- **learning_rate**: The learning rate for the optimizer, which controls the step size during optimization.
- **per_device_train_batch_size**: The batch size used during training on each device.
- **per_device_eval_batch_size**: The batch size used during evaluation on each device.
- **num_train_epochs**: Number of epochs to train the model.
- **weight_decay**: Regularization parameter that helps prevent overfitting.
- **logging_dir**: Directory where training logs will be saved.
- **logging_steps**: Defines how frequently the metrics should be logged (every 10 steps in this case).
- **load_best_model_at_end**: Ensures the best model (based on evaluation) is loaded at the end of training.

Once configured, these arguments will be used to control the training process.



In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",           # Directory for saving model checkpoints
    evaluation_strategy="epoch",     # Evaluate at the end of each epoch
    save_strategy="epoch",           # Save model at the end of each epoch
    learning_rate=2e-5,              # Learning rate for the optimizer
    per_device_train_batch_size=2,   # Batch size for training
    per_device_eval_batch_size=2,    # Batch size for evaluation
    num_train_epochs=5,              # Number of training epochs
    weight_decay=0.01,               # Weight decay for regularization
    logging_dir="./logs",            # Directory for logs
    logging_steps=10,                # Log metrics every 10 steps
    load_best_model_at_end=True,     # Load the best model after training
)




Setting up the trainer with the training args  :

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model_nf4_lora,                # PEFT fine-tuned model
    args=training_args,              # Training arguments
    train_dataset=train_dataset,     # Training dataset
    eval_dataset=val_dataset,        # Validation dataset
)


## And Finally, We TRAIN!!! 🚀

See how simple and convenient it is to set up training with Hugging Face's Transformers and PEFT! But now, the difference is—you have a **pretty good understanding** of what's going on **under the hood**. You've configured the tokenizer, LoRA, and training arguments, giving you full control over your training process.

What's even more exciting is that with this knowledge, you can **appreciate** the **amazing engineering** happening behind the scenes in these open-source libraries, especially in **Hugging Face**, which is truly pushing the **state-of-the-art (SOTA)** in the field of natural language processing.




In [None]:
trainer.train()


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss


## Congratulations on Completing This Milestone! 🎉

What an incredible achievement! 🎉 You've successfully completed your **third notebook on fine-tuning**, and that’s no small feat! Fine-tuning is one of the most crucial weapons in your **LLM arsenal**, and by mastering LoRA and QLoRA, you've unlocked new levels of efficiency in adapting large language models. You've come a long way, and your journey in the world of deep learning continues to be both thrilling and rewarding.

### Summary: What We Covered

In this notebook, we explored **LoRA** (Low-Rank Adaptation) and **QLoRA** (Quantized LoRA) in detail, understanding how these techniques make the fine-tuning process more efficient while retaining the model's performance. Here’s a recap of what we accomplished:

1. **Understanding LoRA and QLoRA**:  
   We started by diving deep into LoRA and QLoRA, both designed to inject low-rank updates into pre-trained models, significantly reducing the number of parameters that need to be fine-tuned.

2. **Applying LoRA to the Alpaca Dataset**:  
   We successfully applied LoRA and QLoRA to the **Alpaca dataset**, modifying the model with low-rank parameters and achieving efficient fine-tuning.

3. **Three Methods of Implementation**:  
   - **Manual Model Modifications**:
     We manually changed the model's source code to incorporate LoRA and QLoRA, understanding the mechanics of the process.  
   - **Using a Compact Class**:  
     We built a compact class to facilitate the injection of LoRA and QLoRA, streamlining the process.  
   - **Production-Ready Approach**:  
     Finally, we used **Hugging Face**, **Bits and Bytes**, and **PEFT** for a more production-oriented, scalable solution for fine-tuning.

### What’s Next?

We’ve come a long way, and I want to thank you for riding along on this journey with me! 🚀 But we still have several key aspects of LLMs to address. Here's a sneak peek at what’s to come:

- **Inference**:  
  Although we’ve done some inference on our trained models and even built a simple UI for it, it’s time to dive into **optimization techniques** for smoother, more efficient inference deployment. This will improve both the speed and resource utilization during deployment.

- **Models**:  
  So far, we’ve focused on **GPT-like architectures**, but there are still other important models like **LLaMA** and **BERT** to cover. Don't worry—it will be much easier now that you have a solid grasp of LLM concepts.

- **RAG (Retrieval-Augmented Generation)**:  
  I’m considering diving into this topic next. It’s more of an **engineering solution** than pure deep learning, but it’s a critical part of modern LLM applications. Stay tuned!

I’m excited to keep progressing with you. The world of LLMs is vast, and we’ve only scratched the surface. Let’s keep pushing forward! 💡
