<h1 align="center" style="color:green;font-size: 3em;">Homework 2:
Implementing Fine-tuning Techniques</h1>



# Part 1: Introduction

In this homework, you will implement various fine-tuning methods as described in different papers, specifically LoRA and IA3, and answer some conceptual questions about these techniques. Additionally, you will get an introduction to Hugging Face, a platform offering a wide range of models and datasets.

**Instructions:**
- Follow the notebook sections to implement various fine-tuning techniques.
- Complete the code cells marked with `TODO`.
- Ensure your code runs correctly by the end of the notebook.

# Part 2: Import Libraries

In [2]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/472.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/194.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:

# importing required libraries
import torch
import torch.nn as nn
import collections
import random
import numpy as np
import math
import matplotlib.pyplot as plt
import warnings

from torch.optim import AdamW
from typing import List
from torch.nn import functional as F
from tqdm import tqdm
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, T5Tokenizer, T5ForSequenceClassification
from torch.utils.data import DataLoader

warnings.simplefilter("ignore")
print(torch.__version__)

2.5.0+cu121


In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Part 5: LoRA Adapters

In this section, we will implement LoRA (Low-Rank Adaptation) and inject it into our causal model. Specifically, we will inject LoRA into the **key, query, and value** matrices of each transformer block.

Recall from the LoRA paper that LoRA enhances model training efficiency by reducing the need to retrain all pretrained weights. Instead, it introduces two smaller matrices, A and B, which capture the necessary adaptations for the new task. This significantly reduces computational overhead while maintaining high performance.

For more information, read the [paper](https://arxiv.org/pdf/2106.09685).

By using LoRA in our causal model, we aim to achieve efficient fine-tuning with minimal computational cost, focusing on the key, query, and value matrices within each transformer block.


## 5.1 LoRA class

<h3>Task:</h3>
First, let's implement the LoRA class based on how it is defined in the paper.

In [33]:
class LoRALayer():
    def __init__(
        self,
        r: int,
        lora_alpha: int,
        lora_dropout: float,
    ):
        self.r = r
        self.lora_alpha = lora_alpha
        # Optional dropout
        if lora_dropout > 0.:
            self.lora_dropout = nn.Dropout(p=lora_dropout)
        else:
            self.lora_dropout = lambda x: x

class LoRAAdapter(nn.Module, LoRALayer):
    def __init__(
        self,
        existing_layer: nn.Module,
        in_features,
        out_features,
        r: int = 0,
        lora_alpha: int = 1,
        lora_dropout: float = 0.,
        **kwargs
    ):
        nn.Module.__init__(self)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout)
        self.existing_layer = existing_layer

        ## TODO: Finish this
        # Check the dtype of the existing layer
        existing_dtype = None
        for param in existing_layer.parameters():
            existing_dtype = param.dtype
            break

        self.r = r
        if r > 0:
            self.lora_A = nn.Parameter(torch.zeros(r, in_features,dtype=existing_dtype))
            self.lora_B = nn.Parameter(torch.zeros(out_features, r, dtype=existing_dtype))
            self.scaling = self.lora_alpha / self.r
        self.reset_parameters()

    ## TODO: Resets the two matrices (A and B) based on how the paper does it
    def reset_parameters(self):
        if self.r > 0:
            # Initialize A with scaled-down kaiming uniform
            torch.nn.init.normal_(self.lora_A, mean=0, std=0.02)
            # Initialize B as zeros
            torch.nn.init.zeros_(self.lora_B)

    def train(self, mode: bool = True):
        self.existing_layer.train(mode)

    ## TODO: Finish this method
    def forward(self, x: torch.Tensor):
        if self.r > 0:
            # Original layer transformation
            original_output = self.existing_layer(x)

            # LoRA transformation
            # 1. Apply dropout
            lora_input = self.lora_dropout(x)
            # 2. First low-rank transformation (Ax)
            lora_output = F.linear(lora_input, self.lora_A)
            # 3. Second low-rank transformation (BAx)
            lora_output = F.linear(lora_output, self.lora_B)
            # 4. Scale the output
            return original_output + (lora_output * self.scaling)
        # If r = 0, just use the original layer
        return self.existing_layer(x)

## 5.2 Inject into the model

Recall in LoRA that we want to freeze the pre-trained model and only train our adapter weights `lora_A` and `lora_B`.  

<hr>
<h3>Task:</h3>

Complete `mask_only_lora_as_trainable` so that only those weights require gradients.

In [34]:
# TODO: Finish the method
def mark_only_lora_as_trainable(model: nn.Module) -> None:

    # First, freeze all parameters
    for param in model.parameters():
        param.requires_grad = False

    # Then, unfreeze only LoRA parameters
    for name, module in model.named_modules():
        if isinstance(module, LoRALayer):
            if hasattr(module, 'lora_A'):
                module.lora_A.requires_grad = True
                print("lora_A")
            if hasattr(module, 'lora_B'):
                module.lora_B.requires_grad = True
                print("lora_B")
            # If using LoRA dropout, make sure its parameters are trainable
            if hasattr(module, 'lora_dropout') and isinstance(module.lora_dropout, nn.Module):
                for param in module.lora_dropout.parameters():
                    param.requires_grad = True

    return


Finally, we want to write the code that will inject the LoRA adapters into our causal model.

<hr>
<h3>Task: </h3>

Complete the following methods so that we can correctly inject our LoRA adapters into the model.

`match_submodules`: Returns a list of names of layers in a model whose names match a specified key.

`get_submodule`: Retrieves a specific submodule from a model based on its name.

`replace_submodule`: Replaces a specific submodule in a model with a new module at a given path.

```
Code Hint:
You can use the set_attr and get_attr methods to get and replace submodules.
```


`inject_adapter`: Replaces all submodules in a model that match any string in a list with a new module created by an adapter function.

```
Code Hint:
Remember to put the adapters onto GPU
```

```
Code Hint:
Here is an example of `inject_adapter` usage:
inject_adapter(model, ["query_key_value"], lambda x: LoRAAdapter(x, r=8,lora_alpha=8, in_features=x.in_features, out_features=x.out_features))
```


In [35]:
# TODO: Finish the method
def match_submodules(model: nn.Module, key:str) -> List[str]:

    result = []
    for name, _ in model.named_modules():
        if key in name:
            result.append(name)
    print("result ", result)
    return result


def get_submodule(model: nn.Module, module_name:str):
    return model.get_submodule(module_name)

# TODO: Finish the method
def replace_submodule(model: nn.Module, module_path: str, new_module):

    path_parts = module_path.split('.')

    if len(path_parts) == 1:
        setattr(model, module_path, new_module)
        return

    # Get the parent module
    parent_path = '.'.join(path_parts[:-1])
    parent_module = get_submodule(model, parent_path)

    # Replace the module in its parent
    setattr(parent_module, path_parts[-1], new_module)


# TODO: Finish the method
def inject_adapter(model: nn.Module, match_on: List[str], adapter_fn):

    # Find all modules that match any of the strings in match_on
    matched_modules = []
    print(match_on)
    for key in match_on:
        matched_modules.extend(match_submodules(model, key))
    print(matched_modules)
    # Replace each matched module with a LoRA adapter
    for module_path in matched_modules:
        # Get the original module
        original_module = get_submodule(model, module_path)

        # Create a new adapter module
        new_module = adapter_fn(original_module)

        # Move to the same device as the original module
        if next(original_module.parameters(), None) is not None:
            device = next(original_module.parameters()).device
            new_module = new_module.to(device)

        # Replace the original module with the adapter
        replace_submodule(model, module_path, new_module)

    return model


## 5.3 Evaluation on a benchmark

Next, we want to inject the LoRA adapter into our causal model we defined earlier. Let's also check to see how many parameters are in this model, as well as how many of these parameters are considered trainable.

<hr>
<h3>Task:</h3>

Re-initialize the causal model and chck the model architecture.

```
Code Hint:
The name of the model is "facebook/opt-125m"
```

In [36]:
# TODO: Re-initialize the causal model
causal_model_name =  "facebook/opt-125m"
causal_model = AutoModelForCausalLM.from_pretrained(causal_model_name, torch_dtype=torch.bfloat16, device_map="auto")
causal_tokenizer = AutoTokenizer.from_pretrained(causal_model_name)

# TODO: Check the model architecture
causal_model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), ep

Next, we want to call the inject_adapter method on our causal model and see how this changed our model architecture.

<hr>
<h3>Task:</h3>

Calculate and print the total number of parameters as well as the number of trainable parameters after we inject LoRA into our model.

In [37]:
#inject_adapter(causal_model, ["q_proj_k_proj_v_proj"], lambda x: LoRAAdapter(x, r=8, lora_alpha=8, in_features=x.in_features, out_features=x.out_features))
inject_adapter(causal_model, ['q_proj','k_proj','v_proj'], lambda x: LoRAAdapter(x, r=8, lora_alpha=8, in_features=x.in_features, out_features=x.out_features))
mark_only_lora_as_trainable(causal_model)

# TODO: Calculate total parameters and total trainable parameters
total_params = sum(p.numel() for p in causal_model.parameters())
trainable_params = sum(p.numel() for p in causal_model.parameters() if p.requires_grad)


print(f"Total Parameters: {total_params}")
print(f"Trainable Parameters: {trainable_params}")

['q_proj', 'k_proj', 'v_proj']
result  ['model.decoder.layers.0.self_attn.q_proj', 'model.decoder.layers.1.self_attn.q_proj', 'model.decoder.layers.2.self_attn.q_proj', 'model.decoder.layers.3.self_attn.q_proj', 'model.decoder.layers.4.self_attn.q_proj', 'model.decoder.layers.5.self_attn.q_proj', 'model.decoder.layers.6.self_attn.q_proj', 'model.decoder.layers.7.self_attn.q_proj', 'model.decoder.layers.8.self_attn.q_proj', 'model.decoder.layers.9.self_attn.q_proj', 'model.decoder.layers.10.self_attn.q_proj', 'model.decoder.layers.11.self_attn.q_proj']
result  ['model.decoder.layers.0.self_attn.k_proj', 'model.decoder.layers.1.self_attn.k_proj', 'model.decoder.layers.2.self_attn.k_proj', 'model.decoder.layers.3.self_attn.k_proj', 'model.decoder.layers.4.self_attn.k_proj', 'model.decoder.layers.5.self_attn.k_proj', 'model.decoder.layers.6.self_attn.k_proj', 'model.decoder.layers.7.self_attn.k_proj', 'model.decoder.layers.8.self_attn.k_proj', 'model.decoder.layers.9.self_attn.k_proj', 'mo

Finally, run the cell below to check the new model's architecture. If the key, value, and query matrices are all now replaced by a LoRA adapter, you are good to go!

In [38]:
# Check the new model architecture
causal_model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): LoRAAdapter(
              (existing_layer): Linear(in_features=768, out_features=768, bias=True)
            )
            (v_proj): LoRAAdapter(
              (existing_layer): Linear(in_features=768, out_features=768, bias=True)
            )
            (q_proj): LoRAAdapter(
              (existing_layer): Linear(in_features=768, out_features=768, bias=True)
            )
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1)

## 5.4: Finetuning your LoRA adapters on Wikitext

In this next part, we will finally finetune the LoRA adapter of our causal model on a small subset of the training set of Wikitext. If all went correctly, we should notice that the perplexity over our test set went down!

Since we are only using a small subset of the training set and a low chunk size, you shouldn't expect the perplexity to go down by much (<1 point).

**Note:** Please be aware that this code may take some time to run (you are literally training a large language model), so please be fully confident in your completed code above.  With this being said **please ensure that you record the final perplexity score** (you may even want to screenshot it for proof).

First, let's define our finetuning function:

In [39]:
def finetune_causal_model(model, train_dataset, epochs=1, learning_rate=1e-4):
        def tokenize_function(examples):
            result = causal_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256) #256 chosen for Colab's GPU size
            result["labels"] = result["input_ids"].copy()
            return result

        train_dataset = Dataset.from_dict(train_dataset)
        tokenized_dataset = train_dataset.map(tokenize_function, batched=True)
        data_collator = DataCollatorForLanguageModeling(causal_tokenizer, mlm=False)
        training_args = TrainingArguments(
            output_dir="/content",
            evaluation_strategy="epoch",
            per_device_train_batch_size=8,
            learning_rate=learning_rate,
            weight_decay=0.01,
            num_train_epochs=epochs,
        )
        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=tokenized_dataset,
            eval_dataset=tokenized_dataset,
            data_collator=data_collator,
        )
        trainer.train()

Next, let's load our training dataset.

A few interesting things to note: The training dataset can be quite large with respect to our compute resources, so we're only going to use a small fraction of it.  Also, we are going to split our text into chunks so that the attention gradients can fit on Colab's GPU.


In [40]:
wiki_training_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
chunks = []

# As big as Colab's GPU can fit
chunk_size = 256


def split_into_chunks(text, chunk_size):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

for example in wiki_training_dataset:
    text = example['text']
    text_chunks = split_into_chunks(text, chunk_size)
    chunks.extend(text_chunks)

processed_train_dataset = {'text':chunks[:len(chunks)//10]}

Finally, calculate the score of our new model.

In [41]:
# Import wikitext dataset
causal_test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
causal_test_encodings = causal_tokenizer("\n\n".join(causal_test["text"]), return_tensors="pt")

In [42]:
def calc_perplexity(model, encodings, stride):
    max_length = 1024
    seq_len = encodings.input_ids.size(1)

    nlls = []
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride)):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        input_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)

        prev_end_loc = end_loc
        if end_loc == seq_len:
            break
    return torch.exp(torch.stack(nlls).mean())

In [43]:
finetune_causal_model(causal_model, processed_train_dataset)
calc_perplexity(causal_model, causal_test_encodings, 256)

Map:   0%|          | 0/5752 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,4.3532,4.172896


100%|█████████▉| 1120/1124 [02:33<00:00,  7.32it/s]


tensor(24.2500, device='cuda:0', dtype=torch.bfloat16)

## 5.5 Conceptual Questions
**Question:** What do you think the benefits of using LoRA are?  What might be some drawbacks?

  **Your Answer:**

  No additional inference time, since we do not add additional structure; less fine tune computation cost; and almost same effect of the full fine tune.

<hr>

**Question:** Discuss the trade-offs between model size, speed, and accuracy when using LoRA in LLMs.

  **Your Answer:**

  Advantages:

Drastically reduced parameter count (often less tha. 1 percent of original model)
Much smaller storage requirements for task-specific adaptations
Multiple tasks can share the same base model


Disadvantages:

Still needs to load full base model in memory
Each task requires separate LoRA weights
May need larger rank for complex tasks



Speed Trade-offs:


Advantages:

Faster training (up to 25% faster than full fine-tuning)
No additional inference latency when merged Quick task switching by loading different LoRA weights


Disadvantages:

Initial model loading still takes same time
Memory I/O can be a bottleneck when switching tasks
Merging weights adds minor overhead


Accuracy Trade-offs:


Advantages:

Can match or exceed full fine-tuning performance
More stable training due to fewer parameters
Less prone to catastrophic forgetting


Disadvantages:

May require tuning rank size for optimal performance
Not all layers benefit equally from LoRA
Some complex tasks might need full fine-tuning

  