## Step 1: Mounting Google Drive

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  qa_pairs  README.md  scripts


## Step 2: Importing Libraries

In [51]:
import os
import json
from pathlib import Path
import re
from google.colab import files

## Step 3: Setting Paths

In [4]:
BASE_DIR = "/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer"
PDF_DIR = os.path.join(BASE_DIR, "data", "QA_corpus")
QA_DIR = os.path.join(BASE_DIR, "qa_pairs")

os.makedirs(QA_DIR, exist_ok=True)

## Step 4: Creating Function to Generate QA Pairs

In [31]:
def generate_QA_pair(paper_number, publication_year, short_id, title, qa_pairs):
  data = {
    "paper_id": f"{paper_number}_{publication_year}_{short_id}",
    "title": title,
    "qa_pairs": qa_pairs
  }
  filename = f"{paper_number}_{publication_year}_{short_id}.json"
  with open(os.path.join(QA_DIR, filename), "w") as f:
    json.dump(data, f, indent=4)
  return

## Step 5: Listing PDFs in QA Corpus Folder

In [40]:
# List PDFs in QA corpus folder
pdf_files = sorted([f for f in os.listdir(PDF_DIR) if f.endswith(".pdf")])

print("Available Papers:")
for i, file in enumerate(pdf_files):
    print(f"[{i}] {file}")

# Select your paper here
paper_index = 0  # Change this index to select a different paper
pdf_path = os.path.join(PDF_DIR, pdf_files[paper_index])
pdf_name = os.path.splitext(pdf_files[paper_index])[0]

print(f"\n Selected: {pdf_files[paper_index]}")

Available Papers:
[0] ADALORA:_ADAPTIVE_BUDGET_ALLOCATION_FOR_PARAMETER-EFFICIENT_FINE-TUNING.pdf
[1] AutoLoRA:_Automatically_Tuning_Matrix_Ranks_in_Low-Rank_Adaptation_Based_on_Meta_Learning.pdf
[2] Balancing_Continuous_Pre-Training_and_Instruction_Fine-Tuning:
__Optimizing_Instruction-Following_in.pdf
[3] CURLoRA:_Stable_LLM_Continual_Fine-Tuning_and_Catastrophic_Forgetting
__Mitigation.pdf
[4] DELIFT:_Data_Efficient_Language_model_Instruction_Fine_Tuning.pdf
[5] FINETUNED_LANGUAGE_MODELS_ARE_ZERO-SHOT_LEARNERS.pdf
[6] Few-Shot_Parameter-Efficient_Fine-Tuning_is_Better_and_Cheaper_than_In-Context_Learning.pdf
[7] Instruction_Tuning_for_Large_Language_Models:_A_Survey.pdf
[8] LLAMA-ADAPTER:_EFFICIENT_FINE-TUNING_OF_LARGE_LANGUAGE_MODELS_WITH_ZERO-INITIALIZED_ATTENTION.pdf
[9] LLM-Adapters:_An_Adapter_Family_for_Parameter-Efficient_Fine-Tuning_of_Large_Language_Models.pdf
[10] LORA:_LOW-RANK_ADAPTATION_OF_LARGE_LANGUAGE_MODELS.pdf
[11] LoRA_vs_Full_Fine-tuning:_An_Illusion_of_Equivalen

## Step 6: Generating QA pairs

### **00: AdaLoRA — Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning**

**Abstract**  
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal.

To bridge this gap, we propose **AdaLoRA**, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of **singular value decomposition (SVD)**. Such a novel approach allows us to effectively **prune the singular values** of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on **natural language processing**, **question answering**, and **natural language generation** to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in **low budget** settings.

**Introduction**

Large Language Models (LLMs) (Dai et al., 2019; Radford et al., 2019; Zhang et al., 2022; Raffel et al., 2020; Devlin et al., 2018) have stimulated widespread attention in both academia and industry. Driven by massive corpora and advanced hardware, LLMs exhibit remarkable understanding and generative
ability, propelling language tasks to a higher level. Recently, significant progress has been made on instruction-following models, e.g., ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b), which follow language instructions and generate contextual responses. However, the further prevalence of
instruction models is largely impeded by the closed-source restriction and high development costs. To alleviate this, Stanford Alpaca (Taori et al., 2023) proposes to fine-tune an open-source LLM, i.e., LLaMA (Touvron et al., 2023) into an instruction-following model, which is affordable and replicable. Starting from 175 human-written instruction-output pairs (Wang et al., 2022a), Alpaca leverages GPT-3.5 (Brown et al., 2020) to expand the training data to 52K in a self-instruct manner.Supervised by this, Alpaca fine-tunes the entire 7B parameters in LLaMA, producing an exceptional
instruction model that performs similarly to GPT-3.5. Despite Alpaca’s effectiveness, a complete fine-tuning of large-scale LLaMA is still time-consuming, computation-intensive, and cumbersome
to transfer to different downstream scenarios.

In this paper, we introduce **LLaMA-Adapter**, an efficient fine-tuning method that adapts LLaMA into a well-performed instruction-following model. Trained by Alpaca’s instruction-output data, our approach freezes the entire LLaMA model, and proposes a zero-initialized attention mechanism with superior resource efficiency. Specifically, in LLaMA’s higher transformer layers, we append a set of learnable adaption prompts as prefixes to the word tokens. Then, to avoid the noise from randomly initialized prompts at the early training stage, we equip the frozen self-attention layers with a learnable gating factor. The gating mechanism is initialized by zeros, and controls the feature interaction between prompt and word tokens, within the process of attention calculation. Such a strategy can first preserve the original knowledge in LLaMA, and progressively inject the new instructional signals during training. This contributes to a more stable learning process and better
instruction-following capacity of the final model.
Overall, our LLaMA-Adapter exhibits four main characteristics, as shown in Figure 1.

• **1.2M Parameters**. Instead of updating the full 7B parameters, we freeze the pre-trained
LLaMA and only learn the zero-initialized attention mechanism with 1.2M parameters. This,
however, reveals comparable instruction-following proficiency with the 7B Alpaca.

• **One-hour Fine-tuning**. Thanks to our lightweight adaption modules with zero-initialized
gating, the training convergence of LLaMA-Adapter costs less than one hour on 8 A100
GPUs, which are three times faster than Alpaca.

• **Plug with Expertise**. For different scenarios, it is flexible to insert their respective adapters to endow LLaMA with different expert knowledge or new modality input. Thus, it suffices to store a 1.8M adapter within each context, other than a complete copy of the 13G LLaMA.
• **Multi-modal Reasoning**. Besides language instruction, our approach can also incorporate an image encoder via zero-initialized attention to become a multi-modal LLM. Compared to concurrent works (Liu et al., 2023b; Zhu et al., 2023), LLaMA-Adapter showcases higher tuning efficiency with competitive reasoning capacity on MME (Fu et al., 2023), MMBench (Liu et al., 2023c), and LVLM-eHub (Xu et al., 2023) benchmarks.

In addition to instruction tuning, our zero-initialized attention can be generalized to traditional vision and language tasks for parameter-efficient fine-tuning. We apply our approach to the pre-trained ViT (Dosovitskiy et al., 2020), ReBERTa (Liu et al., 2019), and CLIP (Radford et al., 2021),
respectively for fine-tuning vision, language, and vision-language models. On a wide range of
downstream tasks, we demonstrate the effectiveness of our proposed method for traditional tasks.

In [38]:
qa_pairs = [
    {
        "question": "What problem does AdaLoRA aim to solve in the context of fine-tuning large language models?",
        "answer": "AdaLoRA addresses the inefficiency of uniformly distributing the parameter budget across all weight matrices during fine-tuning. It proposes an adaptive allocation strategy that prioritizes important parameters, thus improving performance under constrained budgets."
    },
    {
        "question": "How does AdaLoRA allocate the parameter budget among weight matrices?",
        "answer": "AdaLoRA uses an importance scoring mechanism to assign more parameters to critical weight matrices and fewer to less important ones. This allocation is realized through a low-rank approximation using singular value decomposition (SVD)."
    },
    {
        "question": "What advantage does AdaLoRA's use of singular value decomposition provide?",
        "answer": "By representing incremental updates via SVD, AdaLoRA can prune unimportant singular values, thus reducing computational overhead and improving parameter efficiency without performing exact, expensive SVD computations."
    },
    {
        "question": "Why is full fine-tuning of large pre-trained models often impractical in real-world applications involving many downstream tasks?",
        "answer": "Full fine-tuning requires updating and storing a separate copy of the model for each downstream task, which becomes prohibitively expensive in terms of memory and computation, especially for large models like BERT, T5, or GPT-3 that have hundreds of millions to billions of parameters."
    },
    {
        "question": "What are the two primary approaches to parameter-efficient fine-tuning described in the introduction?",
        "answer": "The first approach involves adding small neural modules—like adapters, prompts, or prefixes—to a frozen base model and fine-tuning only those additions. The second approach models the incremental update of pre-trained weights in a parameter-efficient way without altering the model architecture, using methods like diff pruning or LoRA."
    },
    {
        "question": "How does LoRA improve parameter efficiency in fine-tuning compared to full fine-tuning?",
        "answer": "LoRA improves efficiency by representing the incremental updates as a low-rank matrix—specifically, the product of two smaller matrices. This significantly reduces the number of trainable parameters while preserving or even improving performance, and it avoids the complexity of handling sparse matrices like in diff pruning."
    },
    {
        "question": "What limitation of LoRA does AdaLoRA aim to overcome?",
        "answer": "LoRA uses a fixed rank for all weight matrices during fine-tuning, which assumes all matrices are equally important. AdaLoRA addresses this by dynamically allocating different parameter budgets to different weight matrices based on their relative importance, allowing more effective use of limited resources."
    }
]

In [41]:
title = "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning"
generate_QA_pair("00", 2023, "adalora", title, qa_pairs)

### **01: AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning**

**Abstract**

Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pre-trained weights, has proven particularly effective. Nonetheless, LoRA’s uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA.

**Introduction**

Large Language Models (LLMs) have demonstrated state-
of-the-art performance across a variety of NLP tasks, spanning from Natural Language Understanding (NLU) to Natural Language Generation (NLG), a trajectory highlighted by the success of models like ChatGPT. Their success largely stems from a two-stage process: initial pretraining on vast amounts of unlabeled texts, followed by finetuning on specific downstream tasks. However, as models scale up, for instance transitioning from RoBERTa-large’s 355 million parameters to GPT-3’s staggering 175 billion parameters, finetuning becomes highly expensive in computation.

To address this challenge, many efficient finetuning methods have been developed. For instance, the Adapters method inserts lightweight layers (called adapters) into pretrained networks. During fine-tuning, only these adapters are updated while the pretrained layers are kept frozen. One limitation of this method is that the adapters incur additional computation overhead during inference. Another approach, prefix tuning, introduces trainable prefix parameters which are prepended to the input sequence while making the pretrained model parameters frozen. Nevertheless, determining the optimal length of the prefix can be tricky. A prefix that is too short cannot capture enough information, while an overlong prefix may largely reduce the maximum length of the input sequence. To address these limitations, LoRA proposes to add low-rank incremental update matrices to pretrained weight matrices. During finetuning, only the incremental matrices are trained while the pretrained ones are frozen. The low-rank parameterization significantly reduces the number of finetuning parameters.

While achieving parameter-efficient finetuning without increasing inference costs, LoRA has two limitations. First, the update matrices at different layers share the same rank, without considering the varying properties across layers. Different layers in a pretrained model have varying importance to a downstream task and should be adapted differently, which requires the number of trainable parameters to be layer-specific. Employing a uniform rank across all layers compromises this purpose, which renders some layers to be under-parameterized (leading to suboptimal fine-tuning performance) while others unnecessarily over-parameterized (leading to computation inefficiency). Second, obtaining the optimal rank in LoRA typically involves an extensive manual hyperparameter search, which is time-consuming and poses scalability issues.

To address the aforementioned limitations of LoRA, we introduce the AutoLoRA framework to automatically determine the optimal rank for each LoRA layer. In AutoLoRA, we first decompose an update matrix into the product of two low-rank
matrices (with rank k), in alignment with the LoRA methodology. This product can be expressed as the summation of k rank-1 matrices. For each rank-1
matrix, we assign a continuous trainable selection variable α ∈ [0, 1] indicating the matrix’s relative importance in the summation. After learning, if α is close to zero, the corresponding rank-1 matrix is removed from the summation. These selection variables effectively control the rank of an update
matrix. Learning α directly on a training dataset together with the update matrices can result in over-fitting, and the network learned in this way lacks generalization ability. To mitigate this problem, we formulate the search process of α as a meta learning problem. First, we finetune the weights in the rank-1 matrices on a training dataset. Second, we optimize the α values by minimizing the loss on a validation dataset. These two steps iterate until convergence. Subsequently, we derive the optimal rank of each LoRA layer by thresholding the learned α values. Once the optimal rank is identified for each layer, the weights in the low-rank update matrices are retrained on the combination of training and validation data. An overview of our proposed method is illustrated in Figure 1.

The major contributions of this paper are summarized as follows.

• We propose AutoLoRA, a meta learning based approach that can automatically determine the optimal and layer-specific ranks of update matrices, alleviating the burden of manually tuning them as in LoRA.

• Extensive experiments on natural language understanding and generation tasks demonstrate the effectiveness of AutoLoRA.

**Conclusions and Future Work**

In this paper, we introduce AutoLoRA, a meta learning based framework designed to automatically search for the optimal ranks for LoRA layers. Our method associates each rank-1 matrix in LoRA updates with a selection variable and formulates the rank-tuning problem as optimizing the selection variables via meta learning. Thresholding is applied to derive discrete rank values from continuous selection variables and retraining is performed to bridge the gap incurred by thresholding. Comprehensive experiments show the efficacy of AutoLoRA across various NLP tasks. Similar to the LoRA method, the LoRA layers in AutoLoRA are manually specified, which may be suboptimal. As a future work, we will investigate how to automatically select LoRA layers, by developing a meta learning framework similar to
that in Eq.(5).

In [32]:
qa_pairs = [
    {
        "question": "What problem does AutoLoRA aim to solve in traditional LoRA-based fine-tuning?",
        "answer": "AutoLoRA addresses two core limitations of traditional LoRA: (1) the uniform rank assignment across all layers, which neglects layer-specific importance, leading to suboptimal or inefficient fine-tuning; and (2) the need for exhaustive manual hyperparameter searches to determine optimal ranks."
    },
    {
        "question": "How does AutoLoRA represent each update matrix in the fine-tuning process?",
        "answer":  "AutoLoRA decomposes each update matrix into the product of two low-rank matrices, consistent with the LoRA methodology. This product is then expressed as a sum of rank-1 matrices, each associated with a trainable selection variable α ∈ [0, 1]."
    },
    {
        "question": "What is the role of the selection variable α in AutoLoRA?",
        "answer": "The α variable controls whether a given rank-1 matrix should be retained. If α is close to zero, the corresponding matrix is discarded. The optimal rank of each layer is determined by thresholding these α values after training."
    },
    {
        "question": "How does AutoLoRA determine the optimal rank of each LoRA layer?",
        "answer": "AutoLoRA introduces selection variables associated with each rank-1 matrix in a low-rank update. These variables are learned via a meta-learning method and used to determine the optimal rank by thresholding their values."
    },
    {
        "question": "Why is learning α directly on the training dataset problematic, and how does AutoLoRA address it?",
        "answer": "Directly learning α from training data can lead to overfitting and poor generalization. AutoLoRA mitigates this by framing α-optimization as a meta learning problem: update weights on training data, then update α by minimizing loss on a separate validation set."
    },
    {
        "question": "What distinguishes AutoLoRA from adapter and prefix tuning methods in terms of inference overhead?",
        "answer": "Unlike adapter and prefix tuning, which introduce additional parameters that incur runtime overhead, AutoLoRA does not increase inference cost. Only the low-rank update matrices are trained, and their integration does not burden inference."
    },
    {
        "question": "How does AutoLoRA improve computational efficiency compared to standard LoRA?",
        "answer": "AutoLoRA avoids exhaustive grid searches for optimal ranks by learning them automatically, layer-wise. This reduces computational cost and ensures that model capacity is allocated where it is most beneficial."

    },
    {
        "question": "What is AutoLoRA, and why is it important?",
        "answer": "AutoLoRA is a meta learning-based framework that optimizes the rank of LoRA layers in large language models. It improves parameter efficiency and fine-tuning performance while eliminating costly manual tuning."
    },
    {
        "question": "How does AutoLoRA relate to the broader challenge of scaling large language models?",
        "answer": "As LLMs grow larger, full fine-tuning becomes increasingly resource-intensive. AutoLoRA offers a scalable alternative by fine-tuning only select low-rank matrices with learned rank assignments, thereby conserving resources without sacrificing performance."
    }
]

In [27]:
def generate_QA_pair(publication_year, short_id, title, qa_pairs):
  data = {
    "paper_id": f"{publication_year}_{short_id}",
    "title": title,
    "qa_pairs": qa_pairs
  }
  filename = f"{publication_year}_{short_id}.json"
  with open(os.path.join(QA_DIR, filename), "w") as f:
    json.dump(data, f, indent=4)
  return

In [34]:
title = "AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning"
generate_QA_pair("01", 2024, "autolora", title, qa_pairs)

### **02: Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs**

**Abstract**

Large Language Models (LLMs) for public use require continuous pre-training to remain up-
to-date with the latest data. The models also need to be fine-tuned with specific instructions to maintain their ability to follow instructions accurately. Typically, LLMs are released in two versions: the Base LLM, pre-trained on diverse data, and the instruction-refined LLM, additionally trained with specific instructions for better instruction following. The question arises as to which model should undergo continuous pre-training to maintain its instruction-following abilities while also staying current with the latest data. In this study, we delve into the intricate relationship between continuous pre-training and instruction finetuning of the LLMs and investigate the impact of continuous pre-training on the instruction following abilities of both the base and its instruction finetuned model. Further, the instruction fine-tuning process is computationally intense and requires a substantial number of hand-annotated examples for the model to learn effectively. This study aims to find the most compute-efficient strategy to gain up-to-date knowledge and instruction-following capabilities without requiring any instruction
data and fine-tuning. We empirically prove
our findings on the LLaMa 3, 3.1 and Qwen
2, 2.5 family of base and instruction models,
providing a comprehensive exploration of our
hypotheses across varying sizes of pre-training
data corpus and different LLMs settings.

**Introduction**

Recently, autoregressive large language models
(LLM) showed remarkable progress across a wide
range of natural language tasks, natural language understanding, mathematical reasoning, and coding across various domains. These LLMs are pre-trained with a causal language modeling objective to predict the next token(s) in a given sequence until
it is complete, termed as Base models. These base models exhibit a remarkable ability to generate linguistically coherent text, however not necessarily aligning their generations with human preference and needs. Thus, LLMs often require a fine-tuning step, Instruction fine-tuning to bridge the gap between the base model’s fundamental objective and the practical needs of human users termed as Instruction models.

Instruction fine-tuning is an expensive task and generally requires a significant amount of labeled data1 depending on the type of optimization technique used. This can be expensive and time-consuming to collect and annotate such a big dataset. Algorithmically, it requires training of reward model and RLHF, PPO, DPO fine-tuning which further adds to the complexity of the task.

Parallelly, to stay abreast with the latest data, the base model needs to be either re-pre-trained on a combination of old and newly collected data or continuously
pre-trained on the newly collected data yielding to the new base model. For
example, the LLaMa 3.1 base model is pre-trained with more and high-quality data over the LLaMa 3 base model. Similarly, Qwen
2.5 family base models have more knowledge and improved capabilities over Qwen 2 family models.

Continuous pre-training of the LLM generally results in forgetting previously learned information, several methods have been proposed to maintain the base model performance on previously learned tasks such as Xie et al. (2023); Ibrahim et al. (2024).However, there has been no research focusing on
the influence of continuous training on instruction models. As continuous pre-training is vital for acquiring new knowledge, and instruction tuning is necessary to learn instruction following capabilities, it is required to have both the capabilities to any instruction model. This raises a series of
natural questions:

**a** What happens to the instruction capabilities
when we continuously pre-train the instruction
model to gain new knowledge?

**b** If lost, how to regain instruction capabilities?

**c** Is it necessary to add resource-extensive
instruction-fine-tuning after updating the
knowledge of the base model?

We approach this problem empirically by study-
ing two different settings. In the first setting, we continuously pre-train the instruction model on a specific dataset and observe its performance on the LLM harness framework from EleutherAI. Whereas in another setting we continuously pre-train the base model with the same data and then instruction fine-tune the continuously pre-trained base model. Finally, we compare the instruction capabilities of instruction models from
both settings. Since instruction fine-tuning is an expensive task, we discovered a simple yet efficient approach to regain the instruction capability of the continuous pre-trained base model, given that the instruction-tuned model of the original base model is available. Our main findings and the contributions of this work are as follows:

• Continuous pre-training of an instruction
model results in catastrophic forgetting of the
instruction capabilities and, therefore should
be avoided.

• Continuous pre-training base model and then
instruction tuning preserve both the domain
knowledge and the instruction capabilities.

• Instruction capabilities are portable across the same ancestor models. That is, we can extract the instruction capability by simply subtracting the weight of the base model from the weights of its instruction-tuned model.

• No traditional instruction tuning is required for a continuous pre-trained base model instead the instruction capabilities are ported.

To our knowledge, we are the first ones to sys-
tematically conduct this analysis and discover the portability of the instruction capabilities across models from the same ancestor. We empirically prove all our findings on LLaMa 3, LLaMa 3.1, Qwen2, and Qwen 2.5 families of base and instruct models. We comprehensively test our hypothesis in
breadth and depth with varying sizes of pre-training data corpus across different LLMs settings in Section 3.

**Conclusion**

In conclusion, this study delves into the effects of continuous pre-training on base and instruction-tuned large language models (LLMs) and their instruction capabilities. The findings suggest that while continuous pre-training of instruction models may lead to catastrophic forgetting of instruction capabilities, a more efficient approach is to continuously pre-train the base model with new
data, followed by instruction tuning. This method preserves both domain knowledge and instruction capabilities. Interestingly, the study also reveals that instruction capabilities are transferable across models from the same ancestor, eliminating the need for additional instruction tuning for a continuously pre-trained base model. We empirically demonstrated this analysis on the LLaMa 3 and LLaMa 3.1 family of base and instruction models.

**Limitations**

While our hypothesis is validated for models with 8 billion parameters, we observe a noticeable variation in performance when applied to smaller models, particularly those with around 1.5 billion parameters. Furthermore, the scalability of our proposed strategy for models smaller than 1.5 billion parameters remains uncertain. This presents an
intriguing avenue for future research, where further exploration could investigate whether modifications or optimizations are needed to maintain the same level of effectiveness for these smaller models.

A critical challenge that emerges with the in-
struction residual method is the reliance on theavailability of both the base language model and its instruction fine-tuned counterpart. The approach fundamentally depends on the residual differences between these two models to function effectively. In the absence of either the base mode or the fine-tuned model, the instruction residual method cannot be employed. This limitation highlights a bottleneck in the methodology, especially when resources or computational constraints prevent the simultaneous availability of both models. Future work could explore potential ways to mitigate this dependency, perhaps by developing alternative techniques that either reduce the need for dual-model structures or enhance the portability of instruction-based fine-tuning across a wider range of model sizes.

In [35]:
qa_pairs = [
    {
        "question": "What are the two primary phases in training large language models (LLMs), and why is instruction fine-tuning necessary?",
        "answer": "The two primary phases are large-scale pretraining on diverse unlabeled data, followed by task-specific instruction fine-tuning. While pretraining equips models with general linguistic capabilities, instruction fine-tuning aligns the model’s behavior with human intent, enhancing its ability to follow explicit instructions."
    },
    {
        "question": "Why is continuous pre-training essential for LLMs, and what problem does it pose for instruction-tuned models?",
        "answer":  "Continuous pre-training ensures LLMs stay updated with new knowledge. However, when applied to instruction-tuned models, it causes catastrophic forgetting, diminishing their instruction-following capabilities."
    },
    {
        "question": "What empirical strategy do the authors propose to preserve both updated knowledge and instruction-following ability?",
        "answer": "The authors propose continuously pre-training the base model, then performing instruction fine-tuning afterward. This sequence maintains both domain knowledge and the capacity to follow instructions, avoiding the drawbacks of directly pretraining the instruction-tuned model."
    },
    {
        "question": "What is the “instruction residual” method, and how does it work?",
        "answer": "The instruction residual method extracts the difference in weights between a base model and its instruction-tuned counterpart and applies that delta to a newly updated base model, transferring instruction-following capabilities without redoing instruction fine-tuning."
    },
    {
        "question": "Under what conditions can the instruction residual method be applied effectively?",
        "answer": "The instruction residual method extracts the difference in weights between a base model and its instruction-tuned counterpart and applies that delta to a newly updated base model, transferring instruction-following capabilities without redoing instruction fine-tuning."
    },
    {
        "question": "What key experimental insight did the authors discover about model size and the efficacy of their approach?",
        "answer": "The strategy works well for 8B parameter models but shows variation in effectiveness for smaller models, especially those around 1.5B parameters. The scalability of the approach to such models remains an open research question."
    },
    {
        "question": "What limitations do the authors acknowledge regarding their methodology?",
        "answer": "Two main limitations are identified: (1) its uncertain scalability to smaller models, and (2) dependency on having both base and instruction-tuned versions, which may not always be feasible due to computational or resource constraints."

    },
    {
        "question": "What problem does this paper aim to solve in the context of instruction tuning for LLMs?",
        "answer": "It addresses how to maintain both updated knowledge and instruction-following ability in LLMs without repeatedly performing costly instruction fine-tuning."
    },
    {
        "question": "How does this paper differ from previous work on continual pre-training or catastrophic forgetting?",
        "answer": "Unlike prior work focused mainly on base models, this paper examines the unique effects of continual pretraining on instruction-tuned models and proposes a novel weight residual transfer strategy for preserving instruction-following ability."
    },
    {
        "question": "What practical takeaway does the paper offer for training up-to-date instruction-following LLMs?",
        "answer": "Instead of repeatedly fine-tuning updated instruction models, practitioners can simply reuse instruction residuals from earlier models and apply them to newer base models—saving both time and compute."
    }
]

In [37]:
title = "Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs"
generate_QA_pair("02", 2024, "balancing_pretrain_finetune", title, qa_pairs)

### **03: CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic_Forgetting Mitigation**

**Abstract**

This paper introduces CURLoRA, a novel approach to fine-tuning large language models (LLMs) that leverages CUR matrix decomposition in the context of Low-Rank Adaptation (LoRA). Our method addresses two critical challenges in LLM fine-tuning: mitigating catastrophic forgetting during continual learning and reducing the number of trainable parameters. We propose a unique
modification to the CUR decomposition process, utilizing inverted probabilities for column and row selection which acts as an implicit regularization, and initializing the U matrix as a zero matrix, and only fine-tuning it. We demonstrate through experiments on multiple datasets that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting. It maintains model stability and performance across tasks while significantly reducing the number of trainable parameters. Our results show that CURLoRA achieves very good and stable task accuracy while maintaining base model’s
perplexity scores fixed compared to LoRA upon continual fine-tuning, particularly in scenarios with limited data.

**Introduction**

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating remarkable capabilities across a wide range of tasks. However, fine-tuning these large models for specific tasks requires a lot of computational resources making it challenging to adapt these models efficiently, especially when working with limited datasets and in resource-constrained environments. Parameter-Efficient Fine-Tuning (PEFT) Methods have gained a lot of attention because they make fine-tuning large models accessible and possible. Low-Rank Adaptation (LoRA) has emerged as an efficient PEFT method, enabling fine-tuning large language models on custom tasks while decreasing the number of trainable parameters hence requiring less resources. LoRA works by decomposing pre-trained weight matrices into low-rank matrices and fine-tune these ones instead of the original matrix. Although LoRA has proven to be very excellent and promising, it still faces challenges with catastrophic forgetting. Catastrophic forgetting in LLMs is a critical issue where the model loses previously acquired knowledge when fine-tuned on new tasks. It occurs due to the overwriting of previously learned (pre-trained) weights during the fine-tuning process. In LoRA, this often happens as the adapted output can significantly deviate from the original:

$y=xW+xW_{adapted}=x(W+AB)$

where $W∈ R^{m×n}$ is the original weight matrix, and AB is the low-rank update from multiplying $A∈ R^{m×r}$ by $B∈ R^{r×n}$ where $r < n$. This work introduces CURLoRA, a novel approach that applies low-rank adaptation (LoRA) to pre-trained weight matrices using CUR matrix decomposition instead of random initiation of the low-rank A or B matrices. We propose a unique modification to the CUR decomposition process and demonstrate its effectiveness in mitigating catastrophic forgetting while also reducing the number of trainable parameters. While LoRA successfully reduces computational costs by decomposing weight updates into low-rank matrices, it still suffers from catastrophic forgetting. CURLoRA leverages CUR decomposition with inverted probabilities and initiating U matrix as zero to further mitigate
this issue.

**Conclusion**

This paper introduced CURLoRA, a novel approach to fine-tuning large language models that leverages CUR matrix decomposition to mitigate catastrophic forgetting and improve computational efficiency. Through theoretical analysis and empirical experiments, we demonstrated that CURLoRA outperforms standard LoRA in maintaining model stability and performance across tasks while significantly reducing the number of trainable parameters. Key contributions of this work include:

• A novel modification to CUR decomposition using inverted probabilities for column and row selection and initiating U matrix as zeros. Sampling columns and rows based on inverted probabilities distinguishes CUR-LoRA from traditional CUR, offering better stability and performance.

• Theoretical analysis of how CURLoRA addresses catastrophic forgetting.

• Empirical evidence of CURLoRA’s effectiveness across multiple tasks and evaluation metrics with multiple
models.

Our results suggest that CURLoRA is a promising approach for efficient and stable fine-tuning of large language models, particularly in scenarios with limited fine-tuning data. CURLoRA’s approach to mitigating catastrophic forgetting has broad implications for continual learning in NLP and beyond. Future research could explore its integration with other adaptation techniques to enhance model robustness.

In [43]:
qa_pairs = [
    {
        "question": "What core problem does CURLoRA seek to address in the context of large language model fine-tuning?",
        "answer": "CURLoRA addresses two main challenges: mitigating catastrophic forgetting during continual fine-tuning and reducing the number of trainable parameters required for adaptation."
    },
    {
        "question": "How does CURLoRA differ from standard LoRA in its matrix decomposition strategy?",
        "answer":  "CURLoRA replaces the traditional random initialization in LoRA with CUR matrix decomposition, using inverted probabilities for selecting columns and rows and initializing the U matrix as zeros—this serves as a form of implicit regularization."
    },
    {
        "question": "What is the purpose of initializing the U matrix as a zero matrix in CURLoRA?",
        "answer": "Initializing U as a zero matrix ensures that only U is fine-tuned during training, which minimizes deviations from the pretrained model and reduces the risk of catastrophic forgetting."
    },
    {
        "question": "Why is catastrophic forgetting a problem in LoRA-based fine-tuning?",
        "answer": "In LoRA, the adapted weight output can deviate significantly from the original weight matrix due to low-rank updates, which may overwrite previously learned knowledge and result in forgetting prior tasks."
    },
    {
        "question": "What is the role of inverted probability sampling in CURLoRA’s CUR decomposition?",
        "answer": "Inverted probability sampling prioritizes less dominant features (columns/rows with lower activation) during CUR decomposition, leading to better coverage of information and more stable learning dynamics."
    },
    {
        "question": "How does CURLoRA improve computational efficiency compared to traditional fine-tuning methods?",
        "answer": "CURLoRA reduces the number of trainable parameters by fine-tuning only the U matrix derived from CUR decomposition, requiring fewer resources while maintaining model performance."
    },
    {
        "question": "In what types of scenarios does CURLoRA particularly excel, according to the authors?",
        "answer": "CURLoRA is especially effective in resource-constrained settings and when fine-tuning on limited datasets, where it maintains performance without overwriting prior knowledge."

    },
    {
        "question": "What empirical evidence supports the claims made about CURLoRA?",
        "answer": "Experiments across multiple datasets show that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting, maintaining model stability, and preserving base model perplexity across continual tasks."
    },
    {
        "question": "Summarize the CURLoRA paper in simple terms for a non-expert audience.",
        "answer": "CURLoRA is a new way to fine-tune large language models that helps them remember what they’ve already learned while adapting to new tasks. It uses a mathematical trick called CUR decomposition to update only a small part of the model, making it both efficient and stable."
    },
    {
        "question": "What are the practical implications of CURLoRA for continual learning in NLP?",
        "answer": "CURLoRA offers a pathway to fine-tune models incrementally without sacrificing previous knowledge, making it ideal for applications that require models to stay updated over time without retraining from scratch."
    }
]

In [45]:
title = "CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic_Forgetting Mitigation"
generate_QA_pair("03", 2024, "curlora", title, qa_pairs)

### **04: DELIFT: Data Efficient Language model Instruction_Fine Tuning**

**Abstract**

Fine-tuning large language models (LLMs) is crucial for task specialization but often becomes resource-intensive due to redundant or uninformative data. Existing data selection methods typically rely either on computationally expensive gradient-based metrics or static embeddings that fail to adapt dynamically to the model’s evolving state, thus limiting their practical effectiveness. To address this, we propose DELIFT (Data Efficient Language model Instruction Fine-Tuning), leveraging a novel, computationally efficient utility metric inspired by In-Context Learning (ICL). Our ICL-based metric measures the informational value of each data sample by quantifying its effectiveness as an in-context example in improving model predictions for other samples, reflecting its actual contribution relative to the model’s current state. Integrated with tailored submodular optimization methods, DELIFT systematically selects diverse, informative subsets optimized specifically for each fine-tuning stage: instruction tuning, task-specific adaptation, and continual fine-tuning. Experimental results across multiple datasets and model scales show DELIFT reduces fine-tuning data requirements by up to 7% without compromising performance, consistently outperforming existing methods by up to 26% in effectiveness and efficiency.

**Introduction**

Large Language Models (LLMs) have become indispensable for solving a variety of natural language processing tasks, ranging from question answering and summarization to complex dialogue and reasoning. Despite their remarkable adaptability, fine-tuning LLMs often requires enormous computational resources and time, especially when a significant portion of the training data is either redundant or uninformative. This challenge grows more critical with increasing model and dataset sizes, posing a key limitation to the broader deployment of LLMs.

Existing data selection methods generally fall under two paradigms: (1) static embedding-based approaches that compute sample similarities without reflecting the model’s evolving state, and (2) gradient-based methods that offer more model-specific feedback but often entail prohibitive computational overhead, especially for large-scale models. Although both paradigms can yield initial benefits, they often fail to account for how a model’s knowledge shifts over multiple fine-tuning phases: **(1)Instruction Tuning**, which enhances the model’s ability to follow diverse instructions; **(2) Task-Specific Fine-Tuning**, which focuses on refining domain expertise; and **(3) Continual Fine-Tuning**, which incrementally incorporates new knowledge while mitigating catastrophic forgetting.
Thus, a natural question arises:

***Can we develop a unified, computationally efficient data selection framework that adapts to all
stages of fine-tuning and maximizes model performance while minimizing data redundancy?***

In this paper, we introduce DELIFT (Data-Efficient Language Model Instruction Fine-Tuning), a single-stop solution designed to address data selection across all fine-tuning stages within a single framework. DELIFT is grounded in information theory yet uses the practical intuition of in-context examples to assess the ’information gain’ of each data sample relative to the current state of a model. Specifically, we propose a new utility metric that captures how effectively one sample improves the model’s prediction of another. By combining these pairwise utilities with submodular optimization, DELIFT generates diverse, nonredundant subsets uniquely tailored to each fine-tuning phase.

We evaluated DELIFT on various tasks and model scales, consistently observing that it can pruneup to 70% of the training data without hurting performance - and often improving it - outperforming existing methods by up to 26% in efficiency and effectiveness. In doing so, we show that careful utility-driven data selection can be far more effective than sheer data volume, opening the door to more resource-friendly and targeted fine-tuning.
Our primary contributions are as follows.
1. A unified information-theoretic data selection paradigm that leverages pairwise utilities
grounded in conditional pointwise mutual information, making it adaptable to instruction tuning,
task-specific adaptation, and continual fine-tuning.
2. A single-stop, submodular optimization framework that integrates these utilities to provide
diverse, high-value subsets for each fine-tuning stage without incurring prohibitive computation.
3. Extensive empirical validation showing up to 70% data reduction with minimal (and sometimes zero) performance loss across multiple domains, demonstrating substantial gains in both efficacy and efficiency.

The remainder of this paper is organized as follows. Section 2 reviews prior work on data-efficient strategies for fine-tuning LLMs and situates our approach within the literature. Section 3 introduces our information-theoretic utility metric and describes how it integrates with submodular optimization to enable data selection across diverse fine-tuning stages. Section 4 presents comprehensive experiments demonstrating the effectiveness and efficiency of our framework on multiple tasks and models. Finally, Section 5 discusses the broader implications of our results, outlines limitations, and suggests directions for future research.

**Related Work**

**Data Subset Selection for Deep Neural Networks.** Selecting an informative subset of training samples is a longstanding strategy to reduce computational costs and enhance model generalization. Model-Independent Approaches. Traditional model-independent techniques, such as clustering or distance metrics on pre-trained embeddings, capture broad semantic similarities but do not reflect the model’s changing state, limiting their effectiveness during iterative fine-tuning. Model-Dependent Approaches. Model-dependent methods incorporate the model’s evolving knowledge by analyzing gradients or
loss values, often outperforming static approaches. However, performing gradient or influence estimations at scale becomes prohibitively expensive for large models. Techniques like LESS alleviate some overhead via parameter-efficient fine-tuning (e.g., LoRA), , yet still incur repeated gradient or influence calculations that scale poorly with dataset size. Subset Selection with LLM Feedback. Another emerging direction leverages LLM feedback to score or filter training samples. For instance, SelectIT employs self-reflection prompts to rate data quality, while filtering approaches using GPT-4 rely on external heuristics. Though these provide a form of model-aware sampling, they typically lack a principled theoretical grounding. In addition, all these approaches primarily target
a single fine-tuning stage, limiting their adaptability for instruction tuning, task-specific adaptation, or continual learning.

**Our Contribution.** In contrast, we present a unified, information-theoretic framework that operates effectively across all fine-tuning stages: instruction tuning, task-specific adaptation, and continual fine-tuning. Our novel utility metric quantifies how one data point aids the prediction of another, mirroring the model’s evolving knowledge. Integrated within a submodular selection paradigm, this approach balances diversity, coverage, and informativeness throughout the entire fine-tuning pipeline. As a result, we bridge the gap left by existing methods that are either restricted to a single phase or computationally infeasible at scale, demonstrating consistent performance improvements and notable efficiency gains.

**Conclusion**

In this paper, we introduced DELIFT, a novel approach to data-efficient fine-tuning of large language models by employing a versatile pairwise utility metric combined with submodular optimization techniques for optimal data selection. Empirical evaluations showed that DELIFT can reduce data and computational requirements by up to 70% while achieving performance comparable to the full dataset, and outperforming existing data selection methods by up to 26% in effectiveness. These results suggest that DELIFT offers a promising method for improving the accessibility of LLM adaptation, especially for resource-constrained scenarios. However, our approach has limitations, including potential sensitivity to the quality and diversity of initial data and the risk of bias amplification inherent in the selected data. Future work will explore integrating DELIFT with data augmentation techniques to improve robustness, incorporating fairness constraints to mitigate biases, and extending the approach to emerging model architectures and multimodal learning. Our ongoing efforts are directed toward ensuring that DELIFT contributes to responsible and equitable
AI development while maximizing efficiency.

In [47]:
qa_pairs = [
  {
    "question": "What core limitation of existing data selection methods does DELIFT aim to overcome?",
    "answer": "DELIFT addresses the limitations of existing data selection methods which rely either on computationally expensive gradient-based metrics or static embeddings that fail to adapt to the model’s evolving state."
  },
  {
    "question": "What is the key insight behind the utility metric proposed in DELIFT?",
    "answer": "The utility metric measures the informational value of a data sample by evaluating how effectively it improves the model’s prediction for other samples, inspired by in-context learning and grounded in conditional pointwise mutual information."
  },
  {
    "question": "How does DELIFT ensure data efficiency across different fine-tuning stages?",
    "answer": "DELIFT integrates its utility metric with submodular optimization to select diverse, informative subsets tailored to three fine-tuning stages: instruction tuning, task-specific adaptation, and continual fine-tuning."
  },
  {
    "question": "How does DELIFT differ from traditional model-independent and model-dependent data selection methods?",
    "answer": "DELIFT offers a unified, model-aware framework that adapts to the evolving state of the model and operates across all fine-tuning stages, unlike traditional approaches that are either static or computationally expensive and limited to specific phases."
  },
  {
    "question": "What empirical benefits does DELIFT offer over baseline methods?",
    "answer": "DELIFT reduces training data by up to 70% without sacrificing model performance and outperforms existing data selection methods by up to 26% in both effectiveness and efficiency across multiple datasets and model scales."
  },
  {
    "question": "What are the primary contributions of the DELIFT framework?",
    "answer": "The main contributions include: a unified data selection framework grounded in information theory, an efficient submodular optimization pipeline, and empirical evidence of significant data and computational savings without performance degradation."
  },
  {
    "question": "What are the limitations of DELIFT noted by the authors?",
    "answer": "The authors acknowledge DELIFT’s sensitivity to the quality and diversity of the initial dataset and the risk of bias amplification in the selected data, suggesting future work should incorporate fairness constraints and data augmentation."
  },
  {
    "question": "What problem does DELIFT solve in the context of fine-tuning large language models?",
    "answer": "DELIFT solves the problem of inefficient and resource-heavy fine-tuning by intelligently selecting the most informative data samples, thereby reducing redundancy and computational cost while maintaining or improving performance."
  },
  {
    "question": "Why is data selection important for instruction fine-tuning?",
    "answer": "Data selection is crucial for instruction fine-tuning because the process is resource-intensive, and removing redundant or low-value samples can significantly reduce training cost and time without degrading model quality."
  },
  {
    "question": "Can DELIFT be applied to all stages of LLM fine-tuning?",
    "answer": "Yes, DELIFT is explicitly designed to adapt to all stages of fine-tuning—including instruction tuning, task-specific adaptation, and continual fine-tuning—making it a comprehensive data selection solution."
  }
]

In [49]:
title = "DELIFT: Data Efficient Language model Instruction_Fine Tuning"
generate_QA_pair("04", 2025, "delift", title, qa_pairs)

## Step 8: Downloading the Notebook

In [None]:
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks
files.download("03_qa_curation.ipynb")

/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>