## Step 1: Mounting Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  qa_pairs  README.md  scripts


## Step 2: Importing Libraries

In [None]:
import os
import json
from pathlib import Path
import re
from google.colab import files

## Step 3: Setting Paths

In [None]:
BASE_DIR = "/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer"
PDF_DIR = os.path.join(BASE_DIR, "data", "QA_corpus")
QA_DIR = os.path.join(BASE_DIR, "qa_pairs")

os.makedirs(QA_DIR, exist_ok=True)

## Step 4: Creating Function to Generate QA Pairs

In [None]:
def generate_QA_pair(paper_number, publication_year, short_id, title, qa_pairs):
  data = {
    "paper_id": f"{paper_number}_{publication_year}_{short_id}",
    "title": title,
    "qa_pairs": qa_pairs
  }
  filename = f"{paper_number}_{publication_year}_{short_id}.json"
  with open(os.path.join(QA_DIR, filename), "w") as f:
    json.dump(data, f, indent=4)
  return

## Step 5: Listing PDFs in QA Corpus Folder

In [None]:
# List PDFs in QA corpus folder
pdf_files = sorted([f for f in os.listdir(PDF_DIR) if f.endswith(".pdf")])

print("Available Papers:")
for i, file in enumerate(pdf_files):
    print(f"[{i}] {file}")

# Select your paper here
paper_index = 0  # Change this index to select a different paper
pdf_path = os.path.join(PDF_DIR, pdf_files[paper_index])
pdf_name = os.path.splitext(pdf_files[paper_index])[0]

print(f"\n Selected: {pdf_files[paper_index]}")

Available Papers:
[0] ADALORA:_ADAPTIVE_BUDGET_ALLOCATION_FOR_PARAMETER-EFFICIENT_FINE-TUNING.pdf
[1] AutoLoRA:_Automatically_Tuning_Matrix_Ranks_in_Low-Rank_Adaptation_Based_on_Meta_Learning.pdf
[2] Balancing_Continuous_Pre-Training_and_Instruction_Fine-Tuning:
__Optimizing_Instruction-Following_in.pdf
[3] CURLoRA:_Stable_LLM_Continual_Fine-Tuning_and_Catastrophic_Forgetting
__Mitigation.pdf
[4] DELIFT:_Data_Efficient_Language_model_Instruction_Fine_Tuning.pdf
[5] FINETUNED_LANGUAGE_MODELS_ARE_ZERO-SHOT_LEARNERS.pdf
[6] Few-Shot_Parameter-Efficient_Fine-Tuning_is_Better_and_Cheaper_than_In-Context_Learning.pdf
[7] Instruction_Tuning_for_Large_Language_Models:_A_Survey.pdf
[8] LLAMA-ADAPTER:_EFFICIENT_FINE-TUNING_OF_LARGE_LANGUAGE_MODELS_WITH_ZERO-INITIALIZED_ATTENTION.pdf
[9] LLM-Adapters:_An_Adapter_Family_for_Parameter-Efficient_Fine-Tuning_of_Large_Language_Models.pdf
[10] LORA:_LOW-RANK_ADAPTATION_OF_LARGE_LANGUAGE_MODELS.pdf
[11] LoRA_vs_Full_Fine-tuning:_An_Illusion_of_Equivalen

## Step 6: Generating QA pairs

### **00: AdaLoRA — Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning**

**Abstract**  
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal.

To bridge this gap, we propose **AdaLoRA**, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of **singular value decomposition (SVD)**. Such a novel approach allows us to effectively **prune the singular values** of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on **natural language processing**, **question answering**, and **natural language generation** to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in **low budget** settings.

**Introduction**

Large Language Models (LLMs) (Dai et al., 2019; Radford et al., 2019; Zhang et al., 2022; Raffel et al., 2020; Devlin et al., 2018) have stimulated widespread attention in both academia and industry. Driven by massive corpora and advanced hardware, LLMs exhibit remarkable understanding and generative
ability, propelling language tasks to a higher level. Recently, significant progress has been made on instruction-following models, e.g., ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b), which follow language instructions and generate contextual responses. However, the further prevalence of
instruction models is largely impeded by the closed-source restriction and high development costs. To alleviate this, Stanford Alpaca (Taori et al., 2023) proposes to fine-tune an open-source LLM, i.e., LLaMA (Touvron et al., 2023) into an instruction-following model, which is affordable and replicable. Starting from 175 human-written instruction-output pairs (Wang et al., 2022a), Alpaca leverages GPT-3.5 (Brown et al., 2020) to expand the training data to 52K in a self-instruct manner.Supervised by this, Alpaca fine-tunes the entire 7B parameters in LLaMA, producing an exceptional
instruction model that performs similarly to GPT-3.5. Despite Alpaca’s effectiveness, a complete fine-tuning of large-scale LLaMA is still time-consuming, computation-intensive, and cumbersome
to transfer to different downstream scenarios.

In this paper, we introduce **LLaMA-Adapter**, an efficient fine-tuning method that adapts LLaMA into a well-performed instruction-following model. Trained by Alpaca’s instruction-output data, our approach freezes the entire LLaMA model, and proposes a zero-initialized attention mechanism with superior resource efficiency. Specifically, in LLaMA’s higher transformer layers, we append a set of learnable adaption prompts as prefixes to the word tokens. Then, to avoid the noise from randomly initialized prompts at the early training stage, we equip the frozen self-attention layers with a learnable gating factor. The gating mechanism is initialized by zeros, and controls the feature interaction between prompt and word tokens, within the process of attention calculation. Such a strategy can first preserve the original knowledge in LLaMA, and progressively inject the new instructional signals during training. This contributes to a more stable learning process and better
instruction-following capacity of the final model.
Overall, our LLaMA-Adapter exhibits four main characteristics, as shown in Figure 1.

• **1.2M Parameters**. Instead of updating the full 7B parameters, we freeze the pre-trained
LLaMA and only learn the zero-initialized attention mechanism with 1.2M parameters. This,
however, reveals comparable instruction-following proficiency with the 7B Alpaca.

• **One-hour Fine-tuning**. Thanks to our lightweight adaption modules with zero-initialized
gating, the training convergence of LLaMA-Adapter costs less than one hour on 8 A100
GPUs, which are three times faster than Alpaca.

• **Plug with Expertise**. For different scenarios, it is flexible to insert their respective adapters to endow LLaMA with different expert knowledge or new modality input. Thus, it suffices to store a 1.8M adapter within each context, other than a complete copy of the 13G LLaMA.
• **Multi-modal Reasoning**. Besides language instruction, our approach can also incorporate an image encoder via zero-initialized attention to become a multi-modal LLM. Compared to concurrent works (Liu et al., 2023b; Zhu et al., 2023), LLaMA-Adapter showcases higher tuning efficiency with competitive reasoning capacity on MME (Fu et al., 2023), MMBench (Liu et al., 2023c), and LVLM-eHub (Xu et al., 2023) benchmarks.

In addition to instruction tuning, our zero-initialized attention can be generalized to traditional vision and language tasks for parameter-efficient fine-tuning. We apply our approach to the pre-trained ViT (Dosovitskiy et al., 2020), ReBERTa (Liu et al., 2019), and CLIP (Radford et al., 2021),
respectively for fine-tuning vision, language, and vision-language models. On a wide range of
downstream tasks, we demonstrate the effectiveness of our proposed method for traditional tasks.

In [None]:
qa_pairs = [
    {
        "question": "What problem does AdaLoRA aim to solve in the context of fine-tuning large language models?",
        "answer": "AdaLoRA addresses the inefficiency of uniformly distributing the parameter budget across all weight matrices during fine-tuning. It proposes an adaptive allocation strategy that prioritizes important parameters, thus improving performance under constrained budgets."
    },
    {
        "question": "How does AdaLoRA allocate the parameter budget among weight matrices?",
        "answer": "AdaLoRA uses an importance scoring mechanism to assign more parameters to critical weight matrices and fewer to less important ones. This allocation is realized through a low-rank approximation using singular value decomposition (SVD)."
    },
    {
        "question": "What advantage does AdaLoRA's use of singular value decomposition provide?",
        "answer": "By representing incremental updates via SVD, AdaLoRA can prune unimportant singular values, thus reducing computational overhead and improving parameter efficiency without performing exact, expensive SVD computations."
    },
    {
        "question": "Why is full fine-tuning of large pre-trained models often impractical in real-world applications involving many downstream tasks?",
        "answer": "Full fine-tuning requires updating and storing a separate copy of the model for each downstream task, which becomes prohibitively expensive in terms of memory and computation, especially for large models like BERT, T5, or GPT-3 that have hundreds of millions to billions of parameters."
    },
    {
        "question": "What are the two primary approaches to parameter-efficient fine-tuning described in the introduction?",
        "answer": "The first approach involves adding small neural modules—like adapters, prompts, or prefixes—to a frozen base model and fine-tuning only those additions. The second approach models the incremental update of pre-trained weights in a parameter-efficient way without altering the model architecture, using methods like diff pruning or LoRA."
    },
    {
        "question": "How does LoRA improve parameter efficiency in fine-tuning compared to full fine-tuning?",
        "answer": "LoRA improves efficiency by representing the incremental updates as a low-rank matrix—specifically, the product of two smaller matrices. This significantly reduces the number of trainable parameters while preserving or even improving performance, and it avoids the complexity of handling sparse matrices like in diff pruning."
    },
    {
        "question": "What limitation of LoRA does AdaLoRA aim to overcome?",
        "answer": "LoRA uses a fixed rank for all weight matrices during fine-tuning, which assumes all matrices are equally important. AdaLoRA addresses this by dynamically allocating different parameter budgets to different weight matrices based on their relative importance, allowing more effective use of limited resources."
    }
]

In [None]:
title = "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning"
generate_QA_pair("00", 2023, "adalora", title, qa_pairs)

### **01: AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning**

**Abstract**

Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pre-trained weights, has proven particularly effective. Nonetheless, LoRA’s uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA.

**Introduction**

Large Language Models (LLMs) have demonstrated state-
of-the-art performance across a variety of NLP tasks, spanning from Natural Language Understanding (NLU) to Natural Language Generation (NLG), a trajectory highlighted by the success of models like ChatGPT. Their success largely stems from a two-stage process: initial pretraining on vast amounts of unlabeled texts, followed by finetuning on specific downstream tasks. However, as models scale up, for instance transitioning from RoBERTa-large’s 355 million parameters to GPT-3’s staggering 175 billion parameters, finetuning becomes highly expensive in computation.

To address this challenge, many efficient finetuning methods have been developed. For instance, the Adapters method inserts lightweight layers (called adapters) into pretrained networks. During fine-tuning, only these adapters are updated while the pretrained layers are kept frozen. One limitation of this method is that the adapters incur additional computation overhead during inference. Another approach, prefix tuning, introduces trainable prefix parameters which are prepended to the input sequence while making the pretrained model parameters frozen. Nevertheless, determining the optimal length of the prefix can be tricky. A prefix that is too short cannot capture enough information, while an overlong prefix may largely reduce the maximum length of the input sequence. To address these limitations, LoRA proposes to add low-rank incremental update matrices to pretrained weight matrices. During finetuning, only the incremental matrices are trained while the pretrained ones are frozen. The low-rank parameterization significantly reduces the number of finetuning parameters.

While achieving parameter-efficient finetuning without increasing inference costs, LoRA has two limitations. First, the update matrices at different layers share the same rank, without considering the varying properties across layers. Different layers in a pretrained model have varying importance to a downstream task and should be adapted differently, which requires the number of trainable parameters to be layer-specific. Employing a uniform rank across all layers compromises this purpose, which renders some layers to be under-parameterized (leading to suboptimal fine-tuning performance) while others unnecessarily over-parameterized (leading to computation inefficiency). Second, obtaining the optimal rank in LoRA typically involves an extensive manual hyperparameter search, which is time-consuming and poses scalability issues.

To address the aforementioned limitations of LoRA, we introduce the AutoLoRA framework to automatically determine the optimal rank for each LoRA layer. In AutoLoRA, we first decompose an update matrix into the product of two low-rank
matrices (with rank k), in alignment with the LoRA methodology. This product can be expressed as the summation of k rank-1 matrices. For each rank-1
matrix, we assign a continuous trainable selection variable α ∈ [0, 1] indicating the matrix’s relative importance in the summation. After learning, if α is close to zero, the corresponding rank-1 matrix is removed from the summation. These selection variables effectively control the rank of an update
matrix. Learning α directly on a training dataset together with the update matrices can result in over-fitting, and the network learned in this way lacks generalization ability. To mitigate this problem, we formulate the search process of α as a meta learning problem. First, we finetune the weights in the rank-1 matrices on a training dataset. Second, we optimize the α values by minimizing the loss on a validation dataset. These two steps iterate until convergence. Subsequently, we derive the optimal rank of each LoRA layer by thresholding the learned α values. Once the optimal rank is identified for each layer, the weights in the low-rank update matrices are retrained on the combination of training and validation data. An overview of our proposed method is illustrated in Figure 1.

The major contributions of this paper are summarized as follows.

• We propose AutoLoRA, a meta learning based approach that can automatically determine the optimal and layer-specific ranks of update matrices, alleviating the burden of manually tuning them as in LoRA.

• Extensive experiments on natural language understanding and generation tasks demonstrate the effectiveness of AutoLoRA.

**Conclusions and Future Work**

In this paper, we introduce AutoLoRA, a meta learning based framework designed to automatically search for the optimal ranks for LoRA layers. Our method associates each rank-1 matrix in LoRA updates with a selection variable and formulates the rank-tuning problem as optimizing the selection variables via meta learning. Thresholding is applied to derive discrete rank values from continuous selection variables and retraining is performed to bridge the gap incurred by thresholding. Comprehensive experiments show the efficacy of AutoLoRA across various NLP tasks. Similar to the LoRA method, the LoRA layers in AutoLoRA are manually specified, which may be suboptimal. As a future work, we will investigate how to automatically select LoRA layers, by developing a meta learning framework similar to
that in Eq.(5).

In [None]:
qa_pairs = [
    {
        "question": "What problem does AutoLoRA aim to solve in traditional LoRA-based fine-tuning?",
        "answer": "AutoLoRA addresses two core limitations of traditional LoRA: (1) the uniform rank assignment across all layers, which neglects layer-specific importance, leading to suboptimal or inefficient fine-tuning; and (2) the need for exhaustive manual hyperparameter searches to determine optimal ranks."
    },
    {
        "question": "How does AutoLoRA represent each update matrix in the fine-tuning process?",
        "answer":  "AutoLoRA decomposes each update matrix into the product of two low-rank matrices, consistent with the LoRA methodology. This product is then expressed as a sum of rank-1 matrices, each associated with a trainable selection variable α ∈ [0, 1]."
    },
    {
        "question": "What is the role of the selection variable α in AutoLoRA?",
        "answer": "The α variable controls whether a given rank-1 matrix should be retained. If α is close to zero, the corresponding matrix is discarded. The optimal rank of each layer is determined by thresholding these α values after training."
    },
    {
        "question": "How does AutoLoRA determine the optimal rank of each LoRA layer?",
        "answer": "AutoLoRA introduces selection variables associated with each rank-1 matrix in a low-rank update. These variables are learned via a meta-learning method and used to determine the optimal rank by thresholding their values."
    },
    {
        "question": "Why is learning α directly on the training dataset problematic, and how does AutoLoRA address it?",
        "answer": "Directly learning α from training data can lead to overfitting and poor generalization. AutoLoRA mitigates this by framing α-optimization as a meta learning problem: update weights on training data, then update α by minimizing loss on a separate validation set."
    },
    {
        "question": "What distinguishes AutoLoRA from adapter and prefix tuning methods in terms of inference overhead?",
        "answer": "Unlike adapter and prefix tuning, which introduce additional parameters that incur runtime overhead, AutoLoRA does not increase inference cost. Only the low-rank update matrices are trained, and their integration does not burden inference."
    },
    {
        "question": "How does AutoLoRA improve computational efficiency compared to standard LoRA?",
        "answer": "AutoLoRA avoids exhaustive grid searches for optimal ranks by learning them automatically, layer-wise. This reduces computational cost and ensures that model capacity is allocated where it is most beneficial."

    },
    {
        "question": "What is AutoLoRA, and why is it important?",
        "answer": "AutoLoRA is a meta learning-based framework that optimizes the rank of LoRA layers in large language models. It improves parameter efficiency and fine-tuning performance while eliminating costly manual tuning."
    },
    {
        "question": "How does AutoLoRA relate to the broader challenge of scaling large language models?",
        "answer": "As LLMs grow larger, full fine-tuning becomes increasingly resource-intensive. AutoLoRA offers a scalable alternative by fine-tuning only select low-rank matrices with learned rank assignments, thereby conserving resources without sacrificing performance."
    }
]

In [None]:
def generate_QA_pair(publication_year, short_id, title, qa_pairs):
  data = {
    "paper_id": f"{publication_year}_{short_id}",
    "title": title,
    "qa_pairs": qa_pairs
  }
  filename = f"{publication_year}_{short_id}.json"
  with open(os.path.join(QA_DIR, filename), "w") as f:
    json.dump(data, f, indent=4)
  return

In [None]:
title = "AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning"
generate_QA_pair("01", 2024, "autolora", title, qa_pairs)

### **02: Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs**

**Abstract**

Large Language Models (LLMs) for public use require continuous pre-training to remain up-
to-date with the latest data. The models also need to be fine-tuned with specific instructions to maintain their ability to follow instructions accurately. Typically, LLMs are released in two versions: the Base LLM, pre-trained on diverse data, and the instruction-refined LLM, additionally trained with specific instructions for better instruction following. The question arises as to which model should undergo continuous pre-training to maintain its instruction-following abilities while also staying current with the latest data. In this study, we delve into the intricate relationship between continuous pre-training and instruction finetuning of the LLMs and investigate the impact of continuous pre-training on the instruction following abilities of both the base and its instruction finetuned model. Further, the instruction fine-tuning process is computationally intense and requires a substantial number of hand-annotated examples for the model to learn effectively. This study aims to find the most compute-efficient strategy to gain up-to-date knowledge and instruction-following capabilities without requiring any instruction
data and fine-tuning. We empirically prove
our findings on the LLaMa 3, 3.1 and Qwen
2, 2.5 family of base and instruction models,
providing a comprehensive exploration of our
hypotheses across varying sizes of pre-training
data corpus and different LLMs settings.

**Introduction**

Recently, autoregressive large language models
(LLM) showed remarkable progress across a wide
range of natural language tasks, natural language understanding, mathematical reasoning, and coding across various domains. These LLMs are pre-trained with a causal language modeling objective to predict the next token(s) in a given sequence until
it is complete, termed as Base models. These base models exhibit a remarkable ability to generate linguistically coherent text, however not necessarily aligning their generations with human preference and needs. Thus, LLMs often require a fine-tuning step, Instruction fine-tuning to bridge the gap between the base model’s fundamental objective and the practical needs of human users termed as Instruction models.

Instruction fine-tuning is an expensive task and generally requires a significant amount of labeled data1 depending on the type of optimization technique used. This can be expensive and time-consuming to collect and annotate such a big dataset. Algorithmically, it requires training of reward model and RLHF, PPO, DPO fine-tuning which further adds to the complexity of the task.

Parallelly, to stay abreast with the latest data, the base model needs to be either re-pre-trained on a combination of old and newly collected data or continuously
pre-trained on the newly collected data yielding to the new base model. For
example, the LLaMa 3.1 base model is pre-trained with more and high-quality data over the LLaMa 3 base model. Similarly, Qwen
2.5 family base models have more knowledge and improved capabilities over Qwen 2 family models.

Continuous pre-training of the LLM generally results in forgetting previously learned information, several methods have been proposed to maintain the base model performance on previously learned tasks such as Xie et al. (2023); Ibrahim et al. (2024).However, there has been no research focusing on
the influence of continuous training on instruction models. As continuous pre-training is vital for acquiring new knowledge, and instruction tuning is necessary to learn instruction following capabilities, it is required to have both the capabilities to any instruction model. This raises a series of
natural questions:

**a** What happens to the instruction capabilities
when we continuously pre-train the instruction
model to gain new knowledge?

**b** If lost, how to regain instruction capabilities?

**c** Is it necessary to add resource-extensive
instruction-fine-tuning after updating the
knowledge of the base model?

We approach this problem empirically by study-
ing two different settings. In the first setting, we continuously pre-train the instruction model on a specific dataset and observe its performance on the LLM harness framework from EleutherAI. Whereas in another setting we continuously pre-train the base model with the same data and then instruction fine-tune the continuously pre-trained base model. Finally, we compare the instruction capabilities of instruction models from
both settings. Since instruction fine-tuning is an expensive task, we discovered a simple yet efficient approach to regain the instruction capability of the continuous pre-trained base model, given that the instruction-tuned model of the original base model is available. Our main findings and the contributions of this work are as follows:

• Continuous pre-training of an instruction
model results in catastrophic forgetting of the
instruction capabilities and, therefore should
be avoided.

• Continuous pre-training base model and then
instruction tuning preserve both the domain
knowledge and the instruction capabilities.

• Instruction capabilities are portable across the same ancestor models. That is, we can extract the instruction capability by simply subtracting the weight of the base model from the weights of its instruction-tuned model.

• No traditional instruction tuning is required for a continuous pre-trained base model instead the instruction capabilities are ported.

To our knowledge, we are the first ones to sys-
tematically conduct this analysis and discover the portability of the instruction capabilities across models from the same ancestor. We empirically prove all our findings on LLaMa 3, LLaMa 3.1, Qwen2, and Qwen 2.5 families of base and instruct models. We comprehensively test our hypothesis in
breadth and depth with varying sizes of pre-training data corpus across different LLMs settings in Section 3.

**Conclusion**

In conclusion, this study delves into the effects of continuous pre-training on base and instruction-tuned large language models (LLMs) and their instruction capabilities. The findings suggest that while continuous pre-training of instruction models may lead to catastrophic forgetting of instruction capabilities, a more efficient approach is to continuously pre-train the base model with new
data, followed by instruction tuning. This method preserves both domain knowledge and instruction capabilities. Interestingly, the study also reveals that instruction capabilities are transferable across models from the same ancestor, eliminating the need for additional instruction tuning for a continuously pre-trained base model. We empirically demonstrated this analysis on the LLaMa 3 and LLaMa 3.1 family of base and instruction models.

**Limitations**

While our hypothesis is validated for models with 8 billion parameters, we observe a noticeable variation in performance when applied to smaller models, particularly those with around 1.5 billion parameters. Furthermore, the scalability of our proposed strategy for models smaller than 1.5 billion parameters remains uncertain. This presents an
intriguing avenue for future research, where further exploration could investigate whether modifications or optimizations are needed to maintain the same level of effectiveness for these smaller models.

A critical challenge that emerges with the in-
struction residual method is the reliance on theavailability of both the base language model and its instruction fine-tuned counterpart. The approach fundamentally depends on the residual differences between these two models to function effectively. In the absence of either the base mode or the fine-tuned model, the instruction residual method cannot be employed. This limitation highlights a bottleneck in the methodology, especially when resources or computational constraints prevent the simultaneous availability of both models. Future work could explore potential ways to mitigate this dependency, perhaps by developing alternative techniques that either reduce the need for dual-model structures or enhance the portability of instruction-based fine-tuning across a wider range of model sizes.

In [None]:
qa_pairs = [
    {
        "question": "What are the two primary phases in training large language models (LLMs), and why is instruction fine-tuning necessary?",
        "answer": "The two primary phases are large-scale pretraining on diverse unlabeled data, followed by task-specific instruction fine-tuning. While pretraining equips models with general linguistic capabilities, instruction fine-tuning aligns the model’s behavior with human intent, enhancing its ability to follow explicit instructions."
    },
    {
        "question": "Why is continuous pre-training essential for LLMs, and what problem does it pose for instruction-tuned models?",
        "answer":  "Continuous pre-training ensures LLMs stay updated with new knowledge. However, when applied to instruction-tuned models, it causes catastrophic forgetting, diminishing their instruction-following capabilities."
    },
    {
        "question": "What empirical strategy do the authors propose to preserve both updated knowledge and instruction-following ability?",
        "answer": "The authors propose continuously pre-training the base model, then performing instruction fine-tuning afterward. This sequence maintains both domain knowledge and the capacity to follow instructions, avoiding the drawbacks of directly pretraining the instruction-tuned model."
    },
    {
        "question": "What is the “instruction residual” method, and how does it work?",
        "answer": "The instruction residual method extracts the difference in weights between a base model and its instruction-tuned counterpart and applies that delta to a newly updated base model, transferring instruction-following capabilities without redoing instruction fine-tuning."
    },
    {
        "question": "Under what conditions can the instruction residual method be applied effectively?",
        "answer": "The instruction residual method extracts the difference in weights between a base model and its instruction-tuned counterpart and applies that delta to a newly updated base model, transferring instruction-following capabilities without redoing instruction fine-tuning."
    },
    {
        "question": "What key experimental insight did the authors discover about model size and the efficacy of their approach?",
        "answer": "The strategy works well for 8B parameter models but shows variation in effectiveness for smaller models, especially those around 1.5B parameters. The scalability of the approach to such models remains an open research question."
    },
    {
        "question": "What limitations do the authors acknowledge regarding their methodology?",
        "answer": "Two main limitations are identified: (1) its uncertain scalability to smaller models, and (2) dependency on having both base and instruction-tuned versions, which may not always be feasible due to computational or resource constraints."

    },
    {
        "question": "What problem does this paper aim to solve in the context of instruction tuning for LLMs?",
        "answer": "It addresses how to maintain both updated knowledge and instruction-following ability in LLMs without repeatedly performing costly instruction fine-tuning."
    },
    {
        "question": "How does this paper differ from previous work on continual pre-training or catastrophic forgetting?",
        "answer": "Unlike prior work focused mainly on base models, this paper examines the unique effects of continual pretraining on instruction-tuned models and proposes a novel weight residual transfer strategy for preserving instruction-following ability."
    },
    {
        "question": "What practical takeaway does the paper offer for training up-to-date instruction-following LLMs?",
        "answer": "Instead of repeatedly fine-tuning updated instruction models, practitioners can simply reuse instruction residuals from earlier models and apply them to newer base models—saving both time and compute."
    }
]

In [None]:
title = "Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs"
generate_QA_pair("02", 2024, "balancing_pretrain_finetune", title, qa_pairs)

### **03: CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic_Forgetting Mitigation**

**Abstract**

This paper introduces CURLoRA, a novel approach to fine-tuning large language models (LLMs) that leverages CUR matrix decomposition in the context of Low-Rank Adaptation (LoRA). Our method addresses two critical challenges in LLM fine-tuning: mitigating catastrophic forgetting during continual learning and reducing the number of trainable parameters. We propose a unique
modification to the CUR decomposition process, utilizing inverted probabilities for column and row selection which acts as an implicit regularization, and initializing the U matrix as a zero matrix, and only fine-tuning it. We demonstrate through experiments on multiple datasets that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting. It maintains model stability and performance across tasks while significantly reducing the number of trainable parameters. Our results show that CURLoRA achieves very good and stable task accuracy while maintaining base model’s
perplexity scores fixed compared to LoRA upon continual fine-tuning, particularly in scenarios with limited data.

**Introduction**

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating remarkable capabilities across a wide range of tasks. However, fine-tuning these large models for specific tasks requires a lot of computational resources making it challenging to adapt these models efficiently, especially when working with limited datasets and in resource-constrained environments. Parameter-Efficient Fine-Tuning (PEFT) Methods have gained a lot of attention because they make fine-tuning large models accessible and possible. Low-Rank Adaptation (LoRA) has emerged as an efficient PEFT method, enabling fine-tuning large language models on custom tasks while decreasing the number of trainable parameters hence requiring less resources. LoRA works by decomposing pre-trained weight matrices into low-rank matrices and fine-tune these ones instead of the original matrix. Although LoRA has proven to be very excellent and promising, it still faces challenges with catastrophic forgetting. Catastrophic forgetting in LLMs is a critical issue where the model loses previously acquired knowledge when fine-tuned on new tasks. It occurs due to the overwriting of previously learned (pre-trained) weights during the fine-tuning process. In LoRA, this often happens as the adapted output can significantly deviate from the original:

$y=xW+xW_{adapted}=x(W+AB)$

where $W∈ R^{m×n}$ is the original weight matrix, and AB is the low-rank update from multiplying $A∈ R^{m×r}$ by $B∈ R^{r×n}$ where $r < n$. This work introduces CURLoRA, a novel approach that applies low-rank adaptation (LoRA) to pre-trained weight matrices using CUR matrix decomposition instead of random initiation of the low-rank A or B matrices. We propose a unique modification to the CUR decomposition process and demonstrate its effectiveness in mitigating catastrophic forgetting while also reducing the number of trainable parameters. While LoRA successfully reduces computational costs by decomposing weight updates into low-rank matrices, it still suffers from catastrophic forgetting. CURLoRA leverages CUR decomposition with inverted probabilities and initiating U matrix as zero to further mitigate
this issue.

**Conclusion**

This paper introduced CURLoRA, a novel approach to fine-tuning large language models that leverages CUR matrix decomposition to mitigate catastrophic forgetting and improve computational efficiency. Through theoretical analysis and empirical experiments, we demonstrated that CURLoRA outperforms standard LoRA in maintaining model stability and performance across tasks while significantly reducing the number of trainable parameters. Key contributions of this work include:

• A novel modification to CUR decomposition using inverted probabilities for column and row selection and initiating U matrix as zeros. Sampling columns and rows based on inverted probabilities distinguishes CUR-LoRA from traditional CUR, offering better stability and performance.

• Theoretical analysis of how CURLoRA addresses catastrophic forgetting.

• Empirical evidence of CURLoRA’s effectiveness across multiple tasks and evaluation metrics with multiple
models.

Our results suggest that CURLoRA is a promising approach for efficient and stable fine-tuning of large language models, particularly in scenarios with limited fine-tuning data. CURLoRA’s approach to mitigating catastrophic forgetting has broad implications for continual learning in NLP and beyond. Future research could explore its integration with other adaptation techniques to enhance model robustness.

In [None]:
qa_pairs = [
    {
        "question": "What core problem does CURLoRA seek to address in the context of large language model fine-tuning?",
        "answer": "CURLoRA addresses two main challenges: mitigating catastrophic forgetting during continual fine-tuning and reducing the number of trainable parameters required for adaptation."
    },
    {
        "question": "How does CURLoRA differ from standard LoRA in its matrix decomposition strategy?",
        "answer":  "CURLoRA replaces the traditional random initialization in LoRA with CUR matrix decomposition, using inverted probabilities for selecting columns and rows and initializing the U matrix as zeros—this serves as a form of implicit regularization."
    },
    {
        "question": "What is the purpose of initializing the U matrix as a zero matrix in CURLoRA?",
        "answer": "Initializing U as a zero matrix ensures that only U is fine-tuned during training, which minimizes deviations from the pretrained model and reduces the risk of catastrophic forgetting."
    },
    {
        "question": "Why is catastrophic forgetting a problem in LoRA-based fine-tuning?",
        "answer": "In LoRA, the adapted weight output can deviate significantly from the original weight matrix due to low-rank updates, which may overwrite previously learned knowledge and result in forgetting prior tasks."
    },
    {
        "question": "What is the role of inverted probability sampling in CURLoRA’s CUR decomposition?",
        "answer": "Inverted probability sampling prioritizes less dominant features (columns/rows with lower activation) during CUR decomposition, leading to better coverage of information and more stable learning dynamics."
    },
    {
        "question": "How does CURLoRA improve computational efficiency compared to traditional fine-tuning methods?",
        "answer": "CURLoRA reduces the number of trainable parameters by fine-tuning only the U matrix derived from CUR decomposition, requiring fewer resources while maintaining model performance."
    },
    {
        "question": "In what types of scenarios does CURLoRA particularly excel, according to the authors?",
        "answer": "CURLoRA is especially effective in resource-constrained settings and when fine-tuning on limited datasets, where it maintains performance without overwriting prior knowledge."

    },
    {
        "question": "What empirical evidence supports the claims made about CURLoRA?",
        "answer": "Experiments across multiple datasets show that CURLoRA outperforms standard LoRA in mitigating catastrophic forgetting, maintaining model stability, and preserving base model perplexity across continual tasks."
    },
    {
        "question": "Summarize the CURLoRA paper in simple terms for a non-expert audience.",
        "answer": "CURLoRA is a new way to fine-tune large language models that helps them remember what they’ve already learned while adapting to new tasks. It uses a mathematical trick called CUR decomposition to update only a small part of the model, making it both efficient and stable."
    },
    {
        "question": "What are the practical implications of CURLoRA for continual learning in NLP?",
        "answer": "CURLoRA offers a pathway to fine-tune models incrementally without sacrificing previous knowledge, making it ideal for applications that require models to stay updated over time without retraining from scratch."
    }
]

In [None]:
title = "CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic_Forgetting Mitigation"
generate_QA_pair("03", 2024, "curlora", title, qa_pairs)

### **04: DELIFT: Data Efficient Language model Instruction_Fine Tuning**

**Abstract**

Fine-tuning large language models (LLMs) is crucial for task specialization but often becomes resource-intensive due to redundant or uninformative data. Existing data selection methods typically rely either on computationally expensive gradient-based metrics or static embeddings that fail to adapt dynamically to the model’s evolving state, thus limiting their practical effectiveness. To address this, we propose DELIFT (Data Efficient Language model Instruction Fine-Tuning), leveraging a novel, computationally efficient utility metric inspired by In-Context Learning (ICL). Our ICL-based metric measures the informational value of each data sample by quantifying its effectiveness as an in-context example in improving model predictions for other samples, reflecting its actual contribution relative to the model’s current state. Integrated with tailored submodular optimization methods, DELIFT systematically selects diverse, informative subsets optimized specifically for each fine-tuning stage: instruction tuning, task-specific adaptation, and continual fine-tuning. Experimental results across multiple datasets and model scales show DELIFT reduces fine-tuning data requirements by up to 7% without compromising performance, consistently outperforming existing methods by up to 26% in effectiveness and efficiency.

**Introduction**

Large Language Models (LLMs) have become indispensable for solving a variety of natural language processing tasks, ranging from question answering and summarization to complex dialogue and reasoning. Despite their remarkable adaptability, fine-tuning LLMs often requires enormous computational resources and time, especially when a significant portion of the training data is either redundant or uninformative. This challenge grows more critical with increasing model and dataset sizes, posing a key limitation to the broader deployment of LLMs.

Existing data selection methods generally fall under two paradigms: (1) static embedding-based approaches that compute sample similarities without reflecting the model’s evolving state, and (2) gradient-based methods that offer more model-specific feedback but often entail prohibitive computational overhead, especially for large-scale models. Although both paradigms can yield initial benefits, they often fail to account for how a model’s knowledge shifts over multiple fine-tuning phases: **(1)Instruction Tuning**, which enhances the model’s ability to follow diverse instructions; **(2) Task-Specific Fine-Tuning**, which focuses on refining domain expertise; and **(3) Continual Fine-Tuning**, which incrementally incorporates new knowledge while mitigating catastrophic forgetting.
Thus, a natural question arises:

***Can we develop a unified, computationally efficient data selection framework that adapts to all
stages of fine-tuning and maximizes model performance while minimizing data redundancy?***

In this paper, we introduce DELIFT (Data-Efficient Language Model Instruction Fine-Tuning), a single-stop solution designed to address data selection across all fine-tuning stages within a single framework. DELIFT is grounded in information theory yet uses the practical intuition of in-context examples to assess the ’information gain’ of each data sample relative to the current state of a model. Specifically, we propose a new utility metric that captures how effectively one sample improves the model’s prediction of another. By combining these pairwise utilities with submodular optimization, DELIFT generates diverse, nonredundant subsets uniquely tailored to each fine-tuning phase.

We evaluated DELIFT on various tasks and model scales, consistently observing that it can pruneup to 70% of the training data without hurting performance - and often improving it - outperforming existing methods by up to 26% in efficiency and effectiveness. In doing so, we show that careful utility-driven data selection can be far more effective than sheer data volume, opening the door to more resource-friendly and targeted fine-tuning.
Our primary contributions are as follows.
1. A unified information-theoretic data selection paradigm that leverages pairwise utilities
grounded in conditional pointwise mutual information, making it adaptable to instruction tuning,
task-specific adaptation, and continual fine-tuning.
2. A single-stop, submodular optimization framework that integrates these utilities to provide
diverse, high-value subsets for each fine-tuning stage without incurring prohibitive computation.
3. Extensive empirical validation showing up to 70% data reduction with minimal (and sometimes zero) performance loss across multiple domains, demonstrating substantial gains in both efficacy and efficiency.

The remainder of this paper is organized as follows. Section 2 reviews prior work on data-efficient strategies for fine-tuning LLMs and situates our approach within the literature. Section 3 introduces our information-theoretic utility metric and describes how it integrates with submodular optimization to enable data selection across diverse fine-tuning stages. Section 4 presents comprehensive experiments demonstrating the effectiveness and efficiency of our framework on multiple tasks and models. Finally, Section 5 discusses the broader implications of our results, outlines limitations, and suggests directions for future research.

**Related Work**

**Data Subset Selection for Deep Neural Networks.** Selecting an informative subset of training samples is a longstanding strategy to reduce computational costs and enhance model generalization. Model-Independent Approaches. Traditional model-independent techniques, such as clustering or distance metrics on pre-trained embeddings, capture broad semantic similarities but do not reflect the model’s changing state, limiting their effectiveness during iterative fine-tuning. Model-Dependent Approaches. Model-dependent methods incorporate the model’s evolving knowledge by analyzing gradients or
loss values, often outperforming static approaches. However, performing gradient or influence estimations at scale becomes prohibitively expensive for large models. Techniques like LESS alleviate some overhead via parameter-efficient fine-tuning (e.g., LoRA), , yet still incur repeated gradient or influence calculations that scale poorly with dataset size. Subset Selection with LLM Feedback. Another emerging direction leverages LLM feedback to score or filter training samples. For instance, SelectIT employs self-reflection prompts to rate data quality, while filtering approaches using GPT-4 rely on external heuristics. Though these provide a form of model-aware sampling, they typically lack a principled theoretical grounding. In addition, all these approaches primarily target
a single fine-tuning stage, limiting their adaptability for instruction tuning, task-specific adaptation, or continual learning.

**Our Contribution.** In contrast, we present a unified, information-theoretic framework that operates effectively across all fine-tuning stages: instruction tuning, task-specific adaptation, and continual fine-tuning. Our novel utility metric quantifies how one data point aids the prediction of another, mirroring the model’s evolving knowledge. Integrated within a submodular selection paradigm, this approach balances diversity, coverage, and informativeness throughout the entire fine-tuning pipeline. As a result, we bridge the gap left by existing methods that are either restricted to a single phase or computationally infeasible at scale, demonstrating consistent performance improvements and notable efficiency gains.

**Conclusion**

In this paper, we introduced DELIFT, a novel approach to data-efficient fine-tuning of large language models by employing a versatile pairwise utility metric combined with submodular optimization techniques for optimal data selection. Empirical evaluations showed that DELIFT can reduce data and computational requirements by up to 70% while achieving performance comparable to the full dataset, and outperforming existing data selection methods by up to 26% in effectiveness. These results suggest that DELIFT offers a promising method for improving the accessibility of LLM adaptation, especially for resource-constrained scenarios. However, our approach has limitations, including potential sensitivity to the quality and diversity of initial data and the risk of bias amplification inherent in the selected data. Future work will explore integrating DELIFT with data augmentation techniques to improve robustness, incorporating fairness constraints to mitigate biases, and extending the approach to emerging model architectures and multimodal learning. Our ongoing efforts are directed toward ensuring that DELIFT contributes to responsible and equitable
AI development while maximizing efficiency.

In [None]:
qa_pairs = [
  {
    "question": "What core limitation of existing data selection methods does DELIFT aim to overcome?",
    "answer": "DELIFT addresses the limitations of existing data selection methods which rely either on computationally expensive gradient-based metrics or static embeddings that fail to adapt to the model’s evolving state."
  },
  {
    "question": "What is the key insight behind the utility metric proposed in DELIFT?",
    "answer": "The utility metric measures the informational value of a data sample by evaluating how effectively it improves the model’s prediction for other samples, inspired by in-context learning and grounded in conditional pointwise mutual information."
  },
  {
    "question": "How does DELIFT ensure data efficiency across different fine-tuning stages?",
    "answer": "DELIFT integrates its utility metric with submodular optimization to select diverse, informative subsets tailored to three fine-tuning stages: instruction tuning, task-specific adaptation, and continual fine-tuning."
  },
  {
    "question": "How does DELIFT differ from traditional model-independent and model-dependent data selection methods?",
    "answer": "DELIFT offers a unified, model-aware framework that adapts to the evolving state of the model and operates across all fine-tuning stages, unlike traditional approaches that are either static or computationally expensive and limited to specific phases."
  },
  {
    "question": "What empirical benefits does DELIFT offer over baseline methods?",
    "answer": "DELIFT reduces training data by up to 70% without sacrificing model performance and outperforms existing data selection methods by up to 26% in both effectiveness and efficiency across multiple datasets and model scales."
  },
  {
    "question": "What are the primary contributions of the DELIFT framework?",
    "answer": "The main contributions include: a unified data selection framework grounded in information theory, an efficient submodular optimization pipeline, and empirical evidence of significant data and computational savings without performance degradation."
  },
  {
    "question": "What are the limitations of DELIFT noted by the authors?",
    "answer": "The authors acknowledge DELIFT’s sensitivity to the quality and diversity of the initial dataset and the risk of bias amplification in the selected data, suggesting future work should incorporate fairness constraints and data augmentation."
  },
  {
    "question": "What problem does DELIFT solve in the context of fine-tuning large language models?",
    "answer": "DELIFT solves the problem of inefficient and resource-heavy fine-tuning by intelligently selecting the most informative data samples, thereby reducing redundancy and computational cost while maintaining or improving performance."
  },
  {
    "question": "Why is data selection important for instruction fine-tuning?",
    "answer": "Data selection is crucial for instruction fine-tuning because the process is resource-intensive, and removing redundant or low-value samples can significantly reduce training cost and time without degrading model quality."
  },
  {
    "question": "Can DELIFT be applied to all stages of LLM fine-tuning?",
    "answer": "Yes, DELIFT is explicitly designed to adapt to all stages of fine-tuning—including instruction tuning, task-specific adaptation, and continual fine-tuning—making it a comprehensive data selection solution."
  }
]

In [None]:
title = "DELIFT: Data Efficient Language model Instruction_Fine Tuning"
generate_QA_pair("04", 2025, "delift", title, qa_pairs)

### **05: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning**

**Abstract**

Few-shot in-context learning (ICL) enables pre-trained language models to per-
form a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new PEFT method called (IA) that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our
experiments is publicly available.

**Introduction**

Pre-trained language models have become a cornerstone of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest – i.e., using a pre-trained language model for initialization often produces better results with less labeled data. A
historically common approach has been to use the pre-trained model’s parameters for initialization before performing gradient-based fine-tuning on a downstream task of interest. While fine-tuning has produced many state-of-the-art results, it results in a model that is specialized for a single task with an entirely new set of parameter values, which can become impractical when fine-tuning a model on many downstream tasks. An alternative approach popularized by is in-context learning (ICL), which induces a model to perform a downstream task by inputting prompted examples. Few-shot prompting converts a small collection of input-target pairs into (typically) human-understandable instructions and examples, along with a single unlabeled example for which a prediction is desired. Notably, ICL requires no gradient-based training and therefore allows a single model to immediately perform a wide variety of tasks. Performing ICL therefore solely relies on the capabilities that a model learned during pre-training. These characteristics have led to a great deal of recent interest in ICL methods.

Despite the practical benefits of ICL, it has several major drawbacks. First, processing all prompted input-target pairs every time the model makes a prediction incurs significant compute costs. Second, ICL typically produces inferior performance compared to fine-tuning. Finally, the exact formatting
of the prompt (including the wording and ordering of examples) can have significant and unpredictable impact on the model’s performance, far beyond inter-run variation of fine-tuning. Recent work has also demonstrated that ICL can perform well even when provided with incorrect labels, raising questions as to how much learning is taking place at all.

An additional paradigm for enabling a model to perform a new task with minimal updates is parameter-efficient fine-tuning (PEFT), where a pre-trained model is fine-tuned by only updating a small number of added or selected parameters. Recent methods have matched the performance of fine-tuning the full model while only updating or adding a small fraction (e.g. 0.01%) of the full model’s parameters. Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently, making both PEFT and ICL viable for multitask models.

While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well when very little labeled data is available. Our primary goal in this paper is to close this gap by proposing a recipe – i.e., a model, a PEFT method, and a fixed set of hyperparameters – that attains strong performance on novel, unseen tasks while only updating a tiny fraction of the model’s parameters. Specifically, we base our approach on the T0 model, a variant of T5 fine-tuned on a multitask mixture of prompted datasets.
To improve performance on classification and multiple-choice tasks, we add unlikelihood and length normalization-based loss terms. In addition, we develop (IA), a PEFT method that multiplies intermediate activations by learned vectors. (IA) attains stronger performance than full-model fine-tuning while updating up to 10,000× fewer parameters. Finally, we demonstrate the benefits of pre-training the (IA)3 parameters before fine-tuning. Our overall recipe, which we dub “T-Few”, performs significantly better than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT while requiring dramatically less compute and allowing for mixed-task batches during inference. To facilitate the use of T-Few on new problems and future research on PEFT, we release our code.

After providing background on ICL and PEFT in the following section, we discuss the design of T-Few in section 3. In section 4, we present experiments comparing T-Few to strong ICL baselines. Finally, we discuss related work in appendix B and conclude in section 5.

**Background**

In this section, we provide am verview of ICL and PEFT with a focus on characterizing the computation, memory, and on-disk storage costs of making a prediction. Real-world costs depend on implementation and hardware, so we report costs in terms of FLOPs for computation and bytes for memory and storage, respectively. Additional related work is discussed in appendix B.

**2.1 Few-shot in-context learning (ICL)**

ICL aims to induce a model to perform a task by feeding in concatenated and prompted input-target examples (called “shots”) along with an unlabeled query example. Taking the cycled letter task from Brown et al. as an example, a 4-shot input or context would be “Please unscramble the letters into a word, and write that word: asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, astedro =”, for which the desired output would be “roasted”. ICL induces an autoregressive language model to perform
this task by feeding in the context and sampling from the model. For classification tasks, each label is associated with a string (e.g. “positive” and “negative” for sentiment analysis) and a label is assigned by choosing the label string that the model assigns the highest probability to. For multiple-choice tasks (e.g. choosing between N possible answers to a question), the model’s prediction is similarly determined by determining which choice is assigned the highest probability.

The primary advantage of ICL is that it enables a single model to perform many tasks immediately without fine-tuning. This also enables mixed-task batches, where different examples in a batch of data correspond to different tasks by using different contexts in the input. ICL is also typically performed with only a limited number of labeled examples – called few-shot learning – making it data-efficient.

Despite these advantages, ICL comes with significant practical drawbacks: First, making a prediction is dramatically more expensive because the model needs to process all of the in-context labeled examples. Specifically, ignoring the quadratic complexity of self-attention operations in Transformer
language models (which are typically small compared to the costs of the rest of the model), processing the k training examples for k-shot ICL increases the computational cost by approximately k + 1 times compared to processing the unlabeled example alone. Memory costs similarly scale approximately linearly with k, though during inference the memory costs are typically dominated by storing the model’s parameters. Separately, there is a small amount of on-disk storage required for storing the in-context examples for a given task. For example, storing 32 examples for a task where the prompted input and target for each example is 512 tokens long would require about 66 kilobytes of storage on disk (32 examples × 512 tokens × 32 bits).

Beyond the aforementioned costs, ICL also exhibits unintuitive behavior. Zhao et al. showed that the ordering of examples in the context heavily influences the model’s predictions. Min et al. showed that ICL can still perform well even if the labels of the in-context examples are swapped (i.e. made incorrect), which raises questions about whether ICL is really “learning” from the labeled examples.

Various approaches have been proposed to mitigate these issues. One way to decrease computational costs is to cache the key and value vectors for in-context examples. This is possible because decoder-only Transformer language models have a causal masking pattern, so the model’s activations for the context do not do not depend on the unlabeled example. In an extreme case, 32-shot ICL with 512 tokens per in-context example would result in over 144 gigabytes of cached key and value vectors for the GPT-3 model (32 examples × 512 tokens × 96 layers × 12288 dmodel × 32 bits each for the key
and value vectors). Separately, Min et al. proposed ensemble ICL, where instead of using the output probability from concatenating the k training examples, the output probabilities of the model on each training example (i.e. 1-shot ICL for each of the k examples) are multiplied together. This
lowers the non-parameter memory cost by a factor of k/2 but increases the computational cost by a factor of 2. In terms of task performance, Min et al. find that ensemble ICL outperforms the standard concatenative variant.

**2.2 Parameter-efficient fine-tuning**

While standard fine-tuning updates all parameters of the pre-trained model, it has been demonstrated that it is possible to instead update or add a relatively small number of parameters. Early methods proposed adding adapters, which are small trainable feed-forward networks inserted between
the layers in the fixed pre-trained model. Since then, various sophisticated PEFT methods have been proposed, including methods that choose a sparse subset of parameters to train, produce low-rank updates, perform optimization in a lower-dimensional subspace, add low-rank adapters using hypercomplex multiplication, and more. Relatedly, prompt tuning and prefix tuning concatenate learned continuous embeddings to the model’s input or activations to induce it to perform a task; this can be seen as a PEFT method. State-of-the-art PEFT methods can match the performance of fine-tuning all of the model’s parameters while updating only a tiny fraction
(e.g. 0.01%) of the model’s parameters.

PEFT drastically reduces the memory and storage requirements for training and saving the model. In addition, certain PEFT methods straightforwardly allow mixed-task batches – for example, prompt tuning enables a single model to perform many tasks simply by concatenating different prompt embeddings to each example in the batch. On the other hand, PEFT methods that re-parameterize the model are costly or onerous for mixed-task batches. Separately, different PEFT methods increase the computation and memory required to perform inference by different amounts. For example, adapters effectively add additional (small) layers to the model, resulting in small but
non-negligible increases in computational costs and memory. An additional cost incurred by PEFT is the cost of fine-tuning itself, which must be performed once and is then amortized as the model is used for inference. However, we will show that PEFT can be dramatically more computationally
efficient when considering both fine-tuning and inference while achieving better accuracy than ICL.

**Conclusion**

We introduced T-Few, a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot ICL at a lower computational cost. T-Few uses (IA), a new PEFT method that rescales inner activations with learned vectors. Using (IA)3 produces better performance than fine-tuning
the full model while only introducing a tiny amount of additional parameters. T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. When applying T-Few as-is (with no task-
specific hyperparameter tuning or other changes) to the RAFT benchmark, we attained super-human performance for the first time and outperformed prior submissions by a large margin. Through detailed characterization of computational costs, we found that T-Few uses over 1,000× fewer FLOPs
during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. Since all of our experiments were on classification tasks, we are interested in applying T-Few to generative tasks like as summarization and question answering in future work. We hope our results provide a new perspective on how best to perform few-shot learning with large language models.

## Step 7: Downloading the Notebook

In [None]:
qa_pairs = [
  {
    "question": "What is the main computational drawback of few-shot in-context learning (ICL) compared to parameter-efficient fine-tuning (PEFT)?",
    "answer": "ICL incurs high computational costs because it processes all in-context training examples during every prediction, increasing inference time and memory usage significantly, whereas PEFT fine-tunes a small number of parameters and avoids this repeated overhead."
  },
  {
    "question": "What novel PEFT method is introduced in the paper and how does it work?",
    "answer": "The paper introduces (IA), a parameter-efficient fine-tuning method that rescales intermediate activations by learned vectors, improving performance while introducing very few new parameters."
  },
  {
    "question": "How does T-Few improve over both ICL and standard PEFT methods?",
    "answer": "T-Few combines the T0 model, the (IA) method, and additional loss functions to outperform ICL and even full fine-tuning, achieving better accuracy with significantly reduced compute and memory costs."
  },
  {
    "question": "What role do the unlikelihood and length normalization loss terms play in T-Few?",
    "answer": "These loss terms help T-Few produce more accurate predictions by discouraging high-probability outputs for incorrect answers and adjusting for varying lengths of answer choices."
  },
  {
    "question": "What empirical result supports T-Few’s superiority in few-shot settings?",
    "answer": "T-Few achieved super-human performance on the RAFT benchmark without any task-specific tuning, outperforming previous methods by 6% and using over 1,000× fewer FLOPs than few-shot ICL with GPT-3."
  },
  {
    "question": "Why is PEFT particularly suited for multitask learning scenarios?",
    "answer": "Certain PEFT methods, like prompt tuning, allow for mixed-task batches by attaching different prompts to each input, enabling a single model to handle multiple tasks simultaneously without interference."
  },
  {
    "question": "What are some disadvantages of ICL identified in the paper?",
    "answer": "ICL suffers from high inference costs, unpredictable sensitivity to prompt formatting, and questionable learning behavior—e.g., it may perform well even with incorrectly labeled examples."
  },
  {
    "question": "What is the core idea behind few-shot learning in this paper?",
    "answer": "Few-shot learning refers to adapting a model to perform a new task using only a small number of labeled examples, either via ICL (by providing examples as input) or PEFT (by updating a few parameters)."
  },
  {
    "question": "What is the main contribution of this paper to the few-shot learning literature?",
    "answer": "The paper introduces a PEFT-based approach, T-Few, which is computationally efficient and achieves state-of-the-art few-shot performance without task-specific modifications or large model sizes."
  },
  {
    "question": "Why might someone prefer PEFT over ICL when deploying LLMs at scale?",
    "answer": "PEFT offers better accuracy with fewer resources, lower inference cost, and improved stability compared to ICL, making it more practical and scalable for real-world applications."
  }
]

In [None]:
title = "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"
generate_QA_pair("05", 2022, "fewshot", title, qa_pairs)

### **06: Finetuned Language Models are Zero-Shot Learners**

**Abstract**

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction tune it on
over 60 NLP datasets verbalized via natural language instruction templates. We
evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

**Introduction**

Language models (LMs) at scale, such as GPT-3 (Brown et al., 2020), have been shown to perform few-shot learning remarkably well. They are less successful at zero-shot learning, however. For example, GPT-3’s zero-shot performance is much worse than few-shot performance on tasks such as reading comprehension, question answering, and natural language inference. One potential reason is that, without few-shot exemplars, it is harder for models to perform well on prompts that are not similar to the format of the pretraining data.

In this paper, we explore a simple method to improve the zero-shot performance of large language models, which would expand their reach to a broader audience. We leverage the intuition that NLP tasks can be described via natural language instructions, such as “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” We take a pretrained language model of 137B parameters and perform instruction tuning—finetuning the model on a mixture of more than 60 NLP datasets expressed via natural language instructions. We refer to this resulting
model as FLAN, for Finetuned Language Net.

To evaluate the zero-shot performance of FLAN on unseen tasks, we group NLP datasets into clusters based on their task types and hold out each cluster for evaluation while instruction tuning FLAN on all other clusters. For example, as shown in Figure 1, to evaluate FLAN’s ability to perform
natural language inference, we instruction tune the model on a range of other NLP tasks such as commonsense reasoning, translation, and sentiment analysis. As this setup ensures that FLAN has not seen any natural language inference tasks in instruction tuning, we then evaluate its ability to
perform zero-shot natural language inference.

Our evaluations show that FLAN substantially improves the zero-shot performance of the base 137B-parameter model. FLAN’s zero-shot also outperforms 175B-parameter GPT-3’s zero-shot on 20 of 25 datasets that we evaluate, and even outperforms GPT-3’s few-shot by a large margin on ANLI,
RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. In ablation studies, we find that increasing the number of task clusters in instruction tuning improves performance on unseen tasks and that the benefits of instruction tuning emerge only with sufficient model scale.

Instruction tuning is a simple method that, as depicted in Figure 2, combines appealing aspects of both the pretrain–finetune and prompting paradigms by using supervision via finetuning to improve language model’s responses to inference-time text interactions. Our empirical results
demonstrate promising abilities of language models to perform tasks described purely via instructions.

**Results**

We evaluate FLAN on natural language inference, reading comprehension, closed-book QA, translation, commonsense reasoning, coreference resolution, and struct-to-text. As described in §2.2, we evaluate on unseen tasks by grouping datasets into task clusters and holding out each cluster for evaluation while instruction tuning on all remaining clusters (i.e., each evaluation task cluster uses a different checkpoint). For each dataset, we evaluate the mean of performance on all templates, which proxies the expected performance given a typical natural language instruction. As a dev set is sometimes available for manual prompt engineering (Brown et al., 2020), for each dataset we also obtain the test set performance using the template with the best dev set performance.

For comparison, we report zero and few-shot results for LaMDA-PT using the same prompts as GPT-3 (as LaMDA-PT is not suitable for natural instructions without instruction tuning). This baseline provides the most direct ablation of how much instruction tuning helps. Instruction tuning significantly improves LaMDA-PT on most datasets.

We also show the zero-shot performances of GPT-3 175B and GLaM 64B/64E, as reported in their respective papers. With the best dev template, zero-shot FLAN outperforms zero-shot GPT-3 on 20 of 25 datasets and even surpasses GPT-3’s few-shot performance on 10 datasets. With the best dev-template, zero-shot FLAN outperforms zero-shot GLaM on 13 of 19 available datasets and one-shot GLaM on 11 of 19 datasets.

Overall, we observe that instruction tuning is very effective on tasks naturally verbalized as instructions (e.g., NLI, QA, translation, struct-to-text) and is less effective on tasks directly formulated as language modeling, where instructions would be largely redundant (e.g., commonsense reasoning and coreference resolution tasks that are formatted as finishing an incomplete sentence or paragraph).

**Discussion**

Our paper has explored a simple question in zero-shot prompting: does finetuning a model on a collection of tasks phrased as instructions improve its performance on unseen tasks? We operationalize this question via instruction tuning, a simple method that combines appealing aspects of both
the pretrain–finetune and prompting paradigms. Our instruction-tuned model, FLAN, improves performance against an untuned model and surpasses zero-shot GPT-3 on the majority of tasks that we evaluate on. Ablation studies reveal that performance on unseen tasks improves with the number of instruction tuning task clusters, and, interestingly, that performance improvements from instruction tuning emerge only with sufficient model scale. Moreover, instruction tuning can be combined with other prompting methods such as few-shot prompting and prompt tuning.

The diverse capabilities of language models at scale have drawn attention to the tradeoffs between specialist models (one model per task) and generalist models, for which our study has potential implications. Although one might
expect labeled data to have the most natural role in improving specialist models, instruction tuning demonstrates how labeled data can be used to help large language models perform many, unseen tasks. In other words, the positive effect of instruction tuning on cross-task generalization shows that
task-specific training is complementary to general language modeling and motivates further research on generalist models.

As for limitations of our study, there is a degree of subjectivity in assigning tasks to clusters (though we try to use accepted categorizations in the literature), and we only explore the use of relatively
short instructions of typically a single sentence (c.f. detailed instructions given to crowd-workers). A limitation for our evaluation is that individual examples might have appeared in the models’ pretraining data, which includes web documents, though in post-hoc analysis (Appendix C) we do not find any evidence that data overlap substantially impacted the results. Finally, the scale of FLAN 137B makes it costly to serve. Future work on instruction tuning could include gathering/generating even more task clusters for finetuning, cross-lingual experiments, using FLAN to generate data for training downstream classifiers, and using finetuning to improve model behavior with respect to bias and fairness.

**Conclusion**

This paper has explored a simple method for improving the ability of language models at scale to perform zero-shot tasks based purely on instructions. Our instruction-tuned model, FLAN, compares favorably against GPT-3 and signals the potential ability for language models at scale to follow instructions. We hope that our paper will spur further research on instructions-based NLP, zero-shot learning, and using labeled data to improve large language models.

In [None]:
qa_pairs = [
  {
    "question": "What is the central hypothesis of the paper 'Finetuned Language Models are Zero-Shot Learners'?",
    "answer": "The central hypothesis is that instruction tuning—finetuning large language models on a collection of datasets expressed via natural language instructions—substantially improves zero-shot performance on unseen tasks."
  },
  {
    "question": "What is FLAN and how was it created?",
    "answer": "FLAN (Finetuned Language Net) is a 137B parameter language model created by instruction tuning a pretrained model on over 60 NLP datasets, each described using natural language instruction templates."
  },
  {
    "question": "How does FLAN's zero-shot performance compare to GPT-3's?",
    "answer": "FLAN outperforms GPT-3 (175B) in zero-shot performance on 20 out of 25 datasets and even surpasses GPT-3's few-shot performance on six benchmarks including ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze."
  },
  {
    "question": "What evaluation strategy was used to ensure FLAN was tested on unseen tasks?",
    "answer": "FLAN was evaluated using a leave-one-cluster-out strategy, where datasets were grouped by task type and each cluster was held out during instruction tuning to ensure zero-shot evaluation on entirely unseen tasks."
  },
  {
    "question": "What did the ablation studies in the paper reveal about instruction tuning?",
    "answer": "The studies showed that increasing the number of instruction-tuning task clusters improves generalization to unseen tasks, and that the benefits of instruction tuning only emerge at sufficient model scale."
  },
  {
    "question": "What types of tasks benefit most from instruction tuning according to the results?",
    "answer": "Tasks that are naturally verbalized via instructions—such as natural language inference, question answering, translation, and structured text generation—benefit the most from instruction tuning."
  },
  {
    "question": "What limitations of the FLAN study do the authors acknowledge?",
    "answer": "The authors acknowledge limitations in subjectively assigning tasks to clusters, restricting instructions to brief phrases, potential data overlap with pretraining corpora, and the high computational cost of serving a 137B parameter model."
  },
  {
    "question": "Why is FLAN’s performance improvement on unseen tasks significant for model generalization?",
    "answer": "It suggests that task-specific labeled data, when phrased as instructions, can improve cross-task generalization, offering a path toward building generalist models rather than narrowly specialized ones."
  },
  {
    "question": "How does instruction tuning differ from few-shot prompting or traditional finetuning?",
    "answer": "Instruction tuning uses supervised finetuning on many tasks described via instructions, blending the benefits of pretraining and prompting. Unlike few-shot prompting, it trains the model directly on instruction-following, and unlike traditional finetuning, it supports generalization to unseen tasks."
  },
  {
    "question": "What are the practical implications of FLAN's success for future NLP research?",
    "answer": "FLAN's success highlights the value of instruction tuning for scalable zero-shot learning and motivates further research into multi-task generalization, bias mitigation, and instruction-based learning paradigms."
  },
  {
    "question": "What makes instruction tuning a promising strategy for zero-shot learning?",
    "answer": "It teaches models to follow natural language instructions across diverse tasks, enabling them to generalize better to new instructions without requiring additional examples or task-specific finetuning."
  },
  {
    "question": "How might FLAN's approach influence the development of generalist language models?",
    "answer": "By showing that instruction tuning enhances performance on unseen tasks, FLAN paves the way for generalist models that can handle a wide variety of tasks without requiring separate models or extensive manual adaptation."
  }
]

In [None]:
title = "Finetuned Language Models are Zero-Shot Learners"
generate_QA_pair("06", 2022, "zeroshot", title, qa_pairs)

### **Instruction Tuning for Large Language Models: A Survey**

**Abstract**

This paper surveys research works in the quickly advancing field of instruction tuning (IT), which can also be referred to as supervised fine-tuning (SFT)1, a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of (INSTRUCTION, OUTPUT) pairs in a supervised fashion, which bridges the gap between the
next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of SFT, the construction of SFT datasets, the training of SFT models, and applications to different
modalities, domains and application, along with analysis on aspects that influence the outcome of SFT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful
research.

**Introduction**

The field of large language models (LLMs) has witnessed remarkable progress in recent years. LLMs such as GPT-3, PaLM, and LLaMA have demonstrated
impressive capabilities across a wide range of natural language tasks. One
of the major issues with LLMs is the mismatch between the training objective and users’ objective: LLMs are typically trained on minimizing the contextual word prediction error on large corpora; while users want the model to "follow their instructions helpfully and safely."

To address this mismatch, instruction tuning (IT), which can also be referred to as supervised fine-tuning (SFT), is proposed, serving as an effective technique to enhance the capabilities and controllability of large language
models. It involves further training LLMs using (INSTRUCTION, OUTPUT) pairs, where INSTRUCTION denotes the human instruction for the model, and OUTPUT denotes the desired output that follows the INSTRUCTION. The benefits of SFT are threefold: (1) Finetuning an LLM on the instruction dataset bridges the gap between the next-word prediction objective of LLMs and the
users’ objective of instruction following; (2) SFT allows for a more controllable and predictable model behavior compared to standard LLMs. The
instructions serve to constrain the model’s outputs to align with the desired response characteristics or domain knowledge, providing a channel for humans to intervene with the model’s behaviors; and (3) SFT is computationally efficient and can help LLMs rapidly adapt to a specific domain without extensive retraining or architectural changes.

Despite its effectiveness, SFT also poses challenges: (1) Crafting high-quality instructions that properly cover the desired target behaviors
is non-trivial: existing instruction datasets are usually limited in quantity, diversity, and creativity; (2) there has been an increasing concern that SFT only improves on tasks that are heavily supported in the SFT training dataset; and (3) there has been an intense criticism that SFT only captures surface-level patterns and styles (e.g., the output format) rather than comprehending and learning the task. Improving instruction
adherence and handling unanticipated model responses remain open research problems. These challenges highlight the importance of further investigations, analysis, and summarization in this field, to optimize the fine-tuning process and better understand the behavior of instruction tuned LLMs.

In the literature, there has been an increasing research interest in analysis and discussions on LLMs, including pre-training methods, reasoning abilities, downstream applications, but rarely on the topic of LLM instruction tuning. This survey attempts to fill this blank, organizing the most up-to-date state of knowledge on this quickly advancing field. Specifically,

• Section 2 presents the general methodology employed in instruction tuning.

• Section 3 outlines the construction process of commonly-used SFT representative datasets.

• Section 4 presents representative instruction tuned models.

• Section 5 reviews multi-modality techniques and datasets for instruction tuning, including images, speech, and video.

• Section 6 reviews efforts to adapt LLMs to different domains and applications using the SFT strategy.

• Section 7 reviews explorations to make instruction tuning more efficient, reducing the computational and time costs associated with adapting large models.

• Section 8 presents the evaluation of SFT
models, analysis on them, along with criticism
against them.

**Conclusion**

This work surveys recent advances in the fast growing field of instruction tuning, which can also be referred to as supervised fine-tuning (SFT). We make a systematic review of the literature, including the general methodology of SFT, the construction of SFT datasets, the training of SFT models, SFT’s applications to different modalities, domains and application. We also review analysis on SFT models to discover both their advantages and potential pitfalls. We hope this work will act as a stimulus to motivate further endeavors to address the deficiencies of current SFT models.

In [None]:
qa_pairs = [
  {
    "question": "What is instruction tuning (IT), and why is it also referred to as supervised fine-tuning (SFT)?",
    "answer": "Instruction tuning (IT), also called supervised fine-tuning (SFT), refers to the process of further training a large language model using supervised datasets composed of (INSTRUCTION, OUTPUT) pairs. This process adapts the model from its original objective of next-token prediction to a behavior more aligned with following human instructions."
  },
  {
    "question": "What are the three main benefits of instruction tuning as described in the survey?",
    "answer": "The three main benefits of instruction tuning are: (1) bridging the gap between next-word prediction and instruction-following objectives, (2) enabling more controllable and predictable model behavior through human-aligned constraints, and (3) allowing efficient adaptation to specific domains without requiring extensive retraining or architectural changes."
  },
  {
    "question": "According to the survey, what are the key challenges faced in instruction tuning?",
    "answer": "Key challenges include crafting high-quality and diverse instruction datasets, the limited generalization of models to tasks not present in the training data, and concerns that SFT may teach surface-level formatting patterns rather than genuine task understanding."
  },
  {
    "question": "How does instruction tuning help align LLM behavior with user expectations?",
    "answer": "Instruction tuning helps align LLM behavior with user expectations by training the model to produce outputs that adhere to specific instructions, thereby enabling more interpretable, goal-directed, and human-centered responses."
  },
  {
    "question": "What criticism is leveled against current SFT models according to the authors?",
    "answer": "Criticisms include the claim that SFT models primarily learn output formatting or stylistic patterns instead of genuinely understanding the task, as well as their tendency to perform well only on tasks heavily represented in the training data."
  },
  {
    "question": "Why is this paper significant within the landscape of LLM research?",
    "answer": "The paper fills a notable gap by offering a comprehensive and structured review of instruction tuning techniques, datasets, methodologies, and evaluations—a topic underexplored in contrast to pretraining and downstream application studies."
  },
  {
    "question": "What future directions do the authors recommend for improving SFT techniques?",
    "answer": "The authors recommend improving instruction dataset quality and diversity, developing more robust evaluation metrics, investigating multi-modal SFT techniques, and enhancing generalization across unseen tasks and domains."
  },
  {
    "question": "How does instruction tuning differ from the original pretraining objective of LLMs?",
    "answer": "While pretraining optimizes for contextual word prediction on large corpora, instruction tuning trains models to follow specific human-given tasks, thus aligning model behavior with explicit, user-oriented goals."
  },
  {
    "question": "What is the role of instruction datasets in supervised fine-tuning?",
    "answer": "Instruction datasets provide supervised examples of tasks formatted as (INSTRUCTION, OUTPUT) pairs. Their quality, diversity, and coverage significantly affect the generalization ability and instruction-following precision of the fine-tuned model."
  },
  {
    "question": "How can instruction tuning impact real-world applications of LLMs?",
    "answer": "Instruction tuning enhances the ability of LLMs to perform complex tasks in domains such as healthcare, education, and legal reasoning by tailoring model outputs to human-specified formats and expectations, thereby improving trust, usability, and safety."
  },
  {
    "question": "Why might a practitioner choose SFT over other fine-tuning strategies?",
    "answer": "Practitioners might choose SFT because it is more computationally efficient, supports domain-specific adaptation without modifying the base architecture, and produces more predictable, instruction-aligned behavior than traditional next-token finetuning."
  }
]

In [None]:
title = "Instruction Tuning for Large Language Models: A Survey"
generate_QA_pair("07", 2024, "instruction_tuning_llm", title, qa_pairs)

### **08: LLaMA-Adapter: Efficient Fine-Tuning Of Large Language Midels With Zero-Initialized Attention**

**Abstract**

With the rising tide of large language models (LLMs), there has been a growing interest in developing general-purpose instruction-following models, e.g., Chat-GPT. To this end, we present LLaMA-Adapter, a lightweight adaption method for efficient instruction tuning of LLaMA. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning. Specifically, a zero-initialized attention mechanism is proposed. It adopts a learnable zero gating to adaptively inject the instructional cues into LLaMA within self-attention layers, contributing to a stable training process and superior final performance. In this way, LLaMA-Adapter can generate high-quality responses to diverse language instructions, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, by incorporating an image encoder, our approach can be simply extended to a Multi-modal LLM for image-conditioned instruction following, which achieves superior multi-modal reasoning capacity on several popular benchmarks (MME, MMBench, LVLM-eHub). Furthermore, we also verify the proposed zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa, CLIP) on traditional vision and language tasks, demonstrating the effectiveness and generalizability of our approach.

**Introduction**

Large Language Models (LLMs) have stimulated widespread attention in both academia and industry. Driven by massive corpora and advanced hardware, LLMs exhibit remarkable understanding and generative ability, propelling language tasks to a higher level. Recently, significant progress has been made on instruction-following models, e.g., ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b), which follow language instructions and generate contextual responses. However, the further prevalence of instruction models is largely impeded by the closed-source restriction and high development costs.

To alleviate this, Stanford Alpaca proposes to fine-tune an open-source LLM, i.e., LLaMA into an instruction-following model, which is affordable and replicable. Starting from 175 human-written instruction-output pairs, Alpaca leverages GPT-3.5 to expand the training data to 52K in a self-instruct manner. Supervised by this, Alpaca fine-tunes the entire 7B parameters in LLaMA, producing an exceptional instruction model that performs similarly to GPT-3.5. Despite Alpaca’s effectiveness, a complete fine-tuning of large-scale LLaMA is still time-consuming, computation-intensive, and cumbersome to transfer to different downstream scenarios.

In this paper, we introduce LLaMA-Adapter, an efficient fine-tuning method that adapts LLaMA into a well-performed instruction-following model. Trained by Alpaca’s instruction-output data, our approach freezes the entire LLaMA model, and proposes a zero-initialized attention mechanism with superior resource efficiency. Specifically, in LLaMA’s higher transformer layers, we append a set of learnable adaption prompts as prefixes to the word tokens. Then, to avoid the noise from randomly initialized prompts at the early training stage, we equip the frozen self-attention layers with a learnable gating factor. The gating mechanism is initialized by zeros, and controls the feature interaction between prompt and word tokens, within the process of attention calculation. Such a strategy can first preserve the original knowledge in LLaMA, and progressively inject the new instructional signals during training. This contributes to a more stable learning process and better
instruction-following capacity of the final model.

Overall, our LLaMA-Adapter exhibits four main characteristics, as shown in Figure 1.

• 1.2M Parameters. Instead of updating the full 7B parameters, we freeze the pre-trained LLaMA and only learn the zero-initialized attention mechanism with 1.2M parameters. This, however, reveals comparable instruction-following proficiency with the 7B Alpaca.

• One-hour Fine-tuning. Thanks to our lightweight adaption modules with zero-initialized gating, the training convergence of LLaMA-Adapter costs less than one hour on 8 A100 GPUs, which are three times faster than Alpaca.

• Plug with Expertise. For different scenarios, it is flexible to insert their respective adapters to endow LLaMA with different expert knowledge or new modality input. Thus, it suffices to store a 1.8M adapter within each context, other than a complete copy of the 13G LLaMA.

• Multi-modal Reasoning. Besides language instruction, our approach can also incorporate an image encoder via zero-initialized attention to become a multi-modal LLM. Compared to concurrent works, LLaMA-Adapter showcases
higher tuning efficiency with competitive reasoning capacity on MME, MMBench, and LVLM-eHub benchmarks.

In addition to instruction tuning, our zero-initialized attention can be generalized to traditional vision and language tasks for parameter-efficient fine-tuning. We apply our approach to the pre-trained ViT, ReBERTa, and CLIP,
respectively for fine-tuning vision, language, and vision-language models. On a wide range of downstream tasks, we demonstrate the effectiveness of our proposed method for traditional tasks.

**Conclusion**

In this paper, we propose LLaMA-Adapter, an efficient adaption method for tuning instruction-following models. For better training stability and final performance, we introduce the zero-initialized attention mechanism with a learnable gating factor, which increasingly incorporates instructional signals, while preserving the pre-trained knowledge in LLaMA. With only 1.2M parameters and one-hour training, our approach effectively fine-tunes LLaMA with superior efficiency compared to the 7B-parameter Alpaca. LLaMA-Adapter can be generalized to image-conditioned generation as a multi-modal LLM, achieving competitive results on various visual question answering benchmarks.
On traditional vision and language tasks, our zero-initialized attention also attains favorable fine-tuning performance, which indicates strong generalization capacity.

In [None]:
qa_pairs = [
  {
    "question": "What is the primary goal of LLaMA-Adapter, and how does it differ from full fine-tuning methods like Alpaca?",
    "answer": "LLaMA-Adapter aims to efficiently transform LLaMA into an instruction-following model using only 1.2 million learnable parameters, avoiding the need to update the full 7 billion parameters as in Alpaca. It achieves this via a lightweight mechanism called zero-initialized attention, offering significant gains in training speed and resource efficiency."
  },
  {
    "question": "How does the zero-initialized attention mechanism contribute to stable training in LLaMA-Adapter?",
    "answer": "The zero-initialized attention mechanism introduces a learnable gating factor initialized to zero, which regulates the interaction between adaptation prompts and original token embeddings. This setup minimizes early training noise and allows gradual injection of instructional signals, stabilizing learning while preserving pre-trained knowledge."
  },
  {
    "question": "What are the four main characteristics of LLaMA-Adapter highlighted in the paper?",
    "answer": "The four characteristics are: (1) Only 1.2M parameters are learned; (2) Fine-tuning converges in under one hour using 8 A100 GPUs; (3) Modularity enables domain-specific adapters to be plugged in without retraining the full model; and (4) The method can be extended to multi-modal instruction following with image encoders."
  },
  {
    "question": "How does LLaMA-Adapter extend to multi-modal reasoning, and how does it perform?",
    "answer": "LLaMA-Adapter incorporates an image encoder through the same zero-initialized attention mechanism to handle image-conditioned language tasks. It demonstrates competitive reasoning performance on benchmarks like MME, MMBench, and LVLM-eHub, outperforming or matching other state-of-the-art multi-modal models with greater efficiency."
  },
  {
    "question": "How does the adapter approach facilitate specialization for different downstream tasks or modalities?",
    "answer": "Adapters are lightweight modules that can be easily inserted into the frozen LLaMA model, enabling specialization for different domains or input modalities without retraining the entire model. This design allows storing task-specific adapters separately rather than duplicating the full model."
  },
  {
    "question": "What models beyond LLaMA were evaluated using the zero-initialized attention method, and what were the results?",
    "answer": "The zero-initialized attention mechanism was tested on ViT, RoBERTa, and CLIP for vision, language, and vision-language tasks respectively. In all cases, the method showed strong generalization and effective fine-tuning performance, confirming its versatility across modalities."
  },
  {
    "question": "Why is the gating factor in zero-initialized attention initialized to zero, and what advantage does this offer?",
    "answer": "Initializing the gating factor to zero ensures that during early training stages, the model relies entirely on its pre-trained knowledge. Instructional signals are introduced gradually as training progresses, reducing disruption and improving training stability."
  },
  {
    "question": "How does LLaMA-Adapter compare to Alpaca in terms of efficiency and performance?",
    "answer": "While Alpaca fine-tunes all 7B parameters, LLaMA-Adapter matches its performance using only 1.2M parameters and completes training in one hour, making it significantly more efficient without sacrificing instruction-following quality."
  },
  {
    "question": "What is instruction tuning, and why is it important for aligning LLMs with user intent?",
    "answer": "Instruction tuning is a supervised fine-tuning method that trains models on (INSTRUCTION, OUTPUT) pairs to help them better follow user commands. It bridges the gap between unsupervised language modeling and goal-directed behavior expected by human users."
  },
  {
    "question": "Why is parameter-efficient fine-tuning increasingly important in LLM development?",
    "answer": "As model sizes grow, updating all parameters becomes computationally costly and resource-intensive. Parameter-efficient methods like adapters and prompt tuning allow rapid specialization with fewer resources, enabling broader accessibility and faster iteration in both research and deployment."
  }
]

In [None]:
title = " LLaMA-Adapter: Efficient Fine-Tuning Of Large Language Midels With Zero-Initialized Attention"
generate_QA_pair("08", 2024, "llama_adapter", title, qa_pairs)

### **09: LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models**

**Abstract**

The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods. Moreover, we conduct extensive empirical studies on the impact of adapter types, placement locations, and hyper-parameters to the best design for each adapter-based methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Rea-
soning. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to powerful LLMs (175B) in zero-shot inference on both reasoning tasks.

**Introduction**

Large language models (LLMs), such as Chat-GPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023), have demonstrated unprecedented performance
across various natural language processing (NLP) tasks and multi-modal tasks. These LLMs often possess sizes exceeding hundreds of billions of parameters and are closed-source. Consequently, this has spurred the development of accessible and cost-effective alternatives such as LLaMA. These alternatives involve fine-tuning open-source LLMs utilizing either task-specific data (e.g., Chat-Doctor or instructional data. However, full-model fine-tuning (FFT) is computationally and storage-intensive, thereby presenting significant challenges in practical implementation.

Prior to the emergence of FFT of LLMs (e.g., LLaMA), a compelling solution called parameter-efficient fine-tuning (PEFT) has been proposed in the NLP field, specifically for pre-trained models, offering a promising approach for efficiently fine-tuning LLMs. The advantage of PEFT lies in its ability to fine-tune only a small set of external parameters rather than the entire backbone model while still achieving comparable or even superior performance. Moreover, PEFT can effectively mitigate catastrophic forgetting in comparison to FFT. As shown in Table 1, the advantage of PEFT has resulted in the developing of diverse PEFT modules, encompassing series adapters, parallel adapters, reparameterization-based methods, and prompt-based learning methods.

By incorporating these PEFT modules into backbone models (i.e., LLMs), we can capitalize on the remarkable capabilities of backbone models without requiring extensive computational resources. This opens up opportunities for a broader range of applications, enabling even those with limited access to high-performance computing to harness the power of LLMs in their specific tasks. Despite the success of PEFT for pre-trained models, it remains
unclear which PEFT module, in combination with which layer and hyperparameter configuration, is most suitable for a given task or dataset when meet-
ing LLMs. Therefore, further investigation is needed to determine the optimal PEFT setup that maximizes performance across different tasks and datasets.

Motivated by this, in this paper, we conduct a comprehensive empirical study of PEFT of three representative open-source LLMs, including BLOOM, GPT-J, and LLaMA. Specifically, we undertake an empirical study to address the following three research questions: (i) What is the optimal placement and
configuration of different PEFT methods? (ii) How’s the performance of different adapters across downstream tasks? And (iii) What are the differences in performance between in-distribution (ID) and out-of-distribution (OOD) scenarios for PEF methods? The findings of our study are as follows:

1. The optimal placement for the series adapter, parallel adapter, and LoRA is after the MLP layers, parallel with the MLP layers, and located after both the Attention layers and MLP layers simultaneously, respectively;

2. Smaller language models with the PEFT approach can attain competitive or superior performance on specific tasks compared to larger language models. For instance, LLaMA-13B with LoRA can outperform GPT-3.5 (>175B) on MultiArith, AddSub, and SingleEq ;

3. The ID fine-tuned LLaMA-13B with adapters outperforms ChatGPT on commonsense reasoning tasks indicating that smaller language models have the potential to outperform larger language models on specific tasks with ID fine-tunig data.

Our contributions can be summarized as follows:
• We conduct a comprehensive empirical study of various PEFT methods applied in different open-source LLMs.

• To facilitate our empirical study, we construct two high-quality training datasets to enhance PEFT performance in math reasoning and
commonsense reasoning tasks.

• We develop a user-friendly framework, LLM-Adapter, seamlessly integrates diverse adapters into LLMs, empowering researchers to implement adapter-based PEFT methods for a wide range of tasks.
• We conduct extensive experiments to answer the three research questions to serve as inspiration for future research.

**Conclusion**

In this paper, we develop a user-friendly framework, LLM-Adapter, seamlessly integrates diverse adapters into LLMs, empowering researchers to implement adapter-based PEFT methods for a wide range of tasks. To evaluate different PEFT methods on downstream tasks, we construct two high-quality fine-tuning datasets to enhance PEFT performance on math reasoning and commonsense reasoning tasks. By utilizing the LLM-Adapter toolkit and the constructed fine-tuning datasets, we conduct a comprehensive empirical study and find the answer of research questions on the optimal placement and configuration of different PEFT methods, the impact of adapter architectures, and the influence of ID and OOD scenarios. We hope this work will encourage further research on PEFT method for LLMs.

**Limitations**

There are two limitations to this work. Firstly, due to constrained computing resources, we were unable to evaluate the performance of larger language models such as LLaMA-33B and LLaMA-65B. It is anticipated that these larger models, possesing enhanced language understanding capabilities, would yield superior performance. Secondly, this paper does not delve into the exploration of combining different adapters. Given the extensive search space associated with the combination of various PEFT methods, we intend to explore this direction in future research endeavors.

In [None]:
qa_pairs = [
  {
    "question": "What is the main objective of the LLM-Adapter framework introduced in this paper?",
    "answer": "The LLM-Adapter framework aims to provide a user-friendly, modular platform that integrates diverse adapter-based parameter-efficient fine-tuning (PEFT) methods into large language models, allowing researchers to efficiently apply and evaluate these methods across a wide range of NLP tasks."
  },
  {
    "question": "Which open-source LLMs and adapter types are supported in the LLM-Adapter framework?",
    "answer": "The framework supports open-source models like LLaMA, BLOOM, and GPT-J. It integrates various PEFT techniques including series adapters, parallel adapters, reparameterization-based methods, and prompt-based learning approaches."
  },
  {
    "question": "How does adapter placement affect performance in different PEFT methods according to the empirical study?",
    "answer": "The study finds that optimal adapter placement varies by method: series adapters perform best after MLP layers, parallel adapters work well when placed in parallel with MLP layers, and LoRA achieves the best performance when inserted after both the Attention and MLP layers."
  },
  {
    "question": "What do the authors find regarding the performance of smaller LLMs like LLaMA-13B compared to larger models like GPT-3.5?",
    "answer": "The authors observe that smaller models like LLaMA-13B, when equipped with PEFT methods such as LoRA, can outperform larger models like GPT-3.5 on specific tasks like MultiArith, AddSub, and SingleEq, especially in in-distribution settings."
  },
  {
    "question": "What are the key research questions addressed in this empirical study?",
    "answer": "The study investigates: (1) the optimal placement and configuration for different PEFT methods, (2) the comparative performance of different adapters on downstream tasks, and (3) how PEFT methods perform in in-distribution (ID) versus out-of-distribution (OOD) scenarios."
  },
  {
    "question": "How does the use of in-distribution fine-tuning data affect performance on commonsense reasoning tasks?",
    "answer": "The study shows that in-distribution fine-tuning using adapters allows smaller models like LLaMA-13B to outperform even ChatGPT on commonsense reasoning tasks, highlighting the importance of domain-aligned tuning data."
  },
  {
    "question": "What kinds of datasets did the authors construct for evaluating PEFT performance?",
    "answer": "The authors constructed two high-quality fine-tuning datasets designed to enhance PEFT performance on math reasoning and commonsense reasoning tasks, enabling robust evaluation across different adapter configurations."
  },
  {
    "question": "What are the two main limitations acknowledged in this study?",
    "answer": "First, the study did not evaluate larger models like LLaMA-33B or LLaMA-65B due to resource constraints. Second, the work did not explore combinations of different adapter types, which remains a promising direction for future research."
  },
  {
    "question": "What is parameter-efficient fine-tuning (PEFT) and why is it important?",
    "answer": "PEFT is a technique where only a small subset of parameters is fine-tuned instead of the entire model, reducing computational costs while preserving or even enhancing performance. It enables efficient model adaptation without full retraining."
  },
  {
    "question": "How does adapter-based fine-tuning compare to full model fine-tuning (FFT) in terms of computational efficiency?",
    "answer": "Adapter-based fine-tuning is significantly more computationally efficient than FFT, as it avoids updating all model parameters. It also helps mitigate issues like catastrophic forgetting and is easier to scale across multiple tasks or domains."
  }
]

In [None]:
title = "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
generate_QA_pair("09", 2023, "llm_adapters", title, qa_pairs)

### **10: LoRA VS Full Fine-Tuning: An Illusion of Equivalence**

**Abstract**

Fine-tuning is a crucial paradigm for adapting pre-trained large language mod-
els to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA)
have been shown to match the performance of fully fine-tuned models on various
tasks with an extreme reduction in the number of trainable parameters. Even in
settings where both methods learn similarly accurate models, are their learned
solutions really equivalent? We study how different fine-tuning methods change
pre-trained models by analyzing the model’s weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task’s distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target
task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.


**Introduction**

Adapting large, pre-trained models to downstream tasks via fine-tuning is a computation- and data-efficient way to create domain-specific models for a variety of tasks. The simplest approach is to fine-tun all parameters of the pre-trained model on downstream task data. However, as pre-trained models grow larger, full fine-tuning becomes increasingly challenging and ex-
pensive. Recently, parameter-efficient fine-tuning (PEFT) methods, especially low-rank adaptation, have been shown to enable fine-tuning with only a fraction of the trainable parameters. But even when fine-tuning with LoRA matches the performance of ful fine-tuning, are the solutions learned by
the two methods really equivalent?

While full fine-tuning treats every parameter as trainable, LoRA treats the learned update to a weight matrix as the product of two low-rank matrices. While this parameterization is empirically effective, a principled explanation of the mechanism by which it matches the full fine-tuning performance has remained elusive. One explanation is offered by the
intrinsic dimension hypothesis, which posits that the update
learned via fine-tuning has an intrinsically low intrinsic rank, suggesting that LoRA might recover an approximately equivalent solution to full fine-tuning. However, prior work has observed differences in the ability of LoRA and full fine-tuning to independently change the angle and magnitude with
which a neuron transforms its input (Liu et al., 2024). Moreover, other work has also observed that LoRA has difficulty matching the performance of full fine-tuning on harder tasks, like code generation and long-form text generation. Therefore, it is unclear if these findings indicate a limit in LoRA’s ability to fit to a specific downstream task, or if these methods learn inherently different solutions.

In this paper, we show that full fine-tuning and LoRA learn different solutions with characteristic differences in their spectral properties (as shown in Fig. 1) and different generalization behaviors outside the target task distribution. We observe:

1. LoRA and full fine-tuning produce structurally different parameter updates, characterize by the existence of intruder dimensions. These are singular vectors, with large associated singular values, that are approximately orthogonal to the singular vectors in a pre-trained weight matrix. In contrast, fully fine-tuned models remain spectrally similar to the pre-trained model and do not contain intruder dimensions.

2. Behaviorally, LoRA fine-tuned models with intruder dimensions forget more of the pre-training distribution and exhibit less robust continual learning compared to full fine-tuning: LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside the adaptation task’s distribution, despite matching accuracy in distribution. However, higher-rank LoRA fine-tuned models, with identical adaptation task performance, more closely resemble full fine-tuned models on these measures. Very high rank LoRA models, for e.g., full-rank LoRA, too forget more of their pre-training distribution—highlighting the fact that LoRA is not exempt from the general tradeoff between expressive power and generalization.

3. Even when a low-rank LoRA performs well on a target task, a higher-rank parameterization may still be preferable. While we observe that our low-rank LoRAs (r ≤ 8) fit our downstream task distribution as well as full fine-tuning and high-rank LoRAs, using a high-rank (r = 64) leads to models that both exhibit better generalization and robust adaptability. However, in order to take advantage of higher ranks, the LoRA updated models must be rank-stabilized.


**Conclusion**

The paper describes the finding that LoRA and full fine-tuning, with equal performance on the fine-tuning task, can have solutions with very different generalization behaviors outside the fine-tuning task distribution. We found that LoRA and full fine-tuning yield models with significant differences
spectral properties of their weight matrices: LoRA models often containing “intruder dimensions”, high-ranking singular vectors approximately orthogonal to the singular vectors of pre-trained weight matrices. The existence of intruder dimensions correlates with the fine-tuned model forgetting more of the pre-training distribution as well as forgetting more when trained on tasks sequentially in a continual learning setup.

In [None]:
qa_pairs = [
  {
    "question": "What key question does this paper investigate regarding LoRA and full fine-tuning?",
    "answer": "The paper investigates whether LoRA and full fine-tuning, despite achieving similar accuracy on downstream tasks, actually learn equivalent solutions, particularly in terms of their internal parameter structure and generalization behavior."
  },
  {
    "question": "What are intruder dimensions in the context of LoRA, and how do they differ from full fine-tuning?",
    "answer": "Intruder dimensions are high-ranking singular vectors introduced by LoRA that are approximately orthogonal to the pre-trained weight matrix's singular vectors. These dimensions do not appear in fully fine-tuned models, which tend to preserve the spectral structure of the original model."
  },
  {
    "question": "How do LoRA fine-tuned models perform outside the adaptation task distribution compared to full fine-tuned models?",
    "answer": "LoRA fine-tuned models, particularly those with intruder dimensions, perform worse than full fine-tuned models outside the adaptation task distribution. They exhibit more forgetting of the pre-training distribution and are less robust in continual learning scenarios."
  },
  {
    "question": "What does the paper conclude about LoRA's ability to generalize compared to full fine-tuning?",
    "answer": "The paper concludes that even when LoRA matches full fine-tuning on in-distribution performance, it generalizes less effectively. LoRA models often fail to retain pretraining knowledge and struggle with robust adaptation unless configured with sufficiently high and stabilized ranks."
  },
  {
    "question": "What is rank stabilization in LoRA, and why is it necessary?",
    "answer": "Rank stabilization involves ensuring that the low-rank decomposition used in LoRA maintains a stable and meaningful spectral structure. Without it, increasing the rank may not improve generalization and may exacerbate forgetting of pretraining knowledge."
  },
  {
    "question": "How does increasing the LoRA rank affect the model's performance and generalization?",
    "answer": "Higher LoRA ranks (e.g., r = 64) tend to produce models with better generalization and robustness, more closely resembling full fine-tuned models. However, excessive rank without stabilization can lead to loss of pre-training information, mirroring the tradeoffs seen in full fine-tuning."
  },
  {
    "question": "Why is the intrinsic dimension hypothesis relevant to understanding LoRA's performance?",
    "answer": "The intrinsic dimension hypothesis suggests that task-specific updates may lie in a low-rank subspace, providing a theoretical rationale for LoRA’s success. However, this paper shows that despite this, LoRA and full fine-tuning differ meaningfully in their parameter updates and generalization behavior."
  },
  {
    "question": "What trade-off does LoRA face when increasing its expressive power through higher ranks?",
    "answer": "While increasing LoRA rank improves generalization, it also leads to a higher risk of forgetting pre-trained knowledge—highlighting the classic trade-off between task-specific expressivity and broad generalization."
  },
  {
    "question": "What makes LoRA a popular alternative to full fine-tuning?",
    "answer": "LoRA is popular because it enables fine-tuning of large language models with significantly fewer trainable parameters, reducing computational cost while still achieving strong task-specific performance."
  },
  {
    "question": "What is the paper's overall conclusion about the equivalence between LoRA and full fine-tuning?",
    "answer": "The paper concludes that LoRA and full fine-tuning are not equivalent despite similar in-distribution performance. They explore different regions of parameter space, exhibit distinct spectral properties, and differ in their ability to generalize and retain pre-trained knowledge."
  }
]

In [None]:
title = "LoRA VS Full Fine-Tuning: An Illusion of Equivalence"
generate_QA_pair("10", 2024, "lora_vs_fft", title, qa_pairs)

### **11: LoRA: Low-Rank Adaptation of Large Language Models**

**Abstract**

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA.

**Introduction**

Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that the new model contains as many parameters as in the original model. As larger models are trained every few months, this changes from a mere “inconvenience” for GPT-2 or RoBERTa large to a
critical deployment challenge for GPT-3 with 175 billion trainable parameters.

Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. However, existing techniques often introduce inference latency by extending model depth or reduce the model’s usable sequence length (Section 3). More importantly, these method often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learne over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3 175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) suffices even when the full rank (i.e., d) is as high as 12,288, making LoRA both storage and compute-efficient.

LoRA possesses several key advantages.

• A pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices A and B in Figure 1, reducing the storage requirement and task-switching overhead significantly.

• LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers since we do not need to calculate the gradients or maintain the optimizer states for most parameters. Instead, we only optimize the injected, much smaller low-rank matrices.

• Our simple linear design allows us to merge the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model, by construction.

• LoRA is orthogonal to many prior methods and can be combined with many of them, such as prefix-tuning. We provide an example in Appendix E.

**Conclusion**

Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switching cost for hosting independent instances for different tasks. We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality. Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters. While we focused on Transformer language models, the proposed principles are generally applicable to any neural networks with dense layers.

There are many directions for future works. 1) LoRA can be combined with other efficient adaptation methods, potentially providing orthogonal improvement. 2) The mechanism behind fine-tuning or LoRA is far from clear how are features learned during pre-training transformed to do well on downstream tasks? We believe that LoRA makes it more tractable to answer this than full finetuning. 3) We mostly depend on heuristics to select the weight matrices to apply LoRA to. Are there more principled ways to do it? 4) Finally, the rank-deficiency of $∆W$ suggests that $W$ could be rank-deficient as well, which can also be a source of inspiration for future works.

In [None]:
qa_pairs = [
  {
    "question": "What key challenge in adapting large language models does LoRA aim to address?",
    "answer": "LoRA addresses the challenge of full fine-tuning's computational inefficiency and storage overhead by enabling task adaptation without updating the full set of model parameters, making it feasible to adapt very large models like GPT-3 with significantly fewer trainable parameters."
  },
  {
    "question": "How does LoRA modify the standard fine-tuning process for Transformers?",
    "answer": "LoRA freezes the pre-trained model weights and instead injects trainable low-rank matrices into each layer of the Transformer architecture, allowing efficient training of only the adaptation component while keeping the core model unchanged."
  },
  {
    "question": "What is meant by the 'intrinsic rank' hypothesis that motivates LoRA?",
    "answer": "The intrinsic rank hypothesis suggests that the updates required during fine-tuning lie in a low-dimensional subspace, implying that low-rank adaptations can capture the necessary task-specific information without needing full-rank parameter updates."
  },
  {
    "question": "What empirical benefits does LoRA demonstrate over full fine-tuning on models like GPT-3 and RoBERTa?",
    "answer": "LoRA achieves comparable or superior performance to full fine-tuning on models like GPT-3 and RoBERTa while reducing trainable parameters by up to 10,000 times and requiring 3× less GPU memory, with no added inference latency."
  },
  {
    "question": "How does LoRA enable efficient task-switching in a deployed system?",
    "answer": "LoRA allows the pre-trained model to remain frozen while swapping out small, task-specific low-rank adaptation matrices, enabling fast and memory-efficient switching between tasks without reloading or duplicating the full model."
  },
  {
    "question": "Why does LoRA introduce no additional inference latency?",
    "answer": "Because LoRA's trainable matrices can be merged with the frozen pre-trained weights after training, the final model operates just like a standard Transformer without requiring extra computation during inference."
  },
  {
    "question": "How does LoRA compare to other parameter-efficient methods like adapters or prefix tuning?",
    "answer": "LoRA avoids the inference latency introduced by adapters and the input sequence reduction caused by prefix tuning, offering a more efficient and latency-free alternative while remaining compatible with such methods."
  },
  {
    "question": "What makes LoRA particularly appealing for users with limited computational resources?",
    "answer": "LoRA dramatically reduces the number of trainable parameters and the optimizer state size, making fine-tuning feasible on limited hardware, and allows for fast deployment and task adaptation without retraining large models."
  },
  {
    "question": "Is LoRA specific to language models, or can it generalize to other types of neural networks?",
    "answer": "While LoRA is demonstrated on Transformer-based language models, its underlying principles are applicable to any neural network architecture involving dense layers, making it broadly generalizable."
  },
  {
    "question": "What future research directions are suggested in the paper regarding LoRA?",
    "answer": "Future research directions include combining LoRA with other adaptation methods, understanding how LoRA transforms pre-trained features for downstream tasks, identifying principled ways to choose LoRA injection points, and investigating rank-deficiency in both updates and weights."
  }
]

In [None]:
title = "LoRA: Low-Rank Adaptation of Large Language Models"
generate_QA_pair("11", 2021, "lora", title, qa_pairs)

### **12: Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data**

**Abstract**

Instruction fine-tuning is crucial for today’s large language models (LLMs) to learn to follow instructions and align with human preferences. Conventionally, supervised data, including the instruction and the correct response, is required for instruction fine-tuning. To obtain such data, some researchers prompted well-trained models like GPT-4 to generate instructions and correct responses. In this paper, we propose a novel approach that uses the first half of a random text from OpenWebText as the instruction and GPT-3.5-turbo or GPT-4-turbo to complete the text as the response. Despite the data being ”non-instructional”, we found that pretrained LLMs fine-tuned on this data can gain instruction-following capabilities. This observation is verified by fine-tuning several well-known pre-trained LLMs (e.g., LLaMA 2-7B, LLaMA-3-8B, LLaMA-3-70B, Mistral-7B-v0.1). The ”non-instructional data” also improved some models that underwent supervised fine-tuning and human preference alignment. Our LLaMA-3-70B-Instruct fine-tuned through ”non-instructional data” is comparable with LLaMA-3.1-70B-Instruct on the Arena Hard leaderboard. We analyzed the ”non-instructional data” and ensured it is devoid of content related to instruction fine-tuning. Our findings will inspire further investigation into how to develop instruction-following capabilities without explicit instruction-related data.

**Introduction**

In recent years, large language models (LLMs) like GPT-3 and LLAMA have showcased remarkable natural language processing capabilities across diverse domains. Previous studies have introduced instruction fine-tuning to align LLM training objectives with user goals. These methods involve either fine-tuning the model on various tasks using human-annotated prompts and feedback, or supervised fine-tuning utilizing public benchmarks and datasets augmented with manually or automatically generated instructions. Among these approaches, Self-Instruct tuning stands out as a simple and effective method of aligning LLMs with human intent. This is achieved by learning from instruction-following data generated by state-of-the-art instruction-tuned teacher LLMs.

This paper finds that LLMs with instruction-following capabilities can be learned from ”non-instructional data.” In this context, ”non-instructional data” refers to content tha does not contain any explicit instructions. We employed publicly available datasets, such as OpenWebText, for ChatGPT to continue writing. We demonstrate that data generated through distillation with continuous writing, even without explicit instructions, can enhance the capacity of LLMs to understand and execute tasks. This paper investigates novel methodologies that empower LLMs to learn human instructions from a wider range of data, thus eliminating the need for manually annotated or explicitly generated instructional data. Our contributions are summarized as follows:

1. Introduce a simple framework for generating non-instructional datasets to finetune LLMs, enabling them to more effectively follow human instructions.

2. Propose a methods for generating non-instructional data: conditional distillation and knowledge distillation with
continuous writing.

3. Propose a method of fine-tuning various LLMs using datasets generated by a novel approach. This method retains pre-fine-tuning scores on the Open LLM Leaderboard and significantly improves performance on the Arena Hard and MT Bench benchmark. Notably, our fine-tuned Meta-Llama-3-8b model demonstrated substantial gains on Arena Hard, compared to other strong SFT dataset, and the fine-tuned Meta-Llama-3-70b-Instruct model achieved the highest recorded score of 57.0, surpassing even the more advanced Meta-Llama-3.1-70b-Instruct. These results underscore the effectiveness of our fine-tuning approach in enhancing the instruction-following capabilities of large language models.

4. Introduce the use of lora-base for model enhancement, demonstrating its effectiveness in improving performance. This technique involves merging the LoRA module fine-tuned on the foundation (base) model with the Instruct model, showcasing improvements across various benchmarks without additional training overhead.

**Conclusion**

This work introduces a novel approach for enabling instruction-following capabilities in pre-trained language models without relying on ”non-instructional data”. Comprehensive experiments with various well-known pre-trained LLMs, including LLaMA and Mistral series models on several benchmarks, validate the effectiveness of our approach, with performance even surpassing models tuned on traditional instruction data. Further analysis reveals that the enhanced instruction-following capabilities do not stem from latent instructional content in the non-instructional datasets. This work may open up new avenues for training instruction-following LLMs because, compared to typical instruction-following datasets, which are usually generated in a supervised manner, the generation of non-instructional data is more scalable and less labor-intensive. For future work, we will further investigate how LLMs develop instruction-following abilities from non-instructional data.

**Limitations**

Our study reveals several limitations. Firstly, the mechanisms through which non-instructional data confers instruction-following abilities remain unclear, necessitating further research. Secondly, more comprehensive comparisons with GPT-4 and GPT-4-Turbo distilled Alpaca data are required. The impact of increasing data volume on model performance also needs investigation. Additionally, expert evaluations are necessary to confirm
whether the improvements on MT-Bench and Arena Hard reflect genuine advances or merely mimic the stylistic tendencies of GPT-4 and Claude-3. Lastly, the generalizability of our findings to broader real-world tasks remains uncertain, warranting further exploration.

In [None]:
qa_pairs = [
  {
    "question": "What is the central claim of the paper regarding non-instructional fine-tuning?",
    "answer": "The paper claims that instruction-following capabilities can emerge in pre-trained language models even when fine-tuned on non-instructional data—text continuations without explicit instructions—challenging the assumption that explicit supervision is necessary for instruction alignment."
  },
  {
    "question": "How is 'non-instructional data' defined in this study?",
    "answer": "'Non-instructional data' refers to text samples that contain no explicit instruction-response structure. In this study, it consists of the first half of a randomly selected OpenWebText sample, used as the 'instruction', and a continuation generated by GPT-3.5 or GPT-4, used as the 'response'."
  },
  {
    "question": "What novel methods are introduced for generating non-instructional data?",
    "answer": "The authors introduce conditional distillation and knowledge distillation via continuous writing, where pre-trained LLMs like GPT-3.5 or GPT-4 are used to generate coherent text completions without explicit task framing."
  },
  {
    "question": "Which models were fine-tuned using non-instructional data and evaluated in this study?",
    "answer": "The models fine-tuned using non-instructional data include LLaMA 2-7B, LLaMA-3-8B, LLaMA-3-70B, and Mistral-7B-v0.1, all of which demonstrated improved instruction-following capabilities on standard benchmarks."
  },
  {
    "question": "What performance benchmarks were used to evaluate the fine-tuned models?",
    "answer": "The study used benchmarks such as Arena Hard, MT-Bench, and the Open LLM Leaderboard to evaluate instruction-following capability and general performance improvements after non-instructional fine-tuning."
  },
  {
    "question": "What notable performance did LLaMA-3-70B-Instruct achieve in this study?",
    "answer": "LLaMA-3-70B-Instruct, fine-tuned on non-instructional data, achieved a score of 57.0 on the Arena Hard benchmark, surpassing the more advanced Meta-LLaMA-3.1-70B-Instruct model."
  },
  {
    "question": "What is the role of LoRA in the proposed fine-tuning approach?",
    "answer": "The study incorporates LoRA-based fine-tuning, merging LoRA modules trained on the base model with instruct-tuned models to enhance performance efficiently, without incurring additional training costs."
  },
  {
    "question": "Why might non-instructional data offer a more scalable alternative to traditional instruction datasets?",
    "answer": "Unlike instruction datasets that require manual annotation or teacher-model prompting, non-instructional data can be generated automatically via language model completions, making it less labor-intensive and more scalable."
  },
  {
    "question": "What limitation does the paper acknowledge regarding the mechanism of instruction learning?",
    "answer": "The exact mechanism by which non-instructional data enables instruction-following behavior in LLMs remains unclear, highlighting the need for deeper theoretical and empirical analysis."
  },
  {
    "question": "How does this paper challenge conventional assumptions about supervised fine-tuning?",
    "answer": "By demonstrating that LLMs can acquire instruction-following abilities from non-instructional text, the paper challenges the assumption that explicit (instruction, output) supervision is necessary for aligning models with human intent."
  }
]

In [None]:
title = "Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data"
generate_QA_pair("12", 2024, "NIFT", title, qa_pairs)

### **13: Parameter-Efficient Fine-Tuning for Large Models:A Comprehensive Survey**

**Abstract**

Abstract—Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities.

Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adjusting the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large model to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design.

In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to providing an extensive survey from an algorithmic standpoint, we also examine various real-world system designs to investigate the implementation costs associated with different PEFT approaches. This survey serves as a valuable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.

**Instruction**

Large Models (LMs) have recently captured considerable public interest. Their ability to understand context and nuances enables them to proficiently handle diverse tasks across multiple domains, including natural language processing (NLP), computer vision (CV), etc. In the field of NLP, Large Language Models (LLMs) have achieved significant advancements across various tasks including text generation,translation, personalized chat-bots, and summarization, demonstrating remarkable proficiency.

Earlier studies [1] have suggested that LLMs exhibit high levels of generalization, enabling them to apply their acquired knowledge to new tasks not included in their original training. This capability is commonly known as zero-shot learning. Nevertheless, fine-tuning remains essential to further enhance LLMs for optimal performance on new user datasets and tasks.

Due to its scale, a widely adopted strategy for fine-tuning LLMs involves adjusting a limited number of LLM parameters while keeping the remainder unchanged. This technique, termed Parameter-Efficient-Fine-Tuning (PEFT), involves selectively adjusting a small proportion of their parameters while keeping the rest unaltered. Furthermore, the application of PEFT extends beyond the realm of NLP and quickly attracts interest in the CV community for handling fine-tuning vision models with large parameters, such as Vision Transformers (ViT) and diffusion models, as well as disciplinary models such as vision-language models.

In this survey, we systematically review and categorize recent advancements in PEFT algorithms as well as the system implementation costs associated with various PEFT algorithms across diverse scenarios. Figure 1 presents the overview content for this survey. In section II, we present some fundamental concepts for LLM and PEFT, including computational flow for LLM, basic knowledge of PEFT, commonly used datasets and tasks, and evaluation benchmarks. We categorize all types of PEFT algorithms in Section III according to their computational flow. In Section III-A, we detail additive algorithms that either introduce new weight parameters or modify activations. Algorithms that only require fine-tuning of existing parameters are categorized as selective approaches, which are introduced in Section III-B. In Section III-C, we explore reparameterized PEFT, which constructs a (low- dimensional) reparameterization of original model parameters for training while transforming the weights back to maintain the inference speed. Additionally, there exist algorithms that combine the above techniques, and we have classified these as hybrid approaches, elaborating on them in Section III-D. We also investigate strategies for further reducing the computational complexity of different PEFT algorithms, including KV-cache management, pruning, quantization, and memory optimization, in Section IV.

In Section V, we expand the scope of this survey beyond the computational perspective to involve various potential application scenarios. Specifically, we explore innovations that applying PEFT techniques to different model architecture, including LLMs (Section V-A), Vision Transformer (Section V-B), Vision-Language alignment models (Section V-C), and Diffusion models (Section V-D), for varied downstream tasks, underscoring PEFT’s versatility and applicability in a range of scenarios. After that, in Section VI, we explore the system design challenge for PEFT methods. The discussion includes three advanced system solutions for practical PEFT deployment: PEFT query serving (Section VI-B), distributed tuning (Section VI-C), and concurrent PEFT tuning (Section VI-D). Finally, in Section VII, we summarize our survey and propose several potential future directions from both algorithmic and systemic perspectives, aiming to offer valuable insights for further research and development in the field.

**Conclusion**

In the current era dominated by large models and large datasets, PEFT stands out as a highly attractive method for efficiently adapting models to downstream tasks. This technique gains its appeal by addressing the significant challenges posed by traditional full-model fine-tuning, which often places substantial computational and data demands. This survey offers a comprehensive examination of the most recent advancements in PEFT, including algorithmic design, computational efficiency, application scenarios, and system implementation for PEFT. It offers a comprehensive taxonomy and explanation that serves as an excellent guidance and knowledge base, which enables readers of various levels and disciplines to swiftly grasp the core concepts of PEFT.

For further research on PEFT, we propose a series of possible directions from both algorithm and system perspectives, hoping to inspire more researchers to engage in further studies in these areas.

A. Simplify hyperparameter tuning

The effectiveness of PEFT is often sensitive to its hyperparameters, such as the bottleneck dimension of the adapter, the rank of LoRA, and the arrangement of various additive PEFT layers. Manually tuning these hyperparameters will cost lots of effort. Therefore, future efforts could focus on developing methods that are less dependent on manual tuning of these
parameters, or automatically find the optimal configuration
settings. Several studies have started to address this issue, but there’s a need for more simple and efficient solutions optimizing these hyperparameters.

B. Establish a unified benchmark
Despite the existence of libraries like HuggingFace’s PEFT and AdapterHub, a comprehensive benchmark for PEFT is still lacking. This gap hinders the ability to fairly compare the performance and efficiency of different PEFT approaches. A well-accepted, up-to-date benchmark akin to MMDetection for object detection would enable researchers to validate their methods against a standard set of tasks and metrics, fostering innovation and collaboration
within the community.

C. Enhance training efficiency
The presumed parameter efficiency of PEFT is not always consistent with computational and memory savings during training. Given that trainable parameters are intertwined within the pre-trained model’s architecture, computing and storing activations and gradients for the full model often become necessary during fine-tuning. This oversight calls for a rethinking of what constitutes efficiency. As outlined in Section IV, potential solutions lie in the integration of model compression techniques such as pruning and quantization, alongside innovations specifically designed to optimize memory during PEFT tuning. Further research into enhancing the computational efficiency of PEFT methodologies is imperative.

D. Explore scaling laws
The design and effectiveness of PEFT methods originally developed for smaller Transformer models do not necessarily scale with larger models. As the size of foundation models increases, identifying and adapting PEFT strategies that remain effective is crucial. This investigation will aid in customizing PEFT methodologies to suit the evolving landscape of large model architectures.

E. Serve more models and tasks
The rise of large foundation models across various domains presents new opportunities for PEFT. Designing PEFT methods tailored to the unique characteristics of models, such as Sora, Mamba, and LVM, can unlock new application scenarios and opportunities.

F. Enhancing data privacy

Trusting centralized systems to serve or fine-tune personalized PEFT modules is yet another issue for system developers. Multiple types of inversion attacks have been proposed to reconstruct user’s data by hijacking the intermediate results. One perspective of future trust-worthy LLM system design involves developing an encryption protocol for both personal data and intermediate training and inference results.

G. PEFT with model compression
Model compression is one of the most effective ways to make LLM executable on resource-limited devices. Yet, the impact of model compression techniques on the performance of PEFT algorithms running on hardware remains another systemic challenge. Common compression techniques such as quantization and pruning necessitate dedicated hardware platforms to expedite the process, and building such hardware platforms for compressed models is yet another direction for future research.

In [None]:
qa_pairs = [
  {
    "question": "What is Parameter-Efficient Fine-Tuning (PEFT) and why is it important for large models?",
    "answer": "PEFT refers to fine-tuning a pre-trained large model by adjusting only a small subset of its parameters, thereby reducing computational and memory costs. It is especially important for large models with billions of parameters, where full fine-tuning becomes prohibitively expensive in terms of system resources and deployment feasibility."
  },
  {
    "question": "What are the four main categories of PEFT algorithms surveyed in this paper?",
    "answer": "The survey categorizes PEFT algorithms into four groups: (1) Additive approaches, which introduce new parameters or modify activations; (2) Selective approaches, which fine-tune only a subset of existing parameters; (3) Reparameterized methods, which learn a low-dimensional representation of the parameter changes; and (4) Hybrid approaches, which combine elements of the above strategies."
  },
  {
    "question": "How do additive PEFT methods differ from selective methods?",
    "answer": "Additive methods inject new trainable components, such as adapters or side networks, into the model architecture, while selective methods only fine-tune a small fraction of the existing parameters, such as biases or attention layers, without modifying the architecture."
  },
  {
    "question": "What are some techniques discussed in the survey for reducing PEFT’s computational complexity?",
    "answer": "The survey discusses strategies such as key-value cache management, pruning, quantization, and memory optimization to reduce the computational burden and memory requirements during PEFT training and inference."
  },
  {
    "question": "Which model types beyond NLP are being targeted by recent PEFT research, according to the paper?",
    "answer": "Recent PEFT research extends beyond NLP to include Vision Transformers (ViT), vision-language alignment models, and diffusion models, indicating the broad applicability of PEFT across diverse deep learning domains."
  },
  {
    "question": "What are the three system-level challenges for practical PEFT deployment discussed in the paper?",
    "answer": "The paper identifies three key system-level challenges: (1) PEFT query serving, which deals with deploying multiple fine-tuned modules efficiently; (2) distributed tuning, which addresses large-scale PEFT across nodes; and (3) concurrent tuning, which involves optimizing multiple fine-tuning jobs simultaneously."
  },
  {
    "question": "Why is the absence of a unified benchmark a problem for PEFT research?",
    "answer": "Without a standardized benchmark, it is difficult to compare the performance and efficiency of different PEFT methods fairly. This lack of consistency inhibits collaborative progress, reproducibility, and meaningful evaluation across studies."
  },
  {
    "question": "How does the paper suggest improving PEFT’s training efficiency despite its parameter-efficient design?",
    "answer": "The paper notes that although PEFT reduces the number of trainable parameters, it still often requires full model activations and gradients, which are computationally expensive. To improve efficiency, it recommends integrating model compression techniques like pruning and quantization, and designing memory-optimized training schemes."
  },
  {
    "question": "What future direction is proposed to address hyperparameter tuning challenges in PEFT?",
    "answer": "The paper advocates for research into automatic or simplified hyperparameter tuning strategies, particularly for sensitive parameters such as LoRA rank or adapter bottleneck dimensions, to reduce manual labor and improve accessibility."
  },
  {
    "question": "What systemic challenge does the paper highlight in relation to data privacy and PEFT?",
    "answer": "The paper warns that centralized PEFT systems may be vulnerable to inversion attacks capable of reconstructing user data. It suggests the development of encryption protocols to protect both personal data and intermediate training/inference results as a key area for future trustworthy system design."
  }
]

In [None]:
title = "Parameter-Efficient Fine-Tuning for Large Models:A Comprehensive Survey"
generate_QA_pair("13", 2024, "PEFT_for_LM", title, qa_pairs)

### **14: Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment**

**Abstract**

Abstract—With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success.
However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. The demands for fine-tuning PLMs, especially LLMs, have led to a surge in the development of PEFT methods, as depicted in Fig. 1. In this paper, we present a comprehensive and systematic review of PEFT methods for PLMs. We summarize these PEFT methods, discuss their applications, and outline future directions. Furthermore, we conduct experiments using several representative PEFT methods to better understand their effectiveness in parameter efficiency and memory efficiency. By offering insights into the latest advancements and practical applications, this survey serves as an invaluable resource for researchers and practitioners
seeking to navigate the challenges and opportunities presented
by PEFT in the context of PLMs.

**Introduction**

TRANSFORMER-BASED PLMs have demonstrated remarkable performance across a wide
range of NLP tasks. To fully harness the potential of PLMs, fine-tuning is employed to adapt the PLMs to task-specific data to enhance performance on downstream tasks. However, traditional fine-tuning involves updating all the pretrained parameters of PLMs, which is time-consuming and computationally expensive. As the size of PLMs continues to increase, from models like BERT with 110 million parameters to T5 with 770 million parameters, computational resource requirements become a significant challenge. The advent of LLMs, exemplified by Falcon with a staggering 180 billion parameters, further exacerbates the computational demands. To perform task-specific full fine-tuning with Falcon-180B, a minimum of 5120GB of computational
resources may be required. The enormous computational resource requirements are prohibitive for anyone but the superpower players to utilize LLMs for task-specific fine-tuning.

To address this challenge, a prominent method known as PEFT has emerged as a viable solution to compensate for the tremendous computational cost of full parameter fine-tuning. PEFT involves employing various deep learning
techniques to reduce the number of trainable parameters while still maintaining comparable performance to the full fine-tuning. In addition, PEFT updates only a small number of additional parameters or updates a subset of the pretrained parameters, preserving the knowledge captured by the PLM while adapting it to the target task and reducing the risk of catastrophic forgetting. Furthermore, since the size of the fine-tuned dataset is typically much smaller than the pretrained dataset, performing full fine-tuning to update all the pretrained parameters may lead to overfitting, which is circumvented by the PEFT through selectively or not updating pretrained parameters.

Recently, there has been a significant surge in interest regarding PEFT methods, as demonstrated by the growing number of studies depicted in Fig. 1. This also leads to a few surveys on PEFT approaches for the PLMs. However, the existing surveys have certain limitations. Ding et al. conducted a comprehensive study on PEFT methods, but this survey did not cover much of the latest work in the field and only four PEFT methods were quantitatively
experimented with. Lialin et al. delved into the ideas and operational implementations of PEFT methods in detail but do not perform relevant experiments. In this work, we address these gaps comprehensively. We meticulously categorize the PEFT methods, providing detailed explanations of the ideas and specific implementations of each method. We compare the similarities and differences among various types of PEFT methods, facilitating a better understanding of the evolving landscape of PEFT. Moreover, we conduct extensive fine-tuning experiments with 11 representative PEFT methods.

In this paper, we aim to provide a comprehensive and systematic study of PEFT methods for PLMs in NLP. We undertake an in-depth exploration of these PEFT methods and present a comprehensive taxonomy scheme in Section III. By categorizing PEFT methods into additive fine-tuning, partial fine-tuning, reparameterized fine-tuning, hybrid fine-tuning, and unified fine-tuning, we establish a structured framework for understanding these PEFT approaches, as depicted in Fig. 2. In Section IV, we conduct quantitative investigations and analyses to assess the performance, parameters efficiency, and memory usage of these PEFT approaches. Our quantitative studies primarily focus on natural language understanding (NLU), machine translation (MT), and natural language generation (NLG) tasks. Additionally, we extensively explore the applications of PEFT in multi-task learning, cross-lingual transfer, and backdoor attack and defense, underscoring its effectiveness. Furthermore, our research also unveils potential directions for future investigations in this rapidly evolving field. To summarize, the main contributions of this survey can be outlined as follows:

• We present a comprehensive analysis and review of PEFT methods for transformer-based PLMs.
• We identify the key techniques and approaches employed in PEFT methods, and classify them into additive, partial, reparameterized, hybrid, and unified fine-tuning methods.
• We conduct extensive experiments to evaluate the effectiveness of several representative PEFT methods, specifically examining their impact on parameter efficiency and memory usage.

**Conclusion**

This paper presents a comprehensive and structured study of PEFT methods for PLMs. By classifying the PEFT methods in NLP, we identify the main techniques and challenges associated with them. We employ several representative PEFT methods to fine-tune encoder-based RoBERTa, encoder-decoder-based T5, and decoder-based LLaMA on various downstream tasks. Experimental results reveal that most PEFT methods significantly improve parameter efficiency and achieve comparable or even better performance compared to full fine-tuning. Additionally, most PEFT methods lower the memory footprint, with QLoRA drastically reducing the computational memory requirement, and alleviating the memory challenge when fine-tuning LLMs. Furthermore, we introduce common
applications of PEFT methods and outline future research directions. As the development of LLMs continues, there is a clear need to develop PEFT methods that can effectively reduce computational resource demands and memory usage during fine-tuning. This survey aims to provide a bird’s-eye view of PEFT methods for PLMs and inspiring further research in this area.

In [None]:
qa_pairs = [
  {
    "question": "What is the core motivation behind the development of Parameter-Efficient Fine-Tuning (PEFT) methods for Pretrained Language Models (PLMs)?",
    "answer": "The primary motivation for PEFT is to address the computational and memory inefficiencies of full fine-tuning, especially as PLMs grow to billions of parameters. PEFT allows model adaptation by updating only a small fraction of parameters, preserving pre-trained knowledge while avoiding overfitting and reducing resource costs."
  },
  {
    "question": "What are the five categories of PEFT methods identified in the paper’s taxonomy?",
    "answer": "The paper classifies PEFT methods into five categories: (1) Additive Fine-Tuning, (2) Partial Fine-Tuning, (3) Reparameterized Fine-Tuning, (4) Hybrid Fine-Tuning, and (5) Unified Fine-Tuning. This taxonomy provides a structured framework for understanding the diverse strategies within PEFT."
  },
  {
    "question": "How does additive fine-tuning differ from partial fine-tuning within the PEFT framework?",
    "answer": "Additive fine-tuning introduces new trainable components (e.g., adapters or LoRA modules) without modifying the original model weights, while partial fine-tuning involves updating only a selected subset of the existing parameters within the pre-trained model, such as the final layers or attention blocks."
  },
  {
    "question": "What experimental evidence does the paper provide to support the efficacy of PEFT methods in terms of parameter efficiency and memory savings?",
    "answer": "Through experiments on encoder-based RoBERTa, encoder-decoder-based T5, and decoder-based LLaMA models, the paper shows that most PEFT methods achieve comparable or superior performance to full fine-tuning while significantly reducing trainable parameter counts and memory usage. Notably, QLoRA achieves dramatic reductions in memory footprint."
  },
  {
    "question": "Why is PEFT considered a potential solution to catastrophic forgetting in fine-tuning PLMs?",
    "answer": "PEFT mitigates catastrophic forgetting by preserving the majority of the pre-trained model's parameters and only updating a small subset, thus maintaining the original knowledge while adapting to new tasks without overwriting core representations."
  },
  {
    "question": "What are some of the broader applications of PEFT methods explored in the paper?",
    "answer": "The paper explores PEFT’s applications in multi-task learning, cross-lingual transfer, and backdoor attack and defense, highlighting the flexibility and robustness of PEFT approaches across diverse use cases and threat models."
  },
  {
    "question": "What limitations in previous PEFT surveys does this paper aim to address?",
    "answer": "Previous surveys either lacked coverage of recent methods or failed to conduct empirical evaluations. This paper addresses both gaps by providing an up-to-date taxonomy of PEFT methods and conducting extensive experiments on eleven representative methods across multiple tasks and architectures."
  },
  {
    "question": "Why is QLoRA highlighted as particularly effective among PEFT methods?",
    "answer": "QLoRA stands out due to its ability to drastically reduce the memory footprint required during fine-tuning without compromising model performance, making it especially suitable for adapting large models on memory-constrained hardware."
  },
  {
    "question": "What role does PEFT play in democratizing access to large language models?",
    "answer": "By enabling model fine-tuning with a fraction of the parameters and memory requirements, PEFT allows researchers and practitioners with limited computational resources to adapt and deploy powerful language models, thus lowering the barrier to entry."
  },
  {
    "question": "What future directions does the paper propose for the continued development of PEFT methods?",
    "answer": "The paper encourages future work in areas such as developing unified PEFT frameworks, automating hyperparameter selection, improving cross-task generalizability, and optimizing PEFT for low-resource environments and multilingual applications."
  }
]

In [None]:
title = "Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment"
generate_QA_pair("14", 2023, "PEFT_for_PLM", title, qa_pairs)

### **15: Parameter-Efficient Transfer Learning with Diff Pruning**

**Abstract**


The large size of pretrained networks makes them difficult to deploy for multiple tasks in storage-constrained settings. Diff pruning enables parameter-efficient transfer learning that scales well with new tasks. The approach learns a task-specific “diff” vector that extends the original pretrained parameters. This diff vector is adaptively pruned during training with a differentiable approximation to the L0-norm penalty to encourage sparsity. As the number of tasks increases, diff pruning remains parameter-efficient, as it requires storing only a small diff vector for each task. Since it does not require access to all tasks during training, it is attractive in on-device deployment settings where tasks arrive in stream or even from different providers. Diff pruning can match the performance of finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model’s parameters per task and scales favorably in comparison to popular pruning approaches.

**Introduction**

Task-specific finetuning of pretrained deep networks is the dominant paradigm in contemporary NLP, achieving state-of-the-art results across a suite of natural language understanding tasks. While straightforward and empirically effective, this approach is difficult to scale to multi-task memory-constrained settings (e.g. for on-device applications), as it requires shipping and storing a full set of model parameters for each task. Inasmuch as these models are learning generalizable, task-agnostic language representations through self-supervised pretraining, finetuning the entire model for each task seems especially profligate.


A popular approach to parameter-efficiency is to learn smaller compressed models for each task. Such approaches face a steep sparsity/performance tradeoff and keep a substantial amount of nonzero parameters per task (e.g. 10%-30%). Multi-task learning and feature-based transfer allow for more parameter-efficient transfer learning per task. These methods train a small number of additional parameters (e.g. a linear layer) on top of a shared model. However, multi-task learning generally requires access to all tasks during training to prevent catastrophic forgetting (French, 1999), while feature-based transfer learning (e.g. based on task-agnostic sentence representations) is typically outperformed by finetuning (Howard &
Ruder, 2018).

An appealing middle ground is to finetune an extension of the base model for specific tasks. This approach captures the training benefits of fine-
tuning while maintaining the task modularity of feature-based transfer. For example, Adapters use smaller, task-specific modules that are inserted between layers of a model This approach does not require access to all tasks during training, targeting realistic settings where as new tasks arrive in stream. Houlsby et al. (2019) find that adapter layers can match the performance of fully finetuned BERT on the GLUE benchmark while requiring 3.6% additional parameters (on average) per task.

Diff pruning is a new extension to pretrained models with the goal of even more parameter-efficient transfer learning. Instead of modifying the architecture of the model, diff pruning extends the base model through a task-specific difference vector.

In order to learn this vector, we reparameterize the task-specific model parameters as θtask = θpretrained + δtask, where the pretrained parameter vector θpretrained is fixed and the task-specific diff vector δtask is finetuned. The diff vector is regularized with a differentiable approximation to the L0-norm penalty to encourage sparsity.

Diff pruning can become extremely parameter-efficient, as it only requires storing the nonzero positions and weights of the diff vector for each task. The cost of storing the shared pretrained model remains constant and is amortized across multiple tasks. On the GLUE benchmark, diff pruning can match the performance of the fully finetuned BERT baselines while finetuning only 0.5% of the pretrained parameters per task. As the number of tasks increase, diff pruning outperforms popular pruning-based methods in amount of storage required.

**Results**

5.1 Results on GLUE

Our main results on the GLUE benchmark are shown in Table 1. Structured diff pruning can match the performance of a fully finetuned BERTLARGE model while only requiring 0.5% additional parameters per task. Diff pruning without structured sparsity also performs well, though slightly worse than the structured approach. Non-adaptive diff pruning, which magnitude prunes the diff vector without learning the binary mask zτ , performs significantly worse, indicating the importance of learning the masking vector. Compared to Adapters, diff pruning obtains similar performance while requiring many fewer parameters per task, making it a potential alternative for parameter-efficient transfer learning.

5.2 Results on SQuAD

To demonstrate the effectiveness of our approach beyond the GLUE tasks, we additionally experiment on SQuAD (Rajpurkar et al., 2016), an extractive question answering dataset where the model has to select the answer span to a question given a Wikipedia paragraph. To make direct comparisons with Houlsby et al. (2019), we run all experimentson SQuAD v1.1. For diff pruning, we use the same general hyperparameters as our full finetuning baseline (see section A.1). As shown in Figure 1 (right), diff pruning is able achieve comparable or better performance with only 1.0% additional parameters. Interestingly, diff pruning measurably improves the upon the full finetuning baseline while modifying fewer parameters, which indicates that diff pruning can have a useful regularization effect on top of parameter-efficiency.

**Conclusion**

We propose diff pruning as a simple approach for parameter-efficient transfer learning with pre-trained models. Experiments on standard NLP benchmarks and models show that diff pruning can match the performance of fully finetuned baselines while requiring only a few additional parameters per task, and can sometimes have a regularization effect and improve upon regular fine-tuning. We also propose a structured variant of diff pruning which provides further improvements. Avenues for future work include (i) injecting parameter-efficiency objectives directly into the pretraining process (to pretrain models that are better suited towards sparse transfer learning), and
(ii) combining diff pruning with other techniques (e.g. adapters, model compression) to achieve even greater parameter-efficiency.

In [None]:
qa_pairs = [
  {
    "question": "What is the central innovation of diff pruning for parameter-efficient transfer learning?",
    "answer": "Diff pruning introduces a task-specific 'diff' vector that extends pretrained model parameters without modifying the base weights. This vector is adaptively pruned during training using a differentiable L0-norm approximation to promote sparsity, allowing highly efficient adaptation with minimal parameter overhead."
  },
  {
    "question": "How does diff pruning reparameterize model weights during fine-tuning?",
    "answer": "Diff pruning reparameterizes task-specific model weights as θ_task = θ_pretrained + δ_task, where θ_pretrained remains fixed and only the difference vector δ_task is optimized, allowing for sparse and efficient adaptation."
  },
  {
    "question": "Why is diff pruning well-suited for on-device or multi-task deployment scenarios?",
    "answer": "Diff pruning is ideal for on-device and multi-task settings because it requires storing only a sparse task-specific diff vector, while the shared pretrained model remains constant across tasks. This enables efficient task switching without catastrophic forgetting and with minimal storage costs."
  },
  {
    "question": "What role does the differentiable approximation to the L0-norm play in diff pruning?",
    "answer": "The differentiable L0-norm encourages sparsity in the diff vector by acting as a regularizer during training, allowing the model to learn compact task-specific updates while preserving performance."
  },
  {
    "question": "How does diff pruning compare to adapter-based methods in terms of parameter efficiency?",
    "answer": "While adapter-based methods like Houlsby adapters typically require around 3.6% additional parameters per task, diff pruning achieves similar or better performance with as little as 0.5% added parameters per task, making it significantly more efficient."
  },
  {
    "question": "What are the empirical results of diff pruning on the GLUE benchmark?",
    "answer": "On the GLUE benchmark, structured diff pruning matches the performance of fully fine-tuned BERT models while only modifying 0.5% of parameters per task. The structured variant performs better than unstructured or non-adaptive variants."
  },
  {
    "question": "How did diff pruning perform on the SQuAD v1.1 dataset compared to full fine-tuning?",
    "answer": "On SQuAD v1.1, diff pruning achieved comparable or superior performance to full fine-tuning while modifying only 1% of the parameters, suggesting both efficiency and potential regularization benefits."
  },
  {
    "question": "What are intruder dimensions, and are they relevant in the context of diff pruning?",
    "answer": "While this paper does not explicitly mention intruder dimensions, diff pruning avoids such artifacts by sparsely updating only a minimal difference vector, unlike LoRA which may introduce orthogonal components into the model's spectral space."
  },
  {
    "question": "What future directions for research does the paper propose regarding diff pruning?",
    "answer": "The paper suggests two directions: (i) incorporating parameter-efficiency objectives into the pretraining stage to better support sparse adaptation, and (ii) combining diff pruning with other techniques like adapters or model compression for enhanced efficiency."
  },
  {
    "question": "Why is diff pruning considered a 'middle ground' between full fine-tuning and feature-based transfer learning?",
    "answer": "Diff pruning captures the performance benefits of fine-tuning while maintaining the modularity and storage efficiency of feature-based approaches. It avoids the rigidity of fixed features and the redundancy of duplicating entire model weights for each task."
  }
]

In [None]:
title = "Parameter-Efficient Transfer Learning with Diff Pruning"
generate_QA_pair("15", 2021, "PETL_with_DP", title, qa_pairs)

### **16: Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models**

**Abstract**
Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can not be guaranteed due to the high cost and intensive labor for the creation and maintenance in practice. To overcome the limitations associated with the quality of SFT datasets, we introduce a novel preference-oriented supervised fine-tuning approach, namely PoFT. The intuition is to boost SFT by imposing a particular preference: favoring the target model over aligned LLMs on the same SFT data. This preference encourages the target model to predict a higher likelihood than that predicted by the aligned LLMs, incorporating assessment information on data quality (i.e., predicted likelihood by the aligned LLMs) into the training process. Extensive experiments are conducted, and the results validate the effectiveness of the proposed method. PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models. Moreover, we prove that PoFT can be integrated with existing SFT data filtering methods to achieve better performance, and further improved by following preference optimization procedures, such as DPO.

**Introduction**

Large language models(LLMs) such as ChatGPT have exhibited successful and potent applications in comprehending human queries and delivering plausible responses. This ability has proven to be crucial in real-world applications, e.g. AI assistants and recommendation systems. To equip LLMs with this ability, the alignment methods are usually applied to pre-trained language models. Alignment enables pre-trained models to comprehend the context and generate responses suitable to human interactions. Typical alignment methods can be broadly categorized into two types: Supervised Fine-Tuning (SFT) and
Preference Alignment (PA).

Supervised fine-tuning (SFT) is an essential phase of alignment, wherein the task is framed as causal language modeling performed on a pre-trained language model with instruction-response data D = {⟨x, y⟩}. Generally, it leverages the cross-entropy objective function in optimization, equipping the pre-trained language model with the ability to follow instructions and generate coherent sequences. Several studies are dedicated to exploring SFT training strategies to enhance the alignment of LLMs. However, due to the intrinsic traits of modeling, the optimization process heavily depends on the availability of high-quality ⟨x, y⟩ data, which hinders its performance. Traditionally, the prevalent large-scale SFT datasets in earlier research, such as Alpaca and ShareGPT, were mainly developed via AI distillation or human-and-AI interaction. Assuring the quality of these datasets can be challenging, as the filtration and curation processes demand
significant human resources and efforts.

Instead of solely aligning the instruction and responses, preference alignment (PA), such as InstructGPT and Direct Preference Optimization (DPO), optimizes the LLMs based on chosen-rejected data ⟨x, y+, y−⟩. These PA methods provide exceptional benefits in model alignment, enabling LLMs to align more accurately with AI/human preferences. In particular, DPO employs the Bradley-Terry (BT) ranking objective (Bradley and Terry 1952) in its optimization process to perform direct preference comparison.

Given the limitations of SFT in processing quality-limited data, we leverage the benefits of the BT preference model and incorporate it into the SFT framework, by proposing a Preference-oriented supervised Fine-Tuning method, called PoFT. Specifically, it applies the BT objective to different models by imposing a particular preference: favoring the target model over the aligned LLMs, given the same ⟨x, y⟩ data. Within this framework, the aligned LLMs act as baselines for the target model, prompting it to attain higher preference scores than that of the aligned LLMs on SFT data. Here, we assume these LLMs could discern data that contribute positively to model optimization, thereby providing valid data quality assessments. Moreover, we would like to emphasize that we use BT model to rank models rather than to rank data. This means we are fundamentally not a PA approach but rather an SFT approach since we require only ⟨x, y⟩ and not ⟨x, y+, y−⟩. For that matter, we show our approach is indeed orthogonal to PA since PoFT can be combined with PA methods to further enhance the overall alignment performance (e.g., first PoFT and then DPO).

Despite leveraging the preference modeling with BT, at its essence, PoFT remains faithful to the SFT paradigm, relying on instruction-response data. As an enhanced SFT method, PoFT’s objective offers a remarkable advantage over the conventional SFT objective cross-entropy (CE), i.e., PoFT is more stable and robust when training with quality-limited data. Specifically, the introduction of aligned LLMs provides quality assessments on each sample ⟨x, y⟩, which decreases its sensitivity towards the data quality. In practice, by analyzing the gradient updates, we observe that PoFT assigns dynamic weights (namely coefficient defined in section 3) to different samples {⟨x, y⟩} by the aligned LLMs. These weights guide parameter optimization, reducing the negative effect of low-quality data. In contrast, the CE objective treats all the data equally, without differentiating data samples based on their quality, thus exposing it to vulnerabilities to low-quality data.

In summary, our contributions are three-fold:

• Innovative SFT Training Methodology With Preference Modeling. We present a novel method, called PoFT. This new methodology effortlessly integrates aligned LLMs for preference modeling - a fresh perspective that leads to a boost in the optimization process.

• Analytical Insight into PoFT’s Stability. Through rigorous mathematical analysis, we provide theoretical explanations that shed light on the inherent characteristics of PoFT in gradient update.

• Comprehensive Validation of Methodology. We validate the effectiveness of PoFT through extensive experiments on different base models, demonstrating that PoFT achieves superior performance over the CE objective across diverse training datasets. Our ablation studies indicate PoFT’s stability over increasing epochs and enhanced resilience to noise data. Impressively, our experiments prove that the integration of the PoFT and SFT filtering methods can lead to further performance enhancement. Moreover, the two-step training followed by DPO also shows promising alignment performance.

**Conclusion**

In this paper, we present PoFT, a novel and effective preference-oriented SFT method by applying the Bradley-Terry objective for modeling preferences between different models. Specifically, given the same SFT data, we intentionally define a preference: favoring the target model over aligned LLMs. This preference encourages the target model to generate higher preference scores when compared to the aligned LLMs. In essence, the aligned LLMs provide assessments of the data quality in the optimization process, varying the effects of SFT data. We conduct extensive experiments on diverse training datasets and different base models to verify the efficacy of PoFT compared to the baselines (the CE objective). Furthermore, we prove its stability towards noise data and validate the effectiveness of the designed objectives by conducting ablation studies on the reward functions and aligned LLMs. Furthermore, PoFT can be combined with other SFT Filtering methods to attain enhanced performance outcomes. Notably, integrating PoFT with DPO has the potential to yield even superior performance.

In [None]:
qa_pairs = [
  {
    "question": "What is the core motivation behind the development of Preference-Oriented Supervised Fine-Tuning (PoFT)?",
    "answer": "PoFT was developed to address the limitations of conventional supervised fine-tuning, particularly its sensitivity to low-quality instruction-response pairs. By incorporating preference modeling that favors the target model over aligned LLMs, PoFT introduces a robustness mechanism that implicitly evaluates data quality during training."
  },
  {
    "question": "How does PoFT differ fundamentally from traditional preference alignment methods like DPO?",
    "answer": "While traditional preference alignment methods like DPO require ⟨x, y+, y−⟩ tuples to compare responses, PoFT operates within the supervised fine-tuning paradigm, using only ⟨x, y⟩ pairs. It defines preferences not between responses but between models, aiming to make the target model outperform aligned LLMs on the same data."
  },
  {
    "question": "How does the Bradley-Terry (BT) objective function operate within PoFT?",
    "answer": "In PoFT, the BT objective is used to model a preference between the target model and an aligned LLM on the same ⟨x, y⟩ pair. The loss encourages the target model to assign a higher log-likelihood to the correct response than the aligned model, effectively integrating quality-aware preference signals into the optimization process."
  },
  {
    "question": "Why is PoFT considered more robust than conventional SFT with cross-entropy (CE) loss in the presence of low-quality data?",
    "answer": "PoFT dynamically assigns importance weights to training samples based on the log-likelihood assigned by the aligned LLM. This means that examples with lower quality, as assessed by the aligned model, have less influence on training. In contrast, CE treats all samples equally, making it vulnerable to noise and poor-quality data."
  },
  {
    "question": "Can PoFT be integrated with other methods, and if so, how does it perform in combination with DPO?",
    "answer": "Yes, PoFT is orthogonal to preference alignment methods and can be combined with DPO in a two-stage training process. Experimental results demonstrate that PoFT followed by DPO leads to further alignment improvements compared to using either method alone."
  },
  {
    "question": "How do aligned LLMs function in the PoFT training pipeline?",
    "answer": "Aligned LLMs serve as comparative baselines that implicitly assess the quality of the instruction-response pair. Their predicted likelihoods are used to guide the preference modeling, encouraging the target model to surpass their confidence on each training example."
  },
  {
    "question": "What empirical results support the effectiveness of PoFT across different base models and datasets?",
    "answer": "PoFT consistently outperforms CE-based SFT across various training datasets and LLM backbones. It demonstrates better alignment performance, increased robustness to noise, and stability across training epochs, as shown through ablation studies and benchmark evaluations."
  },
  {
    "question": "What theoretical justification does the paper provide for PoFT’s stability during training?",
    "answer": "The paper provides a gradient-based analysis demonstrating that PoFT’s use of preference-based weighting leads to smoother gradient updates, reducing variance caused by low-quality samples and contributing to more stable and resilient model optimization."
  },
  {
    "question": "Why might PoFT be particularly advantageous when training on instruction data generated through AI distillation (e.g., Alpaca, ShareGPT)?",
    "answer": "Because AI-distilled instruction data often varies in quality, PoFT’s reliance on aligned LLMs to assign implicit quality scores helps mitigate the risk of overfitting to suboptimal examples, making it a natural fit for such semi-automatically curated datasets."
  },
  {
    "question": "What broader insight does PoFT offer for the future of instruction tuning in large language models?",
    "answer": "PoFT suggests that instruction tuning can benefit significantly from model-level preference comparisons rather than solely relying on explicit labels or human preferences. This opens up a new paradigm where even noisy or imperfect data can be used effectively when coupled with reliable model baselines for implicit supervision."
  }
]

In [None]:
title = "Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models"
generate_QA_pair("16", 2024, "POSFT", title, qa_pairs)

### **17: QA-LoRA: Qantization-Aware Low-Rank Adaptation of Large Language Models**

**Abstract**

Recently years have witnessed a rapid development of large language models
(LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios.

**Intriduction**

Recently, large language models (LLMs) have shown unprecedented performance across a wide range of language understanding tasks and served as the foundation of state-of-the-art chat systems. The diversity of real-world applications calls for a pipeline in which LLMs can be fine-tuned to fit different scenarios and quantized to be deployed onto edge devices (e.g., mobile phones), and the key issue is to get rid of the heavy computational burden brought by the large number of parameters of LLMs.

There are two lines of research for this purpose. The first one is parameter-efficient fine-tuning (PEFT) which introduced a small number of learnable parameters while keeping most pre-trained parameters unchanged. Among them, low-rank adaptation (LoRA) (Hu et al., 2021), a popular PEFT algorithm, proposed to fine-tune low-rank matrices to complement the pre-trained weights. Despite the comparable performance to full-parameter fine-tuning, the memory usage of LoRA is still large, especially when the base LLM is large (e.g., LLaMA-65B). The second one studies parameter quantization where the trained weights are quantized into low-bit integers or floating point numbers. Although these methods can alleviate the computational burden, they often report unsatisfying accuracy especially when the quantization bit width is low.

Hence, it is an important topic to integrate PEFT with quantization. A naive solution is to perform post-training quantization (PTQ) after PEFT, but it reports unsatisfying accuracy especially when the quantization bit width is low. Advanced methods exist, but they are either computationally expensive in the fine-tuning stage or unable to maintain the quantized property after fine-tuning. In this paper, we propose a simple yet effective method for
quantization-aware low-rank adaptation (QA-LoRA). Our idea is based on the imbalanced degrees of freedom for quantization and adaptation. Specifically, each column of the pre-trained weight matrix is accompanied by only one pair of scaling and zero parameters but many more LoRA parameters. This imbalance not only results in large quantization errors (which harm the LLM’s accuracy), but also makes it difficult to integrate the auxiliary weights into the main model. QA-LoRA addresses the issue by introducing group-wise operators which increase the degree of freedom of low-bit quantization (each group is quantized individually) and decrease that of LoRA (each group shares the adaptation parameters). QA-LoRA enjoys two-fold benefits: (i) an efficient fine-tuning stage thanks to the LLM’s weights being quantized into low-bit integers; (ii) a lightweight, fine-tuned model without the need for PTQ which often incurs loss of accuracy.

QA-LoRA is easily implemented and applies to a wide range of scenarios. We evaluate QA-LoRA on the LLaMA and LLAMA2 model families and validate it on various language understanding benchmarks. Figure 1 compares the 5-shot accuracy on the MMLU benchmark of QA-LoRA and the direct baseline, QLoRA with and without PTQ, when both methods are fine-tuned on the Alpaca dataset. QA-LoRA consistently outperforms QLoRA with PTQ on top of LLMs of different scales (the advantage becomes more significant when the quantization bit width is lower) and is on par with QLoRA without PTQ. Note that during inference, QA-LoRA has exactly the same complexity as QLoRA with PTQ and is much more efficient than QLoRA without PTQ. Hence, QA-LoRA serves as an effective and off-the-shelf method for joint
quantization and adaptation of LLMs.

**Experiments**

**4.1 SETTINGS**

Foundation models. We establish QA-LoRA upon the LLaMA and LLaMa2 families. In particular, we fine-tune the 7B, 13B, 33B, and 65B models of LLaMA and the 7B and 13B models of LLaMA2.

**Evaluation metrics.**

Following QLoRA, we evaluate both the zero-shot and few-shot performance of the LLMs on Massively Multitask Language Understanding (MMLU) benchmark. It consists of 57 language tasks including humanities, STEM, social science, etc. We use the official MMLU evaluation script and prompts. We
further assess the zero-shot common sense reasoning ability on tasks covering HellaSwag, PIQA, WinoGrande, ARC, BoolQ, and OpenBookQA. We adopt lm-eval-harness to produce the Common Sense QA results.

**Quantization.**

We adopt GPTQ in the quantization step, and our approach is open to other PTQ methods such as. We use the same settings to quantize the QLoRA fine-tuned models and pre-trained LLaMA models. In the main experiments, we conduct a group-wise asymmetric quantization (with a group size of 32). We set the act-order variable to be false and the true-sequential variable to be true.

**Datasets and training details.**

 We choose Alpaca and FLAN v2 as our fine-tuning datasets. Alpaca contains 52K instruction-following data generated from text-davinci-003 (GPT 3.5).FLAN v2 is a collection of 1,836 tasks combining the mixture with CoT, Muffin, T0-SF, and NIV2. To save the tuning cost, we randomly sample a 320K subset from the FLAN v2 collection. Following QLoRA (Dettmers et al., 2023a), we use a paged AdamW optimizer, a maximum gradient norm of 0.3, and a batch size of 16 in the tuning period. We choose the constant learning rate schedule and set the learning rate to be 2 × 10−5 for the 7B and 13B models and 1 × 10−5 for the 33B and 65B models. The number of fine-tuning steps is 10K for Alpaca and 20K for FLAN v2. All experiments are conducted on Tesla V100 GPUs. We use one GPU for the 7B, 13B, and 33B models and two GPUs for the 65B models.

**4.2 MAIN RESULTS AND EFFICIENCY**
**Comparison against recent competitors on LLaMA for MMLU.**

We first apply QA-LoRA to fine-tune the LLaMA models for MMLU. Table 1 summarizes the results with respect to different model sizes, fine-tuning datasets, and bit widths. Besides the base LLaMA models, we also compare QA-LoRA against QLoRA (Dettmers et al., 2023a), the most related work, and PEQA, a recent quantization method that does not use LoRA. We report both the original QLoRA (the inference stage involves FP16 computation) and the variant after GPTQ (for fair comparison). QA-LoRA consistently outperforms both competitors (QLoRA w/ GPTQ and PEQA) in either 0-shot and 5-shot accuracy. The advantage is more significant when the model size is small (e.g., 7B and 13B) or the bit width is small (e.g., INT3 or even INT2 is used), demonstrating that QA-LoRA is a strong solution in the scenarios that require computational efficiency. In some cases, the INT4 version of QA-LoRA performs even better than the original version of QLoRA meanwhile the inference speed is much faster (see the next paragraph). We further demonstrate some examples of QA-LoRA in Appendix A, where one can see the qualitative comparison and QA-LoRA beyond QLoRA w/ GPTQ. QA-LoRA mainly benefits from the quantization-aware adaptation; otherwise, the post-training quantization will not be compensated, resulting in unstable results.

**The efficiency of QA-LoRA.**

A clear advantage of QA-LoRA lies in its computational efficiency. Table 2 compares QA-LoRA to QLoRA in terms of the learnable parameters and training time during the fine-tuning stage. The significant advantage of QA-LoRA in training time mainly comes from the use of INT4 quantization. Compared to NF4 quantization used by QLoRA, INT4 operators have been optimized by CUDA and are much faster in execution. Additionally, during the inference stage, QA-LoRA is also more than 50% faster than QLoRA because the fine-tuned model (after weight integration) is still in INT4, unlike QLoRA that converts it back to FP16.

**Commonsense QA results.**

We also evaluate QA-LoRA for 0-shot commonsense QA based on LLaMA-7B. Results are summarized in Table 3. Similar to the MMLU results, the 4-bit QA-LoRA is comparable with the mixed-precision QLoRA and outperforms the post-quantized QLoRA by an average of 2.0%. The advantage becomes more significant in low-bit scenarios, e.g., the 2-bit QA-LoRA reports a remarkable accuracy gain of 15.0% over the 2-bit post-quantized QLoRA.

**On LLaMA2 models.**

We further validate the effectiveness of our method on LLaMA2. As shown in Table 4, we fine-tune the 7B and 13B models of LLaMA2 and test them on MMLU. Compared to the original FP16 models, the INT4 models fine-tuned with FLAN v2 are consistently better, while those with Alpaca report slightly lower accuracy. These experiments validate that QA-LoRA is generalized to other pre-trained model families.

**4.3 ABLATIVE STUDIES**
**Impact of the quantization group size.**

We investigate different settings of L, the hyper-parameter that controls the degrees of freedom for both quantization and low-rank adaptation. Results are reported in Table 5, where group size (i.e., Din/L is displayed instead of L). Recall that a larger L (corresponding to a smaller group size) implies a larger degree of freedom, i.e., a smaller quantization loss, and a larger number of adaptation parameters. Meanwhile, it also requires a larger number of storage and computation, though negligible as long as L ≫ 1. One can observe that a larger L (e.g., group size is 32) often leads to higher accuracy, and the advantage becomes more significant when the quantization bit width is small, implying that a larger quantization loss needs to be compensated by a larger degree of freedom.

**Impact of fine-tuning datasets.**
We also evaluate QA-LoRA on more datasets such as Self-instruct, Longform, and Chip2 (LAION, 2023). Results are summarized in Table 6. Compared to Alpaca and FLAN v2, these datasets are relatively small, and thus the fine-tuned models report a bit weaker accuracy on MMLU. Note that, with LLaMA-13B as the foundation model, QA-LoRA consistently outperforms QLoRA with mixed precision, meanwhile being much faster in the inference stage.

**Impact of the size of fine-tuning datasets.**
Lastly, we evaluate QA-LoRA on different subsets of FLAN v2. The dataset size varies from 160K, 240K, 320K, 400K, and 480K. LLaMA-7B is used as the foundation model. As shown in Figure 3, low-bit quantization asks for more data, yet 320K is sufficient for both the INT2 and INT4 variants of QA-LoRA.

**Conclusion**

In this paper, we propose QA-LoRA as an efficient method that introduces quantization-awareness into the low-rank adaptation of LLMs. At the core of QA-LoRA lies the group-wise operations for both quantization and low-rank adaptation, and the key insight comes from balancing the degrees of freedom of both sides. QA-LoRA is easily implemented, generalized across various foundation models and language understanding tasks, and computationally efficient in both fine-tuning and inference stages. Extensive experiments on the LLaMA model families validate the effectiveness of QA-LoRA.

In [None]:
qa_pairs = [
  {
    "question": "What is the central motivation behind QA-LoRA, and what core problem does it address?",
    "answer": "QA-LoRA is motivated by the need to make large language models deployable on resource-constrained devices. It addresses the imbalanced degrees of freedom in quantization and adaptation by introducing group-wise operations that simultaneously enhance quantization flexibility and reduce the parameter overhead of adaptation."
  },
  {
    "question": "How does QA-LoRA differ from traditional LoRA and post-training quantization (PTQ) approaches?",
    "answer": "Unlike traditional LoRA, which does not address quantization, and PTQ, which is applied after fine-tuning and often degrades performance, QA-LoRA integrates quantization into the fine-tuning process. This enables the model to adapt while being quantized, preserving accuracy and reducing inference complexity without requiring costly re-quantization."
  },
  {
    "question": "What role do group-wise operations play in the QA-LoRA framework?",
    "answer": "Group-wise operations in QA-LoRA increase the degrees of freedom in quantization by allowing each group of weights to be quantized independently. Simultaneously, they reduce the number of adaptation parameters by sharing them across groups. This balance mitigates quantization loss and ensures efficient adaptation."
  },
  {
    "question": "Why does QA-LoRA outperform QLoRA, particularly at low bit-widths such as INT2 and INT3?",
    "answer": "QA-LoRA introduces quantization-awareness during training, allowing it to compensate for quantization loss as part of the optimization process. QLoRA, when followed by PTQ, lacks this adaptive correction, leading to accuracy degradation at low bit-widths. QA-LoRA’s group-wise structure helps maintain performance even in aggressive quantization settings."
  },
  {
    "question": "How does QA-LoRA achieve superior computational efficiency during both training and inference compared to QLoRA?",
    "answer": "During training, QA-LoRA uses INT4 quantization, which benefits from CUDA-optimized operators, leading to faster execution. In inference, it retains its quantized structure, unlike QLoRA which reverts to FP16. This allows QA-LoRA to be over 50% faster than QLoRA while maintaining or exceeding its accuracy."
  },
  {
    "question": "How does the quantization group size affect QA-LoRA’s performance, particularly at low bit-widths?",
    "answer": "A smaller group size (i.e., larger L) increases quantization granularity, reducing quantization loss and enhancing accuracy, especially in low-bit scenarios. The experiments show that group sizes like 32 yield better performance, demonstrating that fine-grained control is key to balancing accuracy and compression."
  },
  {
    "question": "In the experiments, how did QA-LoRA perform on smaller or lower-resource fine-tuning datasets compared to larger datasets like FLAN v2?",
    "answer": "On smaller datasets such as Self-Instruct or Longform, QA-LoRA maintained a performance edge over QLoRA, albeit with slightly lower overall accuracy than with FLAN v2. This indicates QA-LoRA's robustness, although low-bit quantization benefits more from larger datasets due to its higher representational constraints."
  },
  {
    "question": "Why is quantization-aware adaptation preferable to post-training quantization, especially in deployment settings?",
    "answer": "Quantization-aware adaptation enables the model to learn in the quantized space, preserving task performance even under aggressive compression. PTQ, by contrast, applies quantization after learning, which often introduces mismatch and accuracy degradation. QA-LoRA’s approach eliminates this post-hoc correction need."
  },
  {
    "question": "What potential implications does QA-LoRA have for deploying LLMs on edge devices?",
    "answer": "QA-LoRA provides a viable pathway to deploy powerful LLMs on edge devices by combining low-bit quantization with efficient adaptation. Its ability to retain high performance in a compressed, quantized state makes it ideal for mobile, IoT, and low-latency applications where resource constraints are paramount."
  },
  {
    "question": "How does QA-LoRA contribute to the broader research goal of making LLMs more accessible and environmentally sustainable?",
    "answer": "By significantly reducing memory usage, training time, and inference latency, QA-LoRA lowers the barrier to entry for LLM deployment and reduces the carbon footprint of large-scale fine-tuning. It supports democratization of LLMs without sacrificing quality, aligning efficiency with performance in a scalable manner."
  }
]

In [None]:
title = "QA-LoRA: Qantization-Aware Low-Rank Adaptation of Large Language Models"
generate_QA_pair("17", 2023, "qalora", title, qa_pairs)

### **18: QLORA: Efficient Finetuning of Quantized LLMs**

**Abstract**

We present QLORA, an efficient finetuning approach that reduces memory us-
age enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLORA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLORA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLORA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

**Introduction**

Finetuning large language models (LLMs) is a highly effective way to improve their performance, and to add desirable or remove undesirable behaviors. However, finetuning very large models is prohibitively expensive; regular 16-bit finetuning of a LLaMA 65B parameter model requires more than 780 GB of GPU memory. While recent quantization methods can reduce the memory footprint of LLMs, such techniques only work for inference and break down during training.

We demonstrate for the first time that it is possible to finetune a quantized 4-bit model without any performance degradation. Our method, QLORA, uses a novel high-precision technique to quantize a pretrained model to 4-bit, then adds a small set of learnable Low-rank Adapter weights that are tuned by backpropagating gradients through the quantized weights.

QLORA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to less than 48GB without degrading the runtime or predictive performance compared to a 16-bit fully finetuned baseline. This marks a significant shift in accessibility of LLM finetuning: now the largest publicly available models to date finetunable on a single GPU. Using QLORA, we train the Guanaco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna benchmark. When deployed, our smallest Guanaco model (7B parameters) requires just 5 GB of memory and outperforms a 26 GB Alpaca model by more than 20 percentage points on the Vicuna benchmark (Table 6).

QLORA introduces multiple innovations designed to reduce memory use without sacrificing performance: (1) 4-bit NormalFloat, an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats. (2) Double Quantization, a method that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model). (3) Paged Optimizers, using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length. We combine these contributions into a better tuned LoRA approach that includes adapters at every network layer and thereby avoids almost all of the accuracy tradeoffs seen in prior work.

QLORA introduces multiple innovations designed to reduce memory use without sacrificing performance: (1) 4-bit NormalFloat, an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats. (2) Double Quantization, a method that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model). (3) Paged Optimizers, using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length. We combine these contributions into a better tuned LoRA approach that includes adapters at every network layer and thereby avoids almost all of the accuracy tradeoffs seen in prior work.

Furthermore, we also provide a extensive analysis of chatbot performance that uses both human raters and GPT-4 for evaluation. We use tournament-style benchmarking where models compete against each other in matches to produce the best response for a given prompt. The winner of a match is judged by either GPT-4 or human annotators. The tournament results are aggregated into Elo scores [16, 17] which determine the ranking of chatbot performance. We find that GPT-4 and human evaluations largely agree on the rank of model performance in the tournaments, but we also find there are instances of strong disagreement. As such, we highlight that model-based evaluation while providing a cheap alternative to human-annotation also has its uncertainties.

We augment our chatbot benchmark results with a qualitative analysis of Guanaco models. Our analysis highlights success and failure cases that were not captured by the quantitative benchmarks.

We release all model generations with human and GPT-4 annotations to facilitate further study. We open-source our codebase and CUDA kernels and integrate our methods into the Hugging Face transformers stack, making them easily accessible to all. We release a collection of adapters for 7/13/33/65B size models, trained on 8 different instruction following datasets, for a total of 32 different open sourced, finetuned models.

**Limitations and Discussions**

We have shown evidence that our method, QLORA, can replicate 16-bit full finetuning performance with a 4-bit base model and Low-rank Adapters (LoRA). Despite this evidence, we did not establish that QLORA can match full 16-bit finetuning performance at 33B and 65B scales. Due to the immense resource costs, we leave this study to future work.

Another limitation is the evaluation of instruction finetuning models. While we provide evaluations on MMLU, the Vicuna benchmark, and the OA benchmark, we did not evaluate on other benchmarks such as BigBench, RAFT, and HELM, and it is not ensured that our evaluations generalize to these benchmarks. On the other hand, we perform a very broad study on MMLU and develop new methods for evaluating chatbots.

From the evidence presented, it appears that the performance of these benchmarks likely depends how similar the finetuning data is to the benchmark dataset. For example, FLAN v2 is similar to MMLU, but dissimilar to chatbot benchmarks and vice versa for the Chip2 dataset and both models score accordingly on the MMLU and Vicuna benchmarks. This highlights that not only better benchmarks and evaluation is needed, but that one needs to be careful about what one is evaluating in the first place. Do we want to create models that do well on classroom highschool and colleague knowledge or do we want to do well on chatbot conversation ability? Maybe something else? Because it is always easier to evaluate on an existing benchmark compared to creating a new one, certain benchmarks can steer the community towards a certain direction. We should ensure as a community that the benchmarks measure what we care about.

While we provide a detailed evaluation for general chatbot performance, another limitation is that we only do a limited responsible AI evaluation of Guanaco. We evaluate the likelihood of Guanaco-65B to generate a socially biased sequence of tokens compared to other models in Table 8. We see that the average score in Guanaco-65B is much lower than other raw pretrained models. As such, it seems that finetuning on the OASST1 dataset reduces the bias of the LLaMA base model. While these results are encouraging, it is unclear if Guanaco does also well when assessed on other types of biases. We leave further evaluation of analyzing biases in Guanaco and similar chatbots to future work.

An additional limitation is that we did not evaluate different bit-precisions, such as using 3-bit base models, or different adapter methods. Besides LoRA, there is also a wide variety Parameter Efficient
FineTuning (PEFT) methods that have been shown to work well. However, it is unclear if these methods scale to large models. We used LoRA as many results established its robustness but other adapters might yield better performance. Since finetuning after quantization seems to recover most of the information that is lost during quantization this might enable much more aggressive quantization. For example, 3-bit GPTQ quantization of the basemodel with LoRA might also yield 16-bit full finetuning performance after finetuning.

**Broader Impact**

Our QLORA finetuning method is the first method that enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, while not degrading performance relative to a full finetuning baseline. We have demonstrated that our best 33B model trained on the Open Assistant dataset can rival ChatGPT on the Vicuna benchmark. Since instruction finetuning is an essential tool to transform raw pretrained LLMs into ChatGPT-like chatbots, we believe that our method will make finetuning widespread and common in particular for the researchers that have the least resources, a big win for the accessibility of state of the art NLP technology. QLORA can be seen as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs.

Another potential source of impact is deployment to mobile phones. We believe our QLORA method might enable the critical milestone of enabling the finetuning of LLMs on phones and other low resource settings. While 7B models were shown to be able to be run on phones before, QLORA is the first method that would enable the finetuning of such models. We estimate that with an iPhone 12 Plus, QLORA can finetune 3 million tokens per night while the phone is charging. While finetuned 7B models do not reach the quality of ChatGPT, we believe that the quality is good enough to enable novel applications that have not been possible before due to privacy or LLM quality issues. QLORA can help enable privacy-preserving usage of LLMs, where users can own and manage their own data and models, while simultaneously making LLMs easier to deploy.

However, finetuning is a dual-use technology that can be abused to cause harm. Widespread use of LLMs has known dangers, but we believe that equalizing access to a technology that is quickly becoming ubiquitous will allow for better more independent analysis than keeping the power of LLMs in the hands of large corporations that do not release models or source code for auditing.

All in all, we believe that QLORA will have a broadly positive impact making the finetuning of high quality LLMs much more widely and easily accessible.


In [None]:
qa_pairs = [
  {
    "question": "What is the core innovation of QLoRA that enables fine-tuning a 65B parameter model on a single 48GB GPU?",
    "answer": "QLoRA enables efficient fine-tuning by combining 4-bit quantization of the pretrained model with Low-Rank Adapters (LoRA), backpropagating gradients through the frozen quantized weights. This dramatically reduces memory usage without sacrificing performance."
  },
  {
    "question": "How does QLoRA address the challenge of memory spikes during training with long sequence lengths?",
    "answer": "QLoRA introduces 'Paged Optimizers' which use NVIDIA’s unified memory to manage memory spikes during gradient checkpointing, particularly when processing mini-batches with long sequences. This innovation allows stable training even on limited hardware."
  },
  {
    "question": "What is 4-bit NormalFloat (NF4), and why is it preferable over other 4-bit formats in QLoRA?",
    "answer": "4-bit NormalFloat (NF4) is a quantization datatype designed to be information-theoretically optimal for normally distributed weights. It yields better empirical results than standard 4-bit integers or floats, maintaining performance despite the aggressive compression."
  },
  {
    "question": "How does QLoRA’s 'Double Quantization' contribute to its overall memory efficiency?",
    "answer": "Double Quantization compresses the quantization constants themselves, saving approximately 0.37 bits per parameter. For a 65B model, this equates to around 3GB in memory savings, significantly improving QLoRA’s efficiency at scale."
  },
  {
    "question": "Why is QLoRA particularly well-suited for instruction tuning on resource-limited hardware?",
    "answer": "QLoRA supports fine-tuning of very large models on single GPUs without degrading accuracy, making it ideal for instruction tuning in environments lacking access to multi-GPU clusters. Its ability to train 33B and 65B models with minimal memory makes high-performance alignment tasks feasible for small teams."
  },
  {
    "question": "What evaluation methodology did QLoRA use to assess chatbot performance, and what were the key findings?",
    "answer": "QLoRA used tournament-style benchmarking where models competed to generate the best response, judged by either GPT-4 or human annotators. The Elo scoring system was used to rank models. Results showed strong agreement between human and GPT-4 evaluations, validating the use of LLMs for performance benchmarking."
  },
  {
    "question": "What performance did the Guanaco-65B model achieve on the Vicuna benchmark, and how resource-efficient was its training?",
    "answer": "The Guanaco-65B model reached 99.3% of ChatGPT’s performance on the Vicuna benchmark after just 24 hours of training on a single professional GPU. This demonstrated QLoRA’s ability to train state-of-the-art models with limited resources."
  },
  {
    "question": "What are the main limitations identified in the QLoRA paper regarding model scale and evaluation breadth?",
    "answer": "The paper notes that while QLoRA achieves strong results, it does not conclusively establish parity with 16-bit full fine-tuning at 33B and 65B scales. Additionally, it lacks evaluation on some key benchmarks like BigBench and HELM, and performs only limited bias assessment."
  },
  {
    "question": "How does QLoRA’s open-source approach and hardware accessibility impact the research community and democratization of LLMs?",
    "answer": "By making it possible to fine-tune massive models on consumer or single professional GPUs, QLoRA lowers the barrier to entry for high-quality LLM research. This empowers smaller research teams and fosters transparency, reducing reliance on opaque, corporate-controlled models."
  },
  {
    "question": "In what ways could QLoRA potentially enable privacy-preserving applications on edge devices?",
    "answer": "Because QLoRA enables fine-tuning large models on-device with low memory, it could allow users to customize models locally, retaining sensitive data without uploading it to external servers. This opens up avenues for privacy-respecting AI applications on smartphones and other edge devices."
  }
]

In [None]:
title = "QLORA: Efficient Finetuning of Quantized LLMs"
generate_QA_pair("18", 2023, "qlora", title, qa_pairs)

### **19: Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark**

**Abstract**

In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow in size, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by Malladi et al. (2023). Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning.

**Introduction**

Fine-tuning pre-trained large language models (LLMs) has become the de-facto standard in the current paradigms of natural language processing (NLP). First-order (FO) optimizers, e.g., SGD (Amari, 1993) and Adam (Kingma & Ba, 2014), have been the pre-dominant choices for LLM fine-tuning. However, as LLMs continue to scale, they encounter significant memory overhead due to the back-propagation (BP) required for FO gradient computation. For example, computing the gradient of the LLM OPT-13B requires 12× more memory cost than the model inference. This leads to the challenge of achieving memory-efficient fine-tuning in LLMs. Advancements in addressing this challenge could also facilitate technological breakthroughs in related areas, such as on-device training, where memory efficiency is in high demand.

To enhance memory efficiency, an emerging solution is to replace a BP-required FO optimization method with a BP-free optimizer during LLM fine-tuning. This was initially proposed by Malladi et al. (2023), where the FO gradient is approximated using a finite difference of function values. Despite its new application to LLM fine-tuning, the underlying optimization principle used in Malladi et al. (2023) is commonly known as zeroth-order (ZO) optimization, and the function value-based gradient estimate is referred to as the ZO gradient estimate. Malladi et al. (2023) employed the classical ZO stochastic gradient descent (ZO-SGD) algorithm (Ghadimi & Lan, 2013), termed MeZO, to fine-tune the pre-trained LLMs and leveraged the BP-free characteristics of ZO optimization to reduce memory costs. However, from the perspective of ZO optimization, in addition to ZO-SGD, many other ZO optimization methods have not yet been explored in the context of LLM fine-tuning. Thus, it remains elusive whether there are potential improvements in accuracy and/or efficiency that can be achieved through a benchmarking study of ZO optimization for LLM fine-tuning. This yields the primary question to be explored:

(Q) Can we establish a benchmark for ZO optimization in LLM fine-tuning, explore the overlooked optimization principles, and advance the current state of the art?

To address (Q), our work introduces several key innovations compared to the most relevant work. We explore a broader range of ZO optimization methods beyond ZO-SGD and examine various task and model types, as well as evaluation metrics. We conduct a detailed comparative analysis of different ZO optimization methods, shedding light on the often-overlooked forward gradient method and other ZO optimization techniques in LLM fine-tuning. This benchmarking study helps reveal the pros and cons of these methods in accuracy and efficiency. Extended from the gained insights, we propose to further improve ZO optimization-based LLM fine-tuning using techniques of block-wise descent, hybrid ZO and FO training, and gradient sparsity. In summary, our key contributions are listed below.

• We create the first benchmark for ZO optimization in the context of LLM fine-tuning. This benchmark includes our investigations into 6 BP-free or ZO optimization methods, 5 LLM families, 3 tasks of varying complexities, and 5 fine-tuning schemes, covering both full-parameter and parameter-efficient fine-tuning (PEFT) approaches.

• Assisted by our benchmark, we reveal a range of previously overlooked optimization principles and insights for LLM fine-tuning with ZO optimization. These include the significance of aligning tasks to enhance ZO optimization, the role of forward gradient as an LLM fine-tuning baseline, and the trade-offs between algorithm complexity, fine-tuning accuracy, query and memory efficiency.

• In addition to a holistic assessment of existing ZO optimization methods for LLM fine-tuning, we introduce novel enhancements to ZO optimization, including block-wise ZO optimization, hybrid ZO and FO fine-tuning, and sparsity-induced ZO optimization. These proposed techniques aim to improve the accuracy of ZO LLM fine-tuning while maintaining memory efficiency.

**Conclusion**

This work explores the application of zeroth-order (ZO) optimization in fine-tuning LLMs. ZO optimization approximates gradients using loss differences, eliminating the need for back-propagation and activation storage. While MeZO has made strides in adapting ZO optimization for LLMs, understanding the full ZO landscape remains an open question. To address this question, we broaden the scope by considering various ZO optimization methods, task types, and evaluation metrics. We conduct the first benchmark study of different ZO optimization techniques, shedding light on their accuracy and efficiency. We also uncover the overlooked ZO optimization principles, such as task alignment and the role of forward gradient. Leveraging these insights, we propose techniques like block-wise descent, hybrid ZO and FO training, and gradient sparsity to enhance ZO optimization-based LLM fine-tuning. The proposed enhancements can further improve the fine-tuning accuracy while maintaining the memory efficiency.



In [None]:
qa_pairs = [
  {
    "question": "What fundamental challenge does zeroth-order (ZO) optimization aim to solve in the context of LLM fine-tuning?",
    "answer": "ZO optimization addresses the significant memory overhead caused by back-propagation during first-order optimization, offering a BP-free alternative that approximates gradients using function value differences, thus enabling more memory-efficient fine-tuning of large language models."
  },
  {
    "question": "How does ZO optimization compute gradients without back-propagation?",
    "answer": "ZO optimization estimates gradients by evaluating the change in loss values resulting from small perturbations to the model parameters, rather than computing gradients via back-propagation through the model’s layers."
  },
  {
    "question": "What is the significance of the forward gradient method highlighted in this paper?",
    "answer": "The forward gradient method serves as a baseline in ZO optimization for LLM fine-tuning, offering a simple yet effective way to estimate gradients without back-propagation, and its role had been previously underappreciated in the context of LLMs."
  },
  {
    "question": "What novel enhancements to ZO optimization are proposed in this work?",
    "answer": "The paper introduces block-wise descent, which divides parameters into groups for localized updates; hybrid ZO and FO training, which combines ZO efficiency with FO accuracy; and sparsity-induced ZO optimization, which leverages sparse updates to further reduce memory consumption."
  },
  {
    "question": "How does block-wise ZO optimization improve the fine-tuning process?",
    "answer": "Block-wise ZO optimization improves performance by breaking the parameter space into manageable segments, allowing for more efficient and scalable gradient estimation, which reduces computational complexity and enhances accuracy."
  },
  {
    "question": "In what ways does task alignment influence the effectiveness of ZO optimization?",
    "answer": "Task alignment significantly impacts the performance of ZO optimization; aligning the optimization procedure with the nature and complexity of the task helps mitigate the noise in gradient estimation, leading to better fine-tuning outcomes."
  },
  {
    "question": "What trade-offs are observed between algorithmic complexity and fine-tuning accuracy in ZO methods?",
    "answer": "The study reveals that more complex ZO algorithms can yield higher accuracy but often at the cost of increased query count and computation time. Simpler methods like MeZO are more efficient but less accurate, highlighting a key trade-off between complexity and performance."
  },
  {
    "question": "Why is the exploration of ZO optimization particularly relevant for on-device LLM fine-tuning?",
    "answer": "On-device training typically operates under severe memory constraints, making BP-free methods like ZO optimization attractive because they eliminate the need for storing activations and gradients, enabling feasible fine-tuning of LLMs on edge hardware."
  },
  {
    "question": "How does this work advance the field beyond Malladi et al. (2023)?",
    "answer": "While Malladi et al. introduced MeZO using ZO-SGD, this work expands the scope by benchmarking six different ZO optimization methods across five LLM families, introduces new techniques to improve accuracy and memory efficiency, and systematically analyzes optimization principles."
  },
  {
    "question": "What broader implications does this study suggest for the future of LLM optimization?",
    "answer": "The study suggests that ZO optimization could redefine the paradigm of LLM fine-tuning by enabling high-performance, low-memory training regimes, thus democratizing access to LLM customization and enabling novel applications in low-resource settings."
  }
]

In [None]:
title = "Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark"
generate_QA_pair("19", 2024, "zeroth_order_optimization", title, qa_pairs)

### **20: Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning**

**Abstract**

This paper presents a systematic overview of parameter-efficient fine-tuning methods, covering over 50 papers published between early 2019 and mid-2024. These methods aim to address the challenges of fine-tuning large language models by training only a small subset of parameters. We provide a taxonomy that covers a broad range of methods and present a detailed method comparison with a specific focus on real-life efficiency in fine-tuning multibillion-scale language models. We also conduct an extensive head-to-head experimental comparison of 15 diverse PEFT methods, evaluating their performance and efficiency on models up to 11B parameters. Our findings reveal that methods previously shown to surpass a strong LoRA baseline face difficulties in resource-constrained settings, where hyperparameter optimization is limited and the network is fine-tuned only for a few epochs. Finally, we provide a set of practical recommendations for using PEFT methods and outline potential future research directions.

**Introduction**

In October 2018, BERT Large with 350 million parameters was the biggest Trans-
former model ever trained. At the time, contemporary hardware struggled to fine-tune this model. The section “Out-of-memory issues” on BERT’s GitHub specifies the maximum batch size for BERT Large given 12Gb of GPU RAM and
512 tokens as zero. Five years in, publicly available models grew to 176 billion parameters, i.e. by a factor of 500. Published literature includes models up to 1 trillion parameters. However, single-GPU RAM increased less than 10 times due to the high cost of HBM memory. Model size scales almost two orders of magnitude quicker than computational resources making fine-tuning the largest models to downstream tasks infeasible for most and impractical for everyone.

In-context learning thus became the new normal, the standard way to pass downstream task training data to billion-scale language models. However, the limited context length imposed by the transformer architecture, the absence of ICL abilities in moderately large language models, the quadratic increase in computational cost with an increase in context length (or demonstrations in ICL), and the sensitivity of ICL performance present challenges in the utility, reliability, and efficiency of ICL. In cases where the model performs at par or better in the ICL setting compared to the fine-tuned model, fine-tuning is still a lucrative strategy due to the impractical inference cost of ICL. Thus, we, as a community of researchers and engineers, need efficient ways to train on downstream task data.

Parameter-efficient fine-tuning (PEFT) aims to resolve this problem by only training a small set of parameters, which might be a subset of the existing model parameters or a set of newly added parameters. These methods differ in parameter and memory efficiency, training speed, final model quality, and additional inference costs (if any).

In the last few years, more than a hundred PEFT papers have been published, with several studies providing a good overview of the most popular methods,
such as Adapters, BitFit, LoRA, Compacter, and Soft Prompts.

Pfeiffer et al. (2023) presented a survey on modular deep learning, providing an overview of several similar methods from the perspective of modularity and multi-task inference. Our focus differs by concentrating on PEFT methods, specifically for fine-tuning large language models, where minimizing RAM consumption and training time without sacrificing performance is crucial.

This survey presents a systematic overview, comparison, and taxonomy of 30 parameter-efficient fine-tuning methods. Over the last year, research efforts have also focused on replicating the success of PEFT in the pre-training regime. Hence, we also discuss a few prominent methods that aim to achieve efficiency gains during pre-training. We discuss 30 methods in-depth, covering over 50 papers published from early 2019 to mid-2024. We highlight the current unresolved challenges in PEFT, including the limited theoretical understanding, the performance gap between PEFT and traditional fine-tuning, and reporting issues.

We conduct the most extensive experimental comparison of PEFT methods , evaluating 14 methods and their variations across five datasets and three model sizes (0.7B, 3B, and 11B). The study includes a detailed comparison of these methods’ efficiency in terms of GPU memory consumption and throughput. Our findings reveal that methods previously shown to outperform LoRA struggle to do so in resource-constrained settings and exhibit high hyperparameter sensitivity in hybrid PEFT methods.

We found that Kronecker-based reparametrizations, while not enhancing memory efficiency compared to matrix-product counterparts, can improve training and inference speeds with efficient implementation. Surprisingly, Layer Norm tuning performs exceptionally well compared to most PEFT methods in our study. We also note a significant discrepancy between reported and actual trainable parameter counts in PEFT methods. This leads to unforeseen outcomes, such as the high computational costs of Prompt Tuning and Hybrid
methods. Our code is available on Github.

In conclusion, we suggest several avenues for improvement, such as developing standardized PEFT benchmarks, conducting in-depth studies on hyperparameters and interpretability, exploring the difference in training dynamics of reparametrized neural networks, further improving training and inference efficiency of PEFT methods, and utility of PEFT methods with quantized backbone models.

**11 Comparison of PEFT Methods**

We provide a detailed comparison of the trainable and updated parameters for 28 PEFT methods in Table 2 and discuss this issue further in Section ??. Generally, sparse methods tend to have more trainable than changed parameters, while reparametrization methods often have fewer trainable parameters due to the nature of reparametrization.

In our study, we consider five key dimensions essential to benchmark the effectiveness of parameter-efficient fine-tuning methods. These dimensions include storage efficiency, memory efficiency, computational efficiency, inference overhead, and downstream performance metrics (e.g. accuracy). Our analysis of the published literature shows that while these dimensions are interconnected, improvements along one of the axes do not necessarily translate into improvements along the others. For example, optimizing for parameter efficiency alone does not guarantee reduced RAM usage. Table 1 summarizes our findings.

Despite their significance, PEFT performance metrics such as memory efficiency, training speed, and inference overhead (e.g., throughput) are only occasionally quantified in the papers. However, presenting these metrics only helps to further analyze a particular PEFT method of interest, in isolation. Head-to-head comparison across different PEFT methods is still challenging, primarily because of the impact of experimental setup on performance metrics. Thus, to address this gap, we fix the experimental set-up and perform a large-scale experimental comparison of PEFT methods as a part of this survey. We discuss details of the experimental setup and our results in the following sections.

**11.1 Experimental Comparison: Setup**

Our experimental comparison is designed to provide a comprehensive evaluation of PEFT methods, exceeding the scope and depth of existing studies such as Ding et al. (2022). We have carefully selected 14 PEFT methods representing diverse categories within our taxonomy – including Additive, Selective, Reparametrization-based, and Hybrid methods – to ensure a broad and inclusive analysis. Notably, we exclude sparse selective methods, acknowledging their limited practicality on modern hardware and their primary focus on storage efficiency. Furthermore, our study omits BitFit in the context of T5 networks, which do not utilize biases. In addition to these PEFT methods, we include a full fine-tuning baseline to provide a reference point for both downstream task performance and efficiency metrics. Unlike Ding et al. (2022), which limited their experimental comparison to four PEFT methods and focused primarily on downstream performance and memory consumption, our experimental design covers a wider range of methods and evaluates them in all the following: memory efficiency, training speed, inference speed, and downstream performance.

**Datasets** We use both natural language understanding (NLU) and natural language generation (NLG) tasks for the comparison of the methods.

While the GLUE benchmark (Wang et al., 2018) is commonly used to evaluate parameter-efficient fine-tuning methods in the existing literature, many models now match or exceed human performance on GLUE tasks, some even with no or minimal training. This makes GLUE less effective at evaluating fine-tuning procedure performance. More recently, proposed alternatives to GLUE include MMLU, HELM, and BigBench (Srivastava et al., 2022). MMLU emphasizes zero- or few-shot evaluations, making it unsuitable for assessing fine-tuning. Both HELM and BigBench present computational challenges due to their task diversity, especially when comparing a broad range of methods and models of up to 11B parameters.

In contrast, SuperGLUE tasks remain both demanding (with only a few models surpassing the human baseline) and computationally manageable. Specifically, we selected BoolQ, RTE, and COPA for this study. BoolQ is a yes/no question-answering dataset mainly evaluating models’ world knowledge. COPA focuses on commonsense causal reasoning, for example, “Premise: The man broke his toe. What was the CAUSE of this? Alternative 1: He got a hole in his sock. Alternative 2: He dropped a hammer on his foot.” RTE is a natural language inference dataset where, given a premise, the model needs to predict if the hypothesis can be inferred from it, contradicts it, or is not relevant.

For more diverse comparisons, we also include one natural language generation dataset: CNN-Dailymail, a large (300K training examples) summarization dataset. From the surveyed literature, we found that summarization usually highlights differences between PEFT methods and full fine-tuning, making this dataset particularly useful.

**Models** To compare parameter-efficient fine-tuning methods, we apply them to three sizes of T5: Large (0.7B), 3B, and 11B (Raffel et al., 2019). The range from 0.7B to 11B models not only tests each method’s effectiveness at different scales but also presents common challenges associated with large-scale model training. A key aspect of this comparison is to demonstrate how PEFT methods can address practical issues such as memory constraints.
For instance, the 11B model allows us to compare PEFT methods’ performance and efficiency in one of the most relevant practical cases when full fine-tuning does not fit even into 80GB of GPU memory.

**PEFT Methods** We use the following fine-tuning methods in our comparison:

• Full tuning – regular fine-tuning of all model parameters

• Houlsby – Adapters inserted after the attention and FCN layers of the Transformer as described in Section 5

• Pfeiffer – Adapters inserted only after the FCN layers

• Parallel Adapter – scaled parallel adapter as described in (He et al., 2022a)

• IA3 – (IA)3 learns re-scaling vectors for keys, values, and hidden FFN activations(Section 7.2)

• Prefix Tuning – learns a prefix added to keys and values and uses FCN reparametrization for these parameters (Section 6.2)

• Prompt Tuning – learns a prefix added to keys directly (Section 6.1)

• LN tuning – fine-tune only layer-norm parameters

• LoRA (q and v) – LoRA applied to the query and value networks only (Section 9.2)

• LoRA (all linear) – LoRA applied to all linear layers in the model

• KronA – LoRA-like Kronecker product-based reparametrization of the weight matrices (Section 9.3)

• MAM – Mix-and-Match Adapters (Section 10.2)

• Compacter – Kronecker product-based reparametrization of Adapter layers as described in Section 10.4

• Compacter++ – Compacter layers that are only applied after the FCN in the Transformer, similar to the idea of Pfeiffer vs Houlsby Adapters

• UniPELT – a hybrid method that combines LoRA, Prefix Tuning, and Adapters
through gating (Section 10.3)

**Metrics** In our evaluation, we focus on assessing PEFT method efficiency in terms of memory consumption, training speed, and inference speed, and then compare models on downstream metrics.

To quantify memory efficiency, we track the maximum RAM consumption during training using torch.cuda.max_memory_allocated(). Training speed is quantified by the number of input tokens processed per second during training and for inference – during evaluation. We do not explicitly merge reparametrization-based methods into model weights during evaluation to present results for a typical use-case when methods like LoRA are used in the adapter fashion. When merged, there should be no difference between reparametrization-based methods and regular training in terms of the inference speed. We use accuracy for SuperGLUE datasets and ROUGE-L for summarization.

**Implementation details and hyperparameters** All models are fine-tuned in text-to-text fashion following Raffel et al. (2019). We use Adapters and PEFT libraries for most of the methods and implement several methods in our repository from scratch. When using existing implementations, we utilize default architecture hyperparameters for the method from the corresponding library, which are usually close to the hyperparameters reported in the method’s original paper.

For all NLU datasets, we perform a learning rate sweep over values {1e-3, 1e-4, and 1e-5} and select the best, which we then train on two more random seeds to estimate the standard deviation. Our preliminary experiments indicate a negligible impact of weight decay in our setup (less than 0.01). Due to computational constraints, CNN/Dailymail experiments only use one seed and a learning rate of 1e-3, which we found to perform consistently well across PEFT methods. We estimate the standard deviation of CNN/Dailymail runs via several random seeds on randomly selected PEFT methods and find it to be lower than 0.5 ROUGE-L points, which we use for the standard deviation estimate for the rest of the methods.

We use a maximum input sequence length of 512 tokens in all our experiments. In NLU experiments, the maximum output sequence length is 8, and for summarization, it is 128 tokens.

Each NLU model undergoes training for either 3 epochs or a minimum of 100 update steps. For CNN/Dailymail, the training duration is set to one epoch (9 thousand update steps). We use a batch size of 32 in all our experiments, utilizing gradient accumulation to achieve this batch size when needed. While all SuperGLUE tasks converge by the end of training in most of our experiments, we observe that CNN/Dailymail continues to improve
throughout the training and does not plateau. Our setup thereby favors methods exhibiting faster learning, which is especially relevant for low-resource scenarios commonly faced in PEFT applications.

In total, we train three models of size 0.7-11B parameters with 14 PEFT methods, five runs for each of the three NLU datasets, and one for the summarization dataset. Together with the full fine-tuning baseline, this brings the total experiment count to around 700. We report raw (non-aggregated) results in Appendix B.

**Hardware setup** We estimate throughput using a single A100 40GB GPU for most of the experiments, with several exceptions due to out-of-memory issues. UniPELT, MAM, and Prefix Tuning for T5-11B were trained with a single A100 80GB GPU, which should give comparable throughput numbers to the A100 40GB. Full fine-tuning T5-11B experiments were performed with two A100 80GB GPUs using model-parallel distributed training. RAM estimates for this model training are the total memory consumption of both GPUs, which should give an estimate comparable to the rest of the experiments, as optimizer states are
not shared between GPUs in model-parallel training.

**11.2 Comparison Results: Downstream Performance**

Table 4 shows downstream metrics averaged over the datasets. Scores are averaged, and standard deviations are aggregated using the Euclidean mean of per-dataset variances. This table compares the downstream performance of PEFT methods across model scales. Non-aggregated results for all our experiments are available in Appendix B. We note a few key observations:

Houlsby Adapters and LoRA consistently perform the best Houlsby Adapters and
LoRA are the only methods that consistently achieve full-tuning performance with little to no effort in hyperparameter tuning.

Hybrid methods are especially sensitive to hyperparameters MAM Adapters and UniPELT were consistently hard to train. While the results in Table 4 include only the best model from our sweep over three learning rates, additional experiments to improve MAM and UniPELT only marginally improved their performance. We attribute this to the generally poor performance of Prompt Tuning when trained in a compute-limited scenario.

Prefix Tuning and Prompt Tuning significantly differ in performance Prefix Tuning and Prompt Tuning are two different PEFT methods that are easy to confuse in terms of naming and concept. Both methods use the idea of continuous prompt optimization, but Prefix Tuning reparametrizes the trainable prefix via a fully-connected network (Section 6.2). In contrast, Prompt Tuning directly optimizes the prefix, albeit at the cost of slower convergence and typically much larger prefix length. We observe significant differences between these methods in our experiments. Both of them suffer from slow convergence, which substantially hurts performance in our setup. However, Prompt Tuning never outperformed the constant prediction baseline.9 Additionally, Prompt Tuning was extremely sensitive to the random seed (especially for T5-large and 3B models), as observed by its high standard deviation from the mean.

Multiple methods underperform their reported values Multiple methods that had
claimed to outperform Adapters or LoRA (virtually all other methods) do not perform well in our setup. This includes most of the methods with the exception of Parallel Adapter, Compacter, and KronA, which perform on par with the best methods in several cases, especially for 11B models.

Pfeiffer Adapters Perform Significantly Worse Than Houlsby Pfeiffer et al. (2020a) observes that inserting adapters only after the FCN in the Transformer achieves similar performance as inserting adapters after both FCN and Attention (MHA) layers. However, in our experiments we find a significant and consistent difference of up to 15 points that increases with model scale. This highlights the importance of evaluating methods for both small and large models.

Layer Norm Tuning is unexpectedly competitive Layer Norm parameters are rarely used for parameter-efficient fine-tuning; we found only a few mentions of the method (AkbarTajari et al., 2022; Liu et al., 2022). However, it can be implemented in one line of code and shows performance competitive to full fine-tuning for T5-Large and T5-11B. We want to highlight this result and recommend using LN tuning as a baseline for future PEFT work.

**11.3 Comparison Results: Efficiency**
Table 5 presents a detailed comparison of efficiency and performance for the 14 PEFT methods compared in our study. We show the actual number of trainable parameters (as opposed to changed parameters), the maximum GPU memory consumption during training, and the throughput in ktok/s (thousands of tokens per second) both during training and inference.

All PEFT methods reduce memory consumption As expected, all methods from our study significantly reduce memory consumption. The smallest improvement we see is 4GB in the UniPELT and T5-Large combination, which is quite considerable because it is 10% of the GPU RAM. The biggest improvement is 71.5GB in the Compacter++ and T5-11B combination. This allows fine-tuning T5-11B on a single 40GB GPU instead of two 80GB GPUs and dramatically improves training speed by a factor of more than two.

Smaller models (less than 1B) can train slower with PEFT Any PEFT (Parameter-Efficient Fine-Tuning) method that adds parameters to the network involves additional forward (and potentially backward) pass overhead. For sufficiently large models or when only a few parameters are added, this overhead can be negligible. However, if the method adds too many parameters, it can lead to slower training compared to regular fine-tuning. We observe this in T5-Large models, which are small only compared to billion-scale models, as they have 738M parameters10. For instance, applying LoRA to all T5-Large parameters results in a 20% training slowdown. Similar slowdowns are noted for MAM adapters, Compacter, and UniPELT, with 20%, 5%, and 40% slower training, respectively, compared to full fine-tuning. Despite these slowdowns, they all offer memory improvements.

PEFT significantly affects inference speed In all PEFT methods that add trainable parameters to the network, we observe a significant slowdown in inference speed. The slowdown ranges from 33-55% for T5-Large, 20-60% for T5-3B, and 20-55% for T5-11B (absolute points). Within the set of additive methods, we observe that Pfeiffer adapters and (IA)3 offer the best inference speeds. It is important to note that in our throughput estimation for reparametrization-based methods, we did not merge the method parameters into the network. If merged, they would have the same inference speed as regular fine-tuning, as no additional parameters are present. However, methods like LoRA are increasingly used in modular approaches, such as referenced in, without merging LoRA parameters. The results from Table 5 are relevant for these scenarios.

Kronecker-Based Reparametrizations Do Not Improve Memory Efficiency, But Improve Speed Across different model scales, we observe that extremely parameter-efficient methods like Compacter and KronA, which employ Kronecker products to enhance parameter efficiency, do not significantly reduce memory usage. Despite training with two orders of magnitude fewer parameters than LoRA, the memory consumption of Compacter and KronA is nearly identical to that of LoRA. For instance, LoRA optimizes 20 million parameters for T5-11B, while KronA and Compacter each optimize less than 0.5 million. Nevertheless, all methods consume approximately 28.6GB of GPU memory. This result becomes intuitive in hindsight: beyond a certain point, the memory used for optimizer states and gradients becomes negligible, overshadowed by other factors such as model weights and hidden states. Nevertheless, we observe significant training and inference speed improvements with KronA over LoRA. This likely occurs due to the efficient Kronecker-vector product implementation in KronA (Section 9.3).

In conclusion, our experimental comparison shows several expected results, such as significant improvements in memory consumption and speed. However, we also observed some surprising results. Notably, we observed that methods like Layer Norm Tuning, which are often overlooked, can be unexpectedly effective. Additionally, the effects of various PEFT methods on inference speed, especially in larger models, highlight the complex trade-offs between efficiency and performance. These insights emphasize the need for a comprehensive evaluation of PEFT methods, taking into account not only memory and speed but also their scalability across different model sizes.

**12 Challenges and guidelines**

Survey papers tend to discuss reporting issues, and this one is no exception. We identified several challenges and inconsistencies that make it difficult to evaluate PEFT methods and draw direct comparisons between different PEFT methods, which warrant discussion.

Reporting parameter count One of the primary challenges stems from the difference in the way researchers report parameter counts. These inconsistencies arise from the inherent complexity of the problem. Parameter counts can be categorized into three types: the number of trainable parameters, the number of changed parameters between the original and fine-tuned models, and the rank of the difference between the original and fine-tuned models. These parameter counts are not equivalent. For example, IntrinsicSAID (Section 9.1) learns a low-rank (∼100-1000) transformation of model parameters. However, it changes all (100%) of the model’s parameters. DiffPruning (Section 8.2) learns an update of 0.5% of the parameters, but it actually trains 200% of the parameters: fine-tuning the model and learning the binary mask. For reparameterization-based methods (Sections 9.2, 9.3,
10.4), memory requirements may vary depending on the implementation design choices. Of the three types, the number of trainable parameters is the most reliable predictor of memory efficiency. However, it is still imperfect: Ladder-side Tuning trains more parameters than LoRA or BitFit, but it uses less RAM by avoiding backpropagation to the main network.

Reporting efficiency Evaluating the efficiency of PEFT methods solely based on parameter count is challenging due to the non-linear relationship between parameter count and efficiency. Efficiency in training time is better assessed through memory consumption and training speed. Most PEFT categories, except for Sparse-selective methods, significantly improve RAM usage. However, the Intrinsic SAID (Aghajanyan et al., 2020) method, which
is Reparametrization-based, can result in higher memory usage than full training due to the Fastfood transformation’s demands. Our experiments revealed that modularity in hybrid PEFT methods comes at the cost
of notably higher memory consumption. This emphasizes the need for studies to report memory consumption to help practitioners make informed decisions. We also noticed considerable variability in training speed even with similar RAM usage, suggesting that RAM consumption should be considered alongside training speed. After training, the storage space required for the changed parameters is crucial for evaluating PEFT methods. Unlike full fine-tuning, which alters all model parameters, PEFT only requires saving a subset, significantly improving storage efficiency. However, methods like IPT require saving different parameter sets at various training stages, making clear reporting of space requirements essential. Inference latency is another critical factor in practice. Additive methods typically introduce overhead because they require computations on both the original network and the added parameters, whereas Selective methods do not, as they operate on existing model weights. Moreover, additive and reparametrization-based methods offer advantages in multi-task inference by reducing memory usage from O(NM) to O(M + NA), where A is the number of added weights per task. Some additive methods, like LST, can also enhance inference speed by using the original network solely as a feature extractor. For further details on multi-task training and inference, we refer readers to Modular Deep Learning.

Model sizes Another challenge arises from the variation in model sizes used in the evalu-
ation of PEFT methods. It is important to assess methods fine-tuning different model sizes,

especially >1B and less than 20B parameters. With the increase in the backbone model size, the need and usefulness of PEFT methods increase rapidly. Several studies have demonstrated that larger models require fewer parameters to be updated during fine-tuning, both in terms of percentage and when the model is large enough, sometimes even in absolute terms (Li and Liang, 2021). We would like to particularly stress this, considering that even recent papers often focus solely on BERT. Furthermore, in our experiments, Layer Norm tuning (AkbarTajari et al., 2022) was the only consistently efficient method at different scales, while maintaining a competitive performance now downstream tasks. For all other methods, efficiency, and performance considerably varies at different model sizes. Thus, model size must be considered when reporting PEFT methods.

Method Implementation Another issue encountered is the state of published implementations. Many codebases are simply copies of the Transformers library or other repositories with only minor modifications. These copies often do not use git forks, making it difficult to identify the differences unless they are highlighted in the README file. But even when differences are easy to find, the code is frequently not readable or reusable. Users are often required to install a modified version of the Transformers library, which conflicts with the most recent version and lacks documentation or examples of how to reuse the method outside of the existing codebase. Despite these challenges, there are some methods with reusable implementations worth highlighting, such as LoRA11 and Compacter12. These implementations stand out for their user-friendliness and adaptability,
providing a solid foundation for further research and development.

Comparison Intuitively, the presented PEFT method should be compared against popular approaches (e.g., LoRA, BitFit, Adapters) and the methods that share conceptual and architectural similarities with the presented method. However, the absence of standard benchmarks and metrics complicates the comparison of PEFT methods. New methods are often evaluated on different model/dataset combinations, making it challenging to draw meaningful conclusions. We would like to highlight the papers that report a variety of metrics on standard datasets, simplifying comparison to other methods. For example, KronA evaluated T5-base on the GLUE benchmark and reported accuracy, training time, and inference time while maintaining the same number of trainable parameters. UniPELT (Mao et al., 2021) assessed BERT on the GLUE benchmark and reported accuracy, training time, and inference latency, although it used different parameter counts for various methods. LST evaluated different T5 sizes on the GLUE benchmark, reporting metrics such as accuracy, training time, the number of updated parameters, and memory usage. MAM (He et al., 2022a) applied multiple models to the XSUM benchmark and reported accuracy across a range of trainable parameters, although memory comparisons were not provided. However, even these papers lack full comparability due to differences in their evaluation settings, such as varying parameter counts or the absence of certain metrics like memory comparisons. These inconsistencies highlight the need for a standardized benchmark and unified metrics to facilitate more accurate comparisons and evaluations of PEFT methods. Based on our survey and experiments we identified the principal qualities of each of the categories and summarized them in this section and Table 6.

**13 Discussion**
The growing accessibility of large language models and the democratization of their inference through low-bit quantization (Dettmers et al., 2022; Dettmers and Zettlemoyer, 2022) have enabled the research community to study, experiment, and tackle new tasks with relatively modest compute budgets. Parameter-efficient fine-tuning is the next step that allows us not just to infer, but to modify these models.

Some methods, including Adapters, Prompt Tuning, LoRA, and (IA), have shown their practicality at scale (Table 2). However, in practice, matching the performance of full fine-tuning remains a challenge. One of the reasons is high sensitivity to hyperparameters, with optimal hyperparameters often significantly deviating from those used in full fine-tuning due to the varying number of trainable parameters. For instance, the optimal learning rate for parameter-efficient fine-tuning is generally much higher than that for full fine-tuning. The research community should promote in-depth investigations into the impact of hyper-parameters on these methods and find reasonable defaults, as parameter-efficient fine-tuning of large models can be noticeably costly at the 20-100B scale. Additionally, efforts should be directed towards developing methods that minimize hyperparameter sensitivity, such as pre-training new parameters.

Examining the taxonomy of methods and the progress made thus far, it is evident that low-rank reparameterization has been remarkably successful in enhancing parameter efficiency. LoRA-style (Section 9.2) and Kronecker-product (Sections 10.4 and 9.3) reparameterizations both decrease the number of trainable parameters while requiring minimal extra computation. A possible future direction for finding new PEFT models is exploring different reparametrization techniques with favorable trainable parameter count vs. rank ratio.

Another possible direction for improvement is utilizing what we know about how transformer models process texts (Rogers et al., 2020). Most of the PEFT methods work uniformly for the model, while we know that models process input differently at different layers. Utilizing this knowledge or building systems that have an adaptive number of parameters per layer could further improve parameter efficiency and accuracy.

In many respects, fine-tuning large language models faces the same challenges as those encountered in edge machine learning – we consistently face constraints on memory, computation, and even energy consumption. Techniques like quantization and pruning that are widely used in edge machine learning now benefit large language models. As we move forward, it is not only plausible but also likely that more ideas could be exchanged between these two areas. Cross-disciplinary collaboration could facilitate the exchange of ideas, accelerating innovation and progress in parameter-efficient fine-tuning.




In [None]:
qa_pairs = [
  {
    "question": "What fundamental problem does parameter-efficient fine-tuning (PEFT) seek to address in large language models?",
    "answer": "PEFT addresses the prohibitive memory and computational costs of full fine-tuning by training only a small subset of parameters, enabling fine-tuning of large language models on resource-constrained hardware without significant loss in performance."
  },
  {
    "question": "What are the five key dimensions used in this paper to benchmark PEFT methods?",
    "answer": "The five dimensions are: storage efficiency, memory efficiency, computational efficiency, inference overhead, and downstream performance metrics such as accuracy or ROUGE-L."
  },
  {
    "question": "How does the paper categorize PEFT methods and what is the rationale behind excluding some from the comparison?",
    "answer": "The paper categorizes PEFT methods into Additive, Selective, Reparametrization-based, and Hybrid. Sparse-selective methods are excluded from the experiments due to their limited practicality on modern hardware and their narrow focus on storage efficiency."
  },
  {
    "question": "What were the key experimental findings regarding the performance of Houlsby Adapters and LoRA?",
    "answer": "Houlsby Adapters and LoRA consistently matched or exceeded full fine-tuning performance across model scales and datasets, requiring minimal hyperparameter tuning and demonstrating high reliability in both efficiency and downstream metrics."
  },
  {
    "question": "Why is Layer Norm tuning considered a surprising baseline, and how did it perform?",
    "answer": "Layer Norm tuning is rarely studied in PEFT literature, yet in this study it performed competitively with full fine-tuning for T5-Large and T5-11B models, making it a simple, efficient, and effective baseline."
  },
  {
    "question": "What trade-off was observed between training speed and model size for PEFT methods?",
    "answer": "While PEFT methods reduce memory usage, they can slow down training for smaller models like T5-Large due to the overhead of added parameters. However, for larger models, this overhead becomes negligible, making PEFT more advantageous at scale."
  },
  {
    "question": "How do reparametrization-based methods like KronA and Compacter affect training and inference?",
    "answer": "Though KronA and Compacter significantly reduce the number of trainable parameters, they do not substantially improve memory efficiency. However, due to their efficient Kronecker-vector product operations, they show faster training and inference speeds compared to LoRA."
  },
  {
    "question": "What challenges are associated with comparing PEFT methods across papers?",
    "answer": "Challenges include inconsistent reporting of parameter counts, differing evaluation setups, lack of standardized benchmarks, and absence of unified metrics, which make it difficult to draw fair comparisons between methods."
  },
  {
    "question": "What causes hybrid PEFT methods like UniPELT and MAM to perform poorly in this study?",
    "answer": "Hybrid methods exhibited high sensitivity to hyperparameters and were hard to optimize in compute-limited scenarios. Prompt Tuning, a component of both, showed slow convergence and high variance, contributing to the poor performance of the overall method."
  },
  {
    "question": "How does the performance of Prompt Tuning compare to Prefix Tuning, and what might explain the difference?",
    "answer": "Prompt Tuning underperforms Prefix Tuning, never outperforming a constant prediction baseline. The difference lies in Prefix Tuning's reparametrization of prefixes via a fully connected network, whereas Prompt Tuning directly optimizes longer prefixes, resulting in slower convergence and higher sensitivity to initialization."
  },
  {
    "question": "Why is fine-tuning still considered more practical than in-context learning (ICL) despite the latter’s popularity?",
    "answer": "Fine-tuning, once completed, offers significantly lower inference costs and greater reliability compared to ICL, which suffers from limited context length, quadratic compute scaling, and sensitivity to prompt formatting."
  },
  {
    "question": "What insights does this paper offer for future PEFT method development?",
    "answer": "The paper suggests exploring new reparametrization techniques, leveraging insights into how Transformers process text across layers, and creating adaptive PEFT methods that vary parameter allocation per layer, aiming to improve both efficiency and accuracy."
  },
  {
    "question": "How might PEFT methods intersect with ideas from edge machine learning?",
    "answer": "Both domains share constraints on memory, compute, and energy, making techniques like quantization and pruning highly transferable. Cross-disciplinary collaboration could yield innovations benefiting both edge devices and large-scale model fine-tuning."
  },
  {
    "question": "What broader impact does parameter-efficient fine-tuning have on the accessibility of LLM research and deployment?",
    "answer": "PEFT democratizes the ability to adapt large models by lowering hardware requirements, enabling smaller teams and independent researchers to fine-tune billion-scale LLMs efficiently, thus broadening participation in state-of-the-art NLP development."
  }
]

In [None]:
title = "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning"
generate_QA_pair("20", 2024, "scale_down_to_scale_up", title, qa_pairs)

### **21: Targeted Efficient Fine-tuning: Optimizing Parameter Updates with Data-Driven Sample Selection**

**Abstract**

Fine-tuning all parameters of Large Language Models (LLMs) is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by selectively fine-tuning specific parameters. Most of the parameter efficient fine-tuning (PEFT) methods center on selecting or introducing a set of parameters to be fine-tuned. However, there are few methods that consider the impact of data samples on parameter selecting. Representative data driven methods include FISH Mask based method, which randomly selects a portion of data samples as a basis when selecting parameters. However, this random data sample selection method cannot select optimal parameters for unstable data distribution. In this work, we introduce a data-centric approach and propose the Iterative Range Decreasing (IRD) algorithm to optimize the sample-parameter pair selection in FISH Mask. IRD iteratively refines the selection by identifying subsets of samples
and parameters exhibiting higher Fisher information. We demonstrate the effectiveness and rationality of proposed strategy by conducting experiments on GLUE benchmark. Experimental results show our strategy optimizes the parameter selection and achieves preferable performance over some typical baseline methods.

**Introduction**

Large language models have demonstrated fabulous capabilities across various fields through auto-regressive training on vast datasets from the internet. As generalist models, they are not optimized for any specific task during training. Therefore, supervised fine-tuning usually continues to improve the performance when facing specific problems. With the proposal of transfer learning, using a pre-trained model to fine-tune its parameters on the downstream task is becoming increasingly popular. However, as the parameter size of the pre-trained model keeps growing (e.g., GPT, 175B parameters), it becomes challenging to fine-tune all parameters of LLMs. Consequently, various methods are proposed to alleviate the GPU memory consumption and training time by freezing most of the parameters in the neural network structure and only tuning some of them. This method named Parameter-Efficient Fine-Tuning (PEFT) or Delta Tuning.

Several exemplary PEFT studies have been proposed, including but not limited to LoRA [14], Adapter [13], Prompt Tuning[20], among others. These methodologies center on the design of new additional parameter structures or the subtly selection of specific parameters from existing networks. These existing works rarely pay attention to the impact of data on parameter selection before model training, even if it is necessary to use data for parameter selection. They just empirically select parameters or design additional structures, and then fine-tune the effectiveness of the method on the complete dataset.

Data-oriented approaches [7, 32, 37] offer a promising direction for optimizing parameter selection in PEFT. One such method, FISH Mask [31], leverages the Fisher Information Matrix (FIM) to estimate the importance of each parameter with respect to the training data. Specifically, FISH Mask calculates the FIM based on a randomly selected subset of training data, using the FIM’s diagonal entries to rank parameter importance. This ranking then guides the selection of parameters to be fine-tuned. However, the assumption of independent and identically distributed data and the use of random sampling for FIM calculation in FISH Mask limit its effectiveness in real-world scenarios where data distributions can be complex and non-uniform. This raises a critical research question: How can we improve the
sample selection process in FISH Mask to achieve better parameter selection and fine-tuning performance?

To address this limitation, we propose the Iterative Range Decreasing (IRD) algorithm, which iteratively refines the selection of samples and parameters based on their Fisher information. IRD identifies a sample set with higher Fisher information content, leading to more informed parameter selection. In summary, the contributions of this work are:

• We investigated Parameter-Efficient Fine-Tuning (PEFT) methods based on selective strategies and observed that high-quality training samples are crucial for selecting optimal fine-tuning parameters. Analysis of the FISH Mask method revealed that random sample selection for calculating the Fisher Information Matrix (FIM) limits its performance.

• To address this limitation, we propose an Iterative Range Decreasing (IRD) algorithm to optimize the sample selection process. IRD identifies a sample set with higher Fisher information, which we utilize for parameter selection during fine-tuning.

• Extensive experiments on various representative benchmarks demonstrate the superiority of our proposed strategy. Using only 0.2% of model parameters for fine-tuning, our method significantly outperforms existing approaches, including widely used methods such as LoRA.

**5.4. Experimental Results**

We conduct extensive experiments on the GLUE benchmark to verify the effectiveness of the IRD algorithm. Restricted by page limitation, we use pictures instead of specific values, and distinguish the values through color shades and different marks. We will use an example in Section 5.4.1 to show how Table 3 is replaced by Fig.2, and follow this method to draw all experimental results. The remaining numerical values corresponding to illustrated experimental results will be placed in the Appendix.

5.4.1. Case on SST-2 task: An Example In Fig. 2, we present a comparative visualization of the FISH Mask method versus IRD, which is tested on the
SST-2 task using the BERT model. Result of each GLUE task is structured into two 4 by 4 matrices: the one on the left displays the results of the FISH Mask, while the one on the right showcases the results from IRD. Each block within these matrices corresponds to the fine-tuned score of a unique combination of Mask Sparsity and FISH Mask samples, as detailed in Table 3. The varying shades of green color within the cells signify the performance metrics, with darker hues denoting the highest values achieved in each matrix. Besides, the unexplored combinations are indicated by grey blocks.

The lower right corner of the result table and the 4 by 4 matrix both correspond to the original sample-parameter set pair, and are reduced in size in turn through the IRD algorithm. The sample set size decreases from right to left, and the parameter size decreases from bottom to top. Therefore, using the IRD algorithm to search from the lower right corner to the upper left corner will produce a green ladder-like effect that ascends from the lower right to the upper left corner.

Notably, triangles pointing upwards in the right matrix denote outcomes that surpass their counterparts in the left matrix, a relationship mirrored by the arrow indicators in Table3, with downward pointing triangles indicating the opposite. Additionally, the dark green blocks with red borders highlight the overall maximum results across the entire two 4 by 4 matrices layout. Opting for graphical representation over tabular data, we prioritize clarity and efficient space usage in conveying our experimental findings. In experiment, the initial FISH mask sample is 128 and the initial mask sparsity is 2.5 which is set according to the parameter samples scaling law of FISH Mask shown in Fig.1.

5.4.2. Results with BERT on GLUE
Table 2 shows the detailed performance of IRD and baseline methods on the GLUE benchmark. We conducted 8 experiments based on BERT-large foundation model. With the scale of 0.5% parameters, LoRA achieves the best performance on 6 out of 9 GLUE subtasks. Besides, under the BERT-base foundation model with 0.1% parameters to be fine-tuned, IRD achieves the best performance on 8 out of 9 GLUE subtasks. On a smaller parameter scale (0.02%), IRD can achieve similar performance to LoRA with only half the amount of fine-tuning parameters, and is better than FISH Mask. The parameter scale of LoRA is affected by rank and cannot be linearly adjusted to 0.02%.

Fig.3 shows the detailed results of experiments conducted on the GLUE benchmark under BERT-base model, excluding SST-2. Across these tasks, IRD shows better results (more increased arrows than decreased arrows) than the
FISH Mask method in CoLA, RTE, STS-B, and MRPC, while it underperforms in WNLI. In QQP, QNLI, MNLI-m, and MNLI-mm, IRD, and FISH Mask call a draw. In conclusion, experimental results prove that the IRD method is better than FISH Mask. There are far more upward arrows than downward arrows (30>22, indicating that IRD is better), which proves the reasonableness and superiority of IRD.

5.4.3. Results with GPT-2 on GLUE

To demonstrate the impact of IRD on the transformer-based decoder-only models, we do experiments on the GPT-2 pre-trained model. Fig. 4 reveals the results of experiments conducted on the GLUE benchmark, using the GPT-2 model. As we have set different max_seq_length for each task in GPT-2 that are displayed in Section 5.3, we only run the experiments on six tasks from the GLUE benchmark due to constraints in training time and GPU usage. In SST-2, STS-B, WNLI, and MRPC, IRD achieves better performance than FISH Mask. This result effectively proves the generalizability of IRD under different foundation models.

5.4.4. Results with LLaMA on GLUE
To further investigate the efficacy of IRD on LLaMA3.2, we conduct experiments using the LLaMA pre-trained model. Fig. 6 presents the results on the GLUE benchmark. Similar to the GPT-2 experiments, we adjust the maxs eql ength parameter for each task. Due to computational constraints, we focus on four tasks from the GLUE benchmark. On the MRTC, COLA, and STS-B tasks, IRD demonstrates superior performance. This suggests that the benefits of IRD might be more pronounced in larger models like LLaMA. This further supports the effectiveness and generalizability of the IRD strategy across different foundation model architectures and scales. We used the full
parameter search of the LLaMA 1B model, and the experimental results verified the effectiveness of the IRD method. On larger LLMs, we can use IRD for several deeper layers of the model to improve efficiency. Many existing studies have also confirmed that the last few layers of the model contain more common sense knowledge information.

5.5. Compared with LoRA

As a PEFT method, LoRA[ has attracted a huge amount of attention for its low computing cost and satisfactory performance. LoRA and methods based on it became a widely compared benchmark. In order to verify the effectiveness of the IRD algorithm, we compared it with LoRA on a variety of different parameter scales. The experimental results are shown in Table 2. It shows that LoRA achieves better performance when using 0.5% parameter of BERT. Meanwhile, IRD shows better performance when using 0.1% parameter of foundation model. Since the parameter scale of LoRA depends on the rank of additive matrix, we set the rank as 1 and get the least parameter scale of LoRA on BERT base without changing the other settings (number of layers or number of matrix). LoRA only achieved similar results to IRD with twice the parameter scale (0.04% versus 0.02%). It reveals that the fine-tuning effect is not as good as IRD when the parameter scale is small.

Based on the structure of the LoRA model, it has two disadvantages compared to IRD: (1) Limited by the rank attribute of LoRA, it is impossible to finely adjust the parameter scale linearly. (2) The fine-tuning performance at a smaller parameter scale is not as good as that of IRD. (3) Compared with the selective-based PEFT method, LoRA requires additional parameter storage space. At the same time, we also believe that LoRA does have better computing efficiency and lower computational complexity.

5.6. Analysis

In this work, we design different types of experiments to evaluate IRD algorithm. The chosen GLUE benchmark can fully verify the generalization of the model and facilitate comparison with other methods. By thoroughly compared with the mainstream selective-based PEFT method without changing the dataset is sufficient to demonstrate the effectiveness of our method. Besides, comparing with FISH-Mask under different foundation models further demonstrates our method’s efficacy. A contrastive study shows our method has improved on more corresponding squares and also achieved optimal values on more tasks. These results demonstrate IRD is an effective optimization algorithm because the reverse settings get worse results.

It is worth noting that when the optimal result (red border square) appears in the lower right corner of the 4 by 4 matrix, it means that the model has achieved the best result on the initial sample-parameter size. In this case, the FISH-Mask method and the IRD algorithm get in a draw. This is because the optimal sample-parameter pair appears outside the initial range, and the IRD algorithm cannot achieve better results by continuing to decrease the sample-parameter range.

**6. Conclusion**

In this paper, we adopt a data-oriented perspective to optimize PEFT method before training. Based on this methodology, we propose IRD algorithm to optimize FISH Mask based method. Besides, we designed and conducted experiments to verify the effectiveness of the proposed algorithm, and the experimental results also verified our methodology. We hope our efforts can inspire research in related fields, directing more attention towards data-driven PEFT methods. In future work, we will explore a set of general data-oriented PEFT optimization algorithms instead of just optimizing a certain model. In addition, we will try and explore the relationship between PEFT method data and parameters more deeply.

In [None]:
qa_pairs = [
  {
    "question": "What limitation of the FISH Mask method does the IRD algorithm aim to address in the context of PEFT?",
    "answer": "The IRD algorithm addresses the limitation of random sample selection in FISH Mask, which fails to account for complex, non-uniform data distributions. IRD refines sample and parameter selection by iteratively identifying subsets with higher Fisher information, leading to more effective fine-tuning."
  },
  {
    "question": "How does the Iterative Range Decreasing (IRD) algorithm work to optimize sample-parameter pair selection?",
    "answer": "IRD starts with the full sample-parameter space and iteratively reduces the range by focusing on subsets with high Fisher information, effectively ascending from the lower-right to the upper-left of a matrix that maps sample and parameter sparsity to performance. This strategy improves parameter selection before training."
  },
  {
    "question": "How does IRD perform relative to LoRA when the fine-tuned parameter scale is small?",
    "answer": "IRD outperforms LoRA at smaller parameter scales. For instance, at 0.02% parameter tuning, IRD achieves comparable or better results than LoRA, which requires at least 0.04% to match its performance. Moreover, IRD allows finer control over parameter scale than LoRA, which is constrained by rank-based configurations."
  },
  {
    "question": "What role does the Fisher Information Matrix (FIM) play in IRD and FISH Mask methods?",
    "answer": "Both methods use the FIM to estimate parameter importance based on training data. While FISH Mask calculates FIM using randomly selected samples, IRD improves this by iteratively selecting sample sets with higher Fisher information to guide more optimal parameter selection for fine-tuning."
  },
  {
    "question": "What experimental evidence supports IRD's generalizability across foundation model architectures?",
    "answer": "Experiments on BERT, GPT-2, and LLaMA demonstrate that IRD outperforms or matches FISH Mask across multiple GLUE tasks, highlighting its effectiveness across both encoder-only and decoder-only transformer models, and across a range of parameter scales."
  },
  {
    "question": "How does IRD perform in GLUE benchmark tasks compared to FISH Mask under the BERT-base model?",
    "answer": "IRD achieves better performance than FISH Mask in tasks such as CoLA, RTE, STS-B, and MRPC, draws in QQP, QNLI, MNLI-m, and MNLI-mm, and only underperforms in WNLI. Overall, IRD achieves more improvements (30 upward arrows) than regressions (22 downward arrows), validating its superiority."
  },
  {
    "question": "In what experimental scenario do FISH Mask and IRD produce similar results, and why?",
    "answer": "FISH Mask and IRD perform similarly when the optimal sample-parameter pair lies at the initial selection range (i.e., the lower-right corner of the sample-parameter matrix). In such cases, IRD’s iterative reduction does not yield further gains, resulting in a draw between the methods."
  },
  {
    "question": "Why is LoRA less flexible than IRD in adjusting parameter scale for fine-tuning?",
    "answer": "LoRA’s parameter scale is tied to the rank of its additive matrix, which limits granularity in scaling. In contrast, IRD can precisely adjust the percentage of fine-tuned parameters, enabling superior control in resource-constrained settings."
  },
  {
    "question": "What are the primary advantages of IRD over existing PEFT methods like LoRA and FISH Mask?",
    "answer": "IRD provides better fine-tuning performance at smaller parameter scales, greater flexibility in parameter scaling, and improved sample selection through data-centric optimization. These qualities make it particularly suitable for scenarios with limited computational resources."
  },
  {
    "question": "Why is data selection an important but often overlooked component in parameter-efficient fine-tuning?",
    "answer": "Most PEFT methods focus on architectural or parameter selection without explicitly considering how data quality or distribution affects parameter importance. Data selection is crucial because the informativeness of training samples can significantly influence which parameters should be fine-tuned."
  },
  {
    "question": "What broader impact could data-driven PEFT methods like IRD have on the field of efficient LLM training?",
    "answer": "Data-driven PEFT methods could shift the paradigm from architecture-centric to data-centric optimization, enabling more adaptive, robust, and efficient fine-tuning pipelines. This has implications for low-resource model customization and broader accessibility of LLM technologies."
  }
]

In [None]:
title = "Targeted Efficient Fine-tuning: Optimizing Parameter Updates with Data-Driven Sample Selection"
generate_QA_pair("21", 2024, "targeted_efficient_fine_tuning", title, qa_pairs)

### **22: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning**

**Abstract**

We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 models. Through careful ablation studies on the Flan Collection of instruction tuning tasks and methods, we tease apart the effect of design decisions that enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks—motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available.

**Introuction**

Large language models such as PaLM, Chinchilla, and ChatGPT among others have unlocked new capabilities in performing natural language processing (NLP) tasks from reading instructive prompts. Prior art has shown that instruction tuning—finetuning language models on a collection of NLP tasks formatted with instructions—further enhances the ability of language models to perform an unseen task from an instruction.

In this work, we evaluate the methods and results of open sourced instruction generalization efforts, comparing their finetuning techniques and methods. And in particular, we identify and evaluate the critical methodological improvements in the “Flan 2022 Collection”, which is the term we use for the collection of data and methods for data augmentation and instruction tuning, first implemented and used in Chung et al. (2022). Where Chung et al. (2022) focuses on the emergent and state-of-the-art results of combining Flan 2022 with PaLM 540B, this work focuses in on the details of the instruction tuning methods themselves, ablating individual factors, and comparing them directly to prior work by keeping the pretrained model size and checkpoint consistent.

The Flan 2022 Collection offers the most extensive publicly available set of tasks and methods for instruction tuning, which we have compiled in one place. We have also supplemented this with hundreds more of our own high-quality templates, richer formatting patterns, and data augmentations. We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021, T0++, Super-Natural Instructions , and the concurrent work on OPT-IML. As shown in Figure 1, this includes 4.2%+ and 8.5% improvements on the MMLU and BIG-Bench Hard evaluation benchmarks respectively, for equally sized models.

Analysis of the Flan 2022 method suggests the strong results stem both from the larger and more diverse set of tasks, but also from a set of simple finetuning and data augmentation techniques. In particular, training on a
mix of examples templatized with zero-shot, few-shot, and chain-of-thought prompts improves performance in every one of these settings, together. For instance, adding just 10% few-shot prompts improves zero-shot prompting results by 2%+. Additionally, enriching task diversity by inverting input-output pairs, as used in, along with balancing task sources, are both shown to be critical to performance. The resulting Flan-T5 model converges faster and at a higher performance than T5 models in single-task finetuning—suggesting instruction-tuned models offer a more computationally-efficient starting checkpoint for downstream applications, corroborating Aribandi et al. (2021) and Liu et al. (2022b).

We hope making these findings and resources publicly available will unify resources around instruction tuning and accelerate research into more general-purpose language models. We summarize this work’s core contributions as follows:

• Methodological: Show that training with mixed zero- and few-shot prompts yields much better performance in both settings (Section 3.2).

• Methodological: Measure and demonstrate the critical techniques to effective instruction tuning: scaling Section 3.3, enriching task variety with input inversion (Section 3.4), adding chain-of-thought training data, and balancing different data sources (Section 3.5).

• Results: Demonstrate these technical choices yield 3-17% Held-Out task improvements over existing open source instruction tuning collections (Figure 1).

• Results: Demonstrate Flan-T5 serves as a stronger and more computationally-efficient starting check-point for single-task finetuning (Section 4).

• Open source the new Flan 2022 task collection, templates, and methods for public research.

**Instruction Tuning Enhances Single-Task Finetuning**

In applied settings, machine learning practitioners deploy NLP models finetuned (FT) specifically for a single target task, usually where finetuning data is already available. While prior work has shown the benefits of intermediate finetuning or multi-task finetuning for downstream tasks, this has not been studied extensively for instruction-tuned models.

We evaluate Flan 2022 instruction tuning as an intermediary step before single target finetuning, to understand if Flan-T5 would serve as a better starting checkpoint for applied practitioners. We evaluate three settings in
Figure 5: finetuning T5 directly on the target task as the conventional baseline (blue bars), using Flan-T5 without further finetuning (beige bars), and finetuning Flan-T5 further on the target task (red bars).

Pareto Improvements to Single Task Finetuning For both sets of Held-In and Held-Out tasks examined, finetuning Flan-T5 offers a pareto improvement over finetuning T5 directly. In some instances, usually where finetuning data is limited for a task, Flan-T5 without further finetuning outperforms T5 with task finetuning.

Faster Convergence & Computational Benefits Using Flan-T5 as a starting checkpoint has an added benefit in training efficiency. As demonstrated in Figure 6, Flan-T5 converges much more quickly than T5 during single target finetuning, as well as peaking at higher accuracies. These convergence results also suggest there are strong green-AI incentives for the NLP community to adopt instruction-tuned models, like Flan-T5 for single-task finetuning, rather than conventional non-instruction-tuned models. While instruction tuning is more computationally-expensive than single-task finetuning, it is a one-time cost. On the contrary, pretrained models that require extensive finetuning become more costly when aggregating over many millions of additional training steps. Instruction-tuned models offer
a promising solution to significantly reduce the amount of finetuning steps across a wide swathe of tasks, if they are adopted as a new standard starting point for single-task finetuning.


In [None]:
qa_pairs = [
  {
    "question": "What are the key design choices in Flan 2022 that contribute to its superior instruction tuning performance compared to prior work?",
    "answer": "Flan 2022 introduces several critical design choices: mixing zero-shot, few-shot, and chain-of-thought prompt formats; balancing task sources; and enriching task diversity through input-output inversion. These collectively improve performance by 3–17% over previous instruction-tuning methods."
  },
  {
    "question": "How does training with a mixture of zero-shot, few-shot, and chain-of-thought prompts affect model performance in Flan 2022?",
    "answer": "Training with a mix of prompt types results in better performance across all prompt settings. Notably, including just 10% few-shot prompts improves zero-shot performance by over 2%, demonstrating strong cross-format generalization benefits."
  },
  {
    "question": "What role does input-output inversion play in Flan 2022's instruction tuning framework?",
    "answer": "Input-output inversion enriches task diversity by encouraging the model to generalize across unconventional task formulations. This augmentation strategy strengthens the model's robustness and contributes significantly to performance gains across evaluation benchmarks."
  },
  {
    "question": "In what ways does Flan-T5 serve as a more computationally-efficient starting checkpoint for downstream tasks compared to vanilla T5?",
    "answer": "Flan-T5 requires fewer training steps to converge, reaches higher performance peaks, and outperforms vanilla T5 especially on low-data tasks. Despite the initial cost of instruction tuning, Flan-T5 reduces the need for extensive task-specific finetuning, offering long-term computational savings."
  },
  {
    "question": "What evidence is provided in the paper to support the claim that Flan-T5 improves convergence during single-task finetuning?",
    "answer": "Figure 6 in the paper demonstrates that Flan-T5 converges faster and reaches higher accuracy than vanilla T5 when finetuned on single tasks. This indicates not only performance benefits but also reduced training cost and time."
  },
  {
    "question": "How does the Flan 2022 collection compare against previous instruction-tuning datasets like T0++, Super-Natural Instructions, and OPT-IML?",
    "answer": "Flan 2022 outperforms these previous collections across multiple benchmarks. For instance, it achieves over 4.2% improvement on MMLU and 8.5% on BIG-Bench Hard, even when using models of equivalent size, due to its richer task set and superior tuning methodology."
  },
  {
    "question": "Why does the paper argue that instruction-tuned models like Flan-T5 are more 'green AI' friendly?",
    "answer": "Instruction tuning incurs a one-time computational cost, but significantly reduces the finetuning burden across many downstream tasks. This efficiency makes models like Flan-T5 more environmentally sustainable when aggregated over many applications."
  },
  {
    "question": "What are the main methodological contributions of the Flan 2022 work as highlighted by the authors?",
    "answer": "The paper introduces mixed-prompt training, input-output inversion, task balancing, and scaling strategies as critical components of effective instruction tuning, backed by extensive ablation studies and benchmark evaluations."
  },
  {
    "question": "Why is task balancing considered crucial in the Flan 2022 instruction tuning pipeline?",
    "answer": "Task balancing ensures that no single data source dominates the training signal, allowing the model to generalize better across diverse tasks. This balance contributes to the robust performance improvements observed across both seen and unseen evaluation sets."
  },
  {
    "question": "What broader research impact does the Flan 2022 collection aim to have on the NLP community?",
    "answer": "By releasing its datasets, templates, and instruction-tuning methods, the Flan 2022 collection seeks to standardize instruction-tuning benchmarks and accelerate progress toward more general-purpose, instruction-following language models in both academia and industry."
  }
]

In [None]:
title = "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning"
generate_QA_pair("22", 2023, "flan_collection", title, qa_pairs)