## Step 1: Mounting Google Drive

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  README.md  scripts


## Step 2: Installing PyMuPDF

In [2]:
!pip install -q pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m77.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 3: Importing Libraries

In [3]:
import os
import fitz  # PyMuPDF
import json
from IPython.display import display, Markdown

## Step 4: Setting Paths

In [4]:
BASE_DIR = "/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer"
PDF_DIR = os.path.join(BASE_DIR, "data", "QA_corpus")
QA_DIR = os.path.join(BASE_DIR, "qa_pairs")

os.makedirs(QA_DIR, exist_ok=True)

## Step 5: Defining Utility Functions

This section defines three helper functions to extract text from PDFs, display text cleanly, and save question-answer pairs.

**extract_text_from_pdf**: This function takes the filepath of a PDF and extracts text from its first max_pages. It uses the fitz library (PyMuPDF) to open the PDF and iterates through the specified pages, appending the extracted text to a string, which is then returned.

**display_text_as_markdown**: This function takes a text string and displays it in a user-friendly format using Markdown within the Jupyter Notebook. It removes leading/trailing whitespace and wraps the text in triple backticks to create a code block, enhancing readability.

**save_qa_pairs**: This function takes a `filename` and `qa_pairs` (a list or dictionary) and saves the question-answer pairs to a JSON file. It opens the file in write mode, uses `json.dump` to write the data in JSON format with indentation, and prints a confirmation message.





In [5]:
def extract_text_from_pdf(filepath, max_pages=4):
    """Extracts text from the first few pages of the PDF."""
    text = ""
    with fitz.open(filepath) as doc:
        for page in doc[:max_pages]:
            text += page.get_text()
    return text

def display_text_as_markdown(text):
    """Display large blocks of text cleanly."""
    display(Markdown("```\n" + text.strip() + "\n```"))

def save_qa_pairs(filename, qa_pairs):
    """Save QA pairs to a JSON file."""
    with open(filename, 'w') as f:
        json.dump(qa_pairs, f, indent=2)
    print(f"Saved {len(qa_pairs)} pairs to {filename}")

## Step 6: Listing PDFs in QA Corpus Folder

This code snippet lists all PDF files in a specific folder, displays them to the user, allows the user to select one, and then prepares the path information for further processing.

In [6]:
# List PDFs in QA corpus folder
pdf_files = sorted([f for f in os.listdir(PDF_DIR) if f.endswith(".pdf")])

print("Available Papers:")
for i, file in enumerate(pdf_files):
    print(f"[{i}] {file}")

# Select your paper here
paper_index = 0  # Change this index to select a different paper
pdf_path = os.path.join(PDF_DIR, pdf_files[paper_index])
pdf_name = os.path.splitext(pdf_files[paper_index])[0]

print(f"\n Selected: {pdf_files[paper_index]}")

Available Papers:
[0] ADALORA:_ADAPTIVE_BUDGET_ALLOCATION_FOR_PARAMETER-EFFICIENT_FINE-TUNING.pdf
[1] AutoLoRA:_Automatically_Tuning_Matrix_Ranks_in_Low-Rank_Adaptation_Based_on_Meta_Learning.pdf
[2] Balancing_Continuous_Pre-Training_and_Instruction_Fine-Tuning:
__Optimizing_Instruction-Following_in.pdf
[3] CURLoRA:_Stable_LLM_Continual_Fine-Tuning_and_Catastrophic_Forgetting
__Mitigation.pdf
[4] DELIFT:_Data_Efficient_Language_model_Instruction_Fine_Tuning.pdf
[5] FINETUNED_LANGUAGE_MODELS_ARE_ZERO-SHOT_LEARNERS.pdf
[6] Few-Shot_Parameter-Efficient_Fine-Tuning_is_Better_and_Cheaper_than_In-Context_Learning.pdf
[7] Instruction_Tuning_for_Large_Language_Models:_A_Survey.pdf
[8] LLAMA-ADAPTER:_EFFICIENT_FINE-TUNING_OF_LARGE_LANGUAGE_MODELS_WITH_ZERO-INITIALIZED_ATTENTION.pdf
[9] LLM-Adapters:_An_Adapter_Family_for_Parameter-Efficient_Fine-Tuning_of_Large_Language_Models.pdf
[10] LORA:_LOW-RANK_ADAPTATION_OF_LARGE_LANGUAGE_MODELS.pdf
[11] LoRA_vs_Full_Fine-tuning:_An_Illusion_of_Equivalen

## Step 7: Displying Parsed Text

In [7]:
parsed_text = extract_text_from_pdf(pdf_path, max_pages=4)
display_text_as_markdown(parsed_text)

```
Published as a conference paper at ICLR 2023
ADALORA: ADAPTIVE BUDGET ALLOCATION FOR
PARAMETER-EFFICIENT FINE-TUNING
Qingru Zhang†∗, Minshuo Chen‡, Alexander Bukharin†, Nikos Karampatziakis⋄,
Pengcheng He⋄, Yu Cheng⋄, Weizhu Chen⋄and Tuo Zhao†
†Georgia Institute of Technology
‡Princeton University
⋄Microsoft Azure AI
{qingru.zhang,abukharin3,tourzhao}@gatech.edu
mc0750@princeton.edu
{nikosk,penhe,yu.cheng,wzchen}@microsoft.com
ABSTRACT
Fine-tuning large pre-trained language models on downstream tasks has become
an important paradigm in NLP. However, common practice fine-tunes all of the
parameters in a pre-trained model, which becomes prohibitive when a large number
of downstream tasks are present. Therefore, many fine-tuning methods are proposed
to learn incremental updates of pre-trained weights in a parameter efficient way,
e.g., low-rank increments. These methods often evenly distribute the budget
of incremental updates across all pre-trained weight matrices, and overlook the
varying importance of different weight parameters. As a consequence, the fine-
tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA,
which adaptively allocates the parameter budget among weight matrices according
to their importance score. In particular, AdaLoRA parameterizes the incremental
updates in the form of singular value decomposition. Such a novel approach
allows us to effectively prune the singular values of unimportant updates, which
is essentially to reduce their parameter budget but circumvent intensive exact
SVD computations. We conduct extensive experiments with several pre-trained
models on natural language processing, question answering, and natural language
generation to validate the effectiveness of AdaLoRA. Results demonstrate that
AdaLoRA manifests notable improvement over baselines, especially in the low
budget settings. Our code is publicly available at https://github.com/
QingruZhang/AdaLoRA.
1
INTRODUCTION
Pre-trained language models (PLMs) have manifested superior performance in various natural
language processing tasks (Devlin et al., 2019; Liu et al., 2019; He et al., 2021b; Radford et al.,
2019; Brown et al., 2020). The most common way to adapt pre-trained models to down-stream
tasks is to fine-tune all the parameters (full fine-tuning, Qiu et al. (2020); Raffel et al. (2020)).
However, pre-trained models typically incurs large memory footprint. For example, BERT model
(Devlin et al., 2019) consists up to 300 million parameters; T5 (Raffel et al., 2020) comprises up
to 11 billion parameters and GPT-3 (Brown et al., 2020) contains up to 175 billion parameters.
When building a NLP system upon these pre-trained models, we usually handle multiple tasks
that arrive simultaneously (Radford et al., 2019). Given a large number of down-stream tasks, full
fine-tuning requires that each task maintains a separated copy of large models. The resulting memory
consumption is prohibitively expensive.
To address this issue, researchers have proposed two main lines of research to reduce the fine-tuning
parameters, while maintaining or even improving the performance of PLMs. Specifically, one line
of research focuses on adding small neural modules to PLMs and fine-tune only these modules for
each task – the base model is kept frozen and shared across tasks. In this way, only a small number
of task-specific parameters are introduced and updated, greatly enhancing the practicality of large
models. For example, adapter tuning (Houlsby et al., 2019; Rebuffi et al., 2017; Pfeiffer et al., 2020;
∗Work was done during Qingru Zhang’s internship at Microsoft Azure AI.
1
arXiv:2303.10512v2  [cs.CL]  20 Dec 2023
Published as a conference paper at ICLR 2023
Wq
Wk
Wv
Wo
Wf1
Wf2
88.50
88.75
89.00
89.25
89.50
89.75
90.00
MNLI Matched Acc
88.58
88.98
89.36 89.28
89.91 89.99
(a) Selected weight matrix
1,2,3
4,5,6
7,8,9
10,11,12
78
80
82
84
86
88
MNLI Matched Acc
77.87
85.82
88.15
88.6
(b) Selected layers
Figure 1: Given the total trainable parameters as 0.28M, we apply LoRA only to selected weight matrices (left)
or selected layers (right) of DeBERTaV3-base and compare the fine-tuning performance on MNLI-m. Figure 1a:
we only fine-tune a selected type of weight matrix of every transformer layer, including query/key/value
projection (Wq, Wk, Wv), output projection (Wo) in the self-attention, and two weight matrices (Wf1, Wf2) in
two-layer FFNs. In Figure 1b, we apply LoRA to every weight matrix of the selected layers.
He et al., 2022) inserts small neural modules called adapters between the layers of the base model.
Prefix tuning (Li & Liang, 2021) and prompt tuning (Lester et al., 2021) attach additional trainable
prefix tokens to the input or hidden layers of the base model. These methods have shown to achieve
comparable performance to full fine-tuning, while only updating less than 1% of the original model
parameters, significantly releasing the memory consumption.
Another line of research proposes to model the incremental update of the pre-trained weights in a
parameter-efficient way, without modifying the model architecture (Zaken et al., 2021; Guo et al.,
2020; Hu et al., 2022). Given a pre-trained weight matrix1 W (0), for example, diff pruning (Guo et al.,
2020) models its incremental update ∆as a sparse matrix. Diff pruning initializes ∆as the same
dimension as W (0) and then prunes ∆element-wise based on the magnitude of the entries. As such,
diff pruning can increase the parameter efficiency substantially by adaptively retaining important
updates and pruning unimportant ones. Nonetheless, diff pruning has several limitations. First, it
relies on low-level implementation to speed up the computation of unstructured sparse matrices,
which is not well supported by existing deep learning frameworks. Therefore, we have to store ∆as
a dense matrix during training. Second, it needs to update every entry of ∆with their gradients and
then prune them. This results in similar computational cost as full fine-tuning (Guo et al., 2020).
To overcome these drawbacks, Hu et al. (2022) propose a method named LoRA, which parameterizes
∆as a low-rank matrix by the product of two much smaller matrices:
W = W (0) + ∆= W (0) + BA,
(1)
where W (0), ∆∈Rd1×d2, A ∈Rr×d2 and B ∈Rd1×r with r ≪{d1, d2}. During fine-tuning, only
A and B are updated. The rank r is chosen to be much smaller than the dimension of W (e.g., r = 8
when d1 = d2 = 1024). With less than 0.5% additional trainable parameters, the training overhead
can be reduced up to 70%, compared to full fine-tuning. However, LoRA achieves comparable or
even better performance than full fine-tuning (Hu et al., 2022). Meanwhile, the product of two samll
matrices is more friendly to implement and deploy than unstructured sparse matrices in diff pruning.
LoRA still has limitations as it prespecifies the rank r of each incremental matrix ∆identical. This
ignores the fact that the importance of weight matrices varies significantly across modules and layers
when fine-tuning pre-trained models. To illustrate this point, we present an concrete example in
Figure 1. We compare the performance of LoRA when fine-tuning specific modules or layers with
the same number of trainable parameters. Figure 1a shows that fine-tuning feed-forward networks
(FFN) achieves better performance than self-attention modules. In addition, Figure 1b demonstrates
that weight matrices in top layers are more important than those in bottom layers.
Adding more trainable parameters to the critical weight matrices can lead to better model performance.
In contrast, adding more parameters to those less important weight matrices yields very marginal
gains or even hurt model performance. Given the parameter budget, i.e., the number of total trainable
parameters, we always prefer to allocate more parameters to those important modules. Distributing
the budget evenly to all weight matrices/layers, like LoRA and other methods (e.g., adapter and prefix
tuning), often gives suboptimal performance. To this end, a natural question is:
How can we allocate the parameter budget adaptively according to importance
of modules to improve the performance of parameter-efficient fine-tuning?
1Unless specified otherwise, we use W (0) to denote any pre-trained weight matrix.
2
Published as a conference paper at ICLR 2023
To answer this question, we propose a new method – AdaLoRA (Adaptive Low-Rank Adaptation),
which dynamically allocates the parameter budget among weight matrices during LoRA-alike fine-
tuning. Specifically, AdaLoRA adjusts the rank of incremental matrices to control their budget.
Critical incremental matrices are assigned with high rank such that they can capture more fine-grained
and task-specific information. Less importance ones are pruned to have lower rank to prevent
overfitting and save the computational budget. There are some methods to control the rank of matrices
in the existing literature of matrix approximation (Cai et al., 2010; Koltchinskii et al., 2011; Toh &
Yun, 2010). Most of them directly compute singular value decomposition (SVD) of a matrix and
then truncate the smallest singular values. Such an operation can manipulate the rank explicitly
and, more importantly, minimize the difference between the resulting matrix and the original matrix.
However, for fine-tuning large models, it becomes prohibitively expensive to iteratively apply SVD
for a large number of high-dimensional weight matrices. Therefore, instead of computing SVD
exactly, we parameterize ∆as ∆= PΛQ to mimic SVD. The diagonal matrix Λ contains singular
values while the orthogonal matrices P and Q represent left/right singular vectors of ∆. To regularize
the orthogonality of P and Q, an additional penalty is added to training loss. Such a parameterization
avoids the intensive computations of SVD. Besides, another advantage is that we only need to drop the
unimportant singular values while the singular vectors are maintained. This preserves the possibility
of future recovery and stabilizes the training. See a detailed comparison to LoRA in Section 3.
Based on our SVD parameterization, AdaLoRA dynamically adjusts the rank of ∆= PΛQ by
importance scoring. Specifically, we divide the incremental matrix PΛQ into triplets, where each
triplet Gi contains the i-th singular value and the corresponding singular vectors. To quantify the
importance of triplets, we propose a novel importance metric, which takes account of the contribution
of every entry in Gi to the model performance (Sanh et al., 2020; Liang et al., 2021; Zhang et al.,
2022). Triplets with low importance scores are granted low priority and hence the singular values are
zeroed out. Triplets with high importance are retained for fine-tuning. Moreover, we also propose
a global budget scheduler to facilitate the training. In particular, we start from an initial parameter
budget, which is slightly higher than the final budget, and then gradually reduce it until matching
the target. Such a scheduler can improve the training stability and model performance. Please see
Section 3 for a detailed description of our importance metric and budget scheduler.
We conduct extensive experiments on a wide range of tasks and models to demonstrate the effec-
tiveness of AdaLoRA. Specifically, we evaluate the performance using DeBERTaV3-base (He et al.,
2021a) on natural language understanding (GLUE, Wang et al. (2019)) and question answering
(SQuADv1, Rajpurkar et al. (2016) and SQuADv2, Rajpurkar et al. (2018)) datasets. We also apply
our methods to BART-large (Lewis et al., 2019) and evaluate the performance on natural language
generation (XSum, Narayan et al. (2018) and CNN/DailyMail, Hermann et al. (2015)) tasks. We
show AdaLoRA consistently outperforms the baseline, especially under low budget settings. For
example, with less than 0.1% trainable parameters of full fine-tuning, AdaLoRA achieves a 1.2% F1
improvement on the SQuAD2.0 dataset compared with state-of-the-art approaches.
2
BACKGROUND
Transformer-based Models. A typical transformer model consists of L stacked blocks, where each
block contains two submodules: a multi-head attention (MHA) and a fully connected FFN. Given the
input sequence X ∈Rn×d, MHA performs the attention function in parallel h heads:
MHA (X) = Concat(head1, ..., headh)Wo,
headi = Softmax

XWqi(XWki)⊤/
p
dh

XWvi,
where Wo ∈Rd×d is an output projection and Wqi, Wki, Wvi ∈Rd×dh are query, key and value
projections of head i. dh is typically set to d/h. The other important module is a FFN which consists
of two linear transformations with a ReLU activation in between: FFN(X) = ReLU(XWf1 +
b1)Wf2 + b2, where Wf1 ∈Rd×dm and Wf2 ∈Rdm×d. Finally, a residual connection is used
followed by a layer normalization (Ba et al., 2016).
Low Rank Adaptation. LoRA (Hu et al., 2022) models the incremental update of the pre-trained
weights by the product of two small matrices. For h = W (0)x, the modified forward pass is:
h = W (0)x + ∆x = W (0)x + BAx,
(2)
where W (0), ∆∈Rd1×d2, A ∈Rr×d2 and B ∈Rd1×r with r ≪{d1, d2}. A typically adopts a
random Gaussion initialization while B is initialized with zero to have ∆= 0 at the beginning of
3
Published as a conference paper at ICLR 2023
training. We further denote Ai∗as the i-th row of A, B∗i as the i-th column of B, and Gi = {Ai∗, B∗i}
as the i-th doublet. Hu et al. (2022) only apply LoRA to query and value projections (i.e, Wq and
Wv) in the MHAs. He et al. (2022) extend it to weight matrices of FFNs (i.e, Wf1 and Wf2), leading
to the performance improvement . Meanwhile, they propose a unified view of various efficient tuning
methods including adapter tuning, prefix tuning and LoRA.
3
ADALORA METHOD
Our method contains two important components: (i) SVD-based adaptation, which formulates
the incremental matrices in the form of singular value decomposition; (ii) Importance-aware rank
allocation, which prunes redundant singular values based on our newly-designed importance metric.
3.1
SVD-BASED ADAPTATION
As mentioned in Section 1, we propose to parameterize the incremental updates of the pre-trained
weight matrices in the form of singular value decomposition:
W = W (0) + ∆= W (0) + PΛQ,
(3)
where P ∈Rd1×r and Q ∈Rr×d2 represent the left/right singular vectors of ∆and the diagonal
matrix Λ ∈Rr×r contains the singular values {λi}1≤i≤r with r ≪min(d1, d2). We further denote
Gi = {P∗i, λi, Qi∗} as the triplet containing the i-th singular value and vectors. In practice, since
Λ is diagonal, we only need to save it as a vector in Rr. Λ is initialized with zero while P and Q
adopt a random Gaussian initialization to ensure ∆= 0 at the beginning of training. To enforce the
orthogonality of P and Q, i.e., P ⊤P = QQ⊤= I, we utilize the following regularizer2:
R(P, Q) = ∥P ⊤P −I∥2
F + ∥QQ⊤−I∥2
F.
(4)
In our method, Λ is iteratively pruned to adjust the rank after each gradient decent step. As mentioned
in Section 1, one can directly compute SVD for every ∆to manipulate singular values. The
computational complexity, however, is O(min(d1, d2)d1d2). It becomes extremely expensive to
iteratively apply SVD for a large number of high-dimensional incremental matrices. In contrast, our
parameterization avoids intensive SVD computation, greatly releasing the computational overhead.
We remark that one can also apply structured pruning to LoRA to control the rank (i.e., prune BA
doublet-wise in (1)), whereas it has the following disadvantages. First, when a doublet is measured as
unimportant, we have to prune all of its elements. It makes scarcely possible to reactivate the pruned
doublets as their entries are all zeroed out and not trained. In contrast, AdaLoRA only masks out
the singular values based on (3) while the singular vectors are always maintained. It preserves the
potential of future recovery for the triplets dropped by mistake. Second, A and B of LoRA are not
orthogonal, meaning the doublets can be dependent with each other. Discarding the doublets can
incur larger variation from the original matrix than truncating the smallest singular values. Therefore,
the incremental matrices are often altered dramatically after each step of rank allocation, which
causes training instability and even hurts generalization. To demonstrate this point, we present an
ablation study in Section 4.4, which compares AdaLoRA with structured pruning for LoRA.
3.2
IMPORTANCE-AWARE RANK ALLOCATION
We apply the SVD-based adaptation (3) to every weight matrix including Wq, Wk, Wv, Wf1 and
Wf2 of each transformer layer. In order to control the budget, we iteratively prune singular values
in correspondence to their importance score during the training. For clear reference, we use k to
index the incremental matrix, i.e., ∆k = PkΛkQk for k = 1, . . . , n, where n is the number of
adapted weight matrices. We denote the i-th triplet of ∆k as Gk,i = {Pk,∗i, λk,i, Qk,i∗} and its
importance score as Sk,i. We further denote the parameter sets P = {Pk}n
k=1, E = {Λk}n
k=1,
Q = {Qk}n
k=1 and training cost as C(P, E, Q). With the regularization (4), the training objective
is given by L(P, E, Q) = C(P, E, Q) + γ Pn
k=1 R(Pk, Qk), where γ > 0 is the regularization
coefficient. At the t-th step, we first take a stochastic gradient step to update P (t)
k , Λ(t)
k
and Q(t)
k
for
k = 1, . . . , n. Specifically, for Λ(t)
k
˜Λ(t)
k
= Λ(t)
k −η∇ΛkL(P(t), E(t), Q(t)),
(5)
2We present the experiments in Appendix G to verify the effectiveness of the regularization.
4
```