# Appendix: Parameter-Efficient Fine-Tuning beyond LoRA

This appendix surveys several extensions to Low-Rank Adaptation (LoRA) that have recently appeared in the parameter-efficient fine-tuning (PEFT) literature. We focus on methods that are implemented in [🤗 PEFT](https://huggingface.co/docs/peft/en/package_reference/adalora) so that they are easy to experiment with. The goal is to understand *why* each variant was introduced and to provide a lightweight, reproducible comparison of their trainable parameter counts on a small language model.

## Recap: What LoRA optimises

LoRA proposes to freeze the pretrained weights $W_0$ of a linear layer and to learn a low-rank update $\Delta W = BA$ with $A \in \mathbb{R}^{r \times d_{\text{in}}}$ and $B \in \mathbb{R}^{d_{\text{out}} \times r}$. The forward pass becomes

$$ h = W_0 x + \frac{\alpha}{r} BAx, $$

where $r$ is the rank hyper-parameter and $\alpha$ rescales the update. Optimising only $A$ and $B$ drastically reduces the number of trainable parameters compared to the dense matrix $W_0$.

## LoRA extensions covered here

Below is a concise overview of the LoRA-inspired variants explored in this appendix. Each approach introduces an additional inductive bias on top of the low-rank structure.

| Method | Main idea | Extra knobs | Intuition |
| --- | --- | --- | --- |
| **AdaLoRA** (Adaptive LoRA) | Dynamically reallocates rank budget during training | $(r_t)$ schedule, importance metrics | Concentrate capacity on the most useful layers. |
| **LoHa** (Low-Rank Hypercomplex Adapter) | Factorises updates into hypercomplex components | Hypercomplex multiplier $h$ | Couples channel interactions more expressively than purely real matrices. |
| **DoRA** (Weight-Decomposed LoRA) | Separates weight *direction* and *magnitude* | Scaling vector rank | Learns a rank-one rescaler in addition to low-rank direction updates. |
| **HRA** (Householder Reflection Adaptation) | Uses orthogonal Householder reflections | # reflections $m$ | Enforces near-orthogonal updates that preserve norms. |
| **LoRA+ / other tweaks** | Better initialisation, merged optimisers, dropout tricks | None | Often complementary to the above and compatible with PEFT APIs. |

### AdaLoRA

AdaLoRA augments LoRA with a learnable schedule for the ranks $r_t$ of each adapter. It starts with a small $r$ and gradually increases or prunes ranks based on the importance of singular values, measured by the magnitude of gradient statistics. The optimisation objective for a weight matrix $W$ becomes

$$ \min_{\{A_t, B_t\}} \ \mathcal{L}(W_0 + B_t A_t), \quad \text{s.t. } \operatorname{rank}(A_t) = r_t, $$

where $r_t$ evolves during training. The PEFT implementation automates this via a per-layer scheduler that redistributes a global rank budget subject to user-defined milestones $t_{\text{init}}, t_{\text{final}}$.

### LoHa

LoHa (Low-Rank Hypercomplex Adaptation) generalises the low-rank factors to *hypercomplex* multiplications. Instead of $BA$, LoHa forms updates using $h$ shared components $\{A^{(k)}, B^{(k)}\}_{k=1}^h$ and fuses them with learnable phase matrices $\Phi^{(k)}$. This can be interpreted as a structured Kronecker product: 

$$ \Delta W = \sum_{k=1}^{h} (B^{(k)} \otimes R^{(k)}) (A^{(k)} \otimes L^{(k)}), $$

where $R^{(k)}$ and $L^{(k)}$ encode the hypercomplex rotations. In practice LoHa drops into the same API but exposes a multiplier $h$ that controls how many hypercomplex components are used.

### DoRA

Weight-Decomposed Low-Rank Adaptation splits each pretrained weight into a direction $\hat{W}$ and a scalar magnitude $s = \lVert W \rVert$. LoRA then adapts only the direction with low-rank matrices $(A, B)$ while a separate low-rank adapter learns multiplicative scaling factors $d$: 

$$ W = s \hat{W} \quad \Rightarrow \quad W' = (s + d) (\hat{W} + BA). $$

This decoupling improves stability when the base model contains layers with vastly different scales. PEFT exposes DoRA via `DoRAConfig`, which internally keeps track of both the directional adapters and the rank-one scaling vectors.

### Householder Reflection Adaptation (HRA)

HRA constrains the update matrix to be a product of Householder reflections, each of which is an orthogonal matrix $H_i = I - 2 \frac{v_i v_i^\top}{\lVert v_i \rVert^2}$. Composing $m$ reflections yields an orthogonal transform $Q = \prod_{i=1}^{m} H_i$ that preserves norms. HRA parameterises $Q$ via the $v_i$ vectors and learns a small diagonal scaling $D$, leading to 

$$ W' = Q D W_0. $$

By construction, HRA focuses on rotation-like adaptations, which can be beneficial when invariance to norm changes is desirable.

### Other practical tweaks

- **LoRA+** initialises the $A$ and $B$ factors with different learning rates to stabilise early updates.
- **Dynamic scaling / RsLoRA** rescales $B$ on the fly to keep the spectral norm bounded.
- **AdapterDrop** randomly drops LoRA adapters during training as a regulariser.

These refinements are orthogonal to the structural extensions above and are usually exposed as configuration flags (for example `use_rslora=True`).

## Lightweight empirical comparison

The next cells instantiate adapters for a tiny GPT-2 model to compare how many parameters each method trains. The point is not to run a full fine-tuning loop but to sanity-check that the adapters are easy to attach and that the parameter budgets differ as expected.

In [None]:
# Install dependencies (uncomment if running on a fresh environment)
# !pip install --quiet torch transformers accelerate peft pandas

In [None]:
from __future__ import annotations

import inspect
from dataclasses import dataclass
from typing import Any, Dict

import pandas as pd
import torch
from transformers import AutoModelForCausalLM

from peft import (
    AdaLoraConfig,
    DoRAConfig,
    HRAConfig,
    LoHaConfig,
    LoraConfig,
    TaskType,
    get_peft_model,
)

BASE_MODEL = "hf-internal-testing/tiny-random-GPT2"
TARGET_MODULES = ["c_attn"]  # attention projection in GPT-2 blocks

In [None]:
@dataclass
class AdapterResult:
    name: str
    trainable_params: int
    total_params: int
    percent_trainable: float


def prepare_config(config_cls, *, extra: Dict[str, Any] | None = None, **common_kwargs: Any):
    extra = extra or {}
    sig = inspect.signature(config_cls.__init__)
    kwargs = {}
    for name in sig.parameters:
        if name == "self":
            continue
        if name in extra:
            kwargs[name] = extra[name]
        elif name in common_kwargs:
            kwargs[name] = common_kwargs[name]
    return config_cls(**kwargs)


def count_trainable_parameters(model: torch.nn.Module) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def count_total_parameters(model: torch.nn.Module) -> int:
    return sum(p.numel() for p in model.parameters())


def evaluate_adapter(config_cls, name: str, *, extra: Dict[str, Any] | None = None) -> AdapterResult:
    model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
    model.requires_grad_(False)

    config = prepare_config(
        config_cls,
        extra=extra,
        r=8,
        lora_alpha=16,
        lora_dropout=0.05,
        target_modules=TARGET_MODULES,
        task_type=TaskType.CAUSAL_LM,
    )
    peft_model = get_peft_model(model, config)
    trainable = count_trainable_parameters(peft_model)
    total = count_total_parameters(peft_model)
    return AdapterResult(name, trainable, total, 100.0 * trainable / total)

In [None]:
results = [
    evaluate_adapter(LoraConfig, "LoRA"),
    evaluate_adapter(
        AdaLoraConfig,
        "AdaLoRA",
        extra=dict(init_r=6, target_r=12, beta1=0.85, beta2=0.85, tinit=10, tfinal=50, delta_t=10),
    ),
    evaluate_adapter(LoHaConfig, "LoHa", extra=dict(hypercomplex_multiplier=2)),
    evaluate_adapter(DoRAConfig, "DoRA"),
    evaluate_adapter(HRAConfig, "HRA", extra=dict(num_householder_blocks=2)),
]

df = pd.DataFrame(
    {
        "Adapter": [r.name for r in results],
        "Trainable params": [r.trainable_params for r in results],
        "Total params": [r.total_params for r in results],
        "% trainable": [f"{r.percent_trainable:.4f}%" for r in results],
    }
)

df

The table above shows that all variants update only a small fraction of the model compared to full fine-tuning. AdaLoRA typically allocates a few more parameters because it keeps extra buffers for rank reallocation, while LoHa and HRA introduce structured factors that slightly increase the footprint. DoRA tracks both directional and scaling adapters, leading to a modest increase as well.

### Comparing adapter parameter counts

To make the comparison more explicit, the following bar plot visualises the trainable parameter budgets. The differences remain tiny relative to the full model, but they highlight which methods introduce additional learnable tensors beyond the vanilla LoRA factors.

In [None]:
import plotly.express as px

fig = px.bar(df, x="Adapter", y="Trainable params", text="% trainable", title="Trainable parameters per adapter type")
fig.update_traces(textposition="outside")
fig.update_layout(yaxis_title="Parameters", xaxis_title="Adapter")
fig

## Takeaways

- All of the covered methods are exposed through the same `get_peft_model` workflow, so swapping adapters is mostly a matter of changing the config class.
- Adaptive or structured variants incur a modest increase in trainable parameters but still stay within the PEFT regime.
- Choosing between them depends on the task: AdaLoRA shines when different layers need different capacities, LoHa and HRA introduce richer geometric biases, and DoRA stabilises training when weight scales vary a lot.

These adapters are complementary to the optimisation strategies introduced earlier in the book, and they can be combined with quantisation or pruning techniques from the other appendices.