LoRA tutorial #368

ebsmothers · 2024-02-11T19:11:13Z

Changelog

First pass at a tutorial for LoRA fine-tuning. Basically the flow here is:

What is LoRA/how does it work
What does a PyTorch native LoRA look like, what components are available in TorchTune (more for hackers who wanna piece everything together)
How to run the LoRA finetune recipe, how to experiment with configs like LoRA modules and rank

A couple things that I was hoping to include here but I don't think our repo is quite ready for:

Actually compare results from the different LoRA configs properly (i.e. run generations or evaluations on some ckpts). But
- we don't really have an eval story yet, and
- imo the generation UX is too inconsistent with the rest of our recipes to include in a tutorial (we should fix this).
Do some simple memory profiling to show memory savings
- If we had something easily configurable I would do this, started writing a script from scratch here but felt it was a bit long and would distract from the point of the tutorial.

Note: as is the LoRA configs given in the tutorial are not strictly in line with what's in our repo. We need to first land #347 to expose rank and alpha, but there is still some ongoing discussion there

Test plan

netlify · 2024-02-11T19:11:17Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`4062a46`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65cc0984b1453e0008ecda36
😎 Deploy Preview	https://deploy-preview-368--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ebsmothers · 2024-02-11T19:13:35Z

recipes/README.md

@@ -65,5 +65,5 @@ tune --nnodes 1 --nproc_per_node 2 finetune_lora --config alpaca_llama2_lora_fin

 To run the generation recipe, run this command from inside the main `/torchtune` directory:
 ```
-python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --input "What is some cool music from the 1920s?"
+python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --instruction "What is some cool music from the 1920s?"


I think this is correct? At least based on the samples e.g. here

no this should remain input. instruction specifies the task

Sorry maybe I am being dumb here, but what about the examples given in the HF dataset? See the examples below:

yep these examples are the same, the default instruction for the generate script is actually "Answer the question", and then the input is the question to be answered. That's the same as "Convert the given equation" (instruction) and "3x+5y=9" (input).

With your change the instruction becomes "What is some cool music from the 1920s?" with no input much like the first two examples, vs "Answer the question. What is some cool music from the 1920s?". They're both valid so this change is ok actually, but wanted to point out the slight nuance.

Ah thanks for clarifying, I missed the default value of "Answer the question." for the instruction. Sounds like this is not technically wrong then? In that case I will revert the change, but imo this is kinda unintuitive and we should revisit.

matthewdzmura · 2024-02-11T20:44:45Z

docs/source/examples/lora_finetune.rst

+    the loaded :code:`state_dict` are as expected. TorchTune's LoRA recipe does this by default via
+    :func:`torchtune.modules.peft.validate_state_dict_for_lora`.
+
+Once we've loaded the base model weights, we also want to set only LoRA parameters to trainable.


Why wouldn't we just do this automatically when initializing lora_model?

Good question. So while that may save lines of code, my philosophy here is that (a) it is better to be explicit than implicit, and (b) we shouldn't integrate details of training into modeling components any more than is strictly necessary (otherwise our modeling components become hard to extend). So the model builder will return the architecture, but it doesn't do stuff like load weights, freeze base model parameters, wrap in FSDP, or any of that. All of that should be done in the recipe. This way a user who wants things to "just run" can use the recipe and not have to worry about which params are trainable, while a user who wants to customize or extend things can use our modeling components out of the box more easily.

matthewdzmura · 2024-02-11T20:55:04Z

docs/source/examples/lora_finetune.rst

+
+.. code-block:: bash
+
+    tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune


Out of curiosity why do our tutorials use 2 GPUs? Is it so that we can show off distributed?

Yeah I think we do not have a clear story here, we should come up with a better philosophy across all our tutorials. My heuristic here is basically "if you're running on reasonable hardware (read: 4090), this won't OOM". However, the full finetune tutorial suggests running on 4 devices which with a 4090 should OOM (I think). I guess there are two types of problems we want to avoid here:

(1) We make it seem like a given finetune won't work when it actually does (e.g. in this case things will run fine on 1x A100, but that may not be obvious from the command)
(2) We give a command that will OOM on certain hardware but don't make that clear enough

Unless we aggressively define supported hardware types or explicitly enumerate a ton of caveats, I feel like the best solution here is to continue beefing up the supported hardware table in our readme (maybe move some version to our tutorials page), and point to that. At the same time that's also one extra bit of indirection we have to do each time we give a CLI command.

Either way, maybe it's worth adding a separate tutorial around distributed and some of our other training utilities so that we explicitly show usage of e.g. single-device, no FSDP runs contrasted with multi-device runs with FSDP enabled.

Previous comment isn't a blocker, I was just curious.

That said, I tried running this command and got an error that wandb wasn't found. We should make sure it's included in our default install.

Thanks for flagging this. Actually I think we decided to not include wandb in our core install, so this is an issue with the YAML. I am changing the default logger to disk in #347

should we detail what type of GPUs are used here (esp VRAM)? also would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources

should we detail what type of GPUs are used here (esp VRAM)?

Yeah we could say "on two GPUs (each having VRAM >= 23GB)" instead of "on two GPUs", wdyt?

would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources

I agree. The problem is we don't have any profiling utilities that can be easily integrated/demonstrated in a tutorial (if you can think of a way to do it let me know). But this is absolutely something I want to include in a follow-up.

This is a great point. I think we should explicitly call out the hardware we use for the tutorial i.e. this assumes we use N A100s with 80GB memory. To map this to your setup, please look at this table

matthewdzmura · 2024-02-12T15:15:23Z

docs/source/examples/lora_finetune.rst

+      # and add to the original model's outputs
+      return frozen_out + (self.alpha / self.rank) * lora_out
+
+There are some other details around initialization which we omit here, but otherwise that's


Link to lora.py for users interested in the full implementation?

RdoubleA

overall excellent tutorial and very pleasant to read and follow along. I just have many nit suggestions for beefing it up a bit

RdoubleA · 2024-02-12T18:09:59Z

docs/source/examples/finetune_llm.rst

@@ -56,7 +56,7 @@ To run the recipe without any changes on 4 GPUs, launch a training run using Tun

 .. code-block:: bash

-    tune --nnodes 1 --nproc_per_node 4 --config alpaca_llama2_full_finetune
+    tune --nnodes 1 --nproc_per_node 4 full_finetune --config alpaca_llama2_full_finetune


guilty as charged :(

RdoubleA · 2024-02-12T18:48:26Z

docs/source/examples/lora_finetune.rst

+This guide will teach you about `LoRA <https://arxiv.org/abs/2106.09685>`_, a parameter-efficient finetuning technique,
+and show you how you can use TorchTune to finetune a Llama2 model with LoRA.
+If you already know what LoRA is and want to get straight to running
+your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.


nit:

Suggested change

your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.

your own LoRA finetune in TorchTune, you can jump to :ref:`the recipe<lora_recipe_label>`.

hmmm this isn't the recipe though, it's the section of the tutorial showing how to run the recipe. I could directly use the section title instead, e.g.

Suggested change

your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.

your own LoRA finetune in TorchTune, you can jump to :ref:`LoRA finetuning recipe in TorchTune<lora_recipe_label>`.

RdoubleA · 2024-02-12T18:49:31Z

docs/source/examples/lora_finetune.rst

+What is LoRA?
+-------------
+
+`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable


Suggested change

`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable

`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds trainable

RdoubleA · 2024-02-12T18:49:45Z

docs/source/examples/lora_finetune.rst

+-------------
+
+`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable
+low-rank decomposition to different layers of a neural network, then freezes


Suggested change

low-rank decomposition to different layers of a neural network, then freezes

low-rank decomposition matrices to different layers of a neural network, then freezes

RdoubleA · 2024-02-12T18:51:09Z

docs/source/examples/lora_finetune.rst

+low-rank decomposition to different layers of a neural network, then freezes
+the network's remaining parameters. LoRA is most commonly applied to
+transformer models, in which case it is common to add the low-rank matrices
+to some of the self-attention projections in each transformer layer.


I'd want to emphasize that it's parallel to linear layers, since many don't know what attention projections are

Suggested change

to some of the self-attention projections in each transformer layer.

to some of the linear projections in each transformer layer's self attention.

RdoubleA · 2024-02-12T19:10:43Z

docs/source/examples/lora_finetune.rst

+and V projections. This means a LoRA decomposition of rank :code:`r=8` will reduce the number of trainable
+parameters for a given projection from :math:`4096 * 4096 \approx 15M` to :math:`8 * 8192 \approx 65K`, a
+reduction of over 99%.
+


consider adding a few sentences about why we shouldn't just always go with LoRA fine-tuning (idk this answer) and when you would want to do full-finetuning vs LoRA. may be a bit out of scope for the tutorial but I think it's important to convey since we have both these recipes that users will have to choose from

Sorry to be a broken record 😅. Again I don't wanna make any general claims or suggest best practices here, tbh I don't think we have trained enough models on enough datasets to give that guidance. I agree this would be useful, but maybe we can have another tutorial that is more focused on a modeling deep-dive (think something like this tutorial from Sebastian Raschka

RdoubleA · 2024-02-12T19:12:47Z

docs/source/examples/lora_finetune.rst

+(Feel free to verify this for yourself.)
+
+Why does this matter? TorchTune makes it easy to load checkpoints for LoRA directly from our Llama2
+model without any wrappers or custom checkpoint conversion logic.


love this point

Thanks @rohan-varma 😄

RdoubleA · 2024-02-12T19:15:49Z

docs/source/examples/lora_finetune.rst

+
+.. code-block:: bash
+
+    tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune


should we detail what type of GPUs are used here (esp VRAM)? also would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources

RdoubleA · 2024-02-12T19:17:01Z

docs/source/examples/lora_finetune.rst

+
+    tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune\
+    --override lora_attn_modules='q_proj,k_proj,v_proj,output_proj'\
+    lora_rank=16 output_dir=./lora_experiment_1


non-blocking, but it would be awesome if we could show comparable loss curves or eval metrics compared to full-finetuning. probably in a future followup once the eval story is more clear

Yeah I was actually trying to do eval and/or generation as part of this but as I alluded to in the summary the eval is not ready and the generation UX is clunky. Re loss curves: I would like to do this, but then I introduce a tensorboard/wandb dep in the tutorial. Is this something we're OK with? I guess we can add them with the caveat "you can install this optional dep to reproduce these"

We can show the curves and then comment on how we generated them without explicitly adding the dependency?

Yep this sounds good to me

RdoubleA · 2024-02-12T19:17:56Z

recipes/README.md

@@ -65,5 +65,5 @@ tune --nnodes 1 --nproc_per_node 2 finetune_lora --config alpaca_llama2_lora_fin

 To run the generation recipe, run this command from inside the main `/torchtune` directory:
 ```
-python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --input "What is some cool music from the 1920s?"
+python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --instruction "What is some cool music from the 1920s?"


no this should remain input. instruction specifies the task

kartikayk · 2024-02-13T01:28:45Z

docs/source/examples/lora_finetune.rst

+
+    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+
+      * Be familiar with the :ref:`overview of TorchTune<overview_label>`


Suggested change

* Be familiar with the :ref:`overview of TorchTune<overview_label>`

* Be familiar with :ref:` TorchTune<overview_label>`

kartikayk · 2024-02-13T01:31:54Z

docs/source/examples/lora_finetune.rst

+transformer models, in which case it is common to add the low-rank matrices
+to some of the self-attention projections in each transformer layer.
+
+By finetuning with LoRA (as opposed to finetuning all model parameters),


Maybe link in the full finetuning tutorial here?

kartikayk · 2024-02-13T01:32:51Z

docs/source/examples/lora_finetune.rst

+
+By finetuning with LoRA (as opposed to finetuning all model parameters),
+you can expect to see memory savings due to a substantial reduction in the
+number of gradient parameters. When using an optimizer with momentum,


Hmm is gradient parameters a very common term? Why not "learnable parameters"?

Oh sorry I mean the actual grad values, not the parameters. I am trying to explicitly distinguish between model params (the total # of which increase slightly when using LoRA) and memory used by gradients (which decreases dramatically)

kartikayk · 2024-02-13T01:54:45Z

docs/source/examples/lora_finetune.rst

+How does LoRA work?
+-------------------
+
+LoRA replaces weight update matrices with a low-rank approximation. In general, weight updates


Extreme nit: "In general, weight updates for a given linear layer mapping dimension in_dim to dimension out_dim can have rank" reads a bit weirdly. At least I was confused. Maybe something like:

"In general, weight updates for a given linear layer Linear(in_dim, out_dim) can have rank" makes it easier to understand? Of course this assumes that people are aware of the PyTorch syntax. Feel free to disregard.

I think your version is clearer. Can also probably link to nn.Linear docs just in case it's not clear

kartikayk · 2024-02-13T02:00:34Z

docs/source/examples/lora_finetune.rst

+  # The default settings for lora_llama2_7b will match those for llama2_7b
+  # We just need to define which layers we want LoRA applied to.
+  # We can choose from ["q_proj", "k_proj", "v_proj", and "output_proj"]
+  lora_model = lora_llama2_7b(lora_attn_modules=["q_proj", "v_proj"])


Sorry if I misremember, but I thought one of the tutorials claimed that applying lora to q, k and v should be the default. Is that not true?

So yes, this tutorial says that the best performance comes when applied to all layers. At the same time, a lot of references use Q and V as defaults, e.g. lit-gpt, HF PEFT (that 2nd one took some digging). So I am torn on what to do here, but feel it is better to do the intuitive thing than the "best" thing for our default values. Lmk if you disagree here

kartikayk · 2024-02-13T02:09:40Z

Thanks for putting this together! Overall this looks great and is one of the higher quality tutorials we have. It would have been awesome to actually show how some of the params impact eval, memory footprint etc in greater detail. But I don't think we have the tooling setup for that. Something to add to the backlog though. I think LoRA and QLoRA are the show stoppers and so the easier we make it for users to understand these, the more we'll have people use them. Left some nits. Accepting the PR, but I'll let you address the ongoing comments.

ebsmothers added 3 commits February 11, 2024 09:49

wip LoRA tutorial

78d13ef

Merge branch 'main' into lora-tutorial

cf44ee1

tutorial cleanup, fix some incorrect commands elsewhere

a5a7451

ebsmothers requested review from NicolasHug, rohan-varma, joecummings, RdoubleA and kartikayk February 11, 2024 19:11

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2024

ebsmothers commented Feb 11, 2024

View reviewed changes

matthewdzmura reviewed Feb 11, 2024

View reviewed changes

matthewdzmura reviewed Feb 12, 2024

View reviewed changes

RdoubleA reviewed Feb 12, 2024

View reviewed changes

kartikayk reviewed Feb 13, 2024

View reviewed changes

kartikayk approved these changes Feb 13, 2024

View reviewed changes

ebsmothers added 5 commits February 13, 2024 09:03

Merge branch 'main' into lora-tutorial

d55e81e

address comments, add a couple images

8f0e9ba

couple minor edits

ad19124

remove space

5079da1

revert change to generate readme command

4062a46

ebsmothers merged commit dd009da into pytorch:main Feb 14, 2024
17 checks passed

ebsmothers deleted the lora-tutorial branch February 14, 2024 01:10

kartikayk mentioned this pull request May 22, 2024

Add QAT support for distributed finetuning #980

Open


		.. code-block:: bash

		tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune

	your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.
	your own LoRA finetune in TorchTune, you can jump to :ref:`the recipe<lora_recipe_label>`.

	`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable
	`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds trainable

	low-rank decomposition to different layers of a neural network, then freezes
	low-rank decomposition matrices to different layers of a neural network, then freezes

	to some of the self-attention projections in each transformer layer.
	to some of the linear projections in each transformer layer's self attention.


		.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites

		* Be familiar with the :ref:`overview of TorchTune<overview_label>`

	* Be familiar with the :ref:`overview of TorchTune<overview_label>`
	* Be familiar with :ref:` TorchTune<overview_label>`

LoRA tutorial #368

LoRA tutorial #368

Conversation

ebsmothers commented Feb 11, 2024

Changelog

Test plan

netlify bot commented Feb 11, 2024 • edited

✅ Deploy Preview for torchtune-preview ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk commented Feb 13, 2024

netlify bot commented Feb 11, 2024 •

edited