New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

QLoRA tutorial #693

Merged

rohan-varma merged 32 commits into main from qlora_tutorial

Apr 15, 2024

Member

rohan-varma commented Apr 11, 2024 •

edited

Context

Adds a QLoRA tutorial to accompany the LoRA tutorial in torchtune. Discussed with @ebsmothers, the nuances and details in QLoRA are enough for it to warrant its own tutorial. Will also link LoRA tutorial to QLoRA and backlink QLoRA to LoRA tutorial.

Changelog

...

Test plan

rohan-varma added 16 commits

March 25, 2024 23:31


          Test ghstack

a7da42f

ghstack-source-id: aa906a002fccbc9e80acfe3c4848febe23d5071f
Pull Request resolved: #590


          Merge branch 'main' of github.com:pytorch/torchtune

61562e0


          Merge branch 'main' of github.com:pytorch/torchtune

c515f67


          Merge branch 'main' of github.com:pytorch/torchtune

45455f8


          Merge branch 'main' of github.com:pytorch/torchtune

a80f4d4


          Merge branch 'main' of github.com:pytorch/torchtune

64bb80c


          Merge branch 'main' of github.com:pytorch/torchtune

d1a2137


          Merge branch 'main' of github.com:pytorch/torchtune

c54b158


          Merge branch 'main' of github.com:pytorch/torchtune

8909ed6


          Merge branch 'main' of github.com:pytorch/torchtune

8c06d2b


          Merge branch 'main' of github.com:pytorch/torchtune

0ebd6f6


          Merge branch 'main' of github.com:pytorch/torchtune

3edf931


          Merge remote-tracking branch 'origin'

1142f3a


          QLoRA tutorial

c190177

upd

Upd

f5be68a

pytorch-bot bot commented Apr 11, 2024 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/693

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fa6b0a3 with merge base 5b0dc57 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

rohan-varma marked this pull request as draft

April 11, 2024 22:27

rohan-varma added 4 commits

April 12, 2024 10:26

upd

df2b1a0


          Merge branch 'main' of github.com:pytorch/torchtune into qlora_tutorial

a0847cb


          QLoRA tutorial

cc1d4af

upd

cba7177

rohan-varma marked this pull request as ready for review

April 12, 2024 23:16

rohan-varma requested a review from ebsmothers

April 12, 2024 23:17

rohan-varma changed the title ~~[WIP] QLoRA tutorial~~ QLoRA tutorial

RdoubleA reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

docs/source/examples/qlora_finetune.rst Outdated

+              accuracy.
+              The `QLoRA paper <https://arxiv.org/abs/2305.14314>`_ introduces two key abstractions to decrease memory usage and avoid accuracy degradation: the bespoke 4-bit NormatFloat
+              type, and a double quantization method that quantizes the quantization parameters themselves to save even more memory. TorchTune uses

Contributor

RdoubleA Apr 12, 2024

🤯

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

Upd

f85f4a6

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

+              parameters are still held in the original precision, and activations, gradients, and optimizer states still exist in the higher precision to preserve
+              accuracy.
+              The `QLoRA paper <https://arxiv.org/abs/2305.14314>`_ introduces two key abstractions to decrease memory usage and avoid accuracy degradation: the bespoke 4-bit NormatFloat

Contributor

ebsmothers Apr 12, 2024

nit: I feel like you're linking to the paper too many times, usually I just do it once upon introduction

Contributor

RdoubleA Apr 13, 2024

nah, I think this is fine. you don't know which paragraph someone will start reading at if they're skimming or jumping around

Contributor

joecummings Apr 13, 2024

I think you should link every other reference to the paper to keep people on their toes.

docs/source/examples/qlora_finetune.rst Outdated

Comment on lines 39 to 41

+              quantization is done through the method highlighted in the original `QLoRA paper <https://arxiv.org/abs/2305.14314>`_. Adapter
+              parameters are still held in the original precision, and activations, gradients, and optimizer states still exist in the higher precision to preserve
+              accuracy.

Contributor

ebsmothers Apr 12, 2024

One thing that'd be nice: basically take the diagram in the LoRA tutorial demonstrating full finetune -> LoRA and add one more for LoRA -> QLoRA. (The diagrams take a bit of time so I feel this is more of a nice-to-have at this point.) But if you're interested let me know and I can dig up the original

Member Author

rohan-varma Apr 15, 2024

Definitely interested! Will punt this out to a follow up though.

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

docs/source/examples/qlora_finetune.rst Outdated

+              accuracy.
+              The `QLoRA paper <https://arxiv.org/abs/2305.14314>`_ introduces two key abstractions to decrease memory usage and avoid accuracy degradation: the bespoke 4-bit NormatFloat
+              type, and a double quantization method that quantizes the quantization parameters themselves to save even more memory. TorchTune uses

Contributor

ebsmothers Apr 12, 2024

Is it worth giving some more details on either of these two optimizations, or do you think it's too in the weeds?

Member Author

rohan-varma Apr 15, 2024

IMO its too in the weeds and not worth it to just re explain if its in the paper. Can add a line directing folks to the paper, but IMO its already clear enough to read the paper for this sort of detail

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

docs/source/examples/qlora_finetune.rst Outdated

Comment on lines 179 to 185

+              in a typical LoRA training flow.
+              To achieve this, when using TorchTune's ``qlora_llama2_7b`` builder, we automatically register a hook, :code:`reparametrize_as_dtype_state_dict_post_hook`,
+              that runs after calling ``.state_dict()`` on the top level model. This hook converts ``NF4Tensors`` back to their original precision, while also offloading these
+              converted tensors to the CPU. This offloading is to avoid peaking memory by maintaining an entire bf16/fp32 copy of the ``state_dict``
+              on GPU, which could lead to potential OOMs during checkpoint save, even if memory is appropriately managed during
+              training.

Contributor

ebsmothers Apr 13, 2024

Personally I think this whole section could be more easily illuminated via a code block with e.g. .state_dict() with peak memory printed without the state dict hook, then add the hook and do the same thing.

docs/source/examples/qlora_finetune.rst Outdated

Comment on lines 226 to 252

+              As well as during training:
+              .. code-block:: python
+                Memory Stats::
+                GPU peak memory allocation: 14.40 GB
+                GPU peak memory reserved: 15.57 GB
+                GPU peak memory active: 14.40 GB
+              Comparing to the memory usage during model initialization for QLoRA, we see about a 35% decrease in peak memory reserved:
+              .. code-block:: python
+                Memory Stats after model init::
+                GPU peak memory allocation: 7.36 GB
+                GPU peak memory reserved: 9.13 GB
+                GPU peak memory active: 7.36 GB
+              As well as a 40% decrease in peak memory reserved during training:
+              .. code-block:: python
+                Memory Stats::
+                GPU peak memory allocation: 5.54 GB
+                GPU peak memory reserved: 9.29 GB
+                GPU peak memory active: 5.54 GB

Contributor

ebsmothers Apr 13, 2024

I'm a bit torn here.. while I think it's nice to print the raw output that someone will see in the logs, personally I would just do it one time, e.g. "you will see something like this" and then put it in a table (QLoRA, LoRA) x (peak init memory, peak training memory). Then it's a bit more digestible

Member Author

rohan-varma Apr 15, 2024

Added table

docs/source/examples/qlora_finetune.rst Outdated

+              -----------------------------------------
+              Putting it all together, we can now finetune a model using TorchTune's `LoRA recipe <https://github.com/pytorch/torchtune/blob/48626d19d2108f92c749411fbd5f0ff140023a25/recipes/lora_finetune.py>`_,
+              with a `<QLoRA configuration https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_qlora_single_device.yaml>`_.

Contributor

ebsmothers Apr 13, 2024

This hyperlink format is so annoying lol

Suggested change

      
            with a `<QLoRA configuration https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_qlora_single_device.yaml>`_.
          
            with a `QLoRA configuration <https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_qlora_single_device.yaml>`_.

docs/source/examples/qlora_finetune.rst Outdated


		.. code-block:: bash

		tune run lora_finetune_single_device --config recipes/configs/llama2/7B_qlora_single_device.yaml

Contributor

ebsmothers Apr 13, 2024

Should be able to scrap the recipes/configs/ part here, right? (Same comment for the LoRA command below)

Member Author

rohan-varma Apr 15, 2024

Done

docs/source/examples/qlora_finetune.rst Outdated Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated


		A comparison of the smoothed loss curves between QLoRA and LoRA can be seen below (purple being the QLoRA loss curve).

		.. image:: /_static/img/qlora_experiment.png

Contributor

ebsmothers Apr 13, 2024

Btw you can also add explicit labels to the two lines in the figure, e.g. iconic-pasma-57 -> LoRA and azure-bird-56 -> QLoRA

Contributor

ebsmothers Apr 15, 2024

Sorry to be annoying, but can you filter the x-axis to [0, 1000] or something in wandb and reupload? Otherwise it looks weird that one is running longer.

upd

a04984c

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

Comment on lines 31 to 32

		`QLoRA <https://arxiv.org/abs/2305.14314>`_ builds on top of `LoRA <https://arxiv.org/abs/2106.09685>`_ to enable additional
		memory efficiency on top of LoRA. In LoRA, model parameters can be thought of as existing in two partitions: adapters, which are

Contributor

ebsmothers Apr 15, 2024

nit: multiple instances of "on top of LoRA" is repetitive

upd

bcc2cbc

rohan-varma requested a review from ebsmothers

April 15, 2024 03:33

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

+              In this tutorial, we'll learn about `QLoRA <https://arxiv.org/abs/2305.14314>`_, an enhancement on top of
+              `LoRA <https://arxiv.org/abs/2106.09685>`_ that maintains frozen model parameters in 4-bit quantized precision, thereby reducing memory usage. We'll
+              walk through how QLoRA can be utilized within TorchTune to finetune a Llama2-7b model in < 10 GB of memory.

Contributor

ebsmothers Apr 15, 2024

nit: you should find and replace TorchTune -> torchtune as we have done it everywhere else since after you first opened this PR

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

+              `QLoRA <https://arxiv.org/abs/2305.14314>`_ builds on top of `LoRA <https://arxiv.org/abs/2106.09685>`_ to enable further
+              memory savings. In LoRA, model parameters can be thought of as existing in two partitions: adapters, which are
+              low-rank matrices added to different layers of a neural network, and base model parameters, which are parameters that are part of
+              the original model. In vanilla LoRA style training, both these parameters are held in the same precision (typically fp32 or bf16), and

Contributor

ebsmothers Apr 15, 2024

nit

Suggested change

      
            the original model. In vanilla LoRA style training, both these parameters are held in the same precision (typically fp32 or bf16), and
          
            the original model. In vanilla LoRA-style training, both these parameters are held in the same precision (typically fp32 or bf16), and

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

Comment on lines 45 to 46

		the `NF4Tensor <https://github.com/pytorch-labs/ao/blob/b9beaf351e27133d189b57d6fa725b1a7824a457/torchao/dtypes/nf4tensor.py#L153>`_ abstraction from the `TorchAO library <https://github.com/pytorch-labs/ao>`_ to build QLoRA components as specified in the paper.
		`TorchAO library <https://github.com/pytorch-labs/ao>`_ is a PyTorch-native library that allows you to quantize and prune your models.

Contributor

ebsmothers Apr 15, 2024

nit: I think we want TorchAO -> torchao

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

+              Next, there are a couple of details essential to checkpointing (i.e. ``state_dict``) of QLoRA-enabled models.
+              To integrate well with TorchTune's :ref:`checkpointing <checkpointing_label>`, we need to convert ``NF4Tensors`` back to their
+              original precision (generally fp32/bf16). This allows QLoRA-trained checkpoints to interoperate well with the rest of the ecosystem, within
+              TorchTune and beyond (i.e. checkpoint conversion, post-training quantization, evaluation, inference). This conversion process also allows LoRA adapter weights to be merged back into the base model as done

Contributor

ebsmothers Apr 15, 2024

nit. Also might remove checkpoint conversion since idk that it's really an ecosystem thing in the same way the other items are

Suggested change

      
            TorchTune and beyond (i.e. checkpoint conversion, post-training quantization, evaluation, inference). This conversion process also allows LoRA adapter weights to be merged back into the base model as done
          
            TorchTune and beyond (e.g. checkpoint conversion, post-training quantization, evaluation, inference). This conversion process also allows LoRA adapter weights to be merged back into the base model as done

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated

Comment on lines 118 to 119

		converted tensors to the CPU. This offloading is to avoid peaking memory by maintaining an entire bf16/fp32 copy of the ``state_dict``
		on GPU.

Contributor

ebsmothers Apr 15, 2024

nit: this sentence could be a little unclear, seemingly implying that the way we avoid peaking memory is by maintaining an entire bf16/fp32 copy on GPU.

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated


		.. code-block:: bash

		tune run lora_finetune_single_device --config llama2/7B_lora_single_device.yaml compile=True

Contributor

ebsmothers Apr 15, 2024

Suggested change

      
                tune run lora_finetune_single_device --config llama2/7B_lora_single_device.yaml compile=True
          
                tune run lora_finetune_single_device --config llama2/7B_qlora_single_device.yaml compile=True

Member Author

rohan-varma Apr 15, 2024

oh shoot, .yaml is not correct we need to remove that as well.

rohan-varma added 2 commits

April 14, 2024 20:53

upd

555023f

upd

1333b0d

ebsmothers reviewed

View reviewed changes

docs/source/examples/qlora_finetune.rst Outdated


		1\|228\|Loss: 0.8158286809921265: 1%\| \| 228/25880 [11:59<1:48:16, 3.95it/s

		A comparison of the smoothed loss curves between QLoRA and LoRA can be seen below (purple being the QLoRA loss curve).

Contributor

ebsmothers Apr 15, 2024

You can remove this now that you've added the legend

Suggested change

      
            A comparison of the smoothed loss curves between QLoRA and LoRA can be seen below (purple being the QLoRA loss curve).
          
            A comparison of the smoothed loss curves between QLoRA and LoRA can be seen below.

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/qlora_finetune.rst Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/qlora_finetune.rst Show resolved Hide resolved

ebsmothers reviewed

View reviewed changes

docs/source/tutorials/qlora_finetune.rst Outdated

Comment on lines 274 to 275

		In the next section, we'll learn about how to use QLoRA in TorchTune to build a QLoRA quantized Llama2-7b model, as well as some nuances around
		checkpointing that are important to be aware of to avoid spiking memory usage.

Contributor

ebsmothers Apr 15, 2024

No longer applicable

codecov-commenter commented Apr 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (main@5b0dc57). Click here to learn what that means.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #693   +/-   ##
=======================================
  Coverage        ?   26.69%           
=======================================
  Files           ?      145           
  Lines           ?     6147           
  Branches        ?        0           
=======================================
  Hits            ?     1641           
  Misses          ?     4506           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers approved these changes

View reviewed changes

Contributor

ebsmothers left a comment

OK left a handful of other comments, please make sure they're addressed before landing. Modulo that I think this is looking good!

rohan-varma added 7 commits

April 14, 2024 23:32

upd

b68b2d4

upd

c382cf6

upd

b78d51a

upd

11042e7

upd

b1e5cc8

Upd

49d13d3

upd

fa6b0a3

rohan-varma merged commit 0914d5c into main

27 checks passed

joecummings deleted the qlora_tutorial branch

April 16, 2024 02:11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment