Separate LoRA recipe into single and multi GPU #454

rohan-varma · 2024-03-05T23:20:43Z

Context

For recipe UX, we'd like to offer different recipes for single GPU and multi-GPU finetunes, for a variety of reasons:
- Having single GPU finetune allows us to specifically iterate on this experience UX, perf, and memory efficiency wise, while building for single GPU users and not having to immediately be concerned about composability with FSDP or other distributed techniques
- Single GPU users have to pay less "UX tax", i.e. code in the recipe doesn't contain FSDP wrapping, meta device related UX logic that's irrelevant for their use case --> makes the code more concise, clearer, and easier to understand in single GPU use cases.
- Allows us to evolve multi-gpu recipes to introduce techniques such as distributed_state_dict and async checkpointing, which aren't (immediately) planned for single GPU usage but are useful distributed techniques

Changelog

Separate out the recipe into a single GPU and multi-GPU finetuning scripts with their own configs.
Remove FSDP and distributed init related logic from single GPU finetune
Remove device specification related logic for multi GPU finetune. multi device finetuning will initially not support CPU (it didn't in the past either)
Add some basic validation around --nnodes and --nproc-per-node in tune CLI to ensure these variables are correctly configured.
For single GPU finetune, removes all autocast related logic and enables full bf16 training via a "full_bf16" flag and adds the appropriate checks / validations to ensure this is working. To run with full_bf16, example command is tune lora_finetune_single_device --config recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpoint=True full_bf16=True and this runs in < 16GB of VRAM.

Test plan

Single GPU

LoRA single GPU on this PR: tune lora_finetune_single_device --config recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpointing=False full_bf16=True - loss values: https://gist.github.com/rohan-varma/b2ef84e27d42814d60688a19d6a965ad
Current LoRA on single GPU: tune lora_finetune --config recipes/configs/alpaca_llama2_lora_finetune.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda:1 enable_fsdp=False - loss values: https://gist.github.com/rohan-varma/b3b0230f4a96c3e171a5a76ca8691e20

Multi GPU

LoRA multi gpu on this PR: tune --nproc-per-node 2 lora_finetune_distributed --config recipes/configs/alpaca_llama2_lora_finetune_distributed.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpointing=False, loss: https://gist.github.com/rohan-varma/52a1e754b783ba25644cb0d20afec026
LoRA multi gpu previously: tune --nnodes 1 --nproc_per_node 2 lora_finetune --config recipes/configs/alpaca_llama2_lora_finetune.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug enable_fsdp=True &> out &, loss: https://gist.github.com/rohan-varma/8d2ea4d8a1070f8e06b165f77e2a9e19

Single CPU

(note multi device on CPUs is unsupported)
Can be launched with tune lora_finetune_single_gpu --config recipes/configs/alpaca_llama2_lora_finetune_single_gpu.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cpu

Recipe CI

test_lora_finetune is updated to verify loss values for both multi gpu and single GPU.

Discussion points

For single GPU, whether we wanna still use distributed sampler w/rank = 0 and ws = 1 or not. Not using it simplifies the UX a little, where the user doesn't have to deal with a distributed sampler since there's no distributed training and we just pass in shuffle flag to DataLoader and don't have to call set_epoch. On the other hand, keeping the distributed sampler might make the switch to distributed recipe a bit easier (for a recipe author) as they don't need to change data (in this particular case, for map style datasets w/random sampling) when going from single GPU to distributed. For now, in the spirit of removing all distributed related things I've removed use of DistributedSample as well.
Testing. For now, running both the single and multi "GPU" test in CI via pytest parametrization, where the multi GPU CI test actually runs on CPU with world size = 1 but enables FSDP. Proper Distributed CI still is pending: Enable distributed CI #219

Open items

Make our reduced precision story make sense across single and multi-device. Right now we have full bf16 only for single device and autocast mixed precision for multi-device. This PR is aimed to get a mem efficient single GPU up and running, so we'll resolve this in follow up changes.
Better documentation around the full bf16 training
Test actual checked in configs instead of test ones: Figure out how to document context manager #469
Improve test_lora_finetune for multi device: test_lora_finetune and test_full_finetune multi-device testing #473

netlify · 2024-03-05T23:20:59Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`16255ec`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65eb7b523f71d30008ed3b2c
😎 Deploy Preview	https://deploy-preview-454--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

kartikayk · 2024-03-06T21:00:24Z

recipes/README.md

+To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and
+multi-gpu finetune respectively. E.g. on two devices


Suggested change

To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and

multi-gpu finetune respectively. E.g. on two devices

You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_multi_gpu ` with `alpaca_llama2_lora_finetune_multi_gpu.yaml`. E.g. on two devices.

recipes/lora_finetune_multi_gpu.py

kartikayk · 2024-03-06T21:00:44Z

recipes/lora_finetune_single_device.py

+        lora_weights_state_dict: Optional[Dict[str, Any]] = None,
+    ) -> nn.Module:
+        with self._device:
+            model = config.instantiate(cfg_model)


Can we include the bf16 changes here? It should be as simple as model.to(torch.bfloat16)?

Small nit, but since we're instantiating on GPU here .to(bf16) would not be optimal for peak memory as the model would still be created in fp32 which is the default dtype.

We wanna do:

torch.set_default_dtype(bf16) with self._device: model = instantiate() torch.set_default_dtype(prev_dtype)

and also expose this via config. The config for this raises some interesting questions - if we have a flag like full_bf16, how will this play with our dtype flag? i.e. if user sets full_bf16=True and dtype=fp32 (which controls autocast), should we throw or let training proceed and have torch.autocast upcast some tensors to fp32 (this may help accuracy for some cases depending on the ops, but cost more memory due to the larger dtype)? Should we just make things as simple as possible and ensure that if full_bf16 is true, autocast dtype should also be configured as bf16 (which essentially amounts to disabling autocast)? If we do that, then autocast is really only used when full_bf16=False (so base model dtypes are in fp32) and we autocast down to fp16 / bf16.

Another way to view this is that we have a base_dtype and a compute_dtype. base_dtype is what our parameters are in by default (so bf16 for full bf16 finetuning), compute_dtype is passed into autocast context manager to control the actual dtypes computation is done in. This might be useful as there's a clearly understandable separation btwn these 2 config flags.

kartikayk · 2024-03-06T21:00:47Z

recipes/lora_finetune_single_device.py

+            cfg_dataset,
+            tokenizer=self._tokenizer,
+        )
+        sampler = DistributedSampler(


I saw some discussion around this. My vote would be to keep the DistributedSampler. This is slightly different from the DCP case since DS has been around a lot longer and there's tons of documentation and tutorials around how to use this, set it up for different use cases etc. I put it in the same bucket as using torchrun for the single device setting.

Another added benefit (at least in my mind) is that with all of the FSDP related complexity around state dict management for LoRA multi-device, it might be easier for someone to just set num_gpus on this to 2 for their DDP use case. I might be wrong here, that'll likely just run OOTB? We don't need to prioritize that change, but if using the DS helps a user play around with it, then given the above factors my vote would go to using DS.

For DDP, we would need to add some logic to wrap the model with DDP itself, which would likely be simpler than FSDP as there's less configuration. But we'd need explicit DDP support here, which IMO would make the "single device" name a bit confusing.

Yeah if we're casting votes here I would push for (a) no DistributedSampler and (b) no DDP in this recipe. Imo single device should mean single device through and through. But my opinions are not a blocker for this PR

ebsmothers · 2024-03-06T20:54:04Z

recipes/README.md

@@ -49,16 +49,18 @@ tune --nnodes 1 --nproc_per_node 4 finetune_llm --config alpaca_llama2_finetune

 ### LoRA finetune

-To finetune with LoRA, you can use the `finetune_lora` recipe with the `alpaca_llama2_lora_finetune.yaml` config. E.g. on two devices
+To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and


nit: I find the distinction single_device vs multi_gpu a bit unclear (specifically the switch from "device" to "gpu"). While I understanding the rationale for naming this way, could we change multi_gpu -> distributed?

Yeah, I didn't do multi device since the multi GPU recipe doesn't support device beyond GPU. But this makes sense

ebsmothers · 2024-03-06T20:58:33Z

recipes/tests/test_lora_finetune.py

            kwargs_values["model"].update(
                {
                    "lora_attn_modules": test_lora_attn_modules,
                    "apply_lora_to_mlp": False,
                    "lora_rank": 8,
                    "lora_alpha": 16,
+                    # Note: multi-gpu just signifies to run the


Thanks for adding this note!

ebsmothers · 2024-03-06T21:00:43Z

recipes/lora_finetune_single_device.py

+        cfg_dataset: DictConfig,
+        shuffle: bool,
+        batch_size: int,
+    ) -> DataLoader:


Should be Tuple[Sampler, Dataloader]?

ebsmothers · 2024-03-06T21:03:24Z

torchtune/utils/distributed.py

@@ -34,6 +34,7 @@

 FSDPPolicyType: Type = Callable[[nn.Module, bool, int], bool]

+_valid_distributed_single_node_nnodes = ["1:1", "1"]


Sorry dumb q: what is this 1:1 business?

1:1 is the default for torchrun which is related to autoscaling - 1 is the max and min # of nodes

SLR722 · 2024-03-07T00:51:36Z

recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml

+# Dataset and Sampler
+dataset:
+  _component_: torchtune.datasets.AlpacaDataset
+  train_on_input: False


shall we make it to default True as @kartikayk suggested?

SLR722 · 2024-03-07T00:56:26Z

recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml

+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.llama2.llama2_tokenizer
+  path: /home/rvarm1/local/dev/assets/tokenizer.model


shall we make this path to be more general?

Will revert all debug vestige prior to commit.

rohan-varma · 2024-03-07T00:59:27Z

recipes/README.md

 ```

-FSDP and activation checkpointing are enabled by default, and LoRA weights are added to the Q and V projections in self-attention. If you additionally want to apply LoRA to K and would like to reduce the LoRA rank from the default of eight, you can run
+For both recipes, activation checkpointing is enabled by default, and LoRA weights are added to the Q and V projections in self-attention. FSDP is enabled by default for
+multi-gpu recipe. If you additionally want to apply LoRA to K and would like to reduce the LoRA rank from the default of eight, you can run


distributed recipe

SLR722 · 2024-03-07T09:39:18Z

Would you please test with 'enable_activation_checkpointing': False when you finished all the debugging? The current implementation seems OOM on my side when disable activating checkpoint

ebsmothers

Left a bunch of comments but nothing major. You'll need to make some changes to the recipe test to get that passing, lmk if you need any help with that (and sorry in advance for the annoyance). Modulo that, looks good! Will stamp once CI is green

ebsmothers · 2024-03-07T19:05:16Z

recipes/README.md

@@ -49,16 +49,17 @@ tune --nnodes 1 --nproc_per_node 4 finetune_llm --config alpaca_llama2_finetune

 ### LoRA finetune

-To finetune with LoRA, you can use the `finetune_lora` recipe with the `alpaca_llama2_lora_finetune.yaml` config. E.g. on two devices
+You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_distributed ` with `alpaca_llama2_lora_finetune_distributed.yaml`. E.g. on two devices, you can run the following:


nit: do we wanna say You can finetune LoRA on a "single GPU" or "single device"? No strong preference myself, just wanna make sure we're clear on that

Also, might even be worth it to explicitly add one sentence like "we provide separate recipes and configs for fine-tuning LoRA on a single device vs. multiple devices as each case necessitates different training techniques" (idk, something like that)

recipes/configs/alpaca_llama2_lora_finetune_distributed.yaml

recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml

ebsmothers · 2024-03-07T19:22:34Z

recipes/lora_finetune_single_device.py

+            with self._device:
+                model = config.instantiate(cfg_model)
+
+        # Note: this needs to be set before wrapping with FSDP


No longer relevant 😃

ebsmothers · 2024-03-07T19:23:35Z

recipes/lora_finetune_single_device.py

+        cfg_dataset: DictConfig,
+        shuffle: bool,
+        batch_size: int,
+    ) -> DataLoader:


ebsmothers · 2024-03-07T19:29:09Z

recipes/lora_finetune_single_device.py

+                input_ids = input_ids.to(self._device)
+                labels = labels.to(self._device)
+
+                logits = self._model(input_ids)


Did we get rid of autocast here? I thought from the docstring we still support mixed precision too, lmk if I'm misunderstanding though

Yeah, for single device at least we're getting rid of autocast over full bf16. Next PR making distributed consistent w/full bf16 will have some more thinking around our autocast / mixed precision offerings, but overall I think we should just take it out.

recipes/lora_finetune_single_device.py

ebsmothers · 2024-03-07T19:29:55Z

tests/recipes/configs/test_configs.py

+from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe
+from recipes.lora_finetune_single_device import LoRAFinetuneRecipeSingleDevice


ebsmothers · 2024-03-07T19:30:30Z

tests/recipes/test_lora_finetune.py

+from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe
+from recipes.lora_finetune_single_gpu import LoRAFinetuneSingleDeviceRecipe


This will fail, need to check out the newest version of these tests

sorry, could you clarify why this is expected to fail?

oh. It's totally changed

ebsmothers · 2024-03-07T19:32:53Z

tests/torchtune/utils/test_precision.py

@@ -82,3 +84,19 @@ def test_set_float32_precision(self) -> None:
        assert torch.get_float32_matmul_precision() == "high"
        assert torch.backends.cudnn.allow_tf32
        assert torch.backends.cuda.matmul.allow_tf32
+
+    def test_set_default_dtype(self):
+        dtype = torch.bfloat16


Dumb q: will torch.get_default_dtype() ever naturally return bf16?

If user calls torch.set_default_dtype(bf16) before, though I guess this isn't "naturally"?

SLR722 · 2024-03-07T21:37:10Z

Do we also want to add true bf16 support for distributed recipe?

rohan-varma · 2024-03-08T01:46:19Z

Do we also want to add true bf16 support for distributed recipe?

Definitely! We will address this systematically in a follow up PR - bf16 for distributed + a clear story around autocast vs pure bf16

[WIP] Separate recipes into single and multi GPU

aa707d8

rohan-varma marked this pull request as draft March 5, 2024 23:20

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2024

Update

90e5c81

rohan-varma changed the title ~~[WIP] Separate recipes into single and multi GPU~~ Separate LoRA recipe into single and multi GPU Mar 6, 2024

fix merge

5e796cb

rohan-varma requested a review from ebsmothers March 6, 2024 06:01

rohan-varma added 3 commits March 6, 2024 10:47

Update

2425ae9

Rename to single device

8a2e4ec

Update

1082988

kartikayk reviewed Mar 6, 2024

View reviewed changes

recipes/lora_finetune_multi_gpu.py Outdated Show resolved Hide resolved

kartikayk reviewed Mar 6, 2024

View reviewed changes

ebsmothers reviewed Mar 6, 2024

View reviewed changes

rohan-varma added 3 commits March 6, 2024 16:18

Full bf16 for LoRA

51d69df

merge

d7a80df

Update

4a4019b

SLR722 reviewed Mar 7, 2024

View reviewed changes

Fix merge

9b0890f

SLR722 reviewed Mar 7, 2024

View reviewed changes

rohan-varma commented Mar 7, 2024

View reviewed changes

rohan-varma added 2 commits March 6, 2024 17:01

Fix

254dfd1

Fix

86f3389

rohan-varma requested review from ebsmothers and kartikayk March 7, 2024 01:02

rohan-varma added 3 commits March 6, 2024 17:36

Update

a338187

Fix test

021f14f

Update

956f467

rohan-varma marked this pull request as ready for review March 7, 2024 16:59

ebsmothers reviewed Mar 7, 2024

View reviewed changes

merg

4846cf7

rohan-varma added 2 commits March 8, 2024 10:23

Fixes

ec5ac18

Update

063c40f

rohan-varma mentioned this pull request Mar 8, 2024

Figure out how to document context manager #469

Closed

rohan-varma requested a review from ebsmothers March 8, 2024 18:34

rohan-varma added 2 commits March 8, 2024 10:48

Lint

bfbf7cc

config update

2e91bd9

rohan-varma mentioned this pull request Mar 8, 2024

Fix passing of FSDP arg to LoRA recipe test #472

Merged

rohan-varma added 4 commits March 8, 2024 12:14

Updates

7d4e3fa

merge

1721555

upd

d04b0a8

Upd

0b616d4

ebsmothers mentioned this pull request Mar 8, 2024

[Config] Enable component overrides #456

Merged

Upd

16255ec

ebsmothers approved these changes Mar 8, 2024

View reviewed changes

rohan-varma merged commit 01891fc into main Mar 8, 2024
17 checks passed

rohan-varma mentioned this pull request Mar 18, 2024

Separate full finetune into multi-gpu and single device recipes #482

Merged

joecummings deleted the sep branch March 18, 2024 20:51

		To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and
		multi-gpu finetune respectively. E.g. on two devices

	To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and
	multi-gpu finetune respectively. E.g. on two devices
	You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_multi_gpu ` with `alpaca_llama2_lora_finetune_multi_gpu.yaml`. E.g. on two devices.

		@@ -34,6 +34,7 @@

		FSDPPolicyType: Type = Callable[[nn.Module, bool, int], bool]

		_valid_distributed_single_node_nnodes = ["1:1", "1"]

		from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe
		from recipes.lora_finetune_single_device import LoRAFinetuneRecipeSingleDevice

		from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe
		from recipes.lora_finetune_single_gpu import LoRAFinetuneSingleDeviceRecipe

Separate LoRA recipe into single and multi GPU #454

Separate LoRA recipe into single and multi GPU #454

Conversation

rohan-varma commented Mar 5, 2024 • edited

Context

Changelog

Test plan

Single GPU

Multi GPU

Single CPU

Recipe CI

Discussion points

Open items

netlify bot commented Mar 5, 2024 • edited

✅ Deploy Preview for torchtune-preview ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma Mar 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SLR722 commented Mar 7, 2024 • edited

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SLR722 commented Mar 7, 2024

rohan-varma commented Mar 8, 2024

rohan-varma commented Mar 5, 2024 •

edited

netlify bot commented Mar 5, 2024 •

edited

rohan-varma Mar 6, 2024 •

edited

SLR722 commented Mar 7, 2024 •

edited