Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate LoRA recipe into single and multi GPU #454

Merged
merged 25 commits into from
Mar 8, 2024
Merged

Separate LoRA recipe into single and multi GPU #454

merged 25 commits into from
Mar 8, 2024

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Mar 5, 2024

Context

  • For recipe UX, we'd like to offer different recipes for single GPU and multi-GPU finetunes, for a variety of reasons:
    • Having single GPU finetune allows us to specifically iterate on this experience UX, perf, and memory efficiency wise, while building for single GPU users and not having to immediately be concerned about composability with FSDP or other distributed techniques
    • Single GPU users have to pay less "UX tax", i.e. code in the recipe doesn't contain FSDP wrapping, meta device related UX logic that's irrelevant for their use case --> makes the code more concise, clearer, and easier to understand in single GPU use cases.
    • Allows us to evolve multi-gpu recipes to introduce techniques such as distributed_state_dict and async checkpointing, which aren't (immediately) planned for single GPU usage but are useful distributed techniques

Changelog

  • Separate out the recipe into a single GPU and multi-GPU finetuning scripts with their own configs.
  • Remove FSDP and distributed init related logic from single GPU finetune
  • Remove device specification related logic for multi GPU finetune. multi device finetuning will initially not support CPU (it didn't in the past either)
  • Add some basic validation around --nnodes and --nproc-per-node in tune CLI to ensure these variables are correctly configured.
  • For single GPU finetune, removes all autocast related logic and enables full bf16 training via a "full_bf16" flag and adds the appropriate checks / validations to ensure this is working. To run with full_bf16, example command is tune lora_finetune_single_device --config recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpoint=True full_bf16=True and this runs in < 16GB of VRAM.

Test plan

Single GPU
  • LoRA single GPU on this PR: tune lora_finetune_single_device --config recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpointing=False full_bf16=True - loss values: https://gist.github.com/rohan-varma/b2ef84e27d42814d60688a19d6a965ad
  • Current LoRA on single GPU: tune lora_finetune --config recipes/configs/alpaca_llama2_lora_finetune.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda:1 enable_fsdp=False - loss values: https://gist.github.com/rohan-varma/b3b0230f4a96c3e171a5a76ca8691e20
Multi GPU
  • LoRA multi gpu on this PR: tune --nproc-per-node 2 lora_finetune_distributed --config recipes/configs/alpaca_llama2_lora_finetune_distributed.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpointing=False, loss: https://gist.github.com/rohan-varma/52a1e754b783ba25644cb0d20afec026
  • LoRA multi gpu previously: tune --nnodes 1 --nproc_per_node 2 lora_finetune --config recipes/configs/alpaca_llama2_lora_finetune.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug enable_fsdp=True &> out &, loss: https://gist.github.com/rohan-varma/8d2ea4d8a1070f8e06b165f77e2a9e19
Single CPU

(note multi device on CPUs is unsupported)
Can be launched with tune lora_finetune_single_gpu --config recipes/configs/alpaca_llama2_lora_finetune_single_gpu.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cpu

Recipe CI
  • test_lora_finetune is updated to verify loss values for both multi gpu and single GPU.

Discussion points

  • For single GPU, whether we wanna still use distributed sampler w/rank = 0 and ws = 1 or not. Not using it simplifies the UX a little, where the user doesn't have to deal with a distributed sampler since there's no distributed training and we just pass in shuffle flag to DataLoader and don't have to call set_epoch. On the other hand, keeping the distributed sampler might make the switch to distributed recipe a bit easier (for a recipe author) as they don't need to change data (in this particular case, for map style datasets w/random sampling) when going from single GPU to distributed. For now, in the spirit of removing all distributed related things I've removed use of DistributedSample as well.
  • Testing. For now, running both the single and multi "GPU" test in CI via pytest parametrization, where the multi GPU CI test actually runs on CPU with world size = 1 but enables FSDP. Proper Distributed CI still is pending: Enable distributed CI #219

Open items

@rohan-varma rohan-varma marked this pull request as draft March 5, 2024 23:20
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2024
Copy link

netlify bot commented Mar 5, 2024

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit 16255ec
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65eb7b523f71d30008ed3b2c
😎 Deploy Preview https://deploy-preview-454--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rohan-varma rohan-varma changed the title [WIP] Separate recipes into single and multi GPU Separate LoRA recipe into single and multi GPU Mar 6, 2024
Comment on lines 52 to 53
To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and
multi-gpu finetune respectively. E.g. on two devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and
multi-gpu finetune respectively. E.g. on two devices
You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_multi_gpu ` with `alpaca_llama2_lora_finetune_multi_gpu.yaml`. E.g. on two devices.

lora_weights_state_dict: Optional[Dict[str, Any]] = None,
) -> nn.Module:
with self._device:
model = config.instantiate(cfg_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the bf16 changes here? It should be as simple as model.to(torch.bfloat16)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, but since we're instantiating on GPU here .to(bf16) would not be optimal for peak memory as the model would still be created in fp32 which is the default dtype.

We wanna do:

torch.set_default_dtype(bf16)
with self._device:
    model = instantiate()
torch.set_default_dtype(prev_dtype)

and also expose this via config. The config for this raises some interesting questions - if we have a flag like full_bf16, how will this play with our dtype flag? i.e. if user sets full_bf16=True and dtype=fp32 (which controls autocast), should we throw or let training proceed and have torch.autocast upcast some tensors to fp32 (this may help accuracy for some cases depending on the ops, but cost more memory due to the larger dtype)? Should we just make things as simple as possible and ensure that if full_bf16 is true, autocast dtype should also be configured as bf16 (which essentially amounts to disabling autocast)? If we do that, then autocast is really only used when full_bf16=False (so base model dtypes are in fp32) and we autocast down to fp16 / bf16.

Another way to view this is that we have a base_dtype and a compute_dtype. base_dtype is what our parameters are in by default (so bf16 for full bf16 finetuning), compute_dtype is passed into autocast context manager to control the actual dtypes computation is done in. This might be useful as there's a clearly understandable separation btwn these 2 config flags.

cfg_dataset,
tokenizer=self._tokenizer,
)
sampler = DistributedSampler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some discussion around this. My vote would be to keep the DistributedSampler. This is slightly different from the DCP case since DS has been around a lot longer and there's tons of documentation and tutorials around how to use this, set it up for different use cases etc. I put it in the same bucket as using torchrun for the single device setting.

Another added benefit (at least in my mind) is that with all of the FSDP related complexity around state dict management for LoRA multi-device, it might be easier for someone to just set num_gpus on this to 2 for their DDP use case. I might be wrong here, that'll likely just run OOTB? We don't need to prioritize that change, but if using the DS helps a user play around with it, then given the above factors my vote would go to using DS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DDP, we would need to add some logic to wrap the model with DDP itself, which would likely be simpler than FSDP as there's less configuration. But we'd need explicit DDP support here, which IMO would make the "single device" name a bit confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we're casting votes here I would push for (a) no DistributedSampler and (b) no DDP in this recipe. Imo single device should mean single device through and through. But my opinions are not a blocker for this PR

@@ -49,16 +49,18 @@ tune --nnodes 1 --nproc_per_node 4 finetune_llm --config alpaca_llama2_finetune

### LoRA finetune

To finetune with LoRA, you can use the `finetune_lora` recipe with the `alpaca_llama2_lora_finetune.yaml` config. E.g. on two devices
To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the distinction single_device vs multi_gpu a bit unclear (specifically the switch from "device" to "gpu"). While I understanding the rationale for naming this way, could we change multi_gpu -> distributed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't do multi device since the multi GPU recipe doesn't support device beyond GPU. But this makes sense

kwargs_values["model"].update(
{
"lora_attn_modules": test_lora_attn_modules,
"apply_lora_to_mlp": False,
"lora_rank": 8,
"lora_alpha": 16,
# Note: multi-gpu just signifies to run the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this note!

cfg_dataset: DictConfig,
shuffle: bool,
batch_size: int,
) -> DataLoader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Tuple[Sampler, Dataloader]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

@@ -34,6 +34,7 @@

FSDPPolicyType: Type = Callable[[nn.Module, bool, int], bool]

_valid_distributed_single_node_nnodes = ["1:1", "1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry dumb q: what is this 1:1 business?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1:1 is the default for torchrun which is related to autoscaling - 1 is the max and min # of nodes

# Dataset and Sampler
dataset:
_component_: torchtune.datasets.AlpacaDataset
train_on_input: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make it to default True as @kartikayk suggested?

# Tokenizer
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /home/rvarm1/local/dev/assets/tokenizer.model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make this path to be more general?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert all debug vestige prior to commit.

```

FSDP and activation checkpointing are enabled by default, and LoRA weights are added to the Q and V projections in self-attention. If you additionally want to apply LoRA to K and would like to reduce the LoRA rank from the default of eight, you can run
For both recipes, activation checkpointing is enabled by default, and LoRA weights are added to the Q and V projections in self-attention. FSDP is enabled by default for
multi-gpu recipe. If you additionally want to apply LoRA to K and would like to reduce the LoRA rank from the default of eight, you can run
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributed recipe

@SLR722
Copy link
Contributor

SLR722 commented Mar 7, 2024

Would you please test with 'enable_activation_checkpointing': False when you finished all the debugging? The current implementation seems OOM on my side when disable activating checkpoint

@rohan-varma rohan-varma marked this pull request as ready for review March 7, 2024 16:59
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments but nothing major. You'll need to make some changes to the recipe test to get that passing, lmk if you need any help with that (and sorry in advance for the annoyance). Modulo that, looks good! Will stamp once CI is green

@@ -49,16 +49,17 @@ tune --nnodes 1 --nproc_per_node 4 finetune_llm --config alpaca_llama2_finetune

### LoRA finetune

To finetune with LoRA, you can use the `finetune_lora` recipe with the `alpaca_llama2_lora_finetune.yaml` config. E.g. on two devices
You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_distributed ` with `alpaca_llama2_lora_finetune_distributed.yaml`. E.g. on two devices, you can run the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we wanna say You can finetune LoRA on a "single GPU" or "single device"? No strong preference myself, just wanna make sure we're clear on that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, might even be worth it to explicitly add one sentence like "we provide separate recipes and configs for fine-tuning LoRA on a single device vs. multiple devices as each case necessitates different training techniques" (idk, something like that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

with self._device:
model = config.instantiate(cfg_model)

# Note: this needs to be set before wrapping with FSDP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant 😃

cfg_dataset: DictConfig,
shuffle: bool,
batch_size: int,
) -> DataLoader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

input_ids = input_ids.to(self._device)
labels = labels.to(self._device)

logits = self._model(input_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we get rid of autocast here? I thought from the docstring we still support mixed precision too, lmk if I'm misunderstanding though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for single device at least we're getting rid of autocast over full bf16. Next PR making distributed consistent w/full bf16 will have some more thinking around our autocast / mixed precision offerings, but overall I think we should just take it out.

recipes/lora_finetune_single_device.py Outdated Show resolved Hide resolved
Comment on lines 11 to 12
from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe
from recipes.lora_finetune_single_device import LoRAFinetuneRecipeSingleDevice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋

Comment on lines 16 to 17
from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe
from recipes.lora_finetune_single_gpu import LoRAFinetuneSingleDeviceRecipe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail, need to check out the newest version of these tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, could you clarify why this is expected to fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. It's totally changed

@@ -82,3 +84,19 @@ def test_set_float32_precision(self) -> None:
assert torch.get_float32_matmul_precision() == "high"
assert torch.backends.cudnn.allow_tf32
assert torch.backends.cuda.matmul.allow_tf32

def test_set_default_dtype(self):
dtype = torch.bfloat16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb q: will torch.get_default_dtype() ever naturally return bf16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user calls torch.set_default_dtype(bf16) before, though I guess this isn't "naturally"?

@SLR722
Copy link
Contributor

SLR722 commented Mar 7, 2024

Do we also want to add true bf16 support for distributed recipe?

@rohan-varma
Copy link
Member Author

Do we also want to add true bf16 support for distributed recipe?

Definitely! We will address this systematically in a follow up PR - bf16 for distributed + a clear story around autocast vs pure bf16

@rohan-varma rohan-varma merged commit 01891fc into main Mar 8, 2024
17 checks passed
@joecummings joecummings deleted the sep branch March 18, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants