-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate LoRA recipe into single and multi GPU #454
Conversation
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
recipes/README.md
Outdated
To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and | ||
multi-gpu finetune respectively. E.g. on two devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and | |
multi-gpu finetune respectively. E.g. on two devices | |
You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_multi_gpu ` with `alpaca_llama2_lora_finetune_multi_gpu.yaml`. E.g. on two devices. |
lora_weights_state_dict: Optional[Dict[str, Any]] = None, | ||
) -> nn.Module: | ||
with self._device: | ||
model = config.instantiate(cfg_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include the bf16 changes here? It should be as simple as model.to(torch.bfloat16)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, but since we're instantiating on GPU here .to(bf16)
would not be optimal for peak memory as the model would still be created in fp32 which is the default dtype.
We wanna do:
torch.set_default_dtype(bf16)
with self._device:
model = instantiate()
torch.set_default_dtype(prev_dtype)
and also expose this via config. The config for this raises some interesting questions - if we have a flag like full_bf16
, how will this play with our dtype
flag? i.e. if user sets full_bf16=True
and dtype=fp32
(which controls autocast), should we throw or let training proceed and have torch.autocast upcast some tensors to fp32 (this may help accuracy for some cases depending on the ops, but cost more memory due to the larger dtype)? Should we just make things as simple as possible and ensure that if full_bf16
is true, autocast dtype should also be configured as bf16 (which essentially amounts to disabling autocast)? If we do that, then autocast is really only used when full_bf16=False
(so base model dtypes are in fp32) and we autocast down to fp16 / bf16.
Another way to view this is that we have a base_dtype
and a compute_dtype
. base_dtype
is what our parameters are in by default (so bf16 for full bf16 finetuning), compute_dtype
is passed into autocast
context manager to control the actual dtypes computation is done in. This might be useful as there's a clearly understandable separation btwn these 2 config flags.
cfg_dataset, | ||
tokenizer=self._tokenizer, | ||
) | ||
sampler = DistributedSampler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some discussion around this. My vote would be to keep the DistributedSampler. This is slightly different from the DCP case since DS has been around a lot longer and there's tons of documentation and tutorials around how to use this, set it up for different use cases etc. I put it in the same bucket as using torchrun for the single device setting.
Another added benefit (at least in my mind) is that with all of the FSDP related complexity around state dict management for LoRA multi-device, it might be easier for someone to just set num_gpus on this to 2 for their DDP use case. I might be wrong here, that'll likely just run OOTB? We don't need to prioritize that change, but if using the DS helps a user play around with it, then given the above factors my vote would go to using DS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DDP, we would need to add some logic to wrap the model with DDP itself, which would likely be simpler than FSDP as there's less configuration. But we'd need explicit DDP support here, which IMO would make the "single device" name a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if we're casting votes here I would push for (a) no DistributedSampler and (b) no DDP in this recipe. Imo single device should mean single device through and through. But my opinions are not a blocker for this PR
recipes/README.md
Outdated
@@ -49,16 +49,18 @@ tune --nnodes 1 --nproc_per_node 4 finetune_llm --config alpaca_llama2_finetune | |||
|
|||
### LoRA finetune | |||
|
|||
To finetune with LoRA, you can use the `finetune_lora` recipe with the `alpaca_llama2_lora_finetune.yaml` config. E.g. on two devices | |||
To finetune with LoRA, you can use the either the `lora_finetune_single_device` or `lora_finetune_multi_gpu ` recipes with the `alpaca_llama2_lora_finetune_single_device.yaml` or `alpaca_llama2_lora_finetune_multi_gpu.yaml` configs for single device and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find the distinction single_device
vs multi_gpu
a bit unclear (specifically the switch from "device" to "gpu"). While I understanding the rationale for naming this way, could we change multi_gpu
-> distributed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't do multi device since the multi GPU recipe doesn't support device beyond GPU. But this makes sense
recipes/tests/test_lora_finetune.py
Outdated
kwargs_values["model"].update( | ||
{ | ||
"lora_attn_modules": test_lora_attn_modules, | ||
"apply_lora_to_mlp": False, | ||
"lora_rank": 8, | ||
"lora_alpha": 16, | ||
# Note: multi-gpu just signifies to run the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this note!
cfg_dataset: DictConfig, | ||
shuffle: bool, | ||
batch_size: int, | ||
) -> DataLoader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Tuple[Sampler, Dataloader]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
@@ -34,6 +34,7 @@ | |||
|
|||
FSDPPolicyType: Type = Callable[[nn.Module, bool, int], bool] | |||
|
|||
_valid_distributed_single_node_nnodes = ["1:1", "1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry dumb q: what is this 1:1 business?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1:1 is the default for torchrun which is related to autoscaling - 1 is the max and min # of nodes
# Dataset and Sampler | ||
dataset: | ||
_component_: torchtune.datasets.AlpacaDataset | ||
train_on_input: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make it to default True as @kartikayk suggested?
# Tokenizer | ||
tokenizer: | ||
_component_: torchtune.models.llama2.llama2_tokenizer | ||
path: /home/rvarm1/local/dev/assets/tokenizer.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make this path to be more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revert all debug vestige prior to commit.
recipes/README.md
Outdated
``` | ||
|
||
FSDP and activation checkpointing are enabled by default, and LoRA weights are added to the Q and V projections in self-attention. If you additionally want to apply LoRA to K and would like to reduce the LoRA rank from the default of eight, you can run | ||
For both recipes, activation checkpointing is enabled by default, and LoRA weights are added to the Q and V projections in self-attention. FSDP is enabled by default for | ||
multi-gpu recipe. If you additionally want to apply LoRA to K and would like to reduce the LoRA rank from the default of eight, you can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distributed recipe
Would you please test with 'enable_activation_checkpointing': False when you finished all the debugging? The current implementation seems OOM on my side when disable activating checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments but nothing major. You'll need to make some changes to the recipe test to get that passing, lmk if you need any help with that (and sorry in advance for the annoyance). Modulo that, looks good! Will stamp once CI is green
recipes/README.md
Outdated
@@ -49,16 +49,17 @@ tune --nnodes 1 --nproc_per_node 4 finetune_llm --config alpaca_llama2_finetune | |||
|
|||
### LoRA finetune | |||
|
|||
To finetune with LoRA, you can use the `finetune_lora` recipe with the `alpaca_llama2_lora_finetune.yaml` config. E.g. on two devices | |||
You can finetune LoRA on a single GPU using the `lora_finetune_single_device` recipe with the `alpaca_llama2_lora_finetune_single_device.yaml` config. To do so on multiple GPUs, use `lora_finetune_distributed ` with `alpaca_llama2_lora_finetune_distributed.yaml`. E.g. on two devices, you can run the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we wanna say You can finetune LoRA on a
"single GPU" or "single device"? No strong preference myself, just wanna make sure we're clear on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, might even be worth it to explicitly add one sentence like "we provide separate recipes and configs for fine-tuning LoRA on a single device vs. multiple devices as each case necessitates different training techniques" (idk, something like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
with self._device: | ||
model = config.instantiate(cfg_model) | ||
|
||
# Note: this needs to be set before wrapping with FSDP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer relevant 😃
cfg_dataset: DictConfig, | ||
shuffle: bool, | ||
batch_size: int, | ||
) -> DataLoader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
input_ids = input_ids.to(self._device) | ||
labels = labels.to(self._device) | ||
|
||
logits = self._model(input_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we get rid of autocast here? I thought from the docstring we still support mixed precision too, lmk if I'm misunderstanding though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for single device at least we're getting rid of autocast over full bf16. Next PR making distributed consistent w/full bf16 will have some more thinking around our autocast / mixed precision offerings, but overall I think we should just take it out.
from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe | ||
from recipes.lora_finetune_single_device import LoRAFinetuneRecipeSingleDevice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋
tests/recipes/test_lora_finetune.py
Outdated
from recipes.lora_finetune_distributed import LoRAFinetuneDistributedRecipe | ||
from recipes.lora_finetune_single_gpu import LoRAFinetuneSingleDeviceRecipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail, need to check out the newest version of these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, could you clarify why this is expected to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. It's totally changed
@@ -82,3 +84,19 @@ def test_set_float32_precision(self) -> None: | |||
assert torch.get_float32_matmul_precision() == "high" | |||
assert torch.backends.cudnn.allow_tf32 | |||
assert torch.backends.cuda.matmul.allow_tf32 | |||
|
|||
def test_set_default_dtype(self): | |||
dtype = torch.bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb q: will torch.get_default_dtype()
ever naturally return bf16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user calls torch.set_default_dtype(bf16)
before, though I guess this isn't "naturally"?
Do we also want to add true bf16 support for distributed recipe? |
Definitely! We will address this systematically in a follow up PR - bf16 for distributed + a clear story around autocast vs pure bf16 |
Context
distributed_state_dict
and async checkpointing, which aren't (immediately) planned for single GPU usage but are useful distributed techniquesChangelog
device
specification related logic for multi GPU finetune. multi device finetuning will initially not support CPU (it didn't in the past either)autocast
related logic and enables full bf16 training via a "full_bf16" flag and adds the appropriate checks / validations to ensure this is working. To run with full_bf16, example command istune lora_finetune_single_device --config recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpoint=True full_bf16=True
and this runs in < 16GB of VRAM.Test plan
Single GPU
tune lora_finetune_single_device --config recipes/configs/alpaca_llama2_lora_finetune_single_device.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpointing=False full_bf16=True
- loss values: https://gist.github.com/rohan-varma/b2ef84e27d42814d60688a19d6a965adtune lora_finetune --config recipes/configs/alpaca_llama2_lora_finetune.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda:1 enable_fsdp=False
- loss values: https://gist.github.com/rohan-varma/b3b0230f4a96c3e171a5a76ca8691e20Multi GPU
tune --nproc-per-node 2 lora_finetune_distributed --config recipes/configs/alpaca_llama2_lora_finetune_distributed.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cuda batch_size=2 enable_activation_checkpointing=False
, loss: https://gist.github.com/rohan-varma/52a1e754b783ba25644cb0d20afec026tune --nnodes 1 --nproc_per_node 2 lora_finetune --config recipes/configs/alpaca_llama2_lora_finetune.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug enable_fsdp=True &> out &
, loss: https://gist.github.com/rohan-varma/8d2ea4d8a1070f8e06b165f77e2a9e19Single CPU
(note multi device on CPUs is unsupported)
Can be launched with
tune lora_finetune_single_gpu --config recipes/configs/alpaca_llama2_lora_finetune_single_gpu.yaml --override model_checkpoint=/home/rvarm1/local/dev/assets/llama2-7b-01242024 seed=18 tokenizer_checkpoint=/home/rvarm1/local/dev/assets/tokenizer.model output_dir=/tmp/lora_debug device=cpu
Recipe CI
test_lora_finetune
is updated to verify loss values for both multi gpu and single GPU.Discussion points
shuffle
flag to DataLoader and don't have to callset_epoch
. On the other hand, keeping the distributed sampler might make the switch to distributed recipe a bit easier (for a recipe author) as they don't need to change data (in this particular case, for map style datasets w/random sampling) when going from single GPU to distributed. For now, in the spirit of removing all distributed related things I've removed use of DistributedSample as well.Open items