Separate full finetune into multi-gpu and single device recipes #482

rohan-varma · 2024-03-11T09:29:46Z

Context

Similar to Separate LoRA recipe into single and multi GPU #454, we'd like to separate our full finetune recipes into single device and multi-device.
full bf16, via dtype flag, in accordance with RFC [RFC] Configuring low precision training in torchtune #504 is also enabled for both recipes.
Note that this recipe does not address full memory efficiency. Both finetunes are still at 25G+ memory usage - memory efficiency work will come as part of a follow up PR.
Added a print_peak_memory util to print the peak memory during training. We need to log this to wandB, will be done in follow up PRs.

Changelog

See above

Caveats

As mentioned above, full memory efficiency is follow up work and not yet enabled.

Test plan

Run recipe tests:
Run full ft: tune full_finetune_single_device --config recipes/configs/alpaca_llama2_full_finetune_single_device.yaml
Run distributed full ft: tune --nproc_per_node 2 full_finetune_distributed --config recipes/configs/alpaca_llama2_full_finetune_distributed.yaml

Comparison to fp32 runs

Loss curves for bf16 and fp32 are comparable. Still need to run e2e evals for bf16 runs for both full and LoRA finetunes.

bf16 loss curve:

fp32 loss curve:

netlify · 2024-03-11T09:30:05Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`b3dbe4e`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65f9dca6265ee5000846c0fb
😎 Deploy Preview	https://deploy-preview-482--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

pytorch-bot · 2024-03-16T08:15:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/482

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PLEASE AVOID MERGING] Lots of REDs in main after recovering from S3 Outage

✅ No Failures

As of commit b3dbe4e with merge base 20c323a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rohan-varma · 2024-03-18T18:13:56Z

recipes/configs/alpaca_llama2_full_finetune_distributed.yaml

@@ -6,7 +6,7 @@
 # Tokenizer
 tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
-  path: /tmp/llama2/tokenizer.model
+  path: /home/rvarm1/local/dev/assets/tokenizer.model


will revert these prior to land

bumping this

kartikayk · 2024-03-18T22:48:54Z

recipes/full_finetune_distributed.py

@@ -227,9 +228,13 @@ def _setup_model(
            )

        model.load_state_dict(model_state_dict)
-
+        # Validate model was loaded in with the expected dtype.
+        utils.validate_expected_param_dtype(model, dtype=self._training_precision)


Why do we need this?

This validates that all params in the model are of the expected type. Would be useful for catching issues where some parameters dont end up as fp32, maybe due to accidental overwrite, or state_dict hook manipulating them, etc. Can take it out if needed.

kartikayk · 2024-03-18T22:52:53Z

recipes/full_finetune_distributed.py

+                    )
+                self._optimizer.step()
+                if log_this_iteration:
+                    get_memory_summary(


These logs seem overly intrusive and overly frequent. We're currently printing this every N steps where N is tied to how frequently we log other metrics. I don't think we need memory stats to be logged this frequently. Can we just move this to one place (eg: end of iteration) and specify this with a different frequency?

Different frequency sounds good. The reason of multiple calls within iteration is to help debug memory spikes during different portion of the training. For example, if we just log once at end, we don't know if memory peaked in forward, backward, or optim step. More granular logs help clearly show where the memory usage spikes and isolates the memory debugging there.

Yeah I also wonder what the best approach here is. Personally I have been copy-pasting stuff analogous to this a ton and it'd be nice to just have an easy way to configure it, so I think this is a nice step in that direction. But do agree it's a bit intrusive. While it's useful for us, do you think most users will be debugging memory spikes in forward/backward/optimizer step on a regular basis? My inclination is no, but lmk your thoughts

Moved to end of iteration for now.

kartikayk · 2024-03-18T22:53:27Z

recipes/full_finetune_single_device.py

+
+        # logging attributes
+        self._output_dir = cfg.output_dir
+        self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 10


Responded below!

kartikayk · 2024-03-18T22:53:55Z

recipes/full_finetune_distributed.py

        # logging attributes
        self._output_dir = cfg.output_dir
-        self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 1
+        self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 10


Why 10? I would want to log loss a lot more frequently than this right?

Each time a loss is logged, it requires a CPU / GPU synchronization, which in traces reveal a long GPU-side wait. I think having this explicit host sync every iteration is unnecessarily expensive. If training in some nontrivially large N, I feel like I don't lose much by logging the loss every 10 instead of 1 iteration?

Separately imo we should not spread our defaults across multiple places. Rn most defaults are in the yaml file, I think we should stay consistent with that here.

kartikayk · 2024-03-18T22:55:49Z

recipes/full_finetune_single_device.py

+        model.load_state_dict(model_state_dict)
+
+        # Validate model was loaded in with the expected dtype.
+        utils.validate_expected_param_dtype(model, dtype=self._training_precision)


Same question as above.

Responded in the other comment.

recipes/full_finetune_single_device.py

kartikayk · 2024-03-18T22:57:22Z

recipes/full_finetune_single_device.py

+            self._sampler.set_epoch(curr_epoch)
+
+            for idx, batch in enumerate(
+                pbar := tqdm(self._dataloader, disable=not (rank == 0))


Don't need to disable this

kartikayk · 2024-03-18T22:57:49Z

recipes/full_finetune_single_device.py

+                input_ids = input_ids.to(self._device)
+                labels = labels.to(self._device)
+                if log_this_iteration:
+                    get_memory_summary(


Same comment about this as distributed recipe. Let's reduce the frequency of these logs.

What about log every 100 steps, but keep the frequency in terms of it being after forward, after backward, etc?

Made it just end of iteration for now.

recipes/configs/alpaca_llama2_full_finetune_single_device.yaml

ebsmothers · 2024-03-19T02:33:45Z

recipes/full_finetune_distributed.py

        # logging attributes
        self._output_dir = cfg.output_dir
-        self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 1
+        self._log_every_n_steps = cfg.log_every_n_steps if cfg.log_every_n_steps else 10
+        self._log_peak_memory_every_n_steps = 100


Similar comment here, just define in the config?

There's been discussion in the past about what should be configurable so as to not bloat configs. I'll defer to @kartikayk on this.

In #514, we hardcoded 100 so sticking with that in a variable for now seems reasonable.

ebsmothers · 2024-03-19T02:47:51Z

recipes/full_finetune_distributed.py

+            "If using tune CLI, please specify --nnodes 1 and --nproc_per_node [num_gpus]"
+        )
+
+    init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")


Is gloo moot if we don't support CPU training?

For unittest until we have GPU support.

ebsmothers · 2024-03-19T02:51:32Z

recipes/full_finetune_distributed.py

        - FSDP and activation checkpointing. This is enabled by default but can be
            configured using the ``enable_fsdp`` and ``enable_activation_checkpointing`` flags.
-        - Mixed precision training - fp32, fp16 and bf16 are supported.
+        - Full bf16 training via setting the ``dtype`` flag to bf16.


(Comment is on L38-39). We should make sure we're aligned on the right default for AC, as #514 changes the default for distributed LoRA to no AC

Default has been on for memory efficiency, and I can't tell why #514 turns it off by default (doesn't appear to be in the PR description). So sticking with leaving it on for now.

ebsmothers · 2024-03-19T02:59:38Z

recipes/full_finetune_distributed.py

-
-                    # Update the number of steps when the weights are updated
-                    self.total_training_steps += 1
+                loss.backward()


Did we lose grad accumulation in here somewhere?

Great call (but again, CI didn't catch it, unfortunate)

I think our grad accumulation test is not running for the distributed recipe. I will look into setting this up with the distributed tests

ebsmothers · 2024-03-19T03:03:38Z

recipes/full_finetune_single_device.py

+                log_this_iteration = (
+                    self.total_training_steps % self._log_every_n_steps == 0
+                )


nit: this isn't really making the code clearer. If anything would define a variable for self.total_training_steps % self._log_peak_memory_every_n_steps

ebsmothers · 2024-03-19T03:06:12Z

recipes/full_finetune_single_device.py

+                logits = logits.transpose(1, 2)
+                # Compute loss
+                loss = self._loss_fn(logits, labels)
+                if self.total_training_steps % self._log_peak_memory_every_n_steps == 0:


Wondering about all these logs when we have grad accumulation turned on. In that case are we logging all this stuff separately for every iteration of the step?

Yeah, made the memory log at just the end of iteration for now.

ebsmothers · 2024-03-19T03:07:32Z

tests/recipes/test_full_finetune.py

@@ -94,7 +95,8 @@ def fetch_checkpointer(self, ckpt):
        if ckpt == "small_test_ckpt_meta":
            return "FullModelMetaCheckpointer"

-    def test_loss(self, capsys, pytestconfig, tmpdir, monkeypatch):
+    @pytest.mark.parametrize("single_device", [False])


Any particular reason for this?

Adding true back in - was just testing.

ebsmothers · 2024-03-19T15:33:27Z

torchtune/utils/__init__.py

@@ -53,6 +53,7 @@
    "transform_opt_state_dict",
    "validate_checkpoint",
    "get_autocast",
+    "get_memory_summary",


Should this be memory_stats_log?

ebsmothers

Two more quick comments, otherwise looks good! Just make sure to run both recipes in the final state before landing.

rohan-varma · 2024-03-19T17:54:17Z

torchtune/utils/memory.py

-    Memory Allocated: {torch.cuda.memory_allocated() / 1000**3:.2f} GB
-    Memory Reserved: {torch.cuda.memory_reserved() / 1000**3:.2f} GB
-    Peak Memory: {torch.cuda.max_memory_allocated() / 1000**3:.2f} GB
+def memory_stats_log(


cc @kartikayk, this may cause API confict with #524.

rohan-varma added 2 commits March 11, 2024 02:28

Separate full finetune

1700469

foo

7ec93c6

rohan-varma marked this pull request as draft March 11, 2024 09:29

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2024

rohan-varma added 2 commits March 11, 2024 02:30

upd

ea4fbd6

Merge branch 'main' of github.com:pytorch-labs/torchtune into ft

86a5340

rohan-varma added 3 commits March 16, 2024 01:34

Update after checkpointing changes

93b36fd

Merge branch 'main' of github.com:pytorch-labs/torchtune into ft

1cbac8d

upd

a8f51ef

rohan-varma marked this pull request as ready for review March 18, 2024 17:46

rohan-varma requested a review from ebsmothers March 18, 2024 17:57

rohan-varma commented Mar 18, 2024

View reviewed changes

rohan-varma changed the title ~~[WIP] Separate full finetune into multi-gpu and single device recipes~~ Separate full finetune into multi-gpu and single device recipes Mar 18, 2024

rohan-varma added 10 commits March 18, 2024 12:00

upd

c3424ef

upd

fea4477

upd

811c1e0

update tests

6e86d52

Update

d178ef0

upd

61ef91f

Upd

e3440df

upd

c2c96f4

foo

1cf1084

Upd

e5a783d

kartikayk reviewed Mar 18, 2024

View reviewed changes

rohan-varma added 2 commits March 18, 2024 17:34

Upd

767fdbd

Upd

c2a3609

rohan-varma requested a review from kartikayk March 19, 2024 00:36

ebsmothers reviewed Mar 19, 2024

View reviewed changes

recipes/configs/alpaca_llama2_full_finetune_single_device.yaml Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 19, 2024

View reviewed changes

recipes/configs/alpaca_llama2_full_finetune_single_device.yaml Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 19, 2024

View reviewed changes

recipes/configs/alpaca_llama2_full_finetune_single_device.yaml Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 19, 2024

View reviewed changes

rohan-varma added 4 commits March 18, 2024 21:10

Evan's CR comments

decb867

merge fixes

afb4e16

Upd

1636808

Upd

b071949

rohan-varma requested a review from ebsmothers March 19, 2024 04:32

Upd

3774484

ebsmothers reviewed Mar 19, 2024

View reviewed changes

ebsmothers approved these changes Mar 19, 2024

View reviewed changes

rohan-varma added 3 commits March 19, 2024 10:34

Update

262f964

Upd

3244010

Cfg changes

e4d4477

rohan-varma commented Mar 19, 2024

View reviewed changes

rohan-varma added 3 commits March 19, 2024 11:14

Correct logging:

c1e4c0a

Upd

4a3b3a4

Lint

b3dbe4e

rohan-varma merged commit 65aec15 into main Mar 19, 2024
21 checks passed

joecummings deleted the ft branch April 11, 2024 15:40

Separate full finetune into multi-gpu and single device recipes #482

Separate full finetune into multi-gpu and single device recipes #482

Conversation

rohan-varma commented Mar 11, 2024 • edited

Context

Changelog

Caveats

Test plan

Comparison to fp32 runs

netlify bot commented Mar 11, 2024 • edited

✅ Deploy Preview for torchtune-preview ready!

pytorch-bot bot commented Mar 16, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/482

❗ 1 Active SEVs

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma Mar 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers Mar 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma commented Mar 11, 2024 •

edited

netlify bot commented Mar 11, 2024 •

edited

pytorch-bot bot commented Mar 16, 2024 •

edited

rohan-varma Mar 19, 2024 •

edited

ebsmothers Mar 19, 2024 •

edited