[Unified TorchTrainer] Add PyTorch Lightning Trainer Utilities #37989

woshiyyya · 2023-08-01T20:50:39Z

Why are these changes needed?

Add Lightning related utilities for Unified TorchTrainer.

1 basic example using new API
User guides for Lightning + TorchTrainer

POC example: Text Classification with Ray Data + Lightning

Rendered Doc: https://anyscale-ray--37989.com.readthedocs.build/en/37989/

New utilities

Strategy Class: RayDDPStrategy, RayFSDPStrategy, RayDeepSpeedStrategy
Environment Class: RayLightningEnvironment
Metrics and checkpoint reporting: RayTrainReportCallback
prepare_trainer(): To check if the users correctly configured the pl.Trainer

Users can inject these utilities to run Lightning code in TorchTrainer.

Metrics and checkpoint reporting

We propose to let the users define the logs in a lightning callback themselves.

from pytorch_lightning.callbacks import Callback
class RayTrainReportCallback(Callback):
    def on_train_epoch_end(self, trainer: Trainer, pl_module: LightningModule) -> None:
        with TemporaryDirectory() as tmpdir:
            # Fetch metrics
            metrics = trainer.callback_metrics
            metrics = {k: v.item() for k, v in metrics.items()}
           
            # Save checkpoint to local
            ckpt_path = os.path.join(tmpdir, f"ckpt_epoch_{trainer.current_epoch}.pth")
            trainer.save_checkpoint(ckpt_path, weights_only=False)

            # Report to train session
            checkpoint = Checkpoint.from_directory(tmpdir)
            ray.train.report(metrics=metrics, checkpoint=checkpoint)

The users can choose whether to report the metrics to Ray Train or not.

from pytorch_lightning.callbacks import ModelCheckpoint

def train_loop_per_worker():
    # Report to ray train
    trainer = pl.Trainer(
         callbacks=[RayTrainReportCallback()]
    )

   # Keep using the original lightning's logics
   trainer = pl.Trainer(
         callbacks=[ModelCheckpoint(**kwags)]
    )

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…/add_lightning_utilities

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng

Looks really clean. I think we should showcase one example with the previous API compare it with the new API.

python/ray/train/lightning/lightning_utils.py

python/ray/train/tests/lightning_test_utils.py

python/ray/train/lightning/lightning_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…/add_lightning_utilities

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

scottsun94 · 2023-08-03T01:05:44Z

If the user chooses not to report to train, we won't be able to print any related metrics in the train output right?

What will happen to this training result table?

Training completed after 0 iterations at 2023-06-26 16:06:41. Total running time: 24min 41s
╭──────────────────────────────────────────────────────╮
│ Training result                                      │
├──────────────────────────────────────────────────────┤
│ config/train_loop_config/args   ...weight_decay=0.0) │
╰──────────────────────────────────────────────────────╯

python/ray/train/tests/test_torch_lightning_train.py

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…/add_lightning_utilities

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

python/ray/train/lightning/lightning_trainer.py

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…/add_lightning_utilities

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

doc/source/train/distributed-pytorch/checkpoints.rst

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

krfricke

Read half of it, will continue later today. Looks great so far! Would be great to get a version that renders in RTD for preview.

krfricke · 2023-08-08T14:23:32Z

doc/source/train/distributed-pytorch/checkpoints.rst

+                datamodule = MyLightningDataModule(...)
+
+                trainer = pl.Trainer(
+                    ...


Nit: let's use comments here so that the python code at least would parse correctly (even if it isn't executed)

Suggested change

...

# ...

Technically this actually works in python!

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…/add_lightning_utilities

woshiyyya · 2023-08-08T20:53:17Z

@krfricke Updated rendered doc: https://anyscale-ray--37989.com.readthedocs.build/en/37989/
cc @matthewdeng

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…/add_lightning_utilities

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng

Generally looks great to me, I think that we can polish and shift around some of the documentation content in a follow up PR.

doc/source/train/api/api.rst

doc/source/train/distributed-pytorch/converting-existing-training-loop.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

krfricke

This is great, minor nits

doc/source/train/distributed-pytorch/monitoring-logging.rst

doc/source/train/distributed-pytorch/experiment-tracking.rst

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…roject#37989) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

…roject#37989) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…roject#37989) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

…roject#37989) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

…roject#37989) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…roject#37989) Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

woshiyyya added 3 commits August 1, 2023 13:49

init

3a7dc6e

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/unified-api…

c7b91a7

…/add_lightning_utilities

add docstring

4ef88c9

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya marked this pull request as ready for review August 1, 2023 21:02

woshiyyya requested a review from matthewdeng August 1, 2023 21:02

woshiyyya assigned matthewdeng Aug 1, 2023

matthewdeng reviewed Aug 2, 2023

View reviewed changes

woshiyyya and others added 2 commits August 1, 2023 23:35

Apply suggestions from code review

7b5da30

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

add deepspeed tests

95e2b9b

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya mentioned this pull request Aug 2, 2023

[Ray 2.7 Examples][1/n] Revamp the LightningTrainer CoLA Example #38009

Merged

8 tasks

woshiyyya added 3 commits August 2, 2023 16:35

fix import error

4df1ec6

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/unified-api…

3914a21

…/add_lightning_utilities

make report callback public

21a8ce3

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

scottsun94 reviewed Aug 3, 2023

View reviewed changes

python/ray/train/tests/test_torch_lightning_train.py Outdated Show resolved Hide resolved

woshiyyya added 3 commits August 3, 2023 00:40

fix lint

abdf842

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/unified-api…

d3b01f0

…/add_lightning_utilities

use new ray.train api

0ecd61a

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 3, 2023

matthewdeng reviewed Aug 7, 2023

View reviewed changes

python/ray/train/lightning/lightning_trainer.py Show resolved Hide resolved

woshiyyya added 4 commits August 7, 2023 12:02

switch to new api in LightningTrainer

0de4bc4

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/unified-api…

2b2589e

…/add_lightning_utilities

WIP: add lightning user guides

7c1b1e2

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

change default ckpt name

e8285ab

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested review from richardliaw, krfricke, xwjiang2010, amogkam, Yard1 and maxpumperla as code owners August 7, 2023 22:32

pcmoritz reviewed Aug 8, 2023

View reviewed changes

doc/source/train/distributed-pytorch/checkpoints.rst Outdated Show resolved Hide resolved

add mnist example

ba81c15

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

krfricke reviewed Aug 8, 2023

View reviewed changes

woshiyyya and others added 5 commits August 8, 2023 10:25

fix doc lint

89c7388

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge branch 'master' into train/unified-api/add_lightning_utilities

0d9bdaf

fix

c186f5b

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fixing

6925a05

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/unified-api…

079ddec

…/add_lightning_utilities

fix ut

7c19ffe

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested review from krfricke and matthewdeng August 9, 2023 00:13

woshiyyya added 2 commits August 8, 2023 17:20

Merge remote-tracking branch 'upstream/master' into train/unified-api…

8a65e36

…/add_lightning_utilities

update semgrep

9d9ed17

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng approved these changes Aug 9, 2023

View reviewed changes

Apply suggestions from code review

87aa763

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

krfricke approved these changes Aug 9, 2023

View reviewed changes

woshiyyya and others added 4 commits August 9, 2023 12:07

address comments

74717af

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge branch 'master' into train/unified-api/add_lightning_utilities

a4213a2

fix func name

4e06c89

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

fix ckpt path

1396883

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng merged commit 0dd32ed into ray-project:master Aug 10, 2023
120 of 124 checks passed

This was referenced Aug 21, 2023

[train] Implement TorchTrainer subclass simplifications #38295

Closed

[Train][Ray 2.7] Revamp Ray Train examples with new APIs #38681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unified TorchTrainer] Add PyTorch Lightning Trainer Utilities #37989

[Unified TorchTrainer] Add PyTorch Lightning Trainer Utilities #37989

woshiyyya commented Aug 1, 2023 •

edited

Loading

matthewdeng left a comment

scottsun94 commented Aug 3, 2023

krfricke left a comment •

edited

Loading

krfricke Aug 8, 2023

matthewdeng Aug 9, 2023

woshiyyya commented Aug 8, 2023

matthewdeng left a comment

krfricke left a comment

[Unified TorchTrainer] Add PyTorch Lightning Trainer Utilities #37989

[Unified TorchTrainer] Add PyTorch Lightning Trainer Utilities #37989

Conversation

woshiyyya commented Aug 1, 2023 • edited Loading

Why are these changes needed?

New utilities

Metrics and checkpoint reporting

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

scottsun94 commented Aug 3, 2023

krfricke left a comment • edited Loading

Choose a reason for hiding this comment

krfricke Aug 8, 2023

Choose a reason for hiding this comment

matthewdeng Aug 9, 2023

Choose a reason for hiding this comment

woshiyyya commented Aug 8, 2023

matthewdeng left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

woshiyyya commented Aug 1, 2023 •

edited

Loading

krfricke left a comment •

edited

Loading