Compatibility with PTL 1.6 #159

krfricke · 2022-06-16T18:47:35Z

Todos:

Check if we need to_state_stream / load_state_stream P(0)
Check multi node (P0)
Check multi GPU/multi node (P0)
Fix / change tests (P0)
Check that recent PRs are included, e.g. ray_ddp: support logged_metrics as part of remote worker return value #156 P(0.5-1)
Check Ray client (P1)
Check fractional GPUs (P2)
DDP sharded (P2)

# Conflicts: # ray_lightning/ray_ddp.py

amogkam · 2022-06-17T17:08:15Z

DDP Sharded and Horovod also need to be P0. We have to make sure the entire library works with 1.6.

amogkam · 2022-06-17T17:08:15Z

DDP Sharded and Horovod also need to be P0. We have to make sure the entire library works with 1.6.

amogkam · 2022-06-17T17:09:22Z

Could we also merge in this PR to rename everything to strategy following the 1.6 convention. #129

JiahaoYao · 2022-06-18T03:51:04Z

Error:

enable gpu training

JiahaoYao · 2022-06-18T05:12:58Z

- [ ] multi-node gpu case, it hangs up forever.

JiahaoYao · 2022-06-20T00:04:01Z

to_state_stream / load_state_stream seems to be useful because
the weights will not be changed

from this test:

ray_lightning/ray_lightning/tests/utils.py

Lines 241 to 245 in 6aed848

    
           post_train_values = torch.tensor( 
        
               [torch.sum(torch.abs(x)) for x in model.parameters()]) 
        
           assert trainer.state.finished, f"Trainer failed with {trainer.state}" 
        
           # Check that the model is actually changed post-training. 
        
           assert torch.norm(initial_values - post_train_values) > 0.1

JiahaoYao · 2022-06-21T22:58:20Z

for the Q1:

Check if we need to_state_stream / load_state_stream P(0)

Yes, we need this. On the remote task, the weights is not going to dump / load from the ckpt from the hard-disk. And to_state_stream / load_state_stream provides a elegant way to fetch and pass the weights to driver.

JiahaoYao · 2022-06-21T23:34:25Z

passed all the test except multi-gpu on one node for this version (https://github.com/JiahaoYao/ray_lightning/tree/3df599a8bb1ac917bf6352b8baef63ad64f21595)

JiahaoYao · 2022-06-21T23:34:52Z

the test files used are https://github.com/JiahaoYao/ray_lightning/blob/3df599a8bb1ac917bf6352b8baef63ad64f21595/ray_lightning/tests/test_ddp_gpu.py

amogkam · 2022-06-17T17:39:42Z

ray_lightning/ray_ddp.py

-        #  https://github.com/PyTorchLightning/pytorch-lightning/discussions/8561
-        #  is fixed.
-        ddp_kwargs.pop("parallel_devices", None)
-        ddp_kwargs.pop("cluster_environment", None)


Let's keep this?

JiahaoYao · 2022-06-22T04:12:18Z

test_ddp.py passed

JiahaoYao · 2022-07-16T14:15:19Z

can close this pr @krfricke @amogkam

Kai Fricke added 4 commits June 3, 2022 17:18

wip

2aaa2a2

Examples working

df07332

Merge remote-tracking branch 'upstream/main' into ptl16

f05dff5

# Conflicts: # ray_lightning/ray_ddp.py

Fix one deprecation warning

b293ff8

amogkam reviewed Jun 21, 2022

View reviewed changes

krfricke closed this Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility with PTL 1.6 #159

Compatibility with PTL 1.6 #159

krfricke commented Jun 16, 2022 •

edited by JiahaoYao

amogkam commented Jun 17, 2022

amogkam commented Jun 17, 2022

amogkam commented Jun 17, 2022 •

edited

JiahaoYao commented Jun 18, 2022

JiahaoYao commented Jun 18, 2022

JiahaoYao commented Jun 20, 2022

JiahaoYao commented Jun 21, 2022

JiahaoYao commented Jun 21, 2022

JiahaoYao commented Jun 21, 2022

amogkam Jun 17, 2022

JiahaoYao commented Jun 22, 2022

JiahaoYao commented Jul 16, 2022

Compatibility with PTL 1.6 #159

Compatibility with PTL 1.6 #159

Conversation

krfricke commented Jun 16, 2022 • edited by JiahaoYao

amogkam commented Jun 17, 2022

amogkam commented Jun 17, 2022

amogkam commented Jun 17, 2022 • edited

JiahaoYao commented Jun 18, 2022

JiahaoYao commented Jun 18, 2022

JiahaoYao commented Jun 20, 2022

JiahaoYao commented Jun 21, 2022

JiahaoYao commented Jun 21, 2022

JiahaoYao commented Jun 21, 2022

amogkam Jun 17, 2022

Choose a reason for hiding this comment

JiahaoYao commented Jun 22, 2022

JiahaoYao commented Jul 16, 2022

krfricke commented Jun 16, 2022 •

edited by JiahaoYao

amogkam commented Jun 17, 2022 •

edited