Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Compatibility with PTL 1.6 #159

Closed
wants to merge 4 commits into from
Closed

Conversation

krfricke
Copy link

@krfricke krfricke commented Jun 16, 2022

Todos:

@amogkam
Copy link
Collaborator

amogkam commented Jun 17, 2022

DDP Sharded and Horovod also need to be P0. We have to make sure the entire library works with 1.6.

1 similar comment
@amogkam
Copy link
Collaborator

amogkam commented Jun 17, 2022

DDP Sharded and Horovod also need to be P0. We have to make sure the entire library works with 1.6.

@amogkam
Copy link
Collaborator

amogkam commented Jun 17, 2022

Could we also merge in this PR to rename everything to strategy following the 1.6 convention. #129

@JiahaoYao
Copy link
Contributor

Error:
image

  • enable gpu training

@JiahaoYao
Copy link
Contributor

image

- [ ] multi-node gpu case, it hangs up forever.

@JiahaoYao
Copy link
Contributor

to_state_stream / load_state_stream seems to be useful because
the weights will not be changed
image

from this test:

post_train_values = torch.tensor(
[torch.sum(torch.abs(x)) for x in model.parameters()])
assert trainer.state.finished, f"Trainer failed with {trainer.state}"
# Check that the model is actually changed post-training.
assert torch.norm(initial_values - post_train_values) > 0.1

@JiahaoYao
Copy link
Contributor

for the Q1:

Check if we need to_state_stream / load_state_stream P(0)

Yes, we need this. On the remote task, the weights is not going to dump / load from the ckpt from the hard-disk. And to_state_stream / load_state_stream provides a elegant way to fetch and pass the weights to driver.

@JiahaoYao
Copy link
Contributor

passed all the test except multi-gpu on one node for this version (https://github.com/JiahaoYao/ray_lightning/tree/3df599a8bb1ac917bf6352b8baef63ad64f21595)

@JiahaoYao
Copy link
Contributor

# https://github.com/PyTorchLightning/pytorch-lightning/discussions/8561
# is fixed.
ddp_kwargs.pop("parallel_devices", None)
ddp_kwargs.pop("cluster_environment", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this?

@JiahaoYao
Copy link
Contributor

test_ddp.py passed

@JiahaoYao
Copy link
Contributor

can close this pr @krfricke @amogkam

@krfricke krfricke closed this Aug 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants