Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename variable horovod_num_processes in ReturnnTrainingJob and ReturnnRasrTrainingJob #456

Open
christophmluscher opened this issue Oct 20, 2023 · 2 comments

Comments

@christophmluscher
Copy link
Contributor

The jobs got extended to enable multi-GPU usage for the torch backend (see #444 and #445). The horovod_num_processes variable name is now incorrect. This change needs to be done carefully since this is a potentially hash-breaking change.
Analog to distributed_launch_command rename horovod_num_processes to distributed_num_processes?

@albertz @Judyxujj @JackTemaki comments?

@albertz
Copy link
Member

albertz commented Oct 20, 2023

Yes, this makes sense.

distributed_num_processes is a good name.

However, I currently don't have a good idea how to do this without breaking setups. Do you?

@albertz
Copy link
Member

albertz commented Oct 20, 2023

Ok, we could maybe define a custom _sis_hash for the ReturnnTrainingJob, and there use exactly the old name (horovod_num_processes), and then in __init__, we can do any handling we want for kwargs, so supporting both the new name and the old name. I think sth like this would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants