Support to Multi-node Training #989

BiEchi · 2023-11-08T23:14:29Z

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
I have visited the source website
I have searched through the issue tracker for duplicates

I have mentioned version numbers, operating system and environment, where applicable:

import tianshou, gymnasium as gym, torch, numpy, sys
print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

Hello, we're using Tianshou for a large model, which requires multi-node training. I see the multi-GPU training tutorial here: https://tianshou.readthedocs.io/en/master/tutorials/cheatsheet.html#multi-gpu. However, I'm not sure whether this can be applied to multi-node training. Any previous attempts are greatly appreciated!

MischaPanch · 2023-11-08T23:28:12Z

In principle this should work since tianshou supports ray for parallelization, which in turn supports multiple nodes. @Trinkle23897 may know more.

I haven't used tianshou for LLMs yet (trlx seemed like a good fit for that) but would love to hear experiences and use cases. Could you provide some info on your use case and why you want to use tianshou, or is it non-disclosable for your case?

BiEchi · 2023-11-09T01:08:14Z

Thank you for the prompt response. Unfortuntely, we cannot provide additional details on the specific use cases as this is an ongoing research project. Thank you again for your help and your understanding!

MischaPanch · 2023-11-10T13:25:05Z

I see, thanks. Currently, multi-node parallelization and LLM support are not the main points on my radar, so I won't be able to help you with this at least until 2024. But other users/devs might have more experience with this

MischaPanch added the question Further information is requested label Nov 8, 2023

MischaPanch added enhancement Feature that is not a new algorithm or an algorithm enhancement on hold Won't be worked on for now, but maybe later and removed question Further information is requested labels Jan 8, 2024

BiEchi closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support to Multi-node Training #989

Support to Multi-node Training #989

BiEchi commented Nov 8, 2023 •

edited

Loading

MischaPanch commented Nov 8, 2023

BiEchi commented Nov 9, 2023 •

edited

Loading

MischaPanch commented Nov 10, 2023

Support to Multi-node Training #989

Support to Multi-node Training #989

Comments

BiEchi commented Nov 8, 2023 • edited Loading

MischaPanch commented Nov 8, 2023

BiEchi commented Nov 9, 2023 • edited Loading

MischaPanch commented Nov 10, 2023

BiEchi commented Nov 8, 2023 •

edited

Loading

BiEchi commented Nov 9, 2023 •

edited

Loading