New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRAIN TF-AGENTS WITH MULTIPLE GPUs #289
Comments
It should work with multiple GPUs. You can specify that in your training script like you usually would do in other tensorflow training code. |
if you try sac/examples/v2/train_eval_rnn.py as it is on a vm like the one listed above you'll see only one GPU working. Could you please provide an example on how to run train_eval_rnn.py on a multigpu?
|
@oars any advice for how to make the MirroredStrategy work here? i wonder if you need to build the iterator inside the MirroredStrategy. |
do you mean something like this?
I can give it a try tomorrow and let you know |
with this approach, operations are always added on a single GPU |
I am having the exactly same problem. Any solutions or work-around? Thanks, |
I made some progress by combining some snippets of code from the official tensorflow docs Here what I did so far based on: https://github.com/tensorflow/agents/blob/master/tf_agents/agents/sac/examples/v2/train_eval_rnn.py but with custom networks and environment.
I'm probably messing around with batch_size, global_batch_size and tf_env.batch_size |
Glad to hear you're making progresses. I'll be grateful if you could share your final solution. In the same time, I am trying to see if I can make it work. I'll share mine here as well if I can find a working one. |
Assigning to me to identify the multi-gpu example or get one made. A common problem and we should publish a definitive example to give people a starting point. |
Hi @tfboyd , any update on this? |
@JCMiles what errors are you getting with your implementation? One thing you want to make sure is to create your network variables and the dataset iterator within the strategy scope. I am not sure from your code snippets where you create your networks. The only thing that needs to be outside of strategy scope is the replay buffer itself, so it doesn't get replicated on the GPUs. Something like this:
@tfboyd @sguada @ebrevdo let's discuss internally an example we can share |
@egonina Traceback (most recent call last): File "/home/user/test_agent/train/module.py", line 321, in train_step |
@anj-s , can you PTAL at this use of DistributionStrategy? Thanks! |
From the error message it looks like we are not populating a field which indicates if there is data remaining in the dataset(that has not been processed yet). Given this is a single machine MirroredStrategy example I am not sure why we would not have populated this field. I need a reproducible example to dig into this. @JCMiles @egonina Can you provide me with a reproducible example? |
just add @egonina implementation to agents/sac/examples/v2/train_eval_rnn.py |
I've made the train_eval.py work on multiple GPUs. Attached is the source code. Hope it help |
Hi, [03-04|23:44:08] [14931] INFO 219 coordinator | Error reported to Coordinator: batch_reduce() missing 1 required positional argument: 'value_destination_pairs' |
Hello everyone, I found the problem and fixed also in train_eval_rnn.
I passed as cross_device_ops param the callable. That caused the problem. I have only one small question now before to close this issue [03-05|14:06:21] [26463] INFO 760 cross_device_ops| batch_all_reduce: 112 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10 what are those 4 IndexedSlices? and why efficient allreduce is not supported ? |
I'm using 2 Tesla P100, so I suppose IndexSlices are not related to the number of gpus. Meanwhile I'm running some performance test and I'm facing some unexpected downtime. Sill investigating ... |
Quick update: To run a train cycle on multiple GPUs and with "n" parallel environments I found that is necessary
Hope this helps. |
Assigning to me to test with the new updated distributed scripts. Using the new distributed collect is the best (maybe only) way to really drive multiple GPUs. ParallelPyEnvironment might work but we do not use that approach very often as we prefer to use a lot of CPU only machines to drive data to the GPU server(s). https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/distributed/examples/sac I will close this issue after testing the linked cover on multiple GPUs and verifying it is using both GPUs. The example is unlikely to drive a high usage rate due to the network being very small. |
Hi,
I finally got my vm up and running using:
2 Tesla P100
NVIDIA driver 440.33.01
cuda 10.2
tensorflow=2.1.0
tf_agents=0.3.0
I start training a custom model/env based on SAC agent v2 train loop
but only one GPU is used.
My question :
at the moment is tf-agents able to manage distribute training on multiple GPU?
or should I use only one?
The text was updated successfully, but these errors were encountered: