Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Issue #21671: Handle callbacks and model metrics for TorchPolicy while using multi-GPU optimizers #21697

Merged
merged 5 commits into from
Feb 23, 2022

Conversation

XuehaiPan
Copy link
Contributor

@XuehaiPan XuehaiPan commented Jan 19, 2022

Why are these changes needed?

In:

ray/rllib/agents/trainer.py

Lines 1314 to 1321 in 82103bf

# Use simple optimizer (only for multi-agent or tf-eager; all other
# cases should use the multi-GPU optimizer, even if only using 1 GPU).
# TODO: (sven) rename MultiGPUOptimizer into something more
# meaningful.
if self.config.get("simple_optimizer") is True:
train_results = train_one_step(self, train_batch)
else:
train_results = multi_gpu_train_one_step(self, train_batch)

ray/rllib/agents/trainer.py

Lines 1346 to 1356 in 82103bf

if config.get("simple_optimizer") is True:
train_op = train_op.for_each(TrainOneStep(workers))
else:
train_op = train_op.for_each(
MultiGPUTrainOneStep(
workers=workers,
sgd_minibatch_size=config.get("sgd_minibatch_size",
config["train_batch_size"]),
num_sgd_iter=config.get("num_sgd_iter", 1),
num_gpus=config["num_gpus"],
_fake_gpus=config["_fake_gpus"]))

Unless config["simple_optimizer"] specified, the multi-GPU optimizer will be used when training RL policies on GPU (even if only 1 GPU). That means we will use multi_gpu_train_one_step rather than train_one_step when training with GPU(s).

Related issue number

#21671 (This PR only fixes TorchPolicy)

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
@gjoliver
Copy link
Member

yeah, this seems reasonable. we basically are not calling on_learn_on_batch() callbacks in learn_on_loaded_batch().
but I think @sven1977 should approve this, since he has the most context on this.
ping if you don't hear from Sven soon, we will make sure this gets taken cared of.

Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes sense, but will need to take the reproduction script, and turn it into a test. I can help with this, as its non-trivial to reason about our CI. We could probably do this with the fake gpu towers, right @sven1977? Also need to extend this to tensorflow. Thanks for contributing this @XuehaiPan!

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@avnishn
Copy link
Member

avnishn commented Feb 23, 2022

@sven1977 I have a question -- I was trying to see how to make this work for TF multi gpu experiments. I have little experience with TF, so I had some questions that I was hoping you could answer:

Whats the best way to go about calling self.callbacks.on_learn_on_batch here?

In the single node case, TF, we keep a reference to the sample batch that's going to be passed to self.callbacks.on_learn_on_batch when we load the batch into the Multi gpu tower.

In the multi device case, we store this in the Multi-GPU tower.

We need to be able to get data out of the Multi-GPU tower in the form of a sample batch, and only then will we be able to support this change for Tensorflow multi-gpu. I'm not sure how to go about that doing that.

Another approach here is to store the entire batch in a variable, like we do in the single device case, and then pass that batch to self.callbacks.on_learn_on_batch, however, I think that this would double our memory footprint (please correct me if I'm wrong), since we'd have to hold onto 2 copies of the batch (1 inside of the Multi-GPU tower, and again inside of the policy.)

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @XuehaiPan , sorry for the delay and thanks for this great fix! I must have missed this PR.
Let's get this merged! :)

@sven1977
Copy link
Contributor

@avnishn , you are right, for tf, things work slightly differently unfortunately, as e.g. stats-fn is called on each individual tower batch (for torch, it's only called once with the entire batch, which is much cleaner).

I was hoping that our ongoing "ray train" integration will make these problems all go away. Let's get this merged here first for torch, then we can fix tf static graph (tf2 does currently NOT support multi-GPU!) later.

@sven1977 sven1977 merged commit 018ebbf into ray-project:master Feb 23, 2022
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
@XuehaiPan XuehaiPan deleted the torch-model-metrics branch August 23, 2022 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants