Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments #21685

Merged
merged 18 commits into from
Feb 4, 2022

Conversation

avnishn
Copy link
Member

@avnishn avnishn commented Jan 18, 2022

Hitting 10k timesteps seems to be not achievable
in the 900 second time limit. This means that the
reward threshold must be met in order for the
experiment to terminate. This algorithm is seed
sensitive on this environment, so I toyed with the reward
threshold on various seeds, and ended up fixing the
reward threshold for a certain seed. This should
give us regression information, while eliminating the
possiblilty that the test flakes.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@richardliaw
Copy link
Contributor

Hey @avnishn, I'm not understanding why we have to make this change.

We've slowly lowered this reward threshold from -300 to -1000 over the last 7 months; is there a reason why it used to work and now it doesn't?

@avnishn
Copy link
Member Author

avnishn commented Jan 19, 2022

So here's my best guess @richardliaw.

At some point, we probably tried to decrease the runtime of these tests, by looking at successful reward curves, and picking a reward threshold that SAC should have hit for this environment **within a certain number of timesteps -- so in this case, -700 within 10k timesteps.

If you remove the 10k timestep stopping criteria, then SAC can achieve a reward of -150 in 30k timesteps. (As of today)** The original regression test had no stopping criteria based on the number of timesteps that we've trained on, so it probably worked, but was slower in its runtime.

I just increase the number of timesteps for the stopping threshold, and increased the reward threshold back to -150. Lets see if the BK runner can run for this many timesteps in the allotted 900s.

@avnishn
Copy link
Member Author

avnishn commented Jan 19, 2022

I moved the pendulum sac tests to gpu and that seemed to fix the problem. This is ready to merge.

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable.
a couple of questions.

@@ -97,6 +97,9 @@
if args.framework in ["tf2", "tfe"]:
exp["config"]["eager_tracing"] = True

if int(os.environ.get("RLLIB_NUM_GPUS", "0")):
exp["config"]["num_gpus"] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you want to force this? we should respect users' input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only inject this environment variable in our tests. I wanted to add gpu to our yaml file for the tuned example, but couldn't because then it wouldn't run on most users laptops. So instead, I look for the env variable, which is injected by gpu runners on buildkite, so that tests on the gpu runner take advantage of the gpu on the runner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh understand why you need to check the env var this way, I am more suggesting the following:

num_gpus = int(os.environ.get("RLLIB_NUM_GPUS", "0"))
if num_gpus:
    exp["config"]["num_gpus"] = num_gpus

I was only curious about the hardcoded number 1 there. I don't see why we want to force the test to only use 1 GPU.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh silly me I'm dumb. Yeah you're right

n_step: 3
rollout_fragment_length: 1
prioritized_replay: true
prioritized_replay: False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? harder to say where the difference comes from if we change multiple variables in a single PR.
I am hoping using prioritized_replay would help learning, but maybe slowing things down a bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it back, but it's possible that this hurts and doesn't help. I'll remove it and ablate the change, then post the learning curve in the channel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I am really curious to see the learning curve. if we need this, we can keep it no problem, but would love to have a bit better understanding.

@avnishn avnishn changed the title [RLlib] [CI] Lower the pendulum-sac reward threshold [RLlib] [CI] [WIP] Lower the pendulum-sac reward threshold Jan 19, 2022
rllib/BUILD Outdated
@@ -46,8 +46,7 @@
# Additional tags are:
# - "team:ml": Indicating that all tests in this file are the responsibility of
# the ML Team.
# - "needs_gpu": Indicating that a test needs to have a GPU in order to run.
# - "gpu": Indicating that a test may (but doesn't have to) be run in the GPU
# - "gpu_rllib_X": Indicating that a test needs to be run in the GPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the "X" mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gpu runner buildkite tags for rllib are named gpu_rllib_1 and gpu_rllib_2. I thought c would be a nice placeholder for a number.

rllib/BUILD Outdated
@@ -198,13 +197,33 @@ py_test(

# DDPG
py_test(
name = "learning_tests_pendulum_ddpg",
name = "learning_tests_pendulum_ddpg_tf",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, shouldn't we do the framework setting the same as the non-GPU tests? Meaning, no framework specifier and the buildkite pipeline definitions pick the correct command line options?

I understand we probably need more than one GPU BK job then. Let me know if this is what you tried to avoid. Happy to do it this way for some limited amount of time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it the old way, but that would require 3 gpu bk runners instead of 2. I'm not opposed, just didn't want to add bk runners unnecessarily.

I could add more runners though. With 4-5 tests on the runner, the setup for gpu was taking like 30 minutes, with 20 minutes of tests. Perhaps it was a one off, but that lead to a 50 minute test runner, which is probably at the limit of what we want in terms of test time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's add another runner, then. This shouldn't be a problem.

@avnishn avnishn changed the title [RLlib] [CI] [WIP] Lower the pendulum-sac reward threshold [RLlib] [CI] [WIP] Deflake longer running RLlib tests that require a gpu Jan 19, 2022
@avnishn
Copy link
Member Author

avnishn commented Jan 20, 2022

Using @krfricke's awesome tool for reproducing ci runners, I discovered that the flakey tests don't actually run any faster on gpu. I'm going to revert some of my changes here, including adding multi-gpu runners for the flakey tests.

I also discovered that there wasn't a performance regression in these tests, but rather that pendulum v1 has a different reward function/different dynamics that make it harder to learn, vs v0, for which the test's stop criteria and hparams were based on.

See my new comment below

@avnishn
Copy link
Member Author

avnishn commented Jan 20, 2022

So there was also one other h_param changed since the test was created: n_step: 1 ->3 what this did was it replaced the reward for the current timestep with the summed discounted return over 3 steps. I'm not sure why we did this, but when I ablated the change, changing back from 3 to 1 seemed to bring the learning speed back to the initial speeds from when the test was first created ... when I shared my initial thoughts this morning, I hadn't realized that I had made this change from 3 back to 1 for this hparam, but only caught it when looking at my changes with a diff tool.

@avnishn
Copy link
Member Author

avnishn commented Jan 20, 2022

meanwhile @gjoliver running this experiment with prioritized replay vs without prioritized replay seems to not have made a difference.
image

These are 6 runs, where 3 have prioritized replay, and 3 don't.

Meanwhile changing n_steps 3 -> 1 has made it so that the number of timesteps needed to meet the stopping criteria from ~30k -> ~10k

@avnishn
Copy link
Member Author

avnishn commented Jan 20, 2022

I'm having strange problems with the tuned example that runs the SAC on pendulum transformed actions environment.

I've fixed the seed of my experiment, and am using the same hparams otherwise, but am seeing that the results of the experiment are different across runs:
image

This reminds of some previous issues that we have had.

@avnishn avnishn changed the title [RLlib] [CI] [WIP] Deflake longer running RLlib tests that require a gpu [RLlib] [CI] [WIP] Deflake longer running RLlib tests Jan 20, 2022
@gjoliver
Copy link
Member

meanwhile @gjoliver running this experiment with prioritized replay vs without prioritized replay seems to not have made a difference. image

These are 6 runs, where 3 have prioritized replay, and 3 don't.

Meanwhile changing n_steps 3 -> 1 has made it so that the number of timesteps needed to meet the stopping criteria from ~30k -> ~10k

very cool! then let's keep that diff out of this change.
the n_step difference is quite puzzling, but it's RL, so I am not surprised :)

@avnishn avnishn changed the title [RLlib] [CI] [WIP] Deflake longer running RLlib tests [RLlib] [CI] [WIP] Deflake longer running RLlib tests. Fix seeding issue in TransformedAction Environments Jan 21, 2022
@avnishn
Copy link
Member Author

avnishn commented Jan 21, 2022

Here's a google colab describing some changes that I'm going to make to the TransformedActionSpaceEnv Environments in rllib.examples.

https://colab.research.google.com/drive/1BhoZNuG-9NDHqZ8bN7BgPJ_9i0jWWEnr?usp=sharing

It explains how seeding was not implemented correctly by the TransformedActionSpaceEnv .

This contributed to the flakiness in the tests that use this environment.

@avnishn avnishn self-assigned this Jan 21, 2022
@avnishn avnishn changed the title [RLlib] [CI] [WIP] Deflake longer running RLlib tests. Fix seeding issue in TransformedAction Environments [RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments Jan 21, 2022
Hitting 10k timesteps seems to be not achievable
in the 900 second time limit. This means that the
reward threshold must be met in order for the
experiment to terminate. This algorithm is seed
sensitive on this environment, so I toyed with the reward
threshold on various seeds, and ended up fixing the
reward threshold for a certain seed. This should
give us regression information, while eliminating the
possiblilty that the test flakes.
@avnishn
Copy link
Member Author

avnishn commented Jan 27, 2022

The final extremely flakey tests were the ddpg continuous learning tests. I updated them after running them each for 12 seeds (had to given our seeding issue, and then found a rough lower bound for the reward after a certain number of ts runnable on the CI. It has been lowered on the fake gpu ddpg pendulum test, but that is because I also reduced the number of timesteps to hit the reward threshold.

@richardliaw could you please merge when tests are passing?

@richardliaw
Copy link
Contributor

richardliaw commented Jan 28, 2022 via email

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@avnishn
Copy link
Member Author

avnishn commented Feb 4, 2022

@sven1977 I ended up adding a new test runner for tf2 eager long learning continuous off policy tests. In that target I lower the reward threshold for tf2 eager pendulum sac and ddpg tests, since on tf2 eager they run at half the speed of their tf1 and torch counterparts. Could you PTAL and merge if you're ok with this?

@gjoliver
Copy link
Member

gjoliver commented Feb 4, 2022

What?? Is this a performance regression?
We should get to the bottom of this instead of trying to get the tests to pass.

@avnishn
Copy link
Member Author

avnishn commented Feb 4, 2022

What?? Is this a performance regression?

We should get to the bottom of this instead of trying to get the tests to pass.

I think we've been aware for a while that tf2 run in eager execution mode is 1.7 times slower than tf2 graph mode and tf1.

But I could be wrong. I'll let @sven1977 comment on that.

.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved
rllib/BUILD Outdated Show resolved Hide resolved
rllib/BUILD Outdated Show resolved Hide resolved
rllib/BUILD Outdated Show resolved Hide resolved
Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for fixing and stabilizing these!

@sven1977
Copy link
Contributor

sven1977 commented Feb 4, 2022

Yeah, n_step=1 better than n_step=3 is weird. But we did add thorough n-step tests lately and I don't think we have a bug there anymore (as we used to). But still something to keep on our radar.

@sven1977
Copy link
Contributor

sven1977 commented Feb 4, 2022

Just waiting for tests ...

@sven1977 sven1977 merged commit 0d2ba41 into ray-project:master Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants