[RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments #21685

avnishn · 2022-01-18T22:31:59Z

Hitting 10k timesteps seems to be not achievable
in the 900 second time limit. This means that the
reward threshold must be met in order for the
experiment to terminate. This algorithm is seed
sensitive on this environment, so I toyed with the reward
threshold on various seeds, and ended up fixing the
reward threshold for a certain seed. This should
give us regression information, while eliminating the
possiblilty that the test flakes.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

richardliaw · 2022-01-18T23:03:17Z

Hey @avnishn, I'm not understanding why we have to make this change.

We've slowly lowered this reward threshold from -300 to -1000 over the last 7 months; is there a reason why it used to work and now it doesn't?

avnishn · 2022-01-19T00:00:35Z

So here's my best guess @richardliaw.

At some point, we probably tried to decrease the runtime of these tests, by looking at successful reward curves, and picking a reward threshold that SAC should have hit for this environment **within a certain number of timesteps -- so in this case, -700 within 10k timesteps.

If you remove the 10k timestep stopping criteria, then SAC can achieve a reward of -150 in 30k timesteps. (As of today)** The original regression test had no stopping criteria based on the number of timesteps that we've trained on, so it probably worked, but was slower in its runtime.

I just increase the number of timesteps for the stopping threshold, and increased the reward threshold back to -150. Lets see if the BK runner can run for this many timesteps in the allotted 900s.

avnishn · 2022-01-19T05:20:13Z

I moved the pendulum sac tests to gpu and that seemed to fix the problem. This is ready to merge.

gjoliver

looks reasonable.
a couple of questions.

gjoliver · 2022-01-19T07:22:12Z

rllib/tests/run_regression_tests.py

@@ -97,6 +97,9 @@
        if args.framework in ["tf2", "tfe"]:
            exp["config"]["eager_tracing"] = True

+        if int(os.environ.get("RLLIB_NUM_GPUS", "0")):
+            exp["config"]["num_gpus"] = 1


why do you want to force this? we should respect users' input.

We only inject this environment variable in our tests. I wanted to add gpu to our yaml file for the tuned example, but couldn't because then it wouldn't run on most users laptops. So instead, I look for the env variable, which is injected by gpu runners on buildkite, so that tests on the gpu runner take advantage of the gpu on the runner.

oh understand why you need to check the env var this way, I am more suggesting the following:

num_gpus = int(os.environ.get("RLLIB_NUM_GPUS", "0")) if num_gpus: exp["config"]["num_gpus"] = num_gpus

I was only curious about the hardcoded number 1 there. I don't see why we want to force the test to only use 1 GPU.

Oh silly me I'm dumb. Yeah you're right

gjoliver · 2022-01-19T07:23:48Z

rllib/tuned_examples/sac/pendulum-sac.yaml

        n_step: 3
        rollout_fragment_length: 1
-        prioritized_replay: true
+        prioritized_replay: False


is this necessary? harder to say where the difference comes from if we change multiple variables in a single PR.
I am hoping using prioritized_replay would help learning, but maybe slowing things down a bit?

I can change it back, but it's possible that this hurts and doesn't help. I'll remove it and ablate the change, then post the learning curve in the channel.

cool, I am really curious to see the learning curve. if we need this, we can keep it no problem, but would love to have a bit better understanding.

sven1977 · 2022-01-19T13:36:51Z

rllib/BUILD

@@ -46,8 +46,7 @@
 # Additional tags are:
 # - "team:ml": Indicating that all tests in this file are the responsibility of
 #   the ML Team.
-# - "needs_gpu": Indicating that a test needs to have a GPU in order to run.
-# - "gpu": Indicating that a test may (but doesn't have to) be run in the GPU
+# - "gpu_rllib_X": Indicating that a test needs to be run in the GPU


What does the "X" mean?

The gpu runner buildkite tags for rllib are named gpu_rllib_1 and gpu_rllib_2. I thought c would be a nice placeholder for a number.

sven1977 · 2022-01-19T13:39:24Z

rllib/BUILD

@@ -198,13 +197,33 @@ py_test(

 # DDPG
 py_test(
-    name = "learning_tests_pendulum_ddpg",
+    name = "learning_tests_pendulum_ddpg_tf",


Wait, shouldn't we do the framework setting the same as the non-GPU tests? Meaning, no framework specifier and the buildkite pipeline definitions pick the correct command line options?

I understand we probably need more than one GPU BK job then. Let me know if this is what you tried to avoid. Happy to do it this way for some limited amount of time.

We could do it the old way, but that would require 3 gpu bk runners instead of 2. I'm not opposed, just didn't want to add bk runners unnecessarily.

I could add more runners though. With 4-5 tests on the runner, the setup for gpu was taking like 30 minutes, with 20 minutes of tests. Perhaps it was a one off, but that lead to a 50 minute test runner, which is probably at the limit of what we want in terms of test time.

Cool, let's add another runner, then. This shouldn't be a problem.

avnishn · 2022-01-20T21:27:58Z

Using @krfricke's awesome tool for reproducing ci runners, I discovered that the flakey tests don't actually run any faster on gpu. I'm going to revert some of my changes here, including adding multi-gpu runners for the flakey tests.

I also discovered that there wasn't a performance regression in these tests, but rather that pendulum v1 has a different reward function/different dynamics that make it harder to learn, vs v0, for which the test's stop criteria and hparams were based on.

See my new comment below

avnishn · 2022-01-20T22:14:33Z

So there was also one other h_param changed since the test was created: n_step: 1 ->3 what this did was it replaced the reward for the current timestep with the summed discounted return over 3 steps. I'm not sure why we did this, but when I ablated the change, changing back from 3 to 1 seemed to bring the learning speed back to the initial speeds from when the test was first created ... when I shared my initial thoughts this morning, I hadn't realized that I had made this change from 3 back to 1 for this hparam, but only caught it when looking at my changes with a diff tool.

avnishn · 2022-01-20T22:32:30Z

meanwhile @gjoliver running this experiment with prioritized replay vs without prioritized replay seems to not have made a difference.

These are 6 runs, where 3 have prioritized replay, and 3 don't.

Meanwhile changing n_steps 3 -> 1 has made it so that the number of timesteps needed to meet the stopping criteria from ~30k -> ~10k

avnishn · 2022-01-20T23:14:45Z

I'm having strange problems with the tuned example that runs the SAC on pendulum transformed actions environment.

I've fixed the seed of my experiment, and am using the same hparams otherwise, but am seeing that the results of the experiment are different across runs:

This reminds of some previous issues that we have had.

gjoliver · 2022-01-21T01:31:53Z

meanwhile @gjoliver running this experiment with prioritized replay vs without prioritized replay seems to not have made a difference.

These are 6 runs, where 3 have prioritized replay, and 3 don't.

Meanwhile changing n_steps 3 -> 1 has made it so that the number of timesteps needed to meet the stopping criteria from ~30k -> ~10k

very cool! then let's keep that diff out of this change.
the n_step difference is quite puzzling, but it's RL, so I am not surprised :)

avnishn · 2022-01-21T02:14:51Z

Here's a google colab describing some changes that I'm going to make to the TransformedActionSpaceEnv Environments in rllib.examples.

https://colab.research.google.com/drive/1BhoZNuG-9NDHqZ8bN7BgPJ_9i0jWWEnr?usp=sharing

It explains how seeding was not implemented correctly by the TransformedActionSpaceEnv .

This contributed to the flakiness in the tests that use this environment.

Hitting 10k timesteps seems to be not achievable in the 900 second time limit. This means that the reward threshold must be met in order for the experiment to terminate. This algorithm is seed sensitive on this environment, so I toyed with the reward threshold on various seeds, and ended up fixing the reward threshold for a certain seed. This should give us regression information, while eliminating the possiblilty that the test flakes.

avnishn · 2022-01-27T23:02:29Z

The final extremely flakey tests were the ddpg continuous learning tests. I updated them after running them each for 12 seeds (had to given our seeding issue, and then found a rough lower bound for the reward after a certain number of ts runnable on the CI. It has been lowered on the fake gpu ddpg pendulum test, but that is because I also reduced the number of timesteps to hit the reward threshold.

@richardliaw could you please merge when tests are passing?

richardliaw · 2022-01-28T00:05:43Z

ping me when tests pass.

…

On Thu, Jan 27, 2022 at 3:02 PM Avnish Narayan ***@***.***> wrote: The final extremely flakey tests were the ddpg continuous learning tests. I updated them after running them each for 12 seeds (had to given our seeding issue, and then found a rough lower bound for the reward after a certain number of ts runnable on the CI. It has been lowered on the fake gpu ddpg pendulum test, but that is because I also reduced the number of timesteps to hit the reward threshold. @richardliaw <https://github.com/richardliaw> could you please merge when tests are passing? — Reply to this email directly, view it on GitHub <#21685 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZPCD7QQ66UHNSZ7HCTUYHFJVANCNFSM5MILB3GQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bveeramani · 2022-01-30T05:08:33Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

…ake_sac_pendulum

…significantly slower runtime with tf2 eager than other frameworks

avnishn · 2022-02-04T01:27:32Z

@sven1977 I ended up adding a new test runner for tf2 eager long learning continuous off policy tests. In that target I lower the reward threshold for tf2 eager pendulum sac and ddpg tests, since on tf2 eager they run at half the speed of their tf1 and torch counterparts. Could you PTAL and merge if you're ok with this?

gjoliver · 2022-02-04T02:13:07Z

What?? Is this a performance regression?
We should get to the bottom of this instead of trying to get the tests to pass.

avnishn · 2022-02-04T02:41:10Z

What?? Is this a performance regression?

We should get to the bottom of this instead of trying to get the tests to pass.

I think we've been aware for a while that tf2 run in eager execution mode is 1.7 times slower than tf2 graph mode and tf1.

But I could be wrong. I'll let @sven1977 comment on that.

.buildkite/pipeline.ml.yml

rllib/BUILD

rllib/tuned_examples/sac/pendulum-sac.yaml

rllib/tuned_examples/sac/pendulum-transformed-actions-sac.yaml

rllib/tuned_examples/sac/pendulum-sac-fake-gpus.yaml

rllib/tuned_examples/ddpg/pendulum-ddpg.yaml

rllib/tuned_examples/ddpg/pendulum-ddpg-fake-gpus.yaml

sven1977

Looks great! Thanks for fixing and stabilizing these!

sven1977 · 2022-02-04T11:49:18Z

Yeah, n_step=1 better than n_step=3 is weird. But we did add thorough n-step tests lately and I don't think we have a bug there anymore (as we used to). But still something to keep on our radar.

sven1977 · 2022-02-04T11:49:32Z

Just waiting for tests ...

avnishn requested review from richardliaw, sven1977 and gjoliver January 18, 2022 22:31

gjoliver reviewed Jan 19, 2022

View reviewed changes

avnishn changed the title ~~[RLlib] [CI] Lower the pendulum-sac reward threshold~~ [RLlib] [CI] [WIP] Lower the pendulum-sac reward threshold Jan 19, 2022

sven1977 reviewed Jan 19, 2022

View reviewed changes

avnishn changed the title ~~[RLlib] [CI] [WIP] Lower the pendulum-sac reward threshold~~ [RLlib] [CI] [WIP] Deflake longer running RLlib tests that require a gpu Jan 19, 2022

avnishn changed the title ~~[RLlib] [CI] [WIP] Deflake longer running RLlib tests that require a gpu~~ [RLlib] [CI] [WIP] Deflake longer running RLlib tests Jan 20, 2022

gjoliver approved these changes Jan 21, 2022

View reviewed changes

avnishn changed the title ~~[RLlib] [CI] [WIP] Deflake longer running RLlib tests~~ [RLlib] [CI] [WIP] Deflake longer running RLlib tests. Fix seeding issue in TransformedAction Environments Jan 21, 2022

avnishn force-pushed the deflake_sac_pendulum branch from e47bff1 to 498f202 Compare January 21, 2022 02:18

avnishn self-assigned this Jan 21, 2022

avnishn changed the title ~~[RLlib] [CI] [WIP] Deflake longer running RLlib tests. Fix seeding issue in TransformedAction Environments~~ [RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments Jan 21, 2022

avnishn force-pushed the deflake_sac_pendulum branch from 2ab8e99 to 7fccf62 Compare January 24, 2022 19:55

avnishn added 6 commits January 27, 2022 14:59

Increase number of timesteps before stopping

019d0c9

Move sac_pendulum tests to GPU

3379bfd

Fix Syntax Error

651a430

Move pendulum ddpg learning tests

3041205

Move gpu continuous learning tests to MultiGPU runners

ef5d348

Update thresholds for pendulum ddpg with fake gpus

21996ec

avnishn force-pushed the deflake_sac_pendulum branch from 7fccf62 to 21996ec Compare January 27, 2022 23:00

remove unnecessary addition

53a304a

avnishn added 2 commits February 3, 2022 15:22

Merge branch 'master' of https://github.com/ray-project/ray into defl…

749142c

…ake_sac_pendulum

Add target for off policy long learning continuous tests that have a …

0883532

…significantly slower runtime with tf2 eager than other frameworks