Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Move bandits into main agents folder; Make RecSim adapter more accessible; #21773

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jan 21, 2022

  • Move bandit algorithms into main rllib.agents folder (from rllib/contrib): LinTS and LinUCB
  • Rename registry names for those algos from "contrib/Lin[TS|UCB]" into "BanditLin[TS|UCB]".
  • Cleanup ResSim environment adapter and create 3 out-of-the-box RLlib-ready RecSim environments (interest evolution, long-term-satisfaction, and interest-exploration).
  • Simple compilation tests for Bandits.

TODO (follow up PR):

  • Run Benchmarks to use for hard-task nightly/weekly learning tests.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sven1977 sven1977 changed the title [WIP RLlib] Move bandits into main agents folder. [RLlib] Move bandits into main agents folder; Make RecSim adapter more accessible; Jan 21, 2022
@sven1977 sven1977 requested a review from avnishn January 21, 2022 10:32
@sven1977 sven1977 marked this pull request as ready for review January 21, 2022 10:32
@@ -81,14 +78,14 @@ def make_model_and_action_dist(policy, obs_space, action_space, config):

# TODO: Have a separate model catalogue for bandits
if exploration_config:
if exploration_config["type"] == TS_PATH:
if exploration_config["type"] == "ThompsonSampling":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this backwards compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this would work: The exploration type is now a single "ThompsonSampling" string as we moved the ThompsonSampling class into our built-in RLlib utils/exploration package.

The two supported exploration types are pretty much hard-coded into these two bandit Trainers (UCB and TS).
Also, the old TS_PATH ( -> ThompsonSampling path) does no longer exist as it was inside the contrib folder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this may not be the right level to do backward compat messaging. Can you make sure that users code from previous ray versions will automatically raise an actionable error message to help them migrate to new changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done through the added "location" check. The new test_bandits test case makes sure this works and users will get a deprecation message when they do stuff like:

tune.run("contrib/LinTS")

# OR

from ray.rllib.contrib.bandits.agents.lin_ts import LinTSTrainer

Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few comments, and asked clarifying questions, etc. Thanks for doing this, and all the hard work.

@@ -81,14 +78,14 @@ def make_model_and_action_dist(policy, obs_space, action_space, config):

# TODO: Have a separate model catalogue for bandits
if exploration_config:
if exploration_config["type"] == TS_PATH:
if exploration_config["type"] == "ThompsonSampling":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

if isinstance(original_space, spaces.Dict):
assert "item" in original_space.spaces, \
"Cannot find 'item' key in observation space"
model_cls = ParametricLinearModelThompsonSampling
else:
model_cls = DiscreteLinearModelThompsonSampling
elif exploration_config["type"] == UCB_PATH:
elif exploration_config["type"] == "UpperConfidenceBound":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here in terms of keeping backwards compatible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment above.

@@ -32,7 +32,7 @@ def plot_model_weights(means, covs):
if __name__ == "__main__":
num_iter = 10
print("Running training for %s time steps" % num_iter)
trainer = LinTSTrainer(env=WheelBanditEnv)
trainer = BanditLinTSTrainer(env=WheelBanditEnv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a toy environment? Asking for clarification, not w.r.t the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one of our "bandit-friendly" example envs, now in the rllib/examples/env/... folder.

# Force good learning behavior (this is a very simple env).
self.assertTrue(results["episode_reward_mean"] == 10.0)

def test_deprecated_locations(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

def tearDownClass(cls) -> None:
ray.shutdown()

def test_bandit_lin_ts_compilation(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its worth it if we started writing some unit tests for algorithms that test whether the loss functions, etc have the right outputs? Its extra work, but I found in the past that by writing unit tests like these, they heavily reduced the time that it took to debug later performance regressions. That, and it was a good exercise for building knowledge on the algorithms that we were writing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely start this pattern on my own whenever I implement an algorithm next.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
not necessarily this pr though.
like if we add tf impl for bandit, we can write unit tests to make sure those loss funcs are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea @avnishn . Some of our algos do have these tests, e.g. ppo, dqn, pg, marwil. But not all. Yes, we should ideally have these for all algos. I agree that it helps to reduce bugs at a small cost!

"contrib/LinUCB",
config=UCB_CONFIG,
"BanditLinUCB",
config=config,
stop={"training_iteration": training_iterations},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a reward based stopping criteria whenever we finish benchmarking these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely need to clean up these scripts. These are the original ones from our contributors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in a follow up PR?
Just moved these here, these are not new scripts. Let's go through them in another PR and make sure the Bandits learn properly.

Examples:
>>> env_ctx = EnvContext({"a": 1, "b": 2}, worker_index=0)
>>> env_ctx.set_defaults({"a": -42, "c": 3})
>>> print(env_ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it removes clutter from the environments that we've added in a lot of the example environments at least, if I understand this function correctly.

rllib/env/wrappers/recsim.py Show resolved Hide resolved
from ray.rllib.utils.error import UnsupportedSpaceException


class TestRecSimWrapper(unittest.TestCase):
def test_observation_space(self):
env = make_recsim_env(config={})
env = InterestEvolutionRecSimEnv()
obs = env.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also check if an observation that has been sampled from the observation space is contained, although im not sure how good a test that is. We could sampled a few thousand observations and check if they're contained in the obs space (its a relatively cheap operation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are doing this right below. Only twice, though after the env.reset() and the step().

Added some more checks for the action space test.

Not sure I fully understand what you are aiming at, though. But please feel free to suggest more checks here.

def tearDownClass(cls) -> None:
ray.shutdown()

def test_bandit_lin_ts_compilation(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
not necessarily this pr though.
like if we add tf impl for bandit, we can write unit tests to make sure those loss funcs are the same.

return multi_action


def rllib_gym_wrapper(recsim_gym_env: gym.Env,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recsys_gym_wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no! Great catch! :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…its_become_1st_class_citizens

# Conflicts:
#	doc/source/rllib/rllib-algorithms.rst
Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some comments about seeding learning tests but otherwise lgtm.

results = None
for i in range(num_iterations):
results = trainer.train()
check_train_results(results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we seed this test.

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sven1977 sven1977 merged commit 893536e into ray-project:master Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants