Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Exploration API: Policy changes needed for forward pass noisifications. #7798

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Mar 29, 2020

This PR contains the following Policy changes to allow different types of forward-pass noisifications:

  • All major Policy classes (tf, eager, torch) allow for setting up action sampling behavior in one of 3 ways:
    • action_sampler_fn: Fully customized action sampling behavior, returning only actions and logp. If provided, the Policy will not automatically use its Exploration object. Exploration has to be done (customized) inside the action_sampler_fn!
    • action_distribution_fn: Customized distribution-inputs and -class generator. Policy's Exploration is used automatically (before_compute_action and get_exploration_action).
    • None of the above: Action-dist input and -class are determined in default fashion: forward pass through model.

This removes the need for a log_likelihood_fn and thus simplifies DQN and SAC.

  • All Policy.compute_actions() methods now return (by default and if available) the extra_action_fetches key:
    ACTION_DIST_INPUTS
    This is currently only used by the ParameterNoise class (but should be useful information for other cases as well).

  • The ActionDistribution object is now passed directly into Exploration.get_exploration_action(). Before, distribution-inputs and distribution-class were passed in separately.

  • Each Exploration class requires the Model upon construction. Hence the Model is not longer passed into e.g. Exploration.get_exploration_action. This requires the Model being generated before the Exploration object in all Policy classes.

Checks

sven1977 and others added 30 commits March 19, 2020 11:33
…oration_API_parameter_noise_api_only

� Conflicts:
�	rllib/policy/policy.py
�	rllib/tests/test_checkpoint_restore.py
…oration_API_parameter_noise_api_only

� Conflicts:
�	rllib/agents/dqn/tests/test_dqn.py
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
…com/sven1977/ray into exploration_api_minimal_param_noise

� Conflicts:
�	rllib/utils/exploration/exploration.py
@@ -95,17 +92,8 @@ def compute_actions(self,
**kwargs):
return list(state_batches[0]), state_batches, {}

def learn_on_batch(self, samples):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider keeping these, there are there to make it clear these are no-ops.

run_heuristic_vs_learned(use_lstm=False)
# run_with_custom_entropy_loss()
parser = argparse.ArgumentParser()
parser.add_argument("--stop", type=int, default=1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider keeping parser args at the top of the file by convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

run_same_policy(args)
run_heuristic_vs_learned(args, use_lstm=True)
run_heuristic_vs_learned(args, use_lstm=False)
run_with_custom_entropy_loss(args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I can see this being a bit confusing to run as an example since there are four different runs in the output.

Could we at least add print()s in between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just fixed that test case. For some reason, 3 of these have always been commented out and weren't working. I'll add the print()s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

timestep=timestep)
else:
# Exploration hook before each forward pass.
self.exploration.before_compute_actions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the hook be called in all cases including action_sampler_fn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I was thinking of action_sampler_fn as a completely custom way of doing things. So the user would have to apply exploration him/herself.
That being said: No one is using action_sampler_fn right now, anyway, so I guess it doesn't matter much.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few minor comments.

@sven1977
Copy link
Contributor Author

Cool, thanks! Will fix these.

sven1977 and others added 3 commits March 31, 2020 08:54
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23998/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24023/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24024/
Test PASSed.

@sven1977
Copy link
Contributor Author

@ericl This can be merged now. Tests are all ok.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24030/
Test PASSed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. looks like some merge conflicts though.

…oration_api_minimal_param_noise_2nd_stage

� Conflicts:
�	rllib/policy/tf_policy.py
�	rllib/policy/torch_policy.py
�	rllib/policy/torch_policy_template.py
@sven1977
Copy link
Contributor Author

sven1977 commented Apr 1, 2020

Merged, waiting for re-testing.

@sven1977
Copy link
Contributor Author

sven1977 commented Apr 1, 2020

@ericl Please merge. Re-tests ok after merging.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24077/
Test FAILed.

@ericl ericl merged commit e153e31 into ray-project:master Apr 1, 2020
@sven1977 sven1977 deleted the exploration_api_minimal_param_noise_2nd_stage branch April 1, 2020 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants