Add Pytorch TRPO #1018

utkarshjp7 · 2019-11-12T00:17:11Z

Implemented Trust Region Policy Optimization in PyTorch.

Benchmarks are currently running and should be finished by tomorrow. I opened this PR to get some feedback since Initial results and tests looked good.

ryanjulian · 2019-11-12T01:39:15Z

src/garage/torch/algos/trpo.py

+                                      rewards)
+        self._optimizer.step(closure)
+
+    def _build_closure(self, itr, paths, valids, obs, actions, rewards):


is there a reason you did this instead of just using lambdas?

https://lintlyci.github.io/Flake8Rules/rules/E731.html

i'd prefer you just # noqa: E731 them than add 50 lines of boilerplate. also, the rule forbids assigning them, not passing them as function arguments. seems to me you could:

def _optimizer(...): self._optimizer.step( compute_loss=lambda: self._compute_loss(itr, paths, valids, obs, actions, rewards) compute_kl=lambda: self._compute_kl_constraint(obs) )

up to you. i think either is a better option than what you had to do here.

ryanjulian · 2019-11-12T01:41:14Z

src/garage/torch/algos/vpg.py

@@ -188,7 +198,7 @@ def _compute_loss(self, itr, paths, valids, obs, actions, rewards):
            objective += self._policy_ent_coeff * policy_entropies

        valid_objectives = loss_function_utils.filter_valids(objective, valids)
-        return torch.cat(valid_objectives).mean()
+        return -1 * torch.cat(valid_objectives).mean()


why not just -torch.cat(valid_objectives).mean()?

ryanjulian · 2019-11-12T01:42:26Z

src/garage/torch/optimizers/conjugate_gradient_optimizer.py

+        """Take an optimization step.
+
+        Args:
+            closure (tuple[function]): Functions to compute loss and


why not separate arguments?

This is overriding torch.optim.Optimizer 's step function, which only takes one parameter.

it seems clear to me that torch intends closure to be a single function. If you want to pseudo-implement their API, I think it would be more analogous to change the amend the signature to accept a second closure (which computes the constraint).

ryanjulian · 2019-11-12T01:44:40Z

src/garage/torch/algos/trpo.py

+                 env_spec,
+                 policy,
+                 baseline,
+                 max_path_length=500,


this is really high for most environments, which often are <100

src/garage/torch/algos/trpo.py

ryanjulian · 2019-11-12T01:46:17Z

Code is looking pretty great, how are the benchmarks?

ryanjulian · 2019-11-12T01:47:19Z

src/garage/experiment/deterministic.py

@@ -15,13 +15,16 @@ def set_seed(seed):

    """
    seed %= 4294967294
-    global seed_
+    global seed_  # pylint: disable=global-statement


you can disable this function-wide here. the reason is fairly obvious.

ryanjulian · 2019-11-12T01:53:29Z

src/garage/torch/optimizers/conjugate_gradient_optimizer.py

+from garage.torch.utils import update_tensor_list_from_flat_tensor
+
+
+def build_hessian_vector_product(func, params, reg_coeff=1e-5):


does this need to be public?

ryanjulian · 2019-11-12T01:53:59Z

src/garage/torch/optimizers/conjugate_gradient_optimizer.py

+    computation 6.1 (1994): 147-160.`
+
+    Args:
+        func (function): A function that returns a torch.Tensor. Hessian of


Callable is the type for this.

utkarshjp7 · 2019-11-12T07:19:21Z

I am not sure what happened with HalfCheetah. Any thoughts?
These are averaged over 3 trials.

ryanjulian · 2019-11-12T16:23:03Z

Hmm, I'm also not sure what happened with HalfCheetah -- but it's actually consistent with the rest of your results, which are slightly lower than TF in all cases. HalfCheetah is just harder than the rest, so the difference looks bigger.

It seems like the variance of the PyTorch plots is much lower. Did you ensure that the initial standard deviation of the policies are the same in both benchmarks? What about other hyperparameters?

There's a reason for every metric which the TF TRPO implementation plots, so make sure your version is plotting every one of those too. How do the Policy/MeanKL plots compare? Policy/Entropy? Baseline/ExplainedVariance? Check all these before resorting to more drastic debugging.

avnishn · 2019-11-12T19:14:12Z

I'd go ahead and take a look at the plots for mean kl, mean mu, mean std. Those will probably be the most helpful in debugging the decreased performance on cheetah @utkarshjp7

avnishn

Are you using the params from the original TRPO paper to get the results that you posted?

src/garage/torch/algos/ppo.py

utkarshjp7 · 2019-11-12T22:25:10Z

@ryanjulian The hyperparameters are all same for both versions. I added logging to PyTorch TRPO and started benchmarks with 5 trials. I will post the results once its finished.

ryanjulian · 2019-11-13T16:39:58Z

Great! Note that you can upload your results to https://tensorboard.dev for easy reviewing.

ryanjulian · 2019-11-13T16:40:20Z

Please prioritize this PR, since it is blocking your MAML implementation.

utkarshjp7 · 2019-11-13T19:23:44Z

The latest benchmark data can be found here ->
https://tensorboard.dev/experiment/OP0fqpjNQzawLEuLDslKjA

It is same as the plots I posted before, but with the additional logging. The main difference I observed is that lower bound of mean policy KL in tensorflow is 6.5e-3 while in PyTorch its 0.0.

ryanjulian · 2019-11-13T19:31:05Z

In future runs I recommend you make the name of the PyTorch policy also "GaussianMLPPolicy" so that these plots overlap. They're hard to analyze apart.

Also, LinearFeatureBaseline/ExplainedVariance is a really essential debugging stat for on-policy RL so make sure to add that to the torch version. Perhaps we should resurrect the log_diagnostics interface or similar to make these things consistent.

@krzentner WDYT?

ryanjulian · 2019-11-13T19:46:13Z

Some disorganized thoughts on these plots:

looking at Reacher-v2/trial_1_seed_18/garage vs Reacher-v2/trial_1_seed_18/garage_pytorch:

LossAfter for PyTorch is never positive which is a bit strange. Loss magnitudes and signs don't have much meaning in policy gradients, but it's weird to see it clamped at 0. This suggests to me that either the optimizer or loss function is truncating, scaling, or normalizing something unexpectedly.
MeanKL magnitudes are as-expected, but the PyTorch version has many steps with 0 MeanKL -- does this mean the line search is failing and it aborts the optimization for those steps? Do you produce some sort of warning when this happens?
dLoss in PyTorch is normal but again has many itrs where it is 0. This adds more weight to my suspicion above that many optimization epochs are failing.
All of the above seems to apply to HalfCheetah experiments

Here's some plots to chew on:

Notice that KL, LossAfter, and dLoss are all 0 for itrs 486 and 487. This suggests to me that your optimizer is silently failing to create an update. Perhaps it can't calculate an update if the the loss is > 0?

codecov · 2019-11-14T00:55:46Z

Codecov Report

Merging #1018 into master will increase coverage by 0.11%.
The diff coverage is 91.61%.

@@            Coverage Diff             @@
##           master    #1018      +/-   ##
==========================================
+ Coverage   85.07%   85.18%   +0.11%     
==========================================
  Files         157      160       +3     
  Lines        7478     7616     +138     
  Branches      930      955      +25     
==========================================
+ Hits         6362     6488     +126     
- Misses        932      935       +3     
- Partials      184      193       +9

Impacted Files	Coverage Δ
src/garage/torch/algos/ppo.py	`100% <ø> (ø)`	⬆️
src/garage/torch/policies/gaussian_mlp_policy.py	`100% <100%> (ø)`	⬆️
src/garage/torch/policies/base.py	`80% <100%> (+15.71%)`	⬆️
src/garage/torch/utils.py	`100% <100%> (ø)`	⬆️
src/garage/torch/algos/vpg.py	`98.42% <100%> (+0.42%)`	⬆️
.../garage/torch/policies/deterministic_mlp_policy.py	`100% <100%> (ø)`	⬆️
src/garage/torch/optimizers/__init__.py	`100% <100%> (ø)`
...e/torch/optimizers/conjugate_gradient_optimizer.py	`86.66% <86.66%> (ø)`
src/garage/torch/algos/trpo.py	`94.73% <94.73%> (ø)`
.../exploration_strategies/epsilon_greedy_strategy.py	`96.29% <0%> (-3.71%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca2d1cf...6b05023. Read the comment docs.

src/garage/experiment/deterministic.py

ryanjulian · 2019-11-14T02:47:02Z

src/garage/torch/algos/vpg.py

@@ -61,6 +64,7 @@ def __init__(
            center_adv=True,
            positive_adv=False,
            optimizer=None,
+            optimizer_args=None,


i'm really not a fan of this pattern of passing dicts of args for constructors into other constructors. can we somehow construct-and-pass the optimizer instead? or just flatten these args into the parent constructor?

@krzentner your thoughts?

I agree, it is less ambiguous in the code if we construct and pass the optimizer.

ryanjulian · 2019-11-14T02:54:41Z

src/garage/torch/algos/vpg.py

@@ -308,5 +361,12 @@ def _log(self, itr, paths):
        tabular.record('StdReturn', np.std(undiscounted_returns))
        tabular.record('MaxReturn', np.max(undiscounted_returns))
        tabular.record('MinReturn', np.min(undiscounted_returns))
+        tabular.record('{0}/LossBefore'.format(self.policy.name), loss_before)


you can use tabular.prefix for this: https://github.com/rlworkgroup/dowel/blob/master/src/dowel/tabular_input.py

you might also be interested in tabular.record_misc_stat

ryanjulian · 2019-11-14T02:56:01Z

src/garage/torch/policies/base.py

@@ -3,37 +3,88 @@


 class Policy(abc.ABC):
-    """
-    Policy base class without Parameterzied.
+    """Policy base class without Parameterzied.


i think "without Parameterized" is pretty outdated here.

ryanjulian · 2019-11-14T02:56:34Z

src/garage/torch/policies/base.py

+                * torch.Tensor: Predicted action.
+                * dict:
+                    * list[float]: Mean of the distribution
+                    * list[float]: Standard deviation of logarithmic values of


it is actually the log of the stddev, not the stddev of the log

ryanjulian · 2019-11-14T02:56:50Z

src/garage/torch/policies/base.py

+                * torch.Tensor: Predicted actions.
+                * dict:
+                    * list[float]: Mean of the distribution
+                    * list[float]: Standard deviation of logarithmic values of


log(std) not std(log)

ryanjulian · 2019-11-14T02:57:08Z

src/garage/torch/policies/deterministic_mlp_policy.py

+                * torch.Tensor: Predicted action.
+                * dict:
+                    * list[float]: Mean of the distribution
+                    * list[float]: Standard deviation of logarithmic values of


ryanjulian · 2019-11-14T02:57:16Z

src/garage/torch/policies/deterministic_mlp_policy.py

+                * torch.Tensor: Predicted actions.
+                * dict:
+                    * list[float]: Mean of the distribution
+                    * list[float]: Standard deviation of logarithmic values of


ryanjulian · 2019-11-14T02:57:55Z

src/garage/torch/policies/gaussian_mlp_policy.py


    """

-    def __init__(self, env_spec, **kwargs):
+    def __init__(self, env_spec, name='GaussianMLPPolicy', **kwargs):


please don't swallow **kwargs in the constructor. (I know you didn't write this, but let's remove it)

If we remove it then how can one change the non-linearility of layer or initialization of weights?, or the point hear is that no one should be changing that through GaussianMLPPolicy?

I see what you mean. We should explicitly mention each keyword argument that this class supports?

yes. always.

ryanjulian · 2019-11-14T02:58:52Z

tests/garage/torch/optimizers/test_torch_conjugate_gradient_optimizer.py

+# pylint: disable=not-callable  #https://github.com/pytorch/pytorch/issues/24807  # noqa: E501
+
+
+class TestConjugateGradientOptimizer:


note that pytest is perfectly happy running test functions outside of classes, as long as they are named test_*

utkarshjp7 · 2019-11-14T09:49:38Z

Benchmark results look much better after fixing a bug in optimizer. These are averaged over 5 trials.

ryanjulian · 2019-11-14T16:34:25Z

Looks like the torch version is 2x slower than the TF version?

Any insight into why that is? Perhaps we haven't enabled CPU parallelism in torch? We don't need to fix it for this PR but it would be nice to know what's going on.

ryanjulian · 2019-11-14T16:42:03Z

I'm still worried that PyTorch my be systematically just a little bit worse.

What do these look like with 10 trials? And what's the average performance gap for each at itr 999?

utkarshjp7 · 2019-11-14T19:40:57Z

Difference in means at itr 999 averaged over 10 trials. (tensorflow - pytorch)

HalfCheetah-v2: -17.11101
Hopper-v2: 13.40361
Reacher-v2: 0.71291
Swimmer-v2: 1.15873
Walker2d-v2: 1.80943

ryanjulian · 2019-11-14T22:10:50Z

Okay, this is as close to equal as algos get. I think the performance is ready.

Did you have a chance to look at the runtime?

utkarshjp7 · 2019-11-15T00:12:13Z

I ran python profiler, and it seems the overhead in PyTorch is computing gradients (the backward function call). These stats are for running PyTorch TRPO for 10 epochs.

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      230    8.429    0.037    8.429    0.037 {method 'run_backward' of 'torch._C._EngineBase' objects}
     3208    3.015    0.001    3.015    0.001 {built-in method cholesky}
     6416    0.832    0.000    0.832    0.000 {built-in method tanh}
    52225    0.661    0.000    0.661    0.000 {method 'step' of 'mujoco_py.cymj.MjSim' objects}
     8808    0.500    0.000    0.500    0.000 {built-in method addmm}
      292    0.485    0.002    1.697    0.006 inspect.py:714(getmodule)
      452    0.453    0.001    0.453    0.001 {built-in method triangular_solve}
  198/179    0.389    0.002    0.462    0.003 {built-in method _imp.create_dynamic}
     3866    0.314    0.000    0.314    0.000 {built-in method marshal.loads}

ryanjulian · 2019-11-15T00:41:58Z

I think that this is a general issue in PyTorch which we can investigate later. You can try some of the suggestions here: pytorch/pytorch#975

Is your CPU usage near 100% (on all cores) during the optimization phase? Anyway, please don't block commit on this one since it looks like it's framework-wide.

avnishn

LGTM

avnishn · 2019-11-15T19:29:16Z

src/garage/torch/algos/vpg.py

@@ -61,6 +64,7 @@ def __init__(
            center_adv=True,
            positive_adv=False,
            optimizer=None,
+            optimizer_args=None,


I agree, it is less ambiguous in the code if we construct and pass the optimizer.

utkarshjp7 requested a review from a team as a code owner November 12, 2019 00:17

ryanjulian reviewed Nov 12, 2019

View reviewed changes

ryanjulian requested review from krzentner and a team November 12, 2019 01:43

ghost requested review from nish21 and removed request for a team November 12, 2019 01:43

ryanjulian reviewed Nov 12, 2019

View reviewed changes

src/garage/torch/algos/trpo.py Outdated Show resolved Hide resolved

ryanjulian reviewed Nov 12, 2019

View reviewed changes

avnishn requested review from avnishn and yonghyuc November 12, 2019 19:14

avnishn reviewed Nov 12, 2019

View reviewed changes

src/garage/torch/algos/ppo.py Show resolved Hide resolved

utkarshjp7 force-pushed the pytorch_trpo branch from e11a24d to 8dd2d23 Compare November 14, 2019 00:06

ryanjulian reviewed Nov 14, 2019

View reviewed changes

src/garage/experiment/deterministic.py Outdated Show resolved Hide resolved

ryanjulian reviewed Nov 14, 2019

View reviewed changes

utkarshjp7 force-pushed the pytorch_trpo branch 4 times, most recently from 6a8fb49 to 81ecd2b Compare November 14, 2019 09:36

utkarshjp7 force-pushed the pytorch_trpo branch from 81ecd2b to 296da04 Compare November 15, 2019 00:33

ryanjulian approved these changes Nov 15, 2019

View reviewed changes

utkarshjp7 added the ready-to-merge label Nov 15, 2019

avnishn approved these changes Nov 15, 2019

View reviewed changes

Implemented TRPO in PyTorch

6b05023

ryanjulian force-pushed the pytorch_trpo branch from 296da04 to 6b05023 Compare November 15, 2019 23:43

mergify bot merged commit 02d6ef7 into master Nov 16, 2019

mergify bot deleted the pytorch_trpo branch November 16, 2019 00:30

		from garage.torch.utils import update_tensor_list_from_flat_tensor


		def build_hessian_vector_product(func, params, reg_coeff=1e-5):

		# pylint: disable=not-callable #https://github.com/pytorch/pytorch/issues/24807 # noqa: E501


		class TestConjugateGradientOptimizer:

Add Pytorch TRPO #1018

Add Pytorch TRPO #1018

Conversation

utkarshjp7 commented Nov 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanjulian commented Nov 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utkarshjp7 commented Nov 12, 2019 • edited Loading

ryanjulian commented Nov 12, 2019

avnishn commented Nov 12, 2019

avnishn left a comment

Choose a reason for hiding this comment

utkarshjp7 commented Nov 12, 2019

ryanjulian commented Nov 13, 2019

ryanjulian commented Nov 13, 2019

utkarshjp7 commented Nov 13, 2019

ryanjulian commented Nov 13, 2019

ryanjulian commented Nov 13, 2019

codecov bot commented Nov 14, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utkarshjp7 commented Nov 14, 2019

ryanjulian commented Nov 14, 2019

ryanjulian commented Nov 14, 2019

utkarshjp7 commented Nov 14, 2019

ryanjulian commented Nov 14, 2019

utkarshjp7 commented Nov 15, 2019

ryanjulian commented Nov 15, 2019

avnishn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utkarshjp7 commented Nov 12, 2019 •

edited

Loading

codecov bot commented Nov 14, 2019 •

edited

Loading