[RLlib] APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. #40927

sven1977 · 2023-11-03T09:48:59Z

APPO/IMPALA: Enable using 2 separate optimizers for policy and value function (and 2 learning rates) on the old API stack.

Note that this feature had already existed for tf/tf2, but not for torch.

Added additional learning tests for APPO (torch) and Impala (tf/tf2 + torch).

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_torch_old_stack_enable_two_optimizers_two_lrs

Signed-off-by: sven1977 <svenmika1977@gmail.com>

kouroshHakha · 2023-11-03T16:24:14Z

rllib/algorithms/impala/impala_torch_policy.py

+            # Figure out, which parameters of the model belong to the value
+            # function (and which to the policy net).
+            dummy_batch = self._lazy_tensor_dict(
+                self._get_dummy_batch_from_view_requirements()
+            )
+            # Zero out all gradients (set to None)
+            for param in self.model.parameters():
+                param.grad = None
+            # Perform a dummy forward pass (through the policy net, which should be
+            # separated from the value function in this particular user setup).
+            out = self.model(dummy_batch)
+            # Perform a (dummy) backward pass to be able to see, which params have
+            # gradients and are therefore used for the policy computations (vs vf
+            # computations).
+            torch.sum(out[0]).backward()  # [0] -> Model returns out and state-outs.
+            # Collect policy vs value function params separately.
+            policy_params = []
+            value_params = []
+            for param in self.model.parameters():
+                if param.grad is None:
+                    value_params.append(param)
+                else:
+                    policy_params.append(param)


I don't understand the need for this. Why can't you directly index to the model and ask it to give you .value.parameters() and .policy.parameters()? There should be a better way than treating self.model as a blackbox with only the knowledge hat if I do forward pass on the model directly it will use the parameters that are used for policy. Also what if there are shared parameters between the value and policy components? This will lump them up into the policy's optimizer. They won't get updated based on the loss from value function.

Good point, the problem here is that the API is NOT defined at all and some users might have self.policy, others self.policy_net, etc..
The only thing that is required for you if you want a value function to be present is to implement the self.value_function() method. Take a look at our torch default models (ModelV2). They are all different in how they store the (separate) value sub-networks. It's quite a mess. I'm with you that this is not the normal way we should solve this, but since this is old API stack, which will get 100% retired very soon, I'm personally fine with this. Suggestions?

Thanks for the explanation, I figured that might be the reason. We should be explicit about this in the comments

kouroshHakha · 2023-11-03T16:26:01Z

rllib/algorithms/registry.py

@@ -229,6 +229,7 @@ def _import_leela_chess_zero():
    "DreamerV3": _import_dreamerv3,
    "DT": _import_dt,
    "IMPALA": _import_impala,
+    "Impala": _import_impala,


wait, where is this coming from? It will mess up with our telemetry.

Let me explain: We added a new test case in this PS, which is IMPALA (separate policy and vf) on cartpole. The new tuned_example file is a python file (I'm trying to create as few new yamls as possible nowadays). Hence, in there I'm using the ImpalaConfig() class/object. It seems to not work well with tune.run_experiment for whatever reason.

I didn't think about telemetry. Let me see, whether there is a better way that would not break things ...

Ah, ok, this is the culprit here (in rllib/train.py).

experiments = { f"default_{uuid.uuid4().hex}": { "run": algo_config.__class__.__name__.replace("Config", ""), "env": config.get("env"), "config": config, "stop": stop, } }

Ok, let me provide a better fix.

yeah We should import the name directly from the registry if possible or avoid run.run_experiments?

All algo config objects know what their corresponding algo class is, so this is solved now much more elegantly.

kouroshHakha

One big comment about how the value vs policy parameters are retrieved.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 · 2023-11-03T18:58:46Z

Fixed. @kouroshHakha, thanks for the review! Please take another look.

kouroshHakha · 2023-11-03T19:21:00Z

the tests are failing. Let's hold onto merging until the issue is resolved. @can-anyscale Can you tell me what is wrong with the tests? All rllib-tests are complaining about a grpc plugin missing.
https://buildkite.com/ray-project/premerge/builds/10829#_

…_two_lrs

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…optimizers_two_lrs' into appo_torch_old_stack_enable_two_optimizers_two_lrs

…_two_lrs

…_torch_old_stack_enable_two_optimizers_two_lrs

…optimizers_two_lrs' into appo_torch_old_stack_enable_two_optimizers_two_lrs

…d vs (and 2 learning rates) on the old API stack. (ray-project#40927)

sven1977 added 2 commits November 3, 2023 10:47

wip

43554be

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into appo…

a1f2a44

…_torch_old_stack_enable_two_optimizers_two_lrs

sven1977 requested review from avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and kouroshHakha as code owners November 3, 2023 09:49

sven1977 assigned kouroshHakha Nov 3, 2023

sven1977 added 4 commits November 3, 2023 16:23

Merge branch 'master' of https://github.com/ray-project/ray into appo…

8b0f1e2

…_torch_old_stack_enable_two_optimizers_two_lrs

wip

6406515

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

ee76b6e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

LINT

0b3d49a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

kouroshHakha reviewed Nov 3, 2023

View reviewed changes

sven1977 added 2 commits November 3, 2023 19:49

wip

ebebfcd

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

105e061

Signed-off-by: sven1977 <svenmika1977@gmail.com>

kouroshHakha approved these changes Nov 3, 2023

View reviewed changes

kouroshHakha and others added 6 commits November 3, 2023 12:30

Merge branch 'master' into appo_torch_old_stack_enable_two_optimizers…

94ee57d

…_two_lrs

wip

b34a7df

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge remote-tracking branch 'origin/appo_torch_old_stack_enable_two_…

f5d973f

…optimizers_two_lrs' into appo_torch_old_stack_enable_two_optimizers_two_lrs

Merge branch 'master' into appo_torch_old_stack_enable_two_optimizers…

7b95f81

…_two_lrs

Merge branch 'master' of https://github.com/ray-project/ray into appo…

bffa864

…_torch_old_stack_enable_two_optimizers_two_lrs

Merge remote-tracking branch 'origin/appo_torch_old_stack_enable_two_…

8e5a5ed

…optimizers_two_lrs' into appo_torch_old_stack_enable_two_optimizers_two_lrs

sven1977 merged commit 8711328 into ray-project:master Nov 4, 2023
24 of 26 checks passed

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023

[RLlib] APPO/IMPALA: Enable using 2 separate optimizers for policy an…

e44d7d1

…d vs (and 2 learning rates) on the old API stack. (ray-project#40927)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. #40927

[RLlib] APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. #40927

sven1977 commented Nov 3, 2023 •

edited

Loading

kouroshHakha Nov 3, 2023 •

edited

Loading

sven1977 Nov 3, 2023

kouroshHakha Nov 3, 2023

kouroshHakha Nov 3, 2023

sven1977 Nov 3, 2023

sven1977 Nov 3, 2023

kouroshHakha Nov 3, 2023

sven1977 Nov 3, 2023

kouroshHakha left a comment

sven1977 commented Nov 3, 2023

kouroshHakha commented Nov 3, 2023

[RLlib] APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. #40927

[RLlib] APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. #40927

Conversation

sven1977 commented Nov 3, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kouroshHakha Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

sven1977 Nov 3, 2023

Choose a reason for hiding this comment

kouroshHakha Nov 3, 2023

Choose a reason for hiding this comment

kouroshHakha Nov 3, 2023

Choose a reason for hiding this comment

sven1977 Nov 3, 2023

Choose a reason for hiding this comment

sven1977 Nov 3, 2023

Choose a reason for hiding this comment

kouroshHakha Nov 3, 2023

Choose a reason for hiding this comment

sven1977 Nov 3, 2023

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

sven1977 commented Nov 3, 2023

kouroshHakha commented Nov 3, 2023

sven1977 commented Nov 3, 2023 •

edited

Loading

kouroshHakha Nov 3, 2023 •

edited

Loading