[RLlib] Removes device infos from state when saving `RModule`s to checkpoints/states. #43906

simonsays1980 · 2024-03-12T11:03:42Z

Why are these changes needed?

When loading an RLModule on CPU from a checkpoint/state that was created from a replica on GPU, an error occurs. This PR fixes this error by forcing the module to save its state in form of numpy.NdArrays.

Related issue number

Closes #43905

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 · 2024-03-12T13:00:41Z

I'm not sure I understand the exact need for this additional option to the API. My main argument against this would be: When stuff gets stored to a checkpoint, it should be stored in a device-independent fashion. So the issue at hand here is NOT the loading from the checkpoint, but the saving to the checkpoint beforehand, which - I'm guessing - probably happened in torch.cuda tensors, NOT in numpy format.

Can we rather take the opposite approach to keep the mental model of what a checkpoint should be clean? Always save weights (and other tensor/matrix states) as numpy arrays, never as torch or tf tensors. When loading from a checkpoint, the sequence should be something like:
0) RLModule.from_checkpoint(dir=...)

load numpy arrays from dir file (pickle?)
pack these numpy arrays into a state dict, mapping
Call RLModule.set_state(state_dict=...) -> This automatically converts the numpy contents of state_dict into torch tensors (with self.device as the device of the already existing RLModule object), then performs a torch.nn.Module.load_state_dict() operation using these (cuda?) tensors.

simonsays1980 · 2024-03-12T13:14:58Z

I'm not sure I understand the exact need for this additional option to the API. My main argument against this would be: When stuff gets stored to a checkpoint, it should be stored in a device-independent fashion. So the issue at hand here is NOT the loading from the checkpoint, but the saving to the checkpoint beforehand, which - I'm guessing - probably happened in torch.cuda tensors, NOT in numpy format.

Can we rather take the opposite approach to keep the mental model of what a checkpoint should be clean? Always save weights (and other tensor/matrix states) as numpy arrays, never as torch or tf tensors. When loading from a checkpoint, the sequence should be something like: 0) RLModule.from_checkpoint(dir=...)

load numpy arrays from dir file (pickle?)

pack these numpy arrays into a state dict, mapping

Call RLModule.set_state(state_dict=...) -> This automatically converts the numpy contents of state_dict into torch tensors (with self.device as the device of the already existing RLModule object), then performs a torch.nn.Module.load_state_dict() operation using these (cuda?) tensors.

I agree with your argument that we should ensure that checkpointing is device-independent. This should be the cleanest way of doing this. We should investigate, where exactly this device-dependent checkpointing takes place and fix the problem there.

I am, however, not so sure, if 3. describes how the workflow runs right now. Here is what makes me wonder: If the torch.nn.Module.load_state_dict() would use the self.device then it should not matter, if the state contains numpy.arrays or torch.tensor (with their own device attribute) as it uses always the module's device. But no matter what - checkpointing always numpy.array will avoid this error anyways.

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

…tore state dict now in numpy format which makes it device-agnostic. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

sven1977 · 2024-03-19T11:53:10Z

rllib/core/rl_module/rl_module.py

        """Loads the module from a checkpoint directory.

        Args:
            checkpoint_dir_path: The directory to load the checkpoint from.
+            map_location: The device on which the module resides.


Remove this line?

sven1977 · 2024-03-19T11:53:14Z

rllib/core/rl_module/marl_module.py

@@ -367,6 +367,7 @@ def load_state(
            modules_to_load: The modules whose state is to be loaded from the path. If
                this is None, all modules that are checkpointed will be loaded into this
                marl module.
+            map_location: The device the module resides on.


Remove this line?

sven1977 · 2024-03-19T11:53:48Z

rllib/core/rl_module/torch/torch_rl_module.py

@@ -117,10 +118,13 @@ def _module_state_file_name(self) -> pathlib.Path:
    @override(RLModule)
    def save_state(self, dir: Union[str, pathlib.Path]) -> None:
        path = str(pathlib.Path(dir) / self._module_state_file_name())
-        torch.save(self.state_dict(), path)
+        torch.save(convert_to_numpy(self.state_dict()), path)


Perfect! This should work.

sven1977

Looks good! Thanks for this important fix @simonsays1980 !

Just two nits on the docstrings.

rllib/core/rl_module/marl_module.py

rllib/core/rl_module/rl_module.py

rllib/core/rl_module/tf/tf_rl_module.py

rllib/core/rl_module/torch/torch_rl_module.py

sven1977 · 2024-03-19T11:59:43Z

rllib/core/rl_module/rl_module.py

+    def load_state(
+        self,
+        dir: Union[str, pathlib.Path],
+    ) -> None:


Suggested change

def load_state(

self,

dir: Union[str, pathlib.Path],

) -> None:

def load_state(self, dir: Union[str, pathlib.Path]) -> None:

Signed-off-by: Sven Mika <sven@anyscale.io>

…heckpoints). (ray-project#43906)

Added device placement for loading 'RModule's from checkpoints/states.

8e6c563

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

simonsays1980 marked this pull request as ready for review March 12, 2024 11:55

simonsays1980 requested review from sven1977, avnishn, ArturNiederfahrenhorst, maxpumperla and kouroshHakha as code owners March 12, 2024 11:55

simonsays1980 added 2 commits March 13, 2024 12:36

Merge branch 'master' into hot-fix-torch-learner-map-location

6e80242

Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

Rolled back changes following discussion with @sven1977 and instead s…

577ba3f

…tore state dict now in numpy format which makes it device-agnostic. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>

simonsays1980 changed the title ~~Adds device placement when loading 'RModule's from checkpoints/states.~~ Removes device infos from state when saving 'RModule's to checkpoints/states. Mar 13, 2024

simonsays1980 changed the title ~~Removes device infos from state when saving 'RModule's to checkpoints/states.~~ Removes device infos from state when saving RModules to checkpoints/states. Mar 13, 2024