[QUESTION] LossModule's functional parameters duplicate weights in memory #1769

wbinventor · 2024-01-03T23:17:08Z

The LossModule.convert_to_functional(...) method creates a deep copy of the parameters. If I understand correctly, this leads to the parameters being duplicated in memory and a larger memory footprint than necessary. Is my understanding correct? If so, why is this necessary? Is there any way for the LossModule simply contain a single reference to the weights for, e.g., its actor and critic TorchModules?

This can be seen in the following line:

rl/torchrl/objectives/common.py

Line 290 in 80c63ad

self.__dict__[module_name] = deepcopy(module)

This is the specific snippet of code containing the deep copy that this question pertains to:

        # set the functional module: we need to convert the params to non-differentiable params
        # otherwise they will appear twice in parameters
        with params.apply(
            self._make_meta_params, device=torch.device("meta")
        ).to_module(module):
            # avoid buffers and params being exposed
            self.__dict__[module_name] = deepcopy(module)

The text was updated successfully, but these errors were encountered:

wbinventor · 2024-01-04T00:13:06Z

I am confused by this deepcopy of the weights. It seems that after just one gradient update, the loss_module.parameters() will differ from the actor/critic TensorDictModule used to generate rollouts to insert into a replay buffer, since two separate copies of the weights will are used to generate rollouts vs. compute the loss. I'm sure I must be missing something here, so any clarifications would be greatly appreciated!

vmoens · 2024-01-04T08:04:08Z

Thanks for posting this!
The important bit in the code snippet you linked is the context manager: we take the parameters, send them to "meta" device (ie, create a stateless copy of the parameters) and then populate temporarily the module with these.
This is the module instance we will copy.
After we exit the context manager, the module retrieves its original parameters, and only a stateless module is stored.

It seems that after just one gradient update, the loss_module.parameters() will differ from the actor/critic TensorDictModule used to generate rollouts to insert into a replay buffer, since two separate copies of the weights will are used to generate rollouts vs. compute the loss.

Not really since we always call the stored module with a functional call!

I hope that clarifies things!

wbinventor · 2024-01-04T16:38:34Z

Thanks, @vmoens! That clarification about the context manager is helpful 🙏

However, can you confirm if my understanding is correct that as a result of the deepcopy in LossModule, the memory footprint is (approximately) twice as large since the actor/critic model parameters are duplicated?

At least when I instantiate a LossModule, I notice that the (e.g., CUDA) memory footprint approximately doubles as a result of this deepcopy. This is an issue for large models, so I'm trying to understand if this duplication of parameters can be avoided.

vmoens · 2024-01-04T16:42:07Z

However, can you confirm if my understanding is correct that as a result of the deepcopy in LossModule, the memory footprint is (approximately) twice as large since the actor/critic model parameters are duplicated?

No, a tensor on "meta" device has no content, so it has (approximately) 0 memory footprint.

If the memory increases by a factor 2x there must be an issue somewhere, this isn't the indented behaviour (it's a bug).

wbinventor · 2024-01-04T16:45:27Z

Ok, that's what I thought as well re: "meta" device behavior.

I can very clearly see the memory footprint double when the deepcopy on line 290 of LossModule is called. Should this be reported separately as a bug?

vmoens · 2024-01-04T16:46:07Z

Nope, I will have a look and push a patch!

vmoens · 2024-01-04T17:05:48Z

Do you have any way to check that the memory doubles?
This piece of code indicates that the parameters are on "meta" device as expected

from torchrl.modules import MLP, QValueActor
from torchrl.data import OneHotDiscreteTensorSpec
from torchrl.objectives import DQNLoss
n_obs, n_act = 4, 3
value_net = MLP(in_features=n_obs, out_features=n_act)
spec = OneHotDiscreteTensorSpec(n_act)
actor = QValueActor(value_net, in_keys=["observation"], action_space=spec)
loss = DQNLoss(actor, action_space=spec)
list(loss.value_network.parameters())

[Parameter containing:
 tensor(..., device='meta', size=(32, 4)),
 Parameter containing:
 tensor(..., device='meta', size=(32,)),
 Parameter containing:
 tensor(..., device='meta', size=(32, 32)),
 Parameter containing:
 tensor(..., device='meta', size=(32,)),
 Parameter containing:
 tensor(..., device='meta', size=(32, 32)),
 Parameter containing:
 tensor(..., device='meta', size=(32,)),
 Parameter containing:
 tensor(..., device='meta', size=(3, 32)),
 Parameter containing:
 tensor(..., device='meta', size=(3,))]

vmoens · 2024-01-04T17:09:03Z

Are you sure the memory footprint isn't twice as big because of the target parameters?

wbinventor · 2024-01-05T07:08:00Z

I was unable to reproduce this with some simplified code (e.g., the example scripts), and have determined from that my full code contains some other, non-parameter tensor attributes on my nn.Module that are duplicated with the deepcopy. I'm closing this issue since the deepcopy of "meta" parameters is not the reason that the memory footprint doubles.

vmoens · 2024-01-05T10:56:06Z

Got it
Unfortunately, as of now, we have to copy the module. We took care of making this work with parameters and buffers. Tensors stored as non-buffer, non-parameters are a strange thing to handle in general. They usually won't be part of your state-dict, they won't be cast when you call module.cuda() etc. They are usually regarded as non-proper usage of nn.Module.

We could design an ad-hoc strategy to avoid deepcopying the non-parameter, non-buffer tensors, but I'm not sure that this is what users will want (I can imagine that some users will want them to be copied, others no).

If you feel like this deepcopy is causing troubles in your use case, I'd be happy to look at an adequate solution.

wbinventor changed the title ~~LossModule's functional parameters duplicate weights in memory~~ [QUESTION] LossModule's functional parameters duplicate weights in memory Jan 3, 2024

vmoens closed this as completed Jan 4, 2024

vmoens reopened this Jan 4, 2024

wbinventor closed this as completed Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] LossModule's functional parameters duplicate weights in memory #1769

[QUESTION] LossModule's functional parameters duplicate weights in memory #1769

wbinventor commented Jan 3, 2024 •

edited

wbinventor commented Jan 4, 2024

vmoens commented Jan 4, 2024

wbinventor commented Jan 4, 2024 •

edited

vmoens commented Jan 4, 2024

wbinventor commented Jan 4, 2024

vmoens commented Jan 4, 2024

vmoens commented Jan 4, 2024 •

edited

vmoens commented Jan 4, 2024

wbinventor commented Jan 5, 2024

vmoens commented Jan 5, 2024

[QUESTION] LossModule's functional parameters duplicate weights in memory #1769

[QUESTION] LossModule's functional parameters duplicate weights in memory #1769

Comments

wbinventor commented Jan 3, 2024 • edited

wbinventor commented Jan 4, 2024

vmoens commented Jan 4, 2024

wbinventor commented Jan 4, 2024 • edited

vmoens commented Jan 4, 2024

wbinventor commented Jan 4, 2024

vmoens commented Jan 4, 2024

vmoens commented Jan 4, 2024 • edited

vmoens commented Jan 4, 2024

wbinventor commented Jan 5, 2024

vmoens commented Jan 5, 2024

wbinventor commented Jan 3, 2024 •

edited

wbinventor commented Jan 4, 2024 •

edited

vmoens commented Jan 4, 2024 •

edited