Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Buffer crashes on extend #2065

Closed
matteobettini opened this issue Apr 8, 2024 · 4 comments
Closed

[BUG] Buffer crashes on extend #2065

matteobettini opened this issue Apr 8, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@matteobettini
Copy link
Contributor

matteobettini commented Apr 8, 2024

The buffer hangs on the extend operation and then crashes with Process finished with exit code 137 (interrupted by signal 9:SIGKILL)

To reproduce, check out #2054

from torchrl.data import LazyTensorStorage, RandomSampler, TensorDictReplayBuffer
from torchrl.envs import MeltingpotEnv

if __name__ == "__main__":
    env = MeltingpotEnv(substrate="commons_harvest__open")
    td = env.rollout(max_steps=100, break_when_any_done=False)

    buffer = TensorDictReplayBuffer(
        storage=LazyTensorStorage(1_000_000, device="cpu"),
        sampler=RandomSampler(),
        batch_size=10,
    )

    buffer.extend(td) # Hangs and then -> Process finished with exit code 137 (interrupted by signal 9:SIGKILL)

This is where it hangs (specifically on expand().clone()

if is_tensor_collection(data):
out = (
data.expand(max_size_along_dim0(data.shape))
.clone()
.zero_()
.to(self.device)
)
elif is_tensor_collection(data):
out = (
data.expand(max_size_along_dim0(data.shape))
.clone()
.zero_()
.to(self.device)
)

PS these are 2 indentical if branches?

@vmoens
Copy link
Contributor

vmoens commented Apr 8, 2024

PS these are 2 indentical if branches?

Yeah it used to be one for tensorclass and one for tensordict but then they became identical and I didn't catch it :)

@vmoens
Copy link
Contributor

vmoens commented Apr 8, 2024

I think your buffer is simply too big to fit in memory, using LazyMemmapStorage or a smaller buffer size solved the issue

@vmoens vmoens closed this as completed Apr 8, 2024
@matteobettini
Copy link
Contributor Author

If this is the case I think there should be an error here, it is currently silently crashing

@vmoens
Copy link
Contributor

vmoens commented Apr 8, 2024

it actually does

Exit code 137 is a signal that occurs when a container's memory exceeds the memory limit provided in the pod specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants