[BugFix] SyncDataCollector init when device and env_device are different #765

albertbou92 · 2022-12-24T18:02:29Z

Description

In the init method of the SyncDataCollector class, a small number of steps is taken with the policy to determine the relevant keys of the output TensorDict. When the policy device and the environment device are different, that can raise a RuntimeError since the input provided to the policy is located in the environment device.

This PR only makes sure that the TensorDict provided to the policy is in the policy device, and then moves the output TensorDict to the environment device again.

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

vmoens

Thanks for this! Can you add a test where device and passing_device differ to check non-regression over this bug fix?

albertbou92 · 2022-12-27T11:41:03Z

Added tests for all device/passing_device combinations for SyncDataCollector, MultiSyncDataCollector and MultiaSyncDataCollector.

I noticed that for MultiSyncDataCollector, the yielded TensorDicts had its device attribute set to None.
I think the problem can be in the TensorDict.cat method, which returns a TensorDict whose "next" TensorDict has device set to None.

As a sanity check, I added a line of code to make sure the MultiSyncDataCollector yielded TensorDicts are cast to the correct device.

vmoens · 2022-12-28T06:20:37Z

@albertbou92 interesting

I noticed that for MultiSyncDataCollector, the yielded TensorDicts had its device attribute set to None.

Does this happen when the passing devices match too? Or just when there's more than one?
torch.cat should conserve the device (since we're sure that all device match). If it doesn't it's a bug. When the devices don't match, we first cast things to cpu before calling cat. It's definitely not ultra fast so we may want to parametrise that feature.

which returns a TensorDict whose "next" TensorDict has device set to None.
You mean that the tensordict has a device but the nested tensordict doesn't? Weird...

albertbou92 · 2022-12-28T07:22:39Z

@albertbou92 interesting

I noticed that for MultiSyncDataCollector, the yielded TensorDicts had its device attribute set to None.

Does this happen when the passing devices match too? Or just when there's more than one? torch.cat should conserve the device (since we're sure that all device match). If it doesn't it's a bug. When the devices don't match, we first cast things to cpu before calling cat. It's definitely not ultra fast so we may want to parametrise that feature.

which returns a TensorDict whose "next" TensorDict has device set to None.
You mean that the tensordict has a device but the nested tensordict doesn't? Weird...

albertbou92 · 2022-12-28T07:23:45Z

Does this happen when the passing devices match too?

Yes it happens for all device combinations

albertbou92 · 2022-12-28T07:24:44Z

You mean that the tensordict has a device but the nested tensordict doesn't? Weird...

exactly

vmoens · 2022-12-28T07:25:39Z

You mean that the tensordict has a device but the nested tensordict doesn't? Weird...

exactly

Ok then it's a tensordict bug, I'll have a look into it.

albertbou92 · 2023-01-02T07:57:31Z

Should I remove the line of code that makes sure the MultiSyncDataCollector yielded TensorDicts are cast to the correct device.

We could keep it as a sanity check, won't add overhead if the TensorDicts is already in the correct device.

vmoens · 2023-01-02T08:16:39Z

Should I remove the line of code that makes sure the MultiSyncDataCollector yielded TensorDicts are cast to the correct device.

We could keep it as a sanity check, won't add overhead if the TensorDicts is already in the correct device.

Which line of code are we talking about?

By the way: can you check if the device bug with nested tensordicts is still present?

albertbou92 · 2023-01-02T08:32:33Z

the device bug with nested TensorDicts seems to be fixed now!

In my code I added to the MultiSyncDataCollector iterator method a out_buffer = out_buffer.to(prev_device) after concatenating all different dicts to make sure the yielded TensorDict was placed to the passing_device. Now that the cat method is fixed might be unnecessary.

I can either remove it or we can keep as a sanity check.

vmoens · 2023-01-02T08:37:47Z

If it's not necessary I would remove it. Those things take little time but they quickly pile up to a consequent overhead.

albertbou92 · 2023-01-02T08:49:59Z

ok, removed that checking line.

Now the PR just fixed the bug in SyncDataCollector when policy device and passing_device are different and adds tests to make sure all collectors work with all device combinations.

vmoens · 2023-01-02T09:35:08Z

test/test_collector.py

    del collector


+@pytest.mark.parametrize("device", ["cuda", "cpu"])


You should probably use ˋget_available_devices` here as some tests run on machines that don't have a cuda device

vmoens · 2023-01-02T09:35:18Z

test/test_collector.py



+@pytest.mark.parametrize("device", ["cuda", "cpu"])
+@pytest.mark.parametrize("passing_device", ["cuda", "cpu"])


Same as above

vmoens · 2023-01-02T11:33:52Z

test/test_collector.py

    del collector


+@pytest.mark.parametrize(


Does it make sense to test on cpu only? Maybe we can skip it in that case

ok yes I can remove it in that case

minor fix

1f403fd

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 24, 2022

vmoens reviewed Dec 26, 2022

View reviewed changes

albertbou92 added 3 commits December 27, 2022 11:25

added tests

e4b109e

format

4bcef8c

fixed device placement

9e86208

albertbou92 closed this Dec 28, 2022

albertbou92 reopened this Dec 28, 2022

Merge branch 'main' into bugfix_collector_init

78a88d2

remove unnecessry check

b6532cd

tests fix

ff8ac3b

vmoens reviewed Jan 2, 2023

View reviewed changes

albertbou92 added 3 commits January 2, 2023 10:40

format

6e93a84

tests fix

ea25c52

format

06f95f4

vmoens reviewed Jan 2, 2023

View reviewed changes

albertbou92 added 3 commits January 2, 2023 12:40

skip only cpu test

0bc93d1

format

ed4a051

fix

33bdf02

vmoens approved these changes Jan 2, 2023

View reviewed changes

fix

106838f

vmoens merged commit de9f488 into pytorch:main Jan 2, 2023

albertbou92 deleted the bugfix_collector_init branch January 18, 2024 10:08

		del collector


		@pytest.mark.parametrize("device", ["cuda", "cpu"])



		@pytest.mark.parametrize("device", ["cuda", "cpu"])
		@pytest.mark.parametrize("passing_device", ["cuda", "cpu"])

[BugFix] SyncDataCollector init when device and env_device are different #765

[BugFix] SyncDataCollector init when device and env_device are different #765

Uh oh!

Conversation

albertbou92 commented Dec 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Checklist

Uh oh!

vmoens left a comment

Choose a reason for hiding this comment

Uh oh!

albertbou92 commented Dec 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmoens commented Dec 28, 2022

Uh oh!

albertbou92 commented Dec 28, 2022

Uh oh!

albertbou92 commented Dec 28, 2022

Uh oh!

albertbou92 commented Dec 28, 2022

Uh oh!

vmoens commented Dec 28, 2022

Uh oh!

albertbou92 commented Jan 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmoens commented Jan 2, 2023

Uh oh!

albertbou92 commented Jan 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmoens commented Jan 2, 2023

Uh oh!

albertbou92 commented Jan 2, 2023

Uh oh!

vmoens Jan 2, 2023

Choose a reason for hiding this comment

Uh oh!

vmoens Jan 2, 2023

Choose a reason for hiding this comment

Uh oh!

vmoens Jan 2, 2023

Choose a reason for hiding this comment

Uh oh!

albertbou92 Jan 2, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertbou92 commented Dec 24, 2022 •

edited

Loading

albertbou92 commented Dec 27, 2022 •

edited

Loading

albertbou92 commented Jan 2, 2023 •

edited

Loading

albertbou92 commented Jan 2, 2023 •

edited

Loading