Skip to content

[pytorch] Fix serialization memory lifetime issue. #30603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jjlilley
Copy link
Contributor

@jjlilley jjlilley commented Dec 2, 2019

Stack from ghstack:

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free.

Differential Revision: D18760463

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection, though it's not as bad as it looks - for device==cpu,
the is no actual issue since the memory is the same as the input args.

Differential Revision: [D18760463](https://our.internmc.facebook.com/intern/diff/D18760463/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Dec 2, 2019
Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection, though it's not as bad as it looks - for device==cpu,
the is no actual issue since the memory is the same as the input args.

Differential Revision: [D18760463](https://our.internmc.facebook.com/intern/diff/D18760463/)

ghstack-source-id: 94741950
Pull Request resolved: #30603
@jjlilley jjlilley requested a review from suo December 2, 2019 07:32
@jjlilley
Copy link
Contributor Author

jjlilley commented Dec 2, 2019

Sorry about this one...

fwiw, it's not clear to me that this is related to any of the test issues (especially because rollback failed to make those failures go away), but I spent some more time this evening staring at code and noticed this memory issue.

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free. 

Differential Revision: [D18760463](https://our.internmc.facebook.com/intern/diff/D18760463/)

[ghstack-poisoned]
Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free. 

Differential Revision: [D18760463](https://our.internmc.facebook.com/intern/diff/D18760463/)

[ghstack-poisoned]
@jjlilley jjlilley requested review from satgera and pietern December 2, 2019 16:44
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for keeping the pickler alive is not immediately clear from looking at the code. I can imagine this breaking again in the future. I recommend leaving the pickler lifetime as-is, but either 1) capturing the vector of tensors from pickler.tensorData() by value, or 2) adding an at::Tensor field to the tuple you put in entries so the data pointer is colocated with the object that extends its lifetime.

@jjlilley
Copy link
Contributor Author

jjlilley commented Dec 2, 2019

Thanks for the suggestion, agree that it would be good to avoid future regressions.
I'm uploading a version that does (1) - keeping the WriteableTensors vector alive.

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free. 

Differential Revision: [D18760463](https://our.internmc.facebook.com/intern/diff/D18760463/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Dec 2, 2019
Pull Request resolved: #30603

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free.
ghstack-source-id: 94756036

Differential Revision: [D18760463](https://our.internmc.facebook.com/intern/diff/D18760463/)
@@ -201,12 +203,13 @@ std::string wireSerialize(
pickler.protocol();
pickler.pushIValue(tensors);
pickler.stop();
auto writeable_tensors = pickler.tensorData();
// tensorData is in function scope so that the data() pointers stay valid.
tensorData = pickler.tensorData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is it to capture by value here? I would assume WriteableTensorData is just a thin wrapper. If not, shall we add some microbenchmark to see how much perf do we lose?

@jjlilley
Copy link
Contributor Author

jjlilley commented Dec 2, 2019

Fortunately, we do have a benchmark for it checked in:
buck build @mode/opt caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentBench
buck-out/opt/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentBench

In practice, the results end up being about the same either way:
Serialize_Wire_64 1.99us 502.38K
Serialize_Wire_1M 276.07us 3.62K

This is mostly because the copy-by-value is largely a refcount operation. If I run several times in a row, the variations between runs is more than the variation between the two impls.

I'm completely happy using the other version as well (that just extends the life of the Pickler) - it should be slightly more efficient, though not in a very measurable manner.

I simply want to get a fix for this read-after-free checked in. :)

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@facebook-github-bot facebook-github-bot deleted the gh/jjlilley/33/head branch December 10, 2019 15:19
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Pull Request resolved: pytorch#30603

Pickler object needs to be kept in scope until data is written out to the
final serialized string. tensorData in particular is a reference to memory
owned by the descoped Pickle object.

Noticed this by inspection. In practice, this potential read-after-free here
is limited to non-cpu tensors, and any such use was very soon after free.
ghstack-source-id: 94756036

Test Plan: existing test suite at buck test mode/dev-nosan caffe2/test:rpc_fork

Differential Revision: D18760463

fbshipit-source-id: 9de890d66626aa48f13ca376dd9bd50b92e0cb00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants