-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'share_from_this' to 'torch::jit::Graph' #87343
Add 'share_from_this' to 'torch::jit::Graph' #87343
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87343
Note: Links to docs will display an error until the docs builds have been completed. ✅ No Failures, 1 PendingAs of commit 1867e03: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
18affda
to
7e96826
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I closed the original issue because it wouldn't reproduce, but this fix really tackles the root cause. Fantastic job
If that works, do you think we can modify the original Graph* owningGraph()
with that logic inside at torch/csrc/jit/ir/ir.h
? This way, we are always getting the proper shared version during export, but keeping existing behavior otherwise
We would need to check whether the const Graph *
, const Block *` and
const Value *` would need this too
Something like
py::bool_ is_in_onnx_export =
py::module::import("torch.onnx.__init__").attr("is_in_onnx_export");
struct Value {
public:
Graph* owningGraph();
const Graph* owningGraph() const;
};
struct TORCH_API Node {
public:
Graph* owningGraph() {
if (py::cast<bool>(is_in_onnx_export)) {
return graph_->shared_from_this();
} else {
return graph_;
}
}
const Graph* owningGraph() const {
return graph_;
}
Block* owningBlock() {
if (py::cast<bool>(is_in_onnx_export)) {
return owning_block_->shared_from_this();
} else {
return owning_block_;
}
}
const Block* owningBlock() const {
return owning_block_;
}
};
struct Block {
Graph* owningGraph() {
if (py::cast<bool>(is_in_onnx_export)) {
return graph_->shared_from_this();
} else {
return graph_;
}
}
const Graph* owningGraph() const {
return graph_;
}
Node* owningNode() {
if (py::cast<bool>(is_in_onnx_export)) {
return owning_node_->shared_from_this();
} else {
return owning_node_;
}
}
const Node* owningNode() const {
return owning_node_;
}
};
inline Graph* Value::owningGraph() {
if (py::cast<bool>(is_in_onnx_export)) {
return node()->owningGraph()->shared_from_this();
} else {
return node()->owningGraph();
}
}
inline const Graph* Value::owningGraph() const {
return node()->owningGraph();
}
7e96826
to
0b784a3
Compare
@thiagocrepaldi yeah ideally we should get rid of all the raw pointers and replace with smart pointers. I fear it would be too intrusive. |
Yeah, that makes sense, it might be too intrusive. Maybe @malfet could look into our proposal and comment on the idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea, but to make it much more explicit, can we make Graph constructor private and instead expose a static inline function that creates it as shared pointer? I.e. I want to prevent one from introducing following:
auto g = new Graph(scope);
return g->shared_from_this();
as it causes memory leak, isn't it?
I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.
Starting from C++17, this code will throw |
0b784a3
to
1867e03
Compare
@malfet wdyt? |
@pytorchbot merge |
Hmm, are we making any promises about stability of C++ API? (I.e. |
The impact is mostly on JIT, I'll leave that for folks from JIT team to comment. The changes on ONNX and exporter side are quite small. |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the `internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr. The behavior afterwards is random and probably undefined. Most of the time it works, if the holder is deallocated timely on python side, and the cache then cleared from `internals::registered_instance`. Things are back to normal. Otherwise, it fails with either segfault or a runtime error of message "Unable to cast from non-held to held instance". One of such scenarios is normally and correctly returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via cache. Due to this, the shared_ptr use_count will not increase. If there is no other use on C++ side, the graph will be freed, while python still has access, via the holder created previously. @t-vi had a great analysis and solution to this exact problem at pytorch#51833 which I hope I had seen before debugging this issue... ~~I'm building the PR based on the original commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately due to CLA issues. Need to check in CI if adding `enable_shared_from_this` breaks other stuff. Fixes pytorch#51833, and CI issues in pytorch#87258, pytorch#86182. cc @malfet, @kit1980 for changes on JIT IR. Pull Request resolved: pytorch#87343 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet
Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the `internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr. The behavior afterwards is random and probably undefined. Most of the time it works, if the holder is deallocated timely on python side, and the cache then cleared from `internals::registered_instance`. Things are back to normal. Otherwise, it fails with either segfault or a runtime error of message "Unable to cast from non-held to held instance". One of such scenarios is normally and correctly returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via cache. Due to this, the shared_ptr use_count will not increase. If there is no other use on C++ side, the graph will be freed, while python still has access, via the holder created previously. @t-vi had a great analysis and solution to this exact problem at pytorch#51833 which I hope I had seen before debugging this issue... ~~I'm building the PR based on the original commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately due to CLA issues. Need to check in CI if adding `enable_shared_from_this` breaks other stuff. Fixes pytorch#51833, and CI issues in pytorch#87258, pytorch#86182. cc @malfet, @kit1980 for changes on JIT IR. Pull Request resolved: pytorch#87343 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet
Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
internals::registered_instance
of pybind11, caching a holder for python w.r.t the rawpointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.
The behavior afterwards is random and probably undefined.
Most of the time it works, if the holder is deallocated timely on python side, and the
cache then cleared from
internals::registered_instance
. Things are back to normal.Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
from non-held to held instance". One of such scenarios is normally and correctly
returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
on C++ side, the graph will be freed, while python still has access, via the holder created
previously.
@t-vi had a great analysis and solution to this exact problem at #51833 which I hope
I had seen before debugging this issue...
I'm building the PR based on the originalSending the PR separatelycommit. @t-vi please let me know if you'd prefer otherwise.
due to CLA issues.
Need to check in CI if adding
enable_shared_from_this
breaks other stuff.Fixes #51833, and CI issues in #87258, #86182.
cc @malfet, @kit1980 for changes on JIT IR.