Add 'share_from_this' to 'torch::jit::Graph' #87343

BowenBao · 2022-10-20T01:20:32Z

Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
internals::registered_instance of pybind11, caching a holder for python w.r.t the raw
pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.

The behavior afterwards is random and probably undefined.
Most of the time it works, if the holder is deallocated timely on python side, and the
cache then cleared from internals::registered_instance. Things are back to normal.
Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
from non-held to held instance". One of such scenarios is normally and correctly
returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
on C++ side, the graph will be freed, while python still has access, via the holder created
previously.

@t-vi had a great analysis and solution to this exact problem at #51833 which I hope
I had seen before debugging this issue... I'm building the PR based on the original
commit. @t-vi please let me know if you'd prefer otherwise. Sending the PR separately
due to CLA issues.

Need to check in CI if adding enable_shared_from_this breaks other stuff.

Fixes #51833, and CI issues in #87258, #86182.

cc @malfet, @kit1980 for changes on JIT IR.

pytorch-bot · 2022-10-20T01:20:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87343

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 1867e03:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2022-10-20T01:20:37Z

The committers listed above are authorized under a signed CLA.

✅ login: BowenBao / name: Bowen Bao (0b784a3)

torch/csrc/jit/passes/onnx.cpp

test/onnx/test_custom_ops.py

justinchuby

🕺

thiagocrepaldi

Wow, I closed the original issue because it wouldn't reproduce, but this fix really tackles the root cause. Fantastic job

If that works, do you think we can modify the original Graph* owningGraph() with that logic inside at torch/csrc/jit/ir/ir.h? This way, we are always getting the proper shared version during export, but keeping existing behavior otherwise

We would need to check whether the const Graph *, const Block *` and const Value *` would need this too

Something like

py::bool_ is_in_onnx_export =
      py::module::import("torch.onnx.__init__").attr("is_in_onnx_export");

struct Value {
 public:
  Graph* owningGraph();
  const Graph* owningGraph() const;
};

struct TORCH_API Node {
 public:
  Graph* owningGraph() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return graph_->shared_from_this();
    } else {
      return graph_;
    }
  }
  const Graph* owningGraph() const {
    return graph_;
  }
  Block* owningBlock() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return owning_block_->shared_from_this();
    } else {
      return owning_block_;
    }
  }
  const Block* owningBlock() const {
    return owning_block_;
  }
};

struct Block {
  Graph* owningGraph() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return graph_->shared_from_this();
    } else {
      return graph_;
    }
  }
  const Graph* owningGraph() const {
      return graph_;
  }
  Node* owningNode() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return owning_node_->shared_from_this();
    } else {
      return owning_node_;
    }
  }
  const Node* owningNode() const {
    return owning_node_;
  }
};

inline Graph* Value::owningGraph() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return node()->owningGraph()->shared_from_this();
    } else {
      return node()->owningGraph();
    }
}

inline const Graph* Value::owningGraph() const {
    return node()->owningGraph();
}

BowenBao · 2022-10-20T22:57:36Z

@thiagocrepaldi yeah ideally we should get rid of all the raw pointers and replace with smart pointers. I fear it would be too intrusive.

thiagocrepaldi · 2022-10-20T23:26:34Z

@thiagocrepaldi yeah ideally we should get rid of all the raw pointers and replace with smart pointers. I fear it would be too intrusive.

Yeah, that makes sense, it might be too intrusive. Maybe @malfet could look into our proposal and comment on the idea.
The new comments do help, but depending on final user to do the proper pointer handling might be a reason for future regressions.

BowenBao · 2022-10-24T16:06:22Z

@kit1980 @malfet please help import, this fixes an important issue that causes random segfault in exporter.

malfet

This sounds like a good idea, but to make it much more explicit, can we make Graph constructor private and instead expose a static inline function that creates it as shared pointer? I.e. I want to prevent one from introducing following:

  auto g = new Graph(scope);
  return g->shared_from_this();

as it causes memory leak, isn't it?

BowenBao · 2022-10-26T00:54:45Z

make Graph constructor private and instead expose a static inline function that creates it as shared pointer

I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.

as it causes memory leak, isn't it?

Starting from C++17, this code will throw std::bad_weak_ptr.

BowenBao · 2022-10-26T20:12:36Z

@malfet wdyt?

BowenBao · 2022-10-28T22:30:44Z

@malfet @kit1980 friendly ping, we have PRs that are blocked on this.

malfet · 2022-10-28T22:34:23Z

@pytorchbot merge

malfet · 2022-10-28T22:37:32Z

I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.

Hmm, are we making any promises about stability of C++ API? (I.e. torch::jit::Graph is not mentioned in any public documentation, is it? And #87747 feels like a way it should be, isn't it?

BowenBao · 2022-10-28T22:46:57Z

I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.

Hmm, are we making any promises about stability of C++ API? (I.e. torch::jit::Graph is not mentioned in any public documentation, is it? And #87747 feels like a way it should be, isn't it?

The impact is mostly on JIT, I'll leave that for folks from JIT team to comment. The changes on ONNX and exporter side are quite small.

pytorchmergebot · 2022-10-28T23:51:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@t-vi

Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the `internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr. The behavior afterwards is random and probably undefined. Most of the time it works, if the holder is deallocated timely on python side, and the cache then cleared from `internals::registered_instance`. Things are back to normal. Otherwise, it fails with either segfault or a runtime error of message "Unable to cast from non-held to held instance". One of such scenarios is normally and correctly returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via cache. Due to this, the shared_ptr use_count will not increase. If there is no other use on C++ side, the graph will be freed, while python still has access, via the holder created previously. @t-vi had a great analysis and solution to this exact problem at pytorch#51833 which I hope I had seen before debugging this issue... ~~I'm building the PR based on the original commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately due to CLA issues. Need to check in CI if adding `enable_shared_from_this` breaks other stuff. Fixes pytorch#51833, and CI issues in pytorch#87258, pytorch#86182. cc @malfet, @kit1980 for changes on JIT IR. Pull Request resolved: pytorch#87343 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet

@t-vi

Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the `internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr. The behavior afterwards is random and probably undefined. Most of the time it works, if the holder is deallocated timely on python side, and the cache then cleared from `internals::registered_instance`. Things are back to normal. Otherwise, it fails with either segfault or a runtime error of message "Unable to cast from non-held to held instance". One of such scenarios is normally and correctly returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via cache. Due to this, the shared_ptr use_count will not increase. If there is no other use on C++ side, the graph will be freed, while python still has access, via the holder created previously. @t-vi had a great analysis and solution to this exact problem at pytorch#51833 which I hope I had seen before debugging this issue... ~~I'm building the PR based on the original commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately due to CLA issues. Need to check in CI if adding `enable_shared_from_this` breaks other stuff. Fixes pytorch#51833, and CI issues in pytorch#87258, pytorch#86182. cc @malfet, @kit1980 for changes on JIT IR. Pull Request resolved: pytorch#87343 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet

BowenBao added module: onnx Related to torch.onnx release notes: jit release notes category release notes: onnx torch.onnx related changes that should show up in the release notes topic: bug fixes topic category labels Oct 20, 2022

BowenBao requested a review from abock as a code owner October 20, 2022 01:20

pytorchbot added the open source label Oct 20, 2022

BowenBao force-pushed the onnx_graph_enable_shared_from_this branch from 18affda to 7e96826 Compare October 20, 2022 17:10

BowenBao mentioned this pull request Oct 20, 2022

[ONNX] Auto test based on OpInfo #86182

Closed

justinchuby reviewed Oct 20, 2022

View reviewed changes

torch/csrc/jit/passes/onnx.cpp Show resolved Hide resolved

justinchuby reviewed Oct 20, 2022

View reviewed changes

test/onnx/test_custom_ops.py Outdated Show resolved Hide resolved

justinchuby approved these changes Oct 20, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 20, 2022

thiagocrepaldi reviewed Oct 20, 2022

View reviewed changes

BowenBao force-pushed the onnx_graph_enable_shared_from_this branch from 7e96826 to 0b784a3 Compare October 20, 2022 22:49

titaiwangms approved these changes Oct 21, 2022

View reviewed changes

BowenBao linked an issue Oct 21, 2022 that may be closed by this pull request

[ONNX] Improve torch.onnx diagnostics #84659

Closed

17 tasks

BowenBao added the onnx-needs-import This PR is related to ONNX, but touches files outside of merge rule patterns, and hence needs import label Oct 24, 2022

kit1980 self-assigned this Oct 24, 2022

malfet requested changes Oct 25, 2022

View reviewed changes

Add 'share_from_this' to 'torch::jit::Graph'

1867e03

BowenBao force-pushed the onnx_graph_enable_shared_from_this branch from 0b784a3 to 1867e03 Compare October 26, 2022 20:11

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 27, 2022

malfet approved these changes Oct 28, 2022

View reviewed changes

pytorchmergebot added the Merged label Oct 28, 2022

pytorchmergebot closed this in 376acf7 Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'share_from_this' to 'torch::jit::Graph' #87343

Add 'share_from_this' to 'torch::jit::Graph' #87343

BowenBao commented Oct 20, 2022 •

edited

pytorch-bot bot commented Oct 20, 2022 •

edited

linux-foundation-easycla bot commented Oct 20, 2022 •

edited

justinchuby left a comment

thiagocrepaldi left a comment

BowenBao commented Oct 20, 2022

thiagocrepaldi commented Oct 20, 2022

BowenBao commented Oct 24, 2022

malfet left a comment •

edited

BowenBao commented Oct 26, 2022 •

edited

BowenBao commented Oct 26, 2022

BowenBao commented Oct 28, 2022

malfet commented Oct 28, 2022

malfet commented Oct 28, 2022

BowenBao commented Oct 28, 2022

pytorchmergebot commented Oct 28, 2022

Add 'share_from_this' to 'torch::jit::Graph' #87343

Add 'share_from_this' to 'torch::jit::Graph' #87343

Conversation

BowenBao commented Oct 20, 2022 • edited

pytorch-bot bot commented Oct 20, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87343

✅ No Failures, 1 Pending

linux-foundation-easycla bot commented Oct 20, 2022 • edited

justinchuby left a comment

Choose a reason for hiding this comment

thiagocrepaldi left a comment

Choose a reason for hiding this comment

BowenBao commented Oct 20, 2022

thiagocrepaldi commented Oct 20, 2022

BowenBao commented Oct 24, 2022

malfet left a comment • edited

Choose a reason for hiding this comment

BowenBao commented Oct 26, 2022 • edited

BowenBao commented Oct 26, 2022

BowenBao commented Oct 28, 2022

malfet commented Oct 28, 2022

malfet commented Oct 28, 2022

BowenBao commented Oct 28, 2022

pytorchmergebot commented Oct 28, 2022

Merge started

BowenBao commented Oct 20, 2022 •

edited

pytorch-bot bot commented Oct 20, 2022 •

edited

linux-foundation-easycla bot commented Oct 20, 2022 •

edited

malfet left a comment •

edited

BowenBao commented Oct 26, 2022 •

edited