Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'share_from_this' to 'torch::jit::Graph' #87343

Closed

Conversation

BowenBao
Copy link
Collaborator

@BowenBao BowenBao commented Oct 20, 2022

Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
internals::registered_instance of pybind11, caching a holder for python w.r.t the raw
pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.

The behavior afterwards is random and probably undefined.
Most of the time it works, if the holder is deallocated timely on python side, and the
cache then cleared from internals::registered_instance. Things are back to normal.
Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
from non-held to held instance". One of such scenarios is normally and correctly
returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
on C++ side, the graph will be freed, while python still has access, via the holder created
previously.

@t-vi had a great analysis and solution to this exact problem at #51833 which I hope
I had seen before debugging this issue... I'm building the PR based on the original
commit. @t-vi please let me know if you'd prefer otherwise.
Sending the PR separately
due to CLA issues.

Need to check in CI if adding enable_shared_from_this breaks other stuff.

Fixes #51833, and CI issues in #87258, #86182.

cc @malfet, @kit1980 for changes on JIT IR.

@BowenBao BowenBao added module: onnx Related to torch.onnx release notes: jit release notes category release notes: onnx torch.onnx related changes that should show up in the release notes topic: bug fixes topic category labels Oct 20, 2022
@BowenBao BowenBao requested a review from abock as a code owner October 20, 2022 01:20
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87343

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 1867e03:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 20, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: BowenBao / name: Bowen Bao (0b784a3)

Copy link
Collaborator

@justinchuby justinchuby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕺

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 20, 2022
Copy link
Collaborator

@thiagocrepaldi thiagocrepaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I closed the original issue because it wouldn't reproduce, but this fix really tackles the root cause. Fantastic job

If that works, do you think we can modify the original Graph* owningGraph() with that logic inside at torch/csrc/jit/ir/ir.h? This way, we are always getting the proper shared version during export, but keeping existing behavior otherwise

We would need to check whether the const Graph *, const Block *` and const Value *` would need this too

Something like

py::bool_ is_in_onnx_export =
      py::module::import("torch.onnx.__init__").attr("is_in_onnx_export");

struct Value {
 public:
  Graph* owningGraph();
  const Graph* owningGraph() const;
};

struct TORCH_API Node {
 public:
  Graph* owningGraph() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return graph_->shared_from_this();
    } else {
      return graph_;
    }
  }
  const Graph* owningGraph() const {
    return graph_;
  }
  Block* owningBlock() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return owning_block_->shared_from_this();
    } else {
      return owning_block_;
    }
  }
  const Block* owningBlock() const {
    return owning_block_;
  }
};

struct Block {
  Graph* owningGraph() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return graph_->shared_from_this();
    } else {
      return graph_;
    }
  }
  const Graph* owningGraph() const {
      return graph_;
  }
  Node* owningNode() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return owning_node_->shared_from_this();
    } else {
      return owning_node_;
    }
  }
  const Node* owningNode() const {
    return owning_node_;
  }
};

inline Graph* Value::owningGraph() {
    if (py::cast<bool>(is_in_onnx_export)) {
      return node()->owningGraph()->shared_from_this();
    } else {
      return node()->owningGraph();
    }
}

inline const Graph* Value::owningGraph() const {
    return node()->owningGraph();
}

@BowenBao BowenBao force-pushed the onnx_graph_enable_shared_from_this branch from 7e96826 to 0b784a3 Compare October 20, 2022 22:49
@BowenBao
Copy link
Collaborator Author

@thiagocrepaldi yeah ideally we should get rid of all the raw pointers and replace with smart pointers. I fear it would be too intrusive.

@thiagocrepaldi
Copy link
Collaborator

@thiagocrepaldi yeah ideally we should get rid of all the raw pointers and replace with smart pointers. I fear it would be too intrusive.

Yeah, that makes sense, it might be too intrusive. Maybe @malfet could look into our proposal and comment on the idea.
The new comments do help, but depending on final user to do the proper pointer handling might be a reason for future regressions.

@BowenBao BowenBao linked an issue Oct 21, 2022 that may be closed by this pull request
17 tasks
@BowenBao BowenBao added the onnx-needs-import This PR is related to ONNX, but touches files outside of merge rule patterns, and hence needs import label Oct 24, 2022
@BowenBao
Copy link
Collaborator Author

@kit1980 @malfet please help import, this fixes an important issue that causes random segfault in exporter.

@kit1980 kit1980 self-assigned this Oct 24, 2022
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea, but to make it much more explicit, can we make Graph constructor private and instead expose a static inline function that creates it as shared pointer? I.e. I want to prevent one from introducing following:

  auto g = new Graph(scope);
  return g->shared_from_this();

as it causes memory leak, isn't it?

@BowenBao
Copy link
Collaborator Author

BowenBao commented Oct 26, 2022

make Graph constructor private and instead expose a static inline function that creates it as shared pointer

I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.

as it causes memory leak, isn't it?

Starting from C++17, this code will throw std::bad_weak_ptr.

@BowenBao BowenBao force-pushed the onnx_graph_enable_shared_from_this branch from 0b784a3 to 1867e03 Compare October 26, 2022 20:11
@BowenBao
Copy link
Collaborator Author

@malfet wdyt?

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 27, 2022
@BowenBao
Copy link
Collaborator Author

@malfet @kit1980 friendly ping, we have PRs that are blocked on this.

@malfet
Copy link
Contributor

malfet commented Oct 28, 2022

@pytorchbot merge

@malfet
Copy link
Contributor

malfet commented Oct 28, 2022

I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.

Hmm, are we making any promises about stability of C++ API? (I.e. torch::jit::Graph is not mentioned in any public documentation, is it? And #87747 feels like a way it should be, isn't it?

@BowenBao
Copy link
Collaborator Author

I'm uncertain on the impact. It is bc breaking and every Graph creation call is affected. It seems a lot of jit code was passing around raw pointers. An incomplete attempt at #87747 for reference. I'm less inclined to block this PR until all is resolved.

Hmm, are we making any promises about stability of C++ API? (I.e. torch::jit::Graph is not mentioned in any public documentation, is it? And #87747 feels like a way it should be, isn't it?

The impact is mostly on JIT, I'll leave that for folks from JIT team to comment. The changes on ONNX and exporter side are quite small.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
`internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw
pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.

The behavior afterwards is random and probably undefined.
Most of the time it works, if the holder is deallocated timely on python side, and the
cache then cleared from `internals::registered_instance`. Things are back to normal.
Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
from non-held to held instance". One of such scenarios is normally and correctly
returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
on C++ side, the graph will be freed, while python still has access, via the holder created
previously.

@t-vi had a great analysis and solution to this exact problem at pytorch#51833 which I hope
I had seen before debugging this issue... ~~I'm building the PR based on the original
commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately
due to CLA issues.

Need to check in CI if adding `enable_shared_from_this` breaks other stuff.

Fixes pytorch#51833, and CI issues in pytorch#87258, pytorch#86182.

cc @malfet, @kit1980 for changes on JIT IR.
Pull Request resolved: pytorch#87343
Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
`internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw
pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.

The behavior afterwards is random and probably undefined.
Most of the time it works, if the holder is deallocated timely on python side, and the
cache then cleared from `internals::registered_instance`. Things are back to normal.
Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
from non-held to held instance". One of such scenarios is normally and correctly
returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
on C++ side, the graph will be freed, while python still has access, via the holder created
previously.

@t-vi had a great analysis and solution to this exact problem at pytorch#51833 which I hope
I had seen before debugging this issue... ~~I'm building the PR based on the original
commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately
due to CLA issues.

Need to check in CI if adding `enable_shared_from_this` breaks other stuff.

Fixes pytorch#51833, and CI issues in pytorch#87258, pytorch#86182.

cc @malfet, @kit1980 for changes on JIT IR.
Pull Request resolved: pytorch#87343
Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: onnx Related to torch.onnx onnx-needs-import This PR is related to ONNX, but touches files outside of merge rule patterns, and hence needs import open source release notes: jit release notes category release notes: onnx torch.onnx related changes that should show up in the release notes topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ONNX] Improve torch.onnx diagnostics Segfault printing torch::jit::Graph in ONNX test suite traceback
9 participants