Skip to content

Conversation

sypneiwski
Copy link
Contributor

Description

This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 4, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit a4b8809 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending it out! At first glance it looks everything is there already! Hopefully we won't iterate too much on it and land it quickly!

One overall comment: in this PR the CUDATrace is a global static, not a thread-local, so perhaps we should remove the "TLS" from the names of those files?

@@ -5,6 +5,19 @@
namespace c10 {
namespace impl {

template<typename... Ts>
static void noop_trace_cuda_fn(const PyInterpreter*, Ts...) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we technically apply this pattern everywhere?

Actually, I've been a bit irritated at the need to manually do function pointers and disarm them one by one, and I'm wondering if there's a way we can use plain old virtual methods to do this.

// WARNING: This class has to be written very carefully, because it may be
// possible for a Tensor to have a reference an interpreter corresponding to
// a shared library that has ALREADY BEEN UNLOADED.  This makes blindly calling
// virtual methods very dangerous, because the vtable may be garbage at that
// point (on a good day, you might get "pure virtual method called").
//
// The idea to solve this problem is we always leak PyInterpreters (so they
// always stay live even after dlclose), and disarm the "virtual methods" by
// replacing them with function pointers that just no-op.  This can't be done
// with a traditional C++ vtable, so we have to roll our own.

I wonder if we can solve the problem that was specified here by doing a second layer of indirection. So the outer PyInterpreter is non-virtual but contains a pointer to an object with the virtual methods, and then unloading simply involves swapping out that pointer for a different one that is guaranteed to be in a stable shared library. The cost is an extra indirection to get to the actual function pointer, maybe that's a small price to pay for dramatically decreasing the boilerplate here (when this class was originally written, it was like... two function pointers. Now there's so many lol.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we start by using this template to "auto-generate" the noop variants of these hooks? We'd still have to assign each of them manually when disarming, but at least we'd remove tons of static function definitions.

@@ -5,6 +5,19 @@
namespace c10 {
namespace impl {

template<typename... Ts>
static void noop_trace_cuda_fn(const PyInterpreter*, Ts...) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we start by using this template to "auto-generate" the noop variants of these hooks? We'd still have to assign each of them manually when disarming, but at least we'd remove tons of static function definitions.

@sypneiwski
Copy link
Contributor Author

The main open issue is the speed of the fast-path. The current solution has all these tricks enabled to make it fast: C10_UNLIKELY macro, inlined get_trace function and the static boolean. Based on the Callgrind instruction counter benchmark the slowdown is ~0.087%. Is this acceptable? Are you concerned by the potential synchronisation issue with the static boolean?

@ezyang
Copy link
Contributor

ezyang commented Aug 8, 2022

The perf seems acceptable to me

@ngimel
Copy link
Collaborator

ngimel commented Aug 8, 2022

My 2c: 0.087% is acceptable, I don't see sync issues with a static boolean (you'd still have your proper atomic once static boolean is flipped, and everyone should see the same value of static boolean at any time)

ezyang and others added 2 commits August 9, 2022 06:20
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! If you change CUDATrace to GPUTrace, and the CI confirms that everything now works, you can go ahead and merge this!

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I just realized about how the tests are run (see the comments I left), I think I've changed my previous suggestions to just add these tests to test_cuda.py: it's probably better to define them in a new file, so that they run as a separate process. Even more, for reproducibility, it might be better to isolate each test method in its own subprocess.

There's a similar CUDA-related test suite that needs each of its testcases to run in a dedicated process, which is test_cuda_primary_ctx.py. You can look at that as an example and copy-paste it to a new test_cuda_trace.py file. There's a few things to pay attention to:

  • Ensure you exit early if TEST_CUDA is false
  • Ensure you call run_tests() as the main function
  • Add it to the list of tests that require subprocesses, here
  • Add it to the list of tests that cannot run in parallel, here
  • Potentially, add it to the list of slow tests (since starting a subprocess and loading PyTorch is very slow), here

(I found all these instances by searching where test_cuda_primary_ctx occurs in the codebase)

import torch.utils._cuda_trace as cuda_trace
from torch.testing._internal.common_utils import TestCase, run_tests

# NOTE: this needs to be run in a brand new process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could be nice to explain why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -30,6 +30,46 @@ using Stack = std::vector<c10::IValue>;
namespace c10 {
namespace impl {

struct C10_API PyInterpreter;
Copy link
Contributor

@malfet malfet Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't it be renamed to something other than PyInterpeter, as name a bit misleading (i.e. this is just an opaque TracerContext pointer, isn't it) As it technically can be used from C++ or Java frontends to trace memory allocation, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a PyInterpreter though, we allocate one per python interpreter (there can be multiple in torchdeploy)

@sypneiwski sypneiwski self-assigned this Aug 11, 2022
@sypneiwski
Copy link
Contributor Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

Hey @sypneiwski.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@sypneiwski sypneiwski added release notes: cuda release notes category topic: new features topic category labels Aug 11, 2022
facebook-github-bot pushed a commit that referenced this pull request Aug 11, 2022
Summary:
### Description
This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc.

Pull Request resolved: #82824
Approved by: https://github.com/lw, https://github.com/ezyang, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/916def84d49043aba5185f75b49a11a5a82248e6

Reviewed By: seemethere

Differential Revision: D38624093

Pulled By: sypneiwski

fbshipit-source-id: fbdeb427f9d23120f0916391fdbbe5085d52fcc5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants