CUDA trace Python hooks #82824

sypneiwski · 2022-08-04T16:40:15Z

Description

This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc.

facebook-github-bot · 2022-08-04T16:40:24Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82824
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit a4b8809 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lw

Thanks for sending it out! At first glance it looks everything is there already! Hopefully we won't iterate too much on it and land it quickly!

One overall comment: in this PR the CUDATrace is a global static, not a thread-local, so perhaps we should remove the "TLS" from the names of those files?

aten/src/ATen/cuda/CUDAEvent.h

c10/core/impl/CUDATraceTLS.cpp

c10/core/impl/CUDATraceTLS.h

c10/cuda/CUDACachingAllocator.cpp

torch/_C/__init__.pyi.in

torch/csrc/autograd/python_variable.cpp

c10/core/impl/CUDATraceTLS.cpp

torch/utils/_cuda_trace.py

aten/src/ATen/cuda/CUDAEvent.h

c10/core/impl/CUDATrace.h

ezyang · 2022-08-06T02:54:33Z

c10/core/impl/PyInterpreter.cpp

@@ -5,6 +5,19 @@
 namespace c10 {
 namespace impl {

+template<typename... Ts>
+static void noop_trace_cuda_fn(const PyInterpreter*, Ts...) {}


Can't we technically apply this pattern everywhere?

Actually, I've been a bit irritated at the need to manually do function pointers and disarm them one by one, and I'm wondering if there's a way we can use plain old virtual methods to do this.

// WARNING: This class has to be written very carefully, because it may be // possible for a Tensor to have a reference an interpreter corresponding to // a shared library that has ALREADY BEEN UNLOADED. This makes blindly calling // virtual methods very dangerous, because the vtable may be garbage at that // point (on a good day, you might get "pure virtual method called"). // // The idea to solve this problem is we always leak PyInterpreters (so they // always stay live even after dlclose), and disarm the "virtual methods" by // replacing them with function pointers that just no-op. This can't be done // with a traditional C++ vtable, so we have to roll our own.

I wonder if we can solve the problem that was specified here by doing a second layer of indirection. So the outer PyInterpreter is non-virtual but contains a pointer to an object with the virtual methods, and then unloading simply involves swapping out that pointer for a different one that is guaranteed to be in a stable shared library. The cost is an extra indirection to get to the actual function pointer, maybe that's a small price to pay for dramatically decreasing the boilerplate here (when this class was originally written, it was like... two function pointers. Now there's so many lol.)

Could we start by using this template to "auto-generate" the noop variants of these hooks? We'd still have to assign each of them manually when disarming, but at least we'd remove tons of static function definitions.

torch/csrc/autograd/python_variable.cpp

aten/src/ATen/cuda/CUDAEvent.h

c10/core/impl/CUDATrace.cpp

c10/core/impl/PyInterpreter.cpp

lw · 2022-08-08T11:50:25Z

c10/core/impl/PyInterpreter.cpp

@@ -5,6 +5,19 @@
 namespace c10 {
 namespace impl {

+template<typename... Ts>
+static void noop_trace_cuda_fn(const PyInterpreter*, Ts...) {}


Could we start by using this template to "auto-generate" the noop variants of these hooks? We'd still have to assign each of them manually when disarming, but at least we'd remove tons of static function definitions.

torch/utils/_cuda_trace.py

torch/csrc/autograd/python_variable.cpp

c10/core/impl/PyInterpreter.h

torch/utils/_cuda_trace.py

sypneiwski · 2022-08-08T16:25:15Z

The main open issue is the speed of the fast-path. The current solution has all these tricks enabled to make it fast: C10_UNLIKELY macro, inlined get_trace function and the static boolean. Based on the Callgrind instruction counter benchmark the slowdown is ~0.087%. Is this acceptable? Are you concerned by the potential synchronisation issue with the static boolean?

ezyang · 2022-08-08T16:36:06Z

The perf seems acceptable to me

ngimel · 2022-08-08T16:37:07Z

My 2c: 0.087% is acceptable, I don't see sync issues with a static boolean (you'd still have your proper atomic once static boolean is flipped, and everyone should see the same value of static boolean at any time)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

lw

Great! If you change CUDATrace to GPUTrace, and the CI confirms that everything now works, you can go ahead and merge this!

torch/utils/_cuda_trace.py

torch/csrc/autograd/python_variable.cpp

c10/core/impl/PyInterpreter.cpp

c10/core/impl/CUDATrace.cpp

c10/core/impl/PyInterpreter.h

lw

Based on what I just realized about how the tests are run (see the comments I left), I think I've changed my previous suggestions to just add these tests to test_cuda.py: it's probably better to define them in a new file, so that they run as a separate process. Even more, for reproducibility, it might be better to isolate each test method in its own subprocess.

There's a similar CUDA-related test suite that needs each of its testcases to run in a dedicated process, which is test_cuda_primary_ctx.py. You can look at that as an example and copy-paste it to a new test_cuda_trace.py file. There's a few things to pay attention to:

Ensure you exit early if TEST_CUDA is false
Ensure you call run_tests() as the main function
Add it to the list of tests that require subprocesses, here
Add it to the list of tests that cannot run in parallel, here
Potentially, add it to the list of slow tests (since starting a subprocess and loading PyTorch is very slow), here

(I found all these instances by searching where test_cuda_primary_ctx occurs in the codebase)

torch/utils/_cuda_trace.py

test/test_cuda.py

lw · 2022-08-10T17:00:31Z

test/test_cuda_trace.py

+import torch.utils._cuda_trace as cuda_trace
+from torch.testing._internal.common_utils import TestCase, run_tests
+
+# NOTE: this needs to be run in a brand new process


Nit: could be nice to explain why

malfet · 2022-08-10T21:17:16Z

c10/core/impl/PyInterpreter.h

@@ -30,6 +30,46 @@ using Stack = std::vector<c10::IValue>;
 namespace c10 {
 namespace impl {

+struct C10_API PyInterpreter;


Hmm, shouldn't it be renamed to something other than PyInterpeter, as name a bit misleading (i.e. this is just an opaque TracerContext pointer, isn't it) As it technically can be used from C++ or Java frontends to trace memory allocation, etc.

It is a PyInterpreter though, we allocate one per python interpreter (there can be multiple in torchdeploy)

sypneiwski · 2022-08-11T10:20:19Z

@pytorchbot merge -g

pytorchmergebot · 2022-08-11T10:21:35Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-08-11T10:22:13Z

Hey @sypneiwski.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: ### Description This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc. Pull Request resolved: #82824 Approved by: https://github.com/lw, https://github.com/ezyang, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/916def84d49043aba5185f75b49a11a5a82248e6 Reviewed By: seemethere Differential Revision: D38624093 Pulled By: sypneiwski fbshipit-source-id: fbdeb427f9d23120f0916391fdbbe5085d52fcc5

sypneiwski requested review from albanD and soulitzer as code owners August 4, 2022 16:40

facebook-github-bot added the cla signed label Aug 4, 2022

lw requested changes Aug 4, 2022

View reviewed changes

albanD mentioned this pull request Aug 4, 2022

track allocate/free event sequences in CudaCachingAllocator #81753

Merged

sypneiwski force-pushed the master branch 6 times, most recently from 6336293 to 493257b Compare August 5, 2022 17:34

sypneiwski requested a review from lw August 5, 2022 17:38

ezyang reviewed Aug 6, 2022

View reviewed changes

aten/src/ATen/cuda/CUDAEvent.h Outdated Show resolved Hide resolved

ezyang reviewed Aug 6, 2022

View reviewed changes

c10/core/impl/CUDATrace.h Outdated Show resolved Hide resolved

ezyang reviewed Aug 6, 2022

View reviewed changes

c10/core/impl/CUDATrace.h Outdated Show resolved Hide resolved

ezyang reviewed Aug 6, 2022

View reviewed changes

torch/csrc/autograd/python_variable.cpp Outdated Show resolved Hide resolved

ezyang approved these changes Aug 6, 2022

View reviewed changes

lw reviewed Aug 8, 2022

View reviewed changes

torch/utils/_cuda_trace.py Outdated Show resolved Hide resolved

lw reviewed Aug 8, 2022

View reviewed changes

torch/utils/_cuda_trace.py Outdated Show resolved Hide resolved

sypneiwski force-pushed the master branch from 493257b to 4c9b7bc Compare August 8, 2022 16:22

sypneiwski force-pushed the master branch from 4c9b7bc to ae36aa1 Compare August 9, 2022 10:09

sypneiwski requested a review from lw August 9, 2022 12:42

ezyang and others added 2 commits August 9, 2022 06:20

Python binding POC

9e0aa70

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Finish wiring up the CUDA trace

1b0edc9

sypneiwski force-pushed the master branch from af91183 to 9896e39 Compare August 9, 2022 18:16

lw approved these changes Aug 10, 2022

View reviewed changes

sypneiwski added 5 commits August 10, 2022 07:49

Make calls to cudaTraceState thread safe

4eedf63

Draft handling of event record and wait tracing

ac898a3

Add FunctionWrapper structure

07b8d85

Add remaining event handling

913a56e

Address review comments

0f6cf31

sypneiwski force-pushed the master branch from 9896e39 to 38c0778 Compare August 10, 2022 14:54

sypneiwski requested review from BowenBao, soumith, gchanan and dzhulgakov as code owners August 10, 2022 14:54

lw reviewed Aug 10, 2022

View reviewed changes

torch/utils/_cuda_trace.py Outdated Show resolved Hide resolved

test/test_cuda.py Outdated Show resolved Hide resolved

test/test_cuda.py Outdated Show resolved Hide resolved

sypneiwski force-pushed the master branch from 38c0778 to 091fedb Compare August 10, 2022 16:41

sypneiwski requested a review from a team as a code owner August 10, 2022 16:41

lw approved these changes Aug 10, 2022

View reviewed changes

Add tests

a4b8809

sypneiwski force-pushed the master branch from 091fedb to a4b8809 Compare August 10, 2022 19:43

malfet approved these changes Aug 10, 2022

View reviewed changes

sypneiwski self-assigned this Aug 11, 2022

pytorchmergebot added the Merged label Aug 11, 2022

pytorchmergebot closed this in 916def8 Aug 11, 2022

sypneiwski added release notes: cuda release notes category topic: new features topic category labels Aug 11, 2022

ezyang mentioned this pull request Jan 24, 2023

Cuda streams and torch.compile #92804

Open

CUDA trace Python hooks #82824

CUDA trace Python hooks #82824

Uh oh!

Conversation

sypneiwski commented Aug 4, 2022

Description

Uh oh!

facebook-github-bot commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ezyang Aug 6, 2022

Choose a reason for hiding this comment

Uh oh!

lw Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sypneiwski commented Aug 8, 2022

Uh oh!

ezyang commented Aug 8, 2022

Uh oh!

ngimel commented Aug 8, 2022

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

sypneiwski Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

malfet Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 10, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Aug 4, 2022 •

edited

Loading

malfet Aug 10, 2022 •

edited

Loading