-
Notifications
You must be signed in to change notification settings - Fork 25.2k
CUDA trace Python hooks #82824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA trace Python hooks #82824
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit a4b8809 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sending it out! At first glance it looks everything is there already! Hopefully we won't iterate too much on it and land it quickly!
One overall comment: in this PR the CUDATrace is a global static, not a thread-local, so perhaps we should remove the "TLS" from the names of those files?
6336293
to
493257b
Compare
c10/core/impl/PyInterpreter.cpp
Outdated
@@ -5,6 +5,19 @@ | |||
namespace c10 { | |||
namespace impl { | |||
|
|||
template<typename... Ts> | |||
static void noop_trace_cuda_fn(const PyInterpreter*, Ts...) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we technically apply this pattern everywhere?
Actually, I've been a bit irritated at the need to manually do function pointers and disarm them one by one, and I'm wondering if there's a way we can use plain old virtual methods to do this.
// WARNING: This class has to be written very carefully, because it may be
// possible for a Tensor to have a reference an interpreter corresponding to
// a shared library that has ALREADY BEEN UNLOADED. This makes blindly calling
// virtual methods very dangerous, because the vtable may be garbage at that
// point (on a good day, you might get "pure virtual method called").
//
// The idea to solve this problem is we always leak PyInterpreters (so they
// always stay live even after dlclose), and disarm the "virtual methods" by
// replacing them with function pointers that just no-op. This can't be done
// with a traditional C++ vtable, so we have to roll our own.
I wonder if we can solve the problem that was specified here by doing a second layer of indirection. So the outer PyInterpreter is non-virtual but contains a pointer to an object with the virtual methods, and then unloading simply involves swapping out that pointer for a different one that is guaranteed to be in a stable shared library. The cost is an extra indirection to get to the actual function pointer, maybe that's a small price to pay for dramatically decreasing the boilerplate here (when this class was originally written, it was like... two function pointers. Now there's so many lol.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we start by using this template to "auto-generate" the noop variants of these hooks? We'd still have to assign each of them manually when disarming, but at least we'd remove tons of static function definitions.
c10/core/impl/PyInterpreter.cpp
Outdated
@@ -5,6 +5,19 @@ | |||
namespace c10 { | |||
namespace impl { | |||
|
|||
template<typename... Ts> | |||
static void noop_trace_cuda_fn(const PyInterpreter*, Ts...) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we start by using this template to "auto-generate" the noop variants of these hooks? We'd still have to assign each of them manually when disarming, but at least we'd remove tons of static function definitions.
The main open issue is the speed of the fast-path. The current solution has all these tricks enabled to make it fast: |
The perf seems acceptable to me |
My 2c: 0.087% is acceptable, I don't see sync issues with a static boolean (you'd still have your proper atomic once static boolean is flipped, and everyone should see the same value of static boolean at any time) |
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! If you change CUDATrace to GPUTrace, and the CI confirms that everything now works, you can go ahead and merge this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what I just realized about how the tests are run (see the comments I left), I think I've changed my previous suggestions to just add these tests to test_cuda.py
: it's probably better to define them in a new file, so that they run as a separate process. Even more, for reproducibility, it might be better to isolate each test method in its own subprocess.
There's a similar CUDA-related test suite that needs each of its testcases to run in a dedicated process, which is test_cuda_primary_ctx.py
. You can look at that as an example and copy-paste it to a new test_cuda_trace.py
file. There's a few things to pay attention to:
- Ensure you exit early if TEST_CUDA is false
- Ensure you call
run_tests()
as the main function - Add it to the list of tests that require subprocesses, here
- Add it to the list of tests that cannot run in parallel, here
- Potentially, add it to the list of slow tests (since starting a subprocess and loading PyTorch is very slow), here
(I found all these instances by searching where test_cuda_primary_ctx
occurs in the codebase)
test/test_cuda_trace.py
Outdated
import torch.utils._cuda_trace as cuda_trace | ||
from torch.testing._internal.common_utils import TestCase, run_tests | ||
|
||
# NOTE: this needs to be run in a brand new process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could be nice to explain why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -30,6 +30,46 @@ using Stack = std::vector<c10::IValue>; | |||
namespace c10 { | |||
namespace impl { | |||
|
|||
struct C10_API PyInterpreter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, shouldn't it be renamed to something other than PyInterpeter
, as name a bit misleading (i.e. this is just an opaque TracerContext
pointer, isn't it) As it technically can be used from C++ or Java frontends to trace memory allocation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a PyInterpreter though, we allocate one per python interpreter (there can be multiple in torchdeploy)
@pytorchbot merge -g |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @sypneiwski. |
Summary: ### Description This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc. Pull Request resolved: #82824 Approved by: https://github.com/lw, https://github.com/ezyang, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/916def84d49043aba5185f75b49a11a5a82248e6 Reviewed By: seemethere Differential Revision: D38624093 Pulled By: sypneiwski fbshipit-source-id: fbdeb427f9d23120f0916391fdbbe5085d52fcc5
Description
This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc.