New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid reference invalidation in cuda SpectralOps' plan_caches #31861
Conversation
@@ -261,9 +261,10 @@ static inline Tensor _run_cufft( | |||
return output; | |||
} | |||
|
|||
// The cuFFT plan cache, defined in CuFFTUtils.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment wasn't correct. plan_caches
is not declared in CuFFTUtils.h
and is only used in this file so made it static
.
💊 CircleCI build failures summary and remediationsAs of commit 7769b5f:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakage: pytorch_xla_linux_xenial_py3_6_clang7_build (1/1)Step: "Build" (full log | pattern match details)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Thanks for fixing.
Could the crash reproducer be added as a slow test, maybe? Anyway, nice work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…h#31861) Summary: Fixes pytorch#31412 The root cause is `plan_caches` being resized in one thread while another holds a reference to an existing `CuFFTParamsLRUCache` which then becomes invalidated. I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though. Pull Request resolved: pytorch#31861 Differential Revision: D19312314 Pulled By: ezyang fbshipit-source-id: 06e4561128d503f2d70cdfe1982be0f3db2a8cf8
Fixes #31412
The root cause is
plan_caches
being resized in one thread while another holds a reference to an existingCuFFTParamsLRUCache
which then becomes invalidated.I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though.