Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reference invalidation in cuda SpectralOps' plan_caches #31861

Closed
wants to merge 1 commit into from

Conversation

peterbell10
Copy link
Collaborator

Fixes #31412

The root cause is plan_caches being resized in one thread while another holds a reference to an existing CuFFTParamsLRUCache which then becomes invalidated.

I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though.

@@ -261,9 +261,10 @@ static inline Tensor _run_cufft(
return output;
}

// The cuFFT plan cache, defined in CuFFTUtils.h
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment wasn't correct. plan_caches is not declared in CuFFTUtils.h and is only used in this file so made it static.

@kostmo
Copy link
Member

kostmo commented Jan 4, 2020

💊 CircleCI build failures summary and remediations

As of commit 7769b5f:

  • 1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_build (1/1)

Step: "Build" (full log | pattern match details)

Jan 04 21:01:47 caused by: Connection refused (os error 111)
Jan 04 21:01:47 +++ eval 'extract_trap_cmd ' 
Jan 04 21:01:47 ++++ extract_trap_cmd 
Jan 04 21:01:47 ++++ printf '%s\n' '' 
Jan 04 21:01:47 +++ printf '%s\n' cleanup 
Jan 04 21:01:47 ++ trap -- ' 
Jan 04 21:01:47 cleanup' EXIT 
Jan 04 21:01:47 ++ which sccache 
Jan 04 21:01:47 ++ sccache --stop-server 
Jan 04 21:01:47 Stopping sccache server... 
Jan 04 21:01:47 error: couldn't connect to server 
Jan 04 21:01:47 caused by: Connection refused (os error 111) 
Jan 04 21:01:47 ++ true 
Jan 04 21:01:47 ++ rm /var/lib/jenkins/sccache_error.log 
Jan 04 21:01:47 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Jan 04 21:01:47 ++ true 
Jan 04 21:01:47 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Jan 04 21:01:47 ++ SCCACHE_IDLE_TIMEOUT=1200 
Jan 04 21:01:47 ++ RUST_LOG=sccache::server=error 
Jan 04 21:01:47 ++ sccache --start-server 
Jan 04 21:01:47 Starting sccache server... 
Jan 04 21:01:47 ++ sccache --zero-stats 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thanks for fixing.

@ezyang
Copy link
Contributor

ezyang commented Jan 8, 2020

I was able to reproduce the crash very reliably without this fix applied and no longer see it

Could the crash reproducer be added as a slow test, maybe? Anyway, nice work!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 54777b1.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 54777b1.

wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
…h#31861)

Summary:
Fixes pytorch#31412

The root cause is `plan_caches` being resized in one thread while another holds a reference to an existing `CuFFTParamsLRUCache` which then becomes invalidated.

I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though.
Pull Request resolved: pytorch#31861

Differential Revision: D19312314

Pulled By: ezyang

fbshipit-source-id: 06e4561128d503f2d70cdfe1982be0f3db2a8cf8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.stft causes segmentation fault with data parellel.
6 participants