Avoid reference invalidation in cuda SpectralOps' plan_caches #31861

peterbell10 · 2020-01-04T20:55:09Z

The root cause is plan_caches being resized in one thread while another holds a reference to an existing CuFFTParamsLRUCache which then becomes invalidated.

I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though.

peterbell10 · 2020-01-04T20:57:48Z

aten/src/ATen/native/cuda/SpectralOps.cu

@@ -261,9 +261,10 @@ static inline Tensor _run_cufft(
  return output;
 }

-// The cuFFT plan cache, defined in CuFFTUtils.h


This comment wasn't correct. plan_caches is not declared in CuFFTUtils.h and is only used in this file so made it static.

kostmo · 2020-01-04T21:13:08Z

💊 CircleCI build failures summary and remediations

As of commit 7769b5f:

1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

pytorch_xla_linux_xenial_py3_6_clang7_build (1/1)

Step: "Build" (full log | pattern match details)

Jan 04 21:01:47 caused by: Connection refused (os error 111)

Jan 04 21:01:47 +++ eval 'extract_trap_cmd ' 
Jan 04 21:01:47 ++++ extract_trap_cmd 
Jan 04 21:01:47 ++++ printf '%s\n' '' 
Jan 04 21:01:47 +++ printf '%s\n' cleanup 
Jan 04 21:01:47 ++ trap -- ' 
Jan 04 21:01:47 cleanup' EXIT 
Jan 04 21:01:47 ++ which sccache 
Jan 04 21:01:47 ++ sccache --stop-server 
Jan 04 21:01:47 Stopping sccache server... 
Jan 04 21:01:47 error: couldn't connect to server 
Jan 04 21:01:47 caused by: Connection refused (os error 111) 
Jan 04 21:01:47 ++ true 
Jan 04 21:01:47 ++ rm /var/lib/jenkins/sccache_error.log 
Jan 04 21:01:47 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Jan 04 21:01:47 ++ true 
Jan 04 21:01:47 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Jan 04 21:01:47 ++ SCCACHE_IDLE_TIMEOUT=1200 
Jan 04 21:01:47 ++ RUST_LOG=sccache::server=error 
Jan 04 21:01:47 ++ sccache --start-server 
Jan 04 21:01:47 Starting sccache server... 
Jan 04 21:01:47 ++ sccache --zero-stats

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

ssnl

Nice catch! Thanks for fixing.

ezyang · 2020-01-08T17:03:36Z

I was able to reproduce the crash very reliably without this fix applied and no longer see it

Could the crash reproducer be added as a slow test, maybe? Anyway, nice work!

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-01-08T21:33:37Z

@ezyang merged this pull request in 54777b1.

facebook-github-bot · 2020-01-08T21:33:48Z

@ezyang merged this pull request in 54777b1.

…h#31861) Summary: Fixes pytorch#31412 The root cause is `plan_caches` being resized in one thread while another holds a reference to an existing `CuFFTParamsLRUCache` which then becomes invalidated. I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though. Pull Request resolved: pytorch#31861 Differential Revision: D19312314 Pulled By: ezyang fbshipit-source-id: 06e4561128d503f2d70cdfe1982be0f3db2a8cf8

Avoid reference invalidation in cuda SpectralOps' plan_caches

7769b5f

peterbell10 added the open source label Jan 4, 2020

peterbell10 requested a review from ezyang January 4, 2020 20:55

peterbell10 commented Jan 4, 2020

View reviewed changes

ssnl approved these changes Jan 5, 2020

View reviewed changes

facebook-github-bot reviewed Jan 8, 2020

View reviewed changes

facebook-github-bot closed this in 54777b1 Jan 8, 2020

facebook-github-bot added the merged label Jan 8, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid reference invalidation in cuda SpectralOps' plan_caches #31861

Avoid reference invalidation in cuda SpectralOps' plan_caches #31861

peterbell10 commented Jan 4, 2020

peterbell10 Jan 4, 2020

kostmo commented Jan 4, 2020

ssnl left a comment

ezyang commented Jan 8, 2020

facebook-github-bot left a comment

facebook-github-bot commented Jan 8, 2020

facebook-github-bot commented Jan 8, 2020

Avoid reference invalidation in cuda SpectralOps' plan_caches #31861

Avoid reference invalidation in cuda SpectralOps' plan_caches #31861

Conversation

peterbell10 commented Jan 4, 2020

peterbell10 Jan 4, 2020

Choose a reason for hiding this comment

kostmo commented Jan 4, 2020

💊 CircleCI build failures summary and remediations

Detailed failure analysis

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_xenial_py3_6_clang7_build (1/1)

ssnl left a comment

Choose a reason for hiding this comment

ezyang commented Jan 8, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 8, 2020

facebook-github-bot commented Jan 8, 2020