Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with primary caching is on #4947

Closed
abey79 opened this issue Jan 29, 2024 · 4 comments · Fixed by #4980
Closed

Deadlock with primary caching is on #4947

abey79 opened this issue Jan 29, 2024 · 4 comments · Fixed by #4980
Assignees
Labels
🪳 bug Something isn't working 💣 crash crash, deadlock/freeze, do-no-start
Milestone

Comments

@abey79
Copy link
Contributor

abey79 commented Jan 29, 2024

I can consistently reproduce a dead-lock on main (as of 4371783) that happens only with primary caching on.

Repro step:

  • reset the blueprint:
    image

  • left space view over middle one:
    image

Deadlock occurs then.

Alternative repro:

  • start with primary caching off
  • enable primary caching

=> deadlock immediately occurs

Stack trace:

image

Repro RRD (800+MB, on private slack):
https://rerunio.slack.com/archives/C041NHU952S/p1706543181xxxx99?thread_ts=1706204242.101919&cid=C041NHU952S

@abey79 abey79 added the 🪳 bug Something isn't working label Jan 29, 2024
@abey79 abey79 added this to the 0.13 milestone Jan 29, 2024
@teh-cmc teh-cmc self-assigned this Jan 29, 2024
@teh-cmc teh-cmc added the 💣 crash crash, deadlock/freeze, do-no-start label Jan 29, 2024
@abey79
Copy link
Contributor Author

abey79 commented Jan 30, 2024

Another suspicious thread:
image

@abey79
Copy link
Contributor Author

abey79 commented Jan 30, 2024

And that error seems to popup shortly after the deadlock:
image

@abey79
Copy link
Contributor Author

abey79 commented Jan 30, 2024

With RAYON_NUM_THREADS=1 or RAYON_NUM_THREADS=2, the deadlock doesn't happen. With RAYON_NUM_THREADS=3 it happens.

@teh-cmc
Copy link
Member

teh-cmc commented Jan 30, 2024

Reproduced together by duplicating a view a bunch of times

teh-cmc added a commit that referenced this issue Jan 31, 2024
This adds support for reentrancy in the latest-at and range caches in
order to support a very nasty edge-case where two rayon tasks (i.e.
space views) that query the same exact data (e.g. because they are
clones of each other) end up running concurrently _on the same thread_.
This can happen because we execute space views through multiple nested
layers of parallel iterators, and because rayon's scheduler is a
work-stealing one, this effectively means one thread can jump to
anywhere in the code at any point, and might do so while holding a lock
it shouldn't.
This becomes a problem now that querying data involves mutations and
locks, due to the presence of a cache on the path.

There is a lot of complexity we could add on top of what this PR already
does in order to make this edge-case more efficient, but there is no
reason to go there unless there is any indication that this is good not
enough in practice (i.e. you don't even notice it's going on).
If it turns out that we can see glitches in practice, we'll go there.

Taking a step back, it's important to realize that this is just another
side-effect of our current "immediate mode querying model", where each
space view computes its dataset on its own at the very last second while
it is rendering, therefore mixing up computing the data and using the
data, running identical queries multiple times, etc.
We already know that we want to --and have to-- move away from this
model in order to make our upcoming features possible (on-disk data,
component conversions, data overrides, external store hub, etc); so I'd
rather not sink any more complexity than the bare minimum required in
this thing -- it has to go away anyhow.


- Fixes #4947 



https://github.com/rerun-io/rerun/assets/2910679/b8d8db83-cafb-46ff-ad38-3e41c9d43cd9



https://github.com/rerun-io/rerun/assets/2910679/c3be6250-31ba-4398-9109-356c9b9a8b4e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working 💣 crash crash, deadlock/freeze, do-no-start
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants