-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock with primary caching is on #4947
Milestone
Comments
4 tasks
With |
Reproduced together by duplicating a view a bunch of times |
This was referenced Jan 30, 2024
teh-cmc
added a commit
that referenced
this issue
Jan 31, 2024
This adds support for reentrancy in the latest-at and range caches in order to support a very nasty edge-case where two rayon tasks (i.e. space views) that query the same exact data (e.g. because they are clones of each other) end up running concurrently _on the same thread_. This can happen because we execute space views through multiple nested layers of parallel iterators, and because rayon's scheduler is a work-stealing one, this effectively means one thread can jump to anywhere in the code at any point, and might do so while holding a lock it shouldn't. This becomes a problem now that querying data involves mutations and locks, due to the presence of a cache on the path. There is a lot of complexity we could add on top of what this PR already does in order to make this edge-case more efficient, but there is no reason to go there unless there is any indication that this is good not enough in practice (i.e. you don't even notice it's going on). If it turns out that we can see glitches in practice, we'll go there. Taking a step back, it's important to realize that this is just another side-effect of our current "immediate mode querying model", where each space view computes its dataset on its own at the very last second while it is rendering, therefore mixing up computing the data and using the data, running identical queries multiple times, etc. We already know that we want to --and have to-- move away from this model in order to make our upcoming features possible (on-disk data, component conversions, data overrides, external store hub, etc); so I'd rather not sink any more complexity than the bare minimum required in this thing -- it has to go away anyhow. - Fixes #4947 https://github.com/rerun-io/rerun/assets/2910679/b8d8db83-cafb-46ff-ad38-3e41c9d43cd9 https://github.com/rerun-io/rerun/assets/2910679/c3be6250-31ba-4398-9109-356c9b9a8b4e
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I can consistently reproduce a dead-lock on main (as of 4371783) that happens only with primary caching on.
Repro step:
reset the blueprint:
left space view over middle one:
Deadlock occurs then.
Alternative repro:
=> deadlock immediately occurs
Stack trace:
Repro RRD (800+MB, on private slack):
https://rerunio.slack.com/archives/C041NHU952S/p1706543181xxxx99?thread_ts=1706204242.101919&cid=C041NHU952S
The text was updated successfully, but these errors were encountered: