Deadlock with primary caching is on #4947

abey79 · 2024-01-29T15:51:31Z

I can consistently reproduce a dead-lock on main (as of 4371783) that happens only with primary caching on.

Repro step:

reset the blueprint:
left space view over middle one:

Deadlock occurs then.

Alternative repro:

start with primary caching off
enable primary caching

=> deadlock immediately occurs

Stack trace:

Repro RRD (800+MB, on private slack):
https://rerunio.slack.com/archives/C041NHU952S/p1706543181xxxx99?thread_ts=1706204242.101919&cid=C041NHU952S

The text was updated successfully, but these errors were encountered:

abey79 · 2024-01-30T07:41:03Z

Another suspicious thread:

abey79 · 2024-01-30T07:41:28Z

And that error seems to popup shortly after the deadlock:

abey79 · 2024-01-30T07:50:21Z

With RAYON_NUM_THREADS=1 or RAYON_NUM_THREADS=2, the deadlock doesn't happen. With RAYON_NUM_THREADS=3 it happens.

teh-cmc · 2024-01-30T10:49:08Z

Reproduced together by duplicating a view a bunch of times

This adds support for reentrancy in the latest-at and range caches in order to support a very nasty edge-case where two rayon tasks (i.e. space views) that query the same exact data (e.g. because they are clones of each other) end up running concurrently _on the same thread_. This can happen because we execute space views through multiple nested layers of parallel iterators, and because rayon's scheduler is a work-stealing one, this effectively means one thread can jump to anywhere in the code at any point, and might do so while holding a lock it shouldn't. This becomes a problem now that querying data involves mutations and locks, due to the presence of a cache on the path. There is a lot of complexity we could add on top of what this PR already does in order to make this edge-case more efficient, but there is no reason to go there unless there is any indication that this is good not enough in practice (i.e. you don't even notice it's going on). If it turns out that we can see glitches in practice, we'll go there. Taking a step back, it's important to realize that this is just another side-effect of our current "immediate mode querying model", where each space view computes its dataset on its own at the very last second while it is rendering, therefore mixing up computing the data and using the data, running identical queries multiple times, etc. We already know that we want to --and have to-- move away from this model in order to make our upcoming features possible (on-disk data, component conversions, data overrides, external store hub, etc); so I'd rather not sink any more complexity than the bare minimum required in this thing -- it has to go away anyhow. - Fixes #4947 https://github.com/rerun-io/rerun/assets/2910679/b8d8db83-cafb-46ff-ad38-3e41c9d43cd9 https://github.com/rerun-io/rerun/assets/2910679/c3be6250-31ba-4398-9109-356c9b9a8b4e

abey79 added the 🪳 bug Something isn't working label Jan 29, 2024

abey79 added this to the 0.13 milestone Jan 29, 2024

teh-cmc self-assigned this Jan 29, 2024

teh-cmc added the 💣 crash crash, deadlock/freeze, do-no-start label Jan 29, 2024

teh-cmc mentioned this issue Jan 29, 2024

Primary cache: fix crash when using Clear #4949

Merged

4 tasks

This was referenced Jan 30, 2024

Deadlock playground: disable top-level system exec parallelism #4968

Closed

Primary cache: support reentrancy #4980

Merged

teh-cmc closed this as completed in #4980 Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock with primary caching is on #4947

Deadlock with primary caching is on #4947

abey79 commented Jan 29, 2024

abey79 commented Jan 30, 2024

abey79 commented Jan 30, 2024

abey79 commented Jan 30, 2024

teh-cmc commented Jan 30, 2024

Deadlock with primary caching is on #4947

Deadlock with primary caching is on #4947

Comments

abey79 commented Jan 29, 2024

abey79 commented Jan 30, 2024

abey79 commented Jan 30, 2024

abey79 commented Jan 30, 2024

teh-cmc commented Jan 30, 2024