test_remove_node_during_mv_insert_4_nodes: sstable - failed reading index: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) #12603

bhalevy · 2023-01-22T11:25:52Z

Regression seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/179/testReport/materialized_views_test/TestMaterializedViews/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split013___test_remove_node_during_mv_insert_4_nodes/
With scylla version aab5954

failed on teardown with "AssertionError: Unexpected errors found: [('node3', ['ERROR 2023-01-19 06:13:55,457 [shard 0] sstable - failed reading index for /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-9ngjriog/test/node3/data/mview/users-51f7426097bf11ed978b68c684f7542b/me-42-big-Data.db: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)', 'ERROR 2023-01-19 06:13:55,460 [shard 0] sstable - failed reading index for /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-9ngjriog/test/node3/data/mview/users-51f7426097bf11ed978b68c684f7542b/me-8-big-Data.db: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)' ...

@denesb please look into this asap

The text was updated successfully, but these errors were encountered:

denesb · 2023-01-24T09:20:31Z

There are reports like this in the logs of node3:

INFO  2023-01-19 06:13:55,453 [shard 0] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 99/100 count and 20408624/10192158 memory resources: timed out, dumping permit diagnostics:
permits	count	memory	table/description/state
99	99	19M	mview.users/shard-reader/waiting_for_memory
249	0	16K	mview.users/multishard-mutation-query/active/unused
150	0	0B	mview.users/shard-reader/waiting_for_admission

This seems to be a kind of deadlock: the top-level multishard reader is waiting for its shard readers. Its shard readers are all blocked on memory, the semaphore is waiting for the multishard readers to release some to make progress.
I think we need a special breakout condition here.

denesb · 2023-01-31T08:39:43Z

I have been unsuccessful so far in reproducing this. I saw 3 reports that all look like the same: 99 reads waiting for memory in what seems to be a deadlock, with lots of waiters for admission.
The reason I think there is a deadlock involved here is that no permit with count resource is in any active/* state. The multishard mutation reader (which is in active/* state) is not doing any actual reading, it is just merging the output of the shard readers, which all seem to be blocked.
I think the bug is possibly in the interaction between inactive reads and the read serialization.

denesb · 2023-02-01T09:57:49Z

After staring at the code hard enough for long enough it finally confessed and I reproduced this with a unit test, on a smaller scale:

Semaphore test_reader_concurrency_semaphore_blessed_read_goes_inactive with 2/3 count and 5120/3072 memory resources: user request, dumping permit diagnostics:
permits count   memory  table/description/state
1       1       3K      ks.cf/permit3/waiting_for_memory
1       1       2K      ks.cf/permit2/waiting_for_memory
1       0       0B      ks.cf/permit1/waiting_for_admission

3       2       5K      total

The problem happens when a permit becomes blessed, then it is registered inactive, then it is evicted and thus have to wait for readmission. If the situation is just right, the permit is queued on the admission queue, while still being the blessed permit. If the semaphore is above the serialize limit and reads don't release memory, the situation seen above can happen: all reads get queued on memory, with no progress because the blessed permit is waiting for admission. Deadlock.
The solution is to release the blessed status upon becoming inactive.

When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its belssed status when doing so. This could result in this permit being req-eueud on the admission queue after a possible eviction, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603

When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its belssed status when doing so. This could result in this permit being re-queued for admission if it was evicted in the meanwhile, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603

When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its blessed status when doing so. This could result in this permit being re-queued for admission if it was evicted in the meanwhile, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603

avikivity · 2023-02-05T11:47:50Z

No vulnerable branches, not backporting.

bhalevy added triage/master Looking for assignee P1 Urgent labels Jan 22, 2023

bhalevy assigned denesb Jan 22, 2023

DoronArazii added the status/regression label Jan 22, 2023

DoronArazii added this to the 5.2 milestone Jan 22, 2023

DoronArazii added type/bug and removed triage/master Looking for assignee labels Feb 1, 2023

denesb mentioned this issue Feb 1, 2023

reader_concurrency_semaphore: un-bless permits when they become inactive #12694

Closed

scylladb-promoter closed this as completed in 34cdcaf Feb 3, 2023

scylladb-promoter added the Backport candidate label Feb 3, 2023

avikivity removed the Backport candidate label Feb 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_remove_node_during_mv_insert_4_nodes: sstable - failed reading index: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) #12603

test_remove_node_during_mv_insert_4_nodes: sstable - failed reading index: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) #12603

bhalevy commented Jan 22, 2023

denesb commented Jan 24, 2023

denesb commented Jan 31, 2023

denesb commented Feb 1, 2023

avikivity commented Feb 5, 2023

test_remove_node_during_mv_insert_4_nodes: sstable - failed reading index: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) #12603

test_remove_node_during_mv_insert_4_nodes: sstable - failed reading index: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) #12603

Comments

bhalevy commented Jan 22, 2023

denesb commented Jan 24, 2023

denesb commented Jan 31, 2023

denesb commented Feb 1, 2023

avikivity commented Feb 5, 2023