New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_remove_node_during_mv_insert_4_nodes: sstable - failed reading index: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) #12603
Comments
There are reports like this in the logs of node3:
This seems to be a kind of deadlock: the top-level multishard reader is waiting for its shard readers. Its shard readers are all blocked on memory, the semaphore is waiting for the multishard readers to release some to make progress. |
I have been unsuccessful so far in reproducing this. I saw 3 reports that all look like the same: 99 reads waiting for memory in what seems to be a deadlock, with lots of waiters for admission. |
After staring at the code hard enough for long enough it finally confessed and I reproduced this with a unit test, on a smaller scale:
The problem happens when a permit becomes blessed, then it is registered inactive, then it is evicted and thus have to wait for readmission. If the situation is just right, the permit is queued on the admission queue, while still being the blessed permit. If the semaphore is above the serialize limit and reads don't release memory, the situation seen above can happen: all reads get queued on memory, with no progress because the blessed permit is waiting for admission. Deadlock. |
When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its belssed status when doing so. This could result in this permit being req-eueud on the admission queue after a possible eviction, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603
When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its belssed status when doing so. This could result in this permit being re-queued for admission if it was evicted in the meanwhile, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603
When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its blessed status when doing so. This could result in this permit being re-queued for admission if it was evicted in the meanwhile, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603
When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its blessed status when doing so. This could result in this permit being re-queued for admission if it was evicted in the meanwhile, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: scylladb#12603
No vulnerable branches, not backporting. |
Regression seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/179/testReport/materialized_views_test/TestMaterializedViews/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split013___test_remove_node_during_mv_insert_4_nodes/
With scylla version aab5954
@denesb please look into this asap
The text was updated successfully, but these errors were encountered: