Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upHinted handoff: use after free during shutdown #4836
Closed
Labels
Milestone
Comments
vladzcloudius
pushed a commit
to vladzcloudius/scylla
that referenced
this issue
Aug 12, 2019
space_watchdog blocks on the end_point_manager::file_update_mutex(). Therefore when we shut scylla down we need to first stop the space_watchdog (which is a part of resource_manager) and only after that shut all instances of db::hints::manager down. Otherwise there may be a use-after-free event. However we should also make sure that hints do not break the disk space contract during the time frame when space_watchdog is down but hints::managers are still up. In order to ensure that we will disable hints storing before stopping space_watchdog. Fixes scylladb#4836 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This comment has been minimized.
This comment has been minimized.
Refs #5089 |
avikivity
added a commit
that referenced
this issue
Oct 2, 2019
space_watchdog blocks on the end_point_manager::file_update_mutex(). Therefore when we shut scylla down we need to first stop the space_watchdog (which is a part of resource_manager) and only after that shut all instances of db::hints::manager down. Otherwise there may be a use-after-free event. However we should also make sure that hints do not break the disk space contract during the time frame when space_watchdog is down but hints::managers are still up. In order to ensure that we will disable hints storing before stopping space_watchdog. Fixes #4836 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20190812201857.5716-2-vladz@scylladb.com>
avikivity
added a commit
that referenced
this issue
Oct 3, 2019
…Vlad " Fix races that may lead to use-after-free events and file system level exceptions during shutdown and drain. The root cause of use-after-free events in question is that space_watchdog blocks on end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as it's accessed even if the corresponding end_point_hints_manager instance is destroyed in the context of manager::drain_for(). File system exceptions may occur when space_watchdog attempts to scan a directory while it's being deleted from the drain_for() context. In case of such an exception new hints generation is going to be blocked - including for materialized views, till the next space_watchdog round (in 1s). Issues that are fixed are #4685 and #4836. Tested as follows: 1) Patched the code in order to trigger the race with (a lot) higher probability and running slightly modified hinted handoff replace dtest with a debug binary for 100 times. Side effect of this testing was discovering of #4836. 2) Using the same patch as above tested that there are no crashes and nodes survive stop/start sequences (they were not without this series) in the context of all hinted handoff dtests. Ran the whole set of tests with dev binary for 10 times. " * 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla: hinted handoff: fix a race on a directory removal between space_watchdog and drain_for() hinted handoff: make taking file_update_mutex safe db::hints::manager::drain_for(): fix alignment db::hints::manager: serialize calls to drain_for() db::hints: cosmetics: identation and missing method qualifier
avikivity
added a commit
that referenced
this issue
Oct 5, 2019
…Vlad " Fix races that may lead to use-after-free events and file system level exceptions during shutdown and drain. The root cause of use-after-free events in question is that space_watchdog blocks on end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as it's accessed even if the corresponding end_point_hints_manager instance is destroyed in the context of manager::drain_for(). File system exceptions may occur when space_watchdog attempts to scan a directory while it's being deleted from the drain_for() context. In case of such an exception new hints generation is going to be blocked - including for materialized views, till the next space_watchdog round (in 1s). Issues that are fixed are #4685 and #4836. Tested as follows: 1) Patched the code in order to trigger the race with (a lot) higher probability and running slightly modified hinted handoff replace dtest with a debug binary for 100 times. Side effect of this testing was discovering of #4836. 2) Using the same patch as above tested that there are no crashes and nodes survive stop/start sequences (they were not without this series) in the context of all hinted handoff dtests. Ran the whole set of tests with dev binary for 10 times. " Fixes #4685 Fixes #4836 * 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla: hinted handoff: fix a race on a directory removal between space_watchdog and drain_for() hinted handoff: make taking file_update_mutex safe db::hints::manager::drain_for(): fix alignment db::hints::manager: serialize calls to drain_for() db::hints: cosmetics: identation and missing method qualifier (cherry picked from commit 3cb081e)
avikivity
added a commit
that referenced
this issue
Oct 5, 2019
…Vlad " Fix races that may lead to use-after-free events and file system level exceptions during shutdown and drain. The root cause of use-after-free events in question is that space_watchdog blocks on end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as it's accessed even if the corresponding end_point_hints_manager instance is destroyed in the context of manager::drain_for(). File system exceptions may occur when space_watchdog attempts to scan a directory while it's being deleted from the drain_for() context. In case of such an exception new hints generation is going to be blocked - including for materialized views, till the next space_watchdog round (in 1s). Issues that are fixed are #4685 and #4836. Tested as follows: 1) Patched the code in order to trigger the race with (a lot) higher probability and running slightly modified hinted handoff replace dtest with a debug binary for 100 times. Side effect of this testing was discovering of #4836. 2) Using the same patch as above tested that there are no crashes and nodes survive stop/start sequences (they were not without this series) in the context of all hinted handoff dtests. Ran the whole set of tests with dev binary for 10 times. " Fixes #4685 Fixes #4836. * 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla: hinted handoff: fix a race on a directory removal between space_watchdog and drain_for() hinted handoff: make taking file_update_mutex safe db::hints::manager::drain_for(): fix alignment db::hints::manager: serialize calls to drain_for() db::hints: cosmetics: identation and missing method qualifier (cherry picked from commit 3cb081e)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Installation details
HEAD: 77686ab
Description
ASAN caught a use-after-free during regular shut-down process during unit testing the fix for #4685.
(I had to patch
space_watchdog
's code in order to simulate the race, see the patch below).It turns out that the resource manager shutdown is not performed in the correct way. It shuts down
manager
s (which in turn shut down all underlyingend_point_manager
s and deletes allend_point_manager
s instances) and only after that it shuts down aspace_watchdog
which may have been blocked waiting for an<some ep_manager object>::file_update_mutex()
.The proper shut down procedure should be:
resource_manager
which should be stopping aspace_watchdog
.manager
s.Patch used to trigger the race