Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal` #16733

kbr-scylla · 2024-01-11T09:22:03Z

In a longevity test reported in #16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Thus the plan is to continue debugging using the longevity test, but we need
more logs. To check whether handle_state_normal was called and which branches
were taken, include some INFO level logs there. Also, detect deadlocks inside
gossiper::lock_endpoint by reporting an error message if lock_endpoint
waits for the lock for too long.

Ref: #16668

In a longevity test reported in scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. Improve the INFO level logging in handle_state_normal to aid debugging in the future. The amount of logs is still constant per-node. Even though some log messages report all tokens owned by a node, handle_state_normal calls are still rare. The most "spammy" situation is when a node starts and calls handle_state_normal for every other node in the cluster, but it is a once-per-startup event.

The original code extracted only the function_name from the source_location for logging. We'll use more information from the source_location in later commits.

scylladb-promoter · 2024-01-11T11:42:12Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 19 min
Builder: i-094afabb8ff7d4857 (m5d.12xlarge)

kbr-scylla · 2024-01-11T13:22:27Z

@bhalevy please review -- I believe this is necessary to proress on debugging #16668 (I don't have any other ideas)

service/storage_service.cc

gms/gossiper.cc

In a longevity test reported in scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt with gossiper deadlocks in the past (e.g. scylladb#7127). Modify the code so it reports an error if `lock_endpoint` waits for the lock for more than a minute. When the issue reproduces again in longevity, we will see if `lock_endpoint` got stuck.

kbr-scylla · 2024-01-11T16:30:11Z

v2: reverse conditions in lock_endpoint to reduce nesting

scylladb-promoter · 2024-01-11T18:21:46Z

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 1 hr 50 min
Builder: spider3.cloudius-systems.com

kbr-scylla added 2 commits January 10, 2024 16:39

gossiper: store source_location instead of string in endpoint_permit

6e39c2f

The original code extracted only the function_name from the source_location for logging. We'll use more information from the source_location in later commits.

kbr-scylla requested review from avikivity and bhalevy January 11, 2024 09:22

kbr-scylla requested a review from tgrabiec as a code owner January 11, 2024 09:22

kbr-scylla mentioned this pull request Jan 11, 2024

Multiple node core dump during decommission operation of other node (conversion to host ID related) #16668

Closed

2 tasks

bhalevy reviewed Jan 11, 2024

View reviewed changes

service/storage_service.cc Show resolved Hide resolved

bhalevy reviewed Jan 11, 2024

View reviewed changes

gms/gossiper.cc Show resolved Hide resolved

bhalevy reviewed Jan 11, 2024

View reviewed changes

gms/gossiper.cc Show resolved Hide resolved

kbr-scylla force-pushed the improve-gossiper-logging branch from a8fd5d0 to cf64602 Compare January 11, 2024 16:29

bhalevy approved these changes Jan 11, 2024

View reviewed changes

scylladb-promoter merged commit 5f44ae8 into scylladb:master Jan 12, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal` #16733

Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal` #16733

kbr-scylla commented Jan 11, 2024

scylladb-promoter commented Jan 11, 2024

kbr-scylla commented Jan 11, 2024

kbr-scylla commented Jan 11, 2024

scylladb-promoter commented Jan 11, 2024

Add more logging for gossiper::lock_endpoint and storage_service::handle_state_normal #16733

Add more logging for gossiper::lock_endpoint and storage_service::handle_state_normal #16733

Conversation

kbr-scylla commented Jan 11, 2024

scylladb-promoter commented Jan 11, 2024

🟢 CI State: SUCCESS

Build Details:

kbr-scylla commented Jan 11, 2024

kbr-scylla commented Jan 11, 2024

scylladb-promoter commented Jan 11, 2024

🟢 CI State: SUCCESS

Build Details:

Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal` #16733

Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal` #16733