Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more logging for gossiper::lock_endpoint and storage_service::handle_state_normal #16733

Merged

Conversation

kbr-scylla
Copy link
Contributor

In a longevity test reported in #16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Thus the plan is to continue debugging using the longevity test, but we need
more logs. To check whether handle_state_normal was called and which branches
were taken, include some INFO level logs there. Also, detect deadlocks inside
gossiper::lock_endpoint by reporting an error message if lock_endpoint
waits for the lock for too long.

Ref: #16668

In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Improve the INFO level logging in handle_state_normal to aid debugging
in the future.

The amount of logs is still constant per-node. Even though some log
messages report all tokens owned by a node, handle_state_normal calls
are still rare. The most "spammy" situation is when a node starts and
calls handle_state_normal for every other node in the cluster, but it is
a once-per-startup event.
The original code extracted only the function_name from the
source_location for logging. We'll use more information from the
source_location in later commits.
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 19 min
  • Builder: i-094afabb8ff7d4857 (m5d.12xlarge)

@kbr-scylla
Copy link
Contributor Author

@bhalevy please review -- I believe this is necessary to proress on debugging #16668 (I don't have any other ideas)

In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt
with gossiper deadlocks in the past (e.g. scylladb#7127).

Modify the code so it reports an error if `lock_endpoint` waits for the
lock for more than a minute. When the issue reproduces again in
longevity, we will see if `lock_endpoint` got stuck.
@kbr-scylla
Copy link
Contributor Author

v2: reverse conditions in lock_endpoint to reduce nesting

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 1 hr 50 min
  • Builder: spider3.cloudius-systems.com

@scylladb-promoter scylladb-promoter merged commit 5f44ae8 into scylladb:master Jan 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants