New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault on node configured to use client encryption #14299
Comments
And we don't know if it's a regression? |
My bad, it's a regression for sure. I do suspect it might be related to recent build changes in seastar But it merged in seastar 27d ago, I don't know when it was introduced into scylla So I'm not quite sure... |
Well, the stack trace suggest we get an empty shared_ptr, and crash on first dereference. It is not 100% clear to me who the topmost caller is though... |
happened again: during a nemesis that restarts scylla, and it keeps on coredumping when scylla starts again.
Installation detailsKernel Version: 5.15.0-1038-aws Cluster size: 6 nodes (i3.4xlarge) Scylla Nodes used in this run:
OS / Image: `` (aws: undefined_region) Test: Logs and commands
Logs:
|
seems like it's quite easily reproduced, and been spotted also on enterprise |
@elcallio setting this crash as P1 (regression) to hopefully deal with it quickly on master. /Cc @eliransin |
Suspect f86dd85 |
A bit scary that no unit test or dtest caught this. I'll look into writing a dtest reproducer for this one... |
got a dtest reproducer for it: seem like it's a combination of raft enabled and server encryption |
@elcallio this issue fails our tests that uses tls and it seams like a regression introduced by certificate base authorization. |
How would f86dd85 be even related? That code changes literally nothing with setting up TLS connectors, nor underlying TLS infrastructure. It only looks at data once incoming client connections are up. The above trace is literally in socket creation. |
@elcallio - any other ideas of what we need to revert? |
bisect? Assuming @fruch :s repro is sound? |
It can't be that commit because the report is older than the merge. I'll dequeue the revert. |
I love temporal evidence. I'll run @fruch :s repro and see if I can make sense of the crash. |
@elcallio - scylladb/seastar@f461641 perhaps? |
It cannot be certificate based authorization. Merge date: f86dd85 Tue Jun 27 12:52:14 2023 +0300 |
@mykaul scylladb/seastar@f461641 only changes things that happen after a connect. And even so only when demanded. I very much doubt it. Let me repro the issue. |
So the problem is that Adding a neat little assert:
will demonstrate this excellently by transforming the segfault to a much earlier and more traceable crash. I would suggest probably making message service TLS init earlier iff code requires sending messages before listeners are up. |
So if you want a commit to blame, I would suggest 38f65e5 |
Fixes scylladb#14299 failure_detector can try sending messages to TLS endpoints before start_listen has been called (why?). Need TLS initialized before this. So do on service creation.
Fixes scylladb#14299 failure_detector can try sending messages to TLS endpoints before start_listen has been called (why?). Need TLS initialized before this. So do on service creation.
we need this in 5.3 (upgrade tests from 5.3.0-rc0, are failing cause if it) |
@fruch do you need a permission to add labels? ^^ |
I have it, I wasn't sure which one to add, or that 'backport candidate' was enough. |
@DoronArazii We need to figure out if this regression might find it's way to 5.2 if there is a chance for backport one day |
the test is gating on master I'll test it on 5.2 and 2023.1, and if working I'll backport it |
Tested and it was working on 5.2/2023.1 backported the test as gating to 5.2/2023.1 |
Removing 'backport candidate' label. |
Issue description
Node-1 keep crashing since boot, with the following callstack:
Impact
The node is becoming unavailable at unknown times, and interfere with the test logic, failing all kind of nemesis
How frequently does it reproduce?
So far it has been seen once
Installation details
Kernel Version: 5.15.0-1038-aws
Scylla version (or git commit hash):
5.4.0~dev-20230618.b7627085cb13
with build-ida2d9adc050ce01f3543f876ea72d863b1ca6e615
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image: `` (aws: undefined_region)
Test:
longevity-100gb-4h-test
Test id:
54f56776-c13f-4749-9b6c-913dc109eb46
Test name:
scylla-master/longevity/longevity-100gb-4h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 54f56776-c13f-4749-9b6c-913dc109eb46
$ hydra investigate show-logs 54f56776-c13f-4749-9b6c-913dc109eb46
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: