Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multitenant] c-s failure during terminate k8s host (after kill scylla) #1693

Closed
soyacz opened this issue Jan 18, 2024 · 2 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@soyacz
Copy link

soyacz commented Jan 18, 2024

What happened?

In multitenant scenario (2 tenants), after kill_scylla nemesis, terminate k8s host happened.
Despite kill_scylla finished without issue and restarted pod was serving data (in logs we can see init - serving and it appears in nodetool status as UN), after terminating k8s node (different than restarted Scylla node was on) c-s failed with:

2023-12-31 06:49:01.601: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=bd52e5dc-a047-46b9-8996-66c70b6ed107 during_nemesis=TerminateKubernetesHostThenDecommissionAndAddScyllaNode: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=1834 node=Node sct-loaders-2-eu-north-1-1 [None | None] (seed: False)
java.io.IOException: Operation x10 on key(s) [4c3650304b4c4d4d3231]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)

So it looks like c-s still couldn't access previously restarted scylla node, while it should.

What did you expect to happen?

c-s continue to work despite one node being down in that moment

How can we reproduce it (as minimally and precisely as possible)?

not always reproducing

Scylla Operator version

v1.12.0-alpha.1-49-g4cc81a6

Kubernetes platform name and version

Installation details

Kernel Version: 5.10.199-190.747.amzn2.x86_64
Scylla version (or git commit hash): 5.5.0~dev-20231230.f1dea4bc8ad8 with build-id 66aced09bf964b091dc2efe9bd4d4cacfc9f93db

Operator Image: scylladb/scylla-operator:latest
Operator Helm Version: v1.12.0-alpha.0-144-g60f7824
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-3h-multitenant-eks
Test id: b14de382-c46c-460e-8c7c-7c764dccd5a0
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor b14de382-c46c-460e-8c7c-7c764dccd5a0
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs b14de382-c46c-460e-8c7c-7c764dccd5a0

Logs:

Jenkins job URL
Argus

Please attach the must-gather archive.

Anything else we need to know?

No response

@soyacz soyacz added the kind/bug Categorizes issue or PR as related to a bug. label Jan 18, 2024
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 18, 2024
@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 18, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 18, 2024
@tnozicka
Copy link
Member

this seems similar to #1077

@tnozicka
Copy link
Member

tnozicka commented May 6, 2024

closing as a dupe of #1077

@tnozicka tnozicka closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants