Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra timeout during SIMPLE write query at consistency ALL in the AddRemoveDc nemesis #10274

Closed
ilya-rarov opened this issue Mar 24, 2022 · 8 comments

Comments

@ilya-rarov
Copy link

Installation details

Kernel Version: 5.13.0-1017-aws

Scylla version (or git commit hash): 5.1.dev-20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • longevity-lwt-3h-master-db-node-e59d59d8-5 (13.48.29.176 | 10.0.1.195) (shards: 8)
  • longevity-lwt-3h-master-db-node-e59d59d8-4 (16.16.26.25 | 10.0.3.113) (shards: 8)
  • longevity-lwt-3h-master-db-node-e59d59d8-3 (16.16.26.1 | 10.0.3.151) (shards: 8)
  • longevity-lwt-3h-master-db-node-e59d59d8-2 (16.170.214.99 | 10.0.1.62) (shards: 8)
  • longevity-lwt-3h-master-db-node-e59d59d8-1 (13.53.122.27 | 10.0.1.100) (shards: 8)

OS / Image: ami-0f50a374dd30afb62 (aws: eu-north-1)

Test: longevity-lwt-3h-test

Test id: e59d59d8-3892-4fda-8e50-e651387aff96

Test name: longevity/longevity-lwt-3h-test

Test config file(s):

Issue description

Before the nemesis AddRemoveDc had started we made sure the cluster was up and running

2022-03-22 15:52:50.430: (ClusterHealthValidatorEvent Severity.NORMAL) period_type=begin event_id=bc326832-8fe7-4e7b-b120-b859642ec5dd
2022-03-22 15:53:55.477: (ClusterHealthValidatorEvent Severity.NORMAL) period_type=end event_id=bc326832-8fe7-4e7b-b120-b859642ec5dd duration=1m5s message=Cluster health check finished

During the nemesis the new DC (eu-north_nemesis_dc) was created and a new node (Node-5) was added to it.

2022-03-22 16:06:56.171: (ScyllaServerStatusEvent Severity.NORMAL) period_type=begin event_id=dac26da1-dbeb-40a1-99a2-bd3f6c753bc3 node=longevity-lwt-3h-master-db-node-e59d59d8-5
2022-03-22 16:07:06.444: (JMXServiceEvent Severity.NORMAL) period_type=begin event_id=a0046713-7c2f-40fd-800d-909a049265bf node=longevity-lwt-3h-master-db-node-e59d59d8-5
2022-03-22 16:08:28.537: (BootstrapEvent Severity.NORMAL) period_type=begin event_id=18f9f47e-6fcf-4d4a-adc7-26000ed7907d node=longevity-lwt-3h-master-db-node-e59d59d8-5
2022-03-22 16:08:30.351: (BootstrapEvent Severity.NORMAL) period_type=end event_id=18f9f47e-6fcf-4d4a-adc7-26000ed7907d duration=1s node=longevity-lwt-3h-master-db-node-e59d59d8-5

Rebuild operation was performed on new DC

2022-03-22 16:10:36.940: (InfoEvent Severity.NORMAL) period_type=not-set event_id=bc8b8b30-1a71-4ab7-931f-e1eececeb390: message=execute rebuild on new datacenter
2022-03-22 16:10:36.940: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f4821f0c-09ed-44b7-93e8-1904a7826fe7: nodetool_command=rebuild node=longevity-lwt-3h-master-db-node-e59d59d8-5 options=-- eu-north
2022-03-22 16:10:39.948: (NodetoolEvent Severity.NORMAL) period_type=end event_id=f4821f0c-09ed-44b7-93e8-1904a7826fe7 duration=3s: nodetool_command=rebuild node=longevity-lwt-3h-master-db-node-e59d59d8-5 options=-- eu-north

Repair operation was performed for each node

2022-03-22 16:10:39.948: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=3f1daf57-da4f-4c36-ac96-570bdbfc96be: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-1 options=-pr
2022-03-22 16:11:46.653: (NodetoolEvent Severity.NORMAL) period_type=end event_id=3f1daf57-da4f-4c36-ac96-570bdbfc96be duration=1m6s: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-1 options=-pr
2022-03-22 16:11:46.653: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=7113797d-d777-4789-b84f-ab40f78bd8d1: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-2 options=-pr
2022-03-22 16:14:29.455: (NodetoolEvent Severity.NORMAL) period_type=end event_id=7113797d-d777-4789-b84f-ab40f78bd8d1 duration=2m42s: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-2 options=-pr
2022-03-22 16:14:29.455: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=613e7d5d-c526-4462-984c-dd23b74989d2: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-3 options=-pr
2022-03-22 16:18:24.049: (NodetoolEvent Severity.NORMAL) period_type=end event_id=613e7d5d-c526-4462-984c-dd23b74989d2 duration=3m54s: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-3 options=-pr
2022-03-22 16:18:24.050: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=19a04946-a729-49ad-a0f4-f27e36509606: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-4 options=-pr
2022-03-22 16:20:57.239: (NodetoolEvent Severity.NORMAL) period_type=end event_id=19a04946-a729-49ad-a0f4-f27e36509606 duration=2m33s: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-4 options=-pr
2022-03-22 16:20:57.239: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=aa956d72-9334-4f14-9418-09712d896a71: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-5 options=-pr
2022-03-22 16:21:07.476: (NodetoolEvent Severity.NORMAL) period_type=end event_id=aa956d72-9334-4f14-9418-09712d896a71 duration=10s: nodetool_command=repair node=longevity-lwt-3h-master-db-node-e59d59d8-5 options=-pr

All the nodes had had the same schema version before the c-s was started

2022-03-22T16:21:20+00:00 longevity-lwt-3h-master-db-node-e59d59d8-1 !    INFO |  [shard 3] schema_tables - Schema version changed to a18e10c9-accd-3e6e-ae3f-bd45d4f15984
2022-03-22T16:21:20+00:00 longevity-lwt-3h-master-db-node-e59d59d8-2 !    INFO |  [shard 3] schema_tables - Schema version changed to a18e10c9-accd-3e6e-ae3f-bd45d4f15984
2022-03-22T16:21:20+00:00 longevity-lwt-3h-master-db-node-e59d59d8-3 !    INFO |  [shard 3] schema_tables - Schema version changed to a18e10c9-accd-3e6e-ae3f-bd45d4f15984
2022-03-22T16:21:20+00:00 longevity-lwt-3h-master-db-node-e59d59d8-4 !    INFO |  [shard 3] schema_tables - Schema version changed to a18e10c9-accd-3e6e-ae3f-bd45d4f15984
2022-03-22T16:21:20+00:00 longevity-lwt-3h-master-db-node-e59d59d8-5 !    INFO |  [shard 3] schema_tables - Schema version changed to a18e10c9-accd-3e6e-ae3f-bd45d4f15984

Finally, the following c-s command started

2022-03-22 16:21:15.727: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=e7980949-34c0-4b14-ac22-7c35f7518e1f: node=Node longevity-lwt-3h-master-loader-node-e59d59d8-1 [13.53.125.186 | 10.0.1.69] (seed: False)
stress_cmd=cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-north=3,eu-north_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5

And it almost immediately returned the critical error that failed the whole test:

2022-03-22 16:21:46.132: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=e7980949-34c0-4b14-ac22-7c35f7518e1f: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=134 node=Node longevity-lwt-3h-master-loader-node-e59d59d8-1 [13.53.12
5.186 | 10.0.1.69] (seed: False)
java.io.IOException: Operation x10 on key(s) [323634374f3036334d30]: Error executing: (WriteTimeoutException): Cassandra timeout during SIMPLE write query at consistency ALL (4 replica were required but only 3 acknowledged the write)
  • Restore Monitor Stack command: $ hydra investigate show-monitor e59d59d8-3892-4fda-8e50-e651387aff96
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e59d59d8-3892-4fda-8e50-e651387aff96

Logs:

Jenkins job URL

@enaydanov
Copy link
Contributor

Installation details

Kernel Version: 5.13.0-1019-aws

Scylla version (or git commit hash): 5.1.dev-20220317.c45050895403 with build-id 076f513b6143670def988c7626389b270411f8f7

Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

  • longevity-lwt-3h-master-db-node-4474867f-5 (52.215.235.122 | 10.0.2.206) (shards: 8)
  • longevity-lwt-3h-master-db-node-4474867f-4 (34.245.34.73 | 10.0.1.138) (shards: 8)
  • longevity-lwt-3h-master-db-node-4474867f-3 (54.154.28.232 | 10.0.0.190) (shards: 8)
  • longevity-lwt-3h-master-db-node-4474867f-2 (34.244.117.244 | 10.0.0.207) (shards: 8)
  • longevity-lwt-3h-master-db-node-4474867f-1 (54.72.128.216 | 10.0.0.171) (shards: 8)

OS / Image: ami-0d27db767997d0ec4 (aws: eu-west-1)

Test: longevity-lwt-3h-test

Test id: 4474867f-7e92-4ec4-af38-da3a541a7847

Test name: longevity/longevity-lwt-3h-test

Test config file(s):

Issue description

Almost same issue but the c-s command failed with different error:

Error executing command: "echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=T4X8YHQIMEC84R3IQYDI; cassandra-stress write no-warmup cl=ALL n=10000 -schema 'keyspace=keyspace_new_dc replication(strategy=NetworkTopologyStrategy,eu-west=3,eu-west_nemesis_dc=1) compression=LZ4Compressor compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native compression=lz4 -rate threads=5 -pop seq=1..10000 -log interval=5 -node 10.0.0.171 -errors skip-unsupported-columns"; Exit status: 1
STDOUT: sis_dc; Host: /10.0.2.206; Rack: 1a
Datatacenter: eu-west; Host: /10.0.0.190; Rack: 1a
Datatacenter: eu-west; Host: /10.0.0.207; Rack: 1a
Datatacenter: eu-west; Host: /10.0.1.138; Rack: 1a
Datatacenter: eu-west; Host: /10.0.0.171; Rack: 1a
STDERR: WARN  06:11:59,683 No schema agreement from live replicas after 10 s. The schema may not be up to date on some nodes.
java.lang.RuntimeException: Encountered exception creating schema
 at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:100)
 at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:69)
 at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:228)
 at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
 at org.apache.cassandra.stress.Stress.run(Stress.java:143)
 at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'keyspace_new_dc' does not exist
 at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
 at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
 at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
 at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
 at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:190)
 at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:86)
 ... 5 more
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'keyspace_new_dc' does not exist
 at com.datastax.driver.core.Responses$Error.asException(Responses.java:181)
 at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:215)
 at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:270)
 at com.datastax.driver.core.RequestHandler.access$2600(RequestHandler.java:62)
 at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:1052)
 at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:855)
 at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1360)
 at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1278)
 at com.datastax.shaded.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
 at com.datastax.shaded.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
 at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
 at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
 at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 ...
  • Restore Monitor Stack command: $ hydra investigate show-monitor 4474867f-7e92-4ec4-af38-da3a541a7847
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 4474867f-7e92-4ec4-af38-da3a541a7847

Logs:

Jenkins job URL

@rayakurl rayakurl added the triage/master Looking for assignee label Mar 30, 2022
@slivne
Copy link
Contributor

slivne commented Apr 3, 2022

I expect this is the same issue as #10296

gossip not in synch with the nodes will cause this failure as well.

@slivne
Copy link
Contributor

slivne commented Apr 3, 2022

I did not check the complete sequence but you can - I expect a simple reproducer is also possible to create for this case

@slivne slivne added this to the 5.1 milestone Apr 3, 2022
@eliransin
Copy link
Contributor

@asias can we verify from the logs somehow if this is indeed the same cause (old nodes not leaving the cluster)?

@DoronArazii
Copy link

@asias ^^

@DoronArazii
Copy link

@eliransin / @asias ping

@roydahan
Copy link

roydahan commented Oct 9, 2022

I don't think this issue still happens, since we run this longevity and this nemesis many times since and I don't see more mentions of the reproducers.
If we still see it, please re-open it.

@asias
Copy link
Contributor

asias commented Oct 9, 2022

I am closing this.

In #10296, we requested to get more info when the issue happens.


    nodetool stauts
    nodetool gossipinfo
    nodetool ring
    nodetool describecluster

@asias asias closed this as completed Oct 9, 2022
@DoronArazii DoronArazii modified the milestones: 5.1, 5.2 Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants