Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "nodetool repair -pr system_auth" command takes 22 minutes (instead of 3 seconds) #15159

Closed
2 tasks
yarongilor opened this issue Aug 24, 2023 · 20 comments
Closed
2 tasks
Assignees
Labels
area/repair Field-Tier1 Urgent issues as per FieldEngineering request P1 Urgent
Milestone

Comments

@yarongilor
Copy link

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

At test start, a nodetool repair for system_auth table is executed on nodes.
It usually takes about 3 seconds. In several recent runs, it took more than 22 minutes.
SCT has a 10 minutes timeout so the test failed like:

< t:2023-08-12 03:25:19,693 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-12 03:25:19,694 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-12 03:25:19,698 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-12 03:25:19,703 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-12 03:45:20,120 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "
< t:2023-08-12 03:45:20,121 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 1200 seconds!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\'  repair -pr system_auth "\n\nStdout:\n\n[2023-08-12 03:25:20,660] Starting repair command #5, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)\n\nStderr:\n\n\n\n')', retrying [#0]
< t:2023-08-12 03:45:20,128 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "
< t:2023-08-12 03:45:20,128 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 1200 seconds!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\'  repair -pr system_auth "\n\nStdout:\n\n[2023-08-12 03:25:20,847] Starting repair command #1, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)\n\nStderr:\n\n\n\n')', retrying [#0]
< t:2023-08-12 03:45:20,176 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "
< t:2023-08-12 03:45:20,176 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 1200 seconds!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\'  repair -pr system_auth "\n\nStdout:\n\n[2023-08-12 03:25:20,720] Starting repair command #5, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)\n\nStderr:\n\n\n\n')', retrying [#0]
< t:2023-08-12 03:45:20,314 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "
< t:2023-08-12 03:45:20,314 f:decorators.py   l:72   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 1200 seconds!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\'  repair -pr system_auth "\n\nStdout:\n\n[2023-08-12 03:25:20,604] Starting repair command #5, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)\n\nStderr:\n\n\n\n')', retrying [#0]

Node setup failed for timeout during repair of system-auth:

< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown < t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Exception in 
setUp. Will call tearDown
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Traceback (most recent call last):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 164, in wrapper
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return method(*args, **kwargs)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 119, in inner
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     res = func(*args, **kwargs)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 825, in setUp
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     self.set_system_auth_rf(db_cluster=db_cluster)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 912, in set_system_auth_rf
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     parallel_objects.run(_nodetool_repair)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 444, in run
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     raise ParallelObjectException(results=results)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > sdcm.utils.common.ParallelObjectException: Node longevity-50gb-12h-master-db-node-18cdaa5c-1 [52.48.20.193 | 10.4.10.40] (seed: True):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >  Traceback (most recent call last):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 428, in run
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = future.result(time_out)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     raise TimeoutError()
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > concurrent.futures._base.TimeoutError
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Node longevity-50gb-12h-master-db-node-18cdaa5c-2 [54.194.113.110 | 10.4.8.155] (seed: True):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >  Traceback (most recent call last):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 428, in run
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = future.result(time_out)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     raise TimeoutError()
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > concurrent.futures._base.TimeoutError
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Node longevity-50gb-12h-master-db-node-18cdaa5c-3 [34.244.252.187 | 10.4.8.141] (seed: True):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >  Traceback (most recent call last):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 428, in run
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = future.result(time_out)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     raise TimeoutError()
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > concurrent.futures._base.TimeoutError
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Node longevity-50gb-12h-master-db-node-18cdaa5c-4 [34.242.216.29 | 10.4.11.104] (seed: True):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >  Traceback (most recent call last):
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 428, in run
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = future.result(time_out)
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     raise TimeoutError()
< t:2023-08-12 03:50:19,698 f:tester.py       l:171  c:sdcm.tester          p:ERROR > concurrent.futures._base.TimeoutError
< t:2023-08-12 03:50:19,703 f:tester.py       l:229  c:silence              p:DEBUG > Finished 'Sending test end event'. No errors were silenced.

This repar is expected to take ~ 3 seconds, as seen in the previous run:

< t:2023-08-06 08:34:17,539 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-06 08:34:17,541 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-06 08:34:17,542 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-06 08:34:17,548 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth "...
< t:2023-08-06 08:34:21,066 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth " finished with status 0
< t:2023-08-06 08:34:21,067 f:cluster.py      l:2623 c:sdcm.cluster_aws     p:DEBUG > Node longevity-50gb-12h-master-db-node-d0849e72-4 [54.171.224.8 | 10.4.9.53] (seed: True): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth ' duration -> 3.5151809980000053 s
< t:2023-08-06 08:34:21,073 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth " finished with status 0
< t:2023-08-06 08:34:21,073 f:cluster.py      l:2623 c:sdcm.cluster_aws     p:DEBUG > Node longevity-50gb-12h-master-db-node-d0849e72-3 [34.242.226.85 | 10.4.8.252] (seed: True): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth ' duration -> 3.5238472599999113 s
< t:2023-08-06 08:34:21,097 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth " finished with status 0
< t:2023-08-06 08:34:21,097 f:cluster.py      l:2623 c:sdcm.cluster_aws     p:DEBUG > Node longevity-50gb-12h-master-db-node-d0849e72-2 [54.78.180.130 | 10.4.8.14] (seed: True): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth ' duration -> 3.5486178639998798 s
< t:2023-08-06 08:34:21,101 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth " finished with status 0
< t:2023-08-06 08:34:21,101 f:cluster.py      l:2623 c:sdcm.cluster_aws     p:DEBUG > Node longevity-50gb-12h-master-db-node-d0849e72-1 [34.246.162.61 | 10.4.9.155] (seed: True): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr system_auth ' duration -> 3.5554634939999232 s

Impact

The basic impact is currently a very long duration for repair to complete (no sure how it is relevant to customer).

How frequently does it reproduce?

reproduced in several recent runs.

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-50gb-12h-master-db-node-18cdaa5c-4 (34.242.216.29 | 10.4.11.104) (shards: 8)
  • longevity-50gb-12h-master-db-node-18cdaa5c-3 (34.244.252.187 | 10.4.8.141) (shards: 13)
  • longevity-50gb-12h-master-db-node-18cdaa5c-2 (54.194.113.110 | 10.4.8.155) (shards: 12)
  • longevity-50gb-12h-master-db-node-18cdaa5c-1 (52.48.20.193 | 10.4.10.40) (shards: 14)

OS / Image: ami-037a9a5f6d0cf7444 (aws: undefined_region)

Test: longevity-150gb-asymmetric-cluster-12h-test
Test id: 18cdaa5c-d2d0-49cf-aebf-273830fac739
Test name: scylla-master/longevity/longevity-150gb-asymmetric-cluster-12h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 18cdaa5c-d2d0-49cf-aebf-273830fac739
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 18cdaa5c-d2d0-49cf-aebf-273830fac739

Logs:

Jenkins job URL
Argus

@yarongilor
Copy link
Author

Run 12 had the following log output:

2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: starting user-requested repair for keyspace system_auth, repair id 1, options {{trace -> false}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> true}}
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - primary-range repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Skipped sending repair_flush_hints_batchlog to nodes={10.4.11.104, 10.4.8.155, 10.4.8.141, 10.4.10.40}
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  3] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  2] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  6] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard 13] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard 12] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard 11] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  8] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  7] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard 10] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  5] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  9] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  4] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  1] repair - repair[9e45e542-2f21-4093-9e49-b2e7c7fd3d3f]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-12T03:25:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7688]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:25:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7688]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:25:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7688]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:26:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7693]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:26:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7693]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:26:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7693]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:26:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7698]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:26:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7698]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:26:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7698]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:27:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7706]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:27:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7706]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:27:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7706]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:27:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7711]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:27:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7711]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:27:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7711]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:28:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7716]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:28:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7716]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:28:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7716]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:28:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7721]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:28:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7721]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:28:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7721]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:29:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7730]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:29:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7730]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:29:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7730]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:29:24+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | amazon-ssm-agent[512]: 2023-08-12 03:29:24 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 8ed3a0d2-55b1-4234-ac1f-9601fd07cfa5 Error: AccessDeniedException: User: arn:aws:sts::797456418907:assumed-role/qa-scylla-manager-backup-role/i-043d9d1e169e73fb9 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:eu-west-1:797456418907:instance/i-043d9d1e169e73fb9 because no identity-based policy allows the ssm:UpdateInstanceInformation action
2023-08-12T03:29:24+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | amazon-ssm-agent[512]:       status code: 400, request id: 8ed3a0d2-55b1-4234-ac1f-9601fd07cfa5
2023-08-12T03:29:24+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | amazon-ssm-agent[512]: 2023-08-12 03:29:24 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 797456418907
2023-08-12T03:29:24+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | amazon-ssm-agent[512]:       status code: 400, request id: cabc9567-2498-426c-80e5-07ae53f6dc8b
2023-08-12T03:29:24+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | amazon-ssm-agent[512]: 2023-08-12 03:29:24 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity
2023-08-12T03:29:24+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | amazon-ssm-agent[512]: 2023-08-12 03:29:24 INFO [CredentialRefresher] Sleeping for 17m35s before retrying retrieve credentials
2023-08-12T03:29:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7737]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:29:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7737]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:29:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7737]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:30:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7744]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:30:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7744]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:30:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7744]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:30:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7749]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:30:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7749]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:30:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7749]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:31:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7754]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:31:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7754]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:31:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7754]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:31:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7759]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:31:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7759]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:31:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7759]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:32:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7764]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:32:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7764]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:32:08+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7764]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:32:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7769]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:32:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7769]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:32:38+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7769]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:33:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7774]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:33:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7774]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:33:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7774]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:33:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7779]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:33:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7779]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:33:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7779]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:34:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7784]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:34:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7784]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:34:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7784]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:34:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7789]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:34:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7789]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:34:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7789]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:35:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7795]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:35:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7795]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:35:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7795]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:35:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7801]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:35:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7801]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:35:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7801]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:36:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7806]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:36:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7806]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:36:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7806]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:36:20+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sshd[4725]: Timeout, client not responding from user scyllaadm 10.4.11.37 port 35718
2023-08-12T03:36:20+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sshd[4702]: pam_unix(sshd:session): session closed for user scyllaadm
2023-08-12T03:36:20+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | systemd-logind[536]: Session 1 logged out. Waiting for processes to exit.
2023-08-12T03:36:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7813]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:36:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7813]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:36:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7813]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:37:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7818]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:37:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7818]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:37:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7818]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:37:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7823]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:37:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7823]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:37:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7823]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:38:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7828]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:38:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7828]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:38:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7828]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:38:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7833]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2023-08-12T03:38:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7833]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2023-08-12T03:38:39+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | sudo[7833]: pam_unix(sudo:session): session closed for user root
2023-08-12T03:39:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1   !NOTICE | sudo[7839]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short

The successful run 13 had the following log output:

2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: starting user-requested repair for keyspace system_auth, repair id 1, options {{trace -> false}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> true}}
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - primary-range repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Skipped sending repair_flush_hints_batchlog to nodes={10.4.8.56, 10.4.8.43, 10.4.8.128, 10.4.11.62}
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 1] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 2] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 5] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 9] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 6] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 7] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 8] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 3] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 4] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 5] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 2] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 9] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 4] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 8] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 7] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 6] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 3] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 1] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 5] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:45+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 2] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 9] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 8] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 4] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 7] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 6] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 3] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 1] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 5] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 3 out of 4 tables in keyspace=system_auth, table=role_members, table_id=0ecdaa87-f8fb-3e60-88d1-74fb36fe5c0d, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 2] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 9] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 8] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 4] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 7] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 6] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 3] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 1] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 5] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.0556295 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 5] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 2] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.0950493 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 2] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 9] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.1270905 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 9] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 8] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.14056 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 8] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 4] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.1680366 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 4] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 7] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.1875298 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 7] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 6] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.1947986 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 6] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: Started to repair 4 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f, repair_reason=repair
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 3] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.2394117 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 3] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 1] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.3180208 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:46+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 1] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth
2023-08-17T12:03:47+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_permissions, role_members, role_attributes}, ranges_nr=256, round_nr=1024, round_nr_fast_path_already_synced=1024, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=15360, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.5279031 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_nr={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}}, row_from_disk_bytes_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} MiB/s, row_from_disk_rows_per_sec={{10.4.8.128, 0}, {10.4.8.43, 0}, {10.4.8.56, 0}, {10.4.11.62, 0}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
2023-08-17T12:03:47+00:00 longevity-50gb-12h-master-db-node-b9965f1b-1     !INFO | scylla[5517]:  [shard 0] repair - repair[c0593a4d-2555-46f8-a948-f4b04ba8d083]: completed successfully, keyspace=system_auth

@mykaul
Copy link
Contributor

mykaul commented Aug 27, 2023

@yarongilor - what version is that?

@yarongilor
Copy link
Author

yarongilor commented Aug 27, 2023

@yarongilor - what version is that?

thanks @mykaul , it's:

ScyllaDB version with details: 5.4.0~dev-0.20230809.108e510a2391 with build-id 1c11038a8e78bc420070c7676c409e5fbe27d901

p.s. - SCT /Argus might experienced an issue retrieving the version because of this "too early" failure.

Run Details

Release: scylla-master
Group: Long Weekly Longevities - Tier1
Test: longevity-150gb-asymmetric-cluster-12h-test
Id: 18cdaa5c-d2d0-49cf-aebf-273830fac739
Start time: 2023-08-12 03:06:43
End time: 2023-08-12 03:52:50
Duration: 46 minutes
Started by: timer
Build job: [scylla-master/longevity/longevity-150gb-asymmetric-cluster-12h-test](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-150gb-asymmetric-cluster-12h-test/12/)
System Information
Backend: aws
Region: eu-west-1
Image id: ami-037a9a5f6d0cf7444
Instance type: i4i.4xlarge
Node amount: 4

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 28, 2023

I see some recurring error in messages.log (And other log files as well). It appears for the failed 12 run and not for successful 13 run (basing on the snippets provided in this issue), but it probably doesn't matter:

2023-08-12T03:47:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | amazon-ssm-agent[514]: 2023-08-12 03:47:09 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 0b4411b6-0128-419a-b63c-b3ee133b5c3d Error: AccessDeniedException: User: arn:aws:sts::797456418907:assumed-role/qa-scylla-manager-backup-role/i-0abe28a20013c1b5b is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:eu-west-1:797456418907:instance/i-0abe28a20013c1b5b because no identity-based policy allows the ssm:UpdateInstanceInformation action
2023-08-12T03:47:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | amazon-ssm-agent[514]: 	status code: 400, request id: 0b4411b6-0128-419a-b63c-b3ee133b5c3d
2023-08-12T03:47:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | amazon-ssm-agent[514]: 2023-08-12 03:47:09 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 797456418907
2023-08-12T03:47:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | amazon-ssm-agent[514]: 	status code: 400, request id: c50aaae3-5527-4c9d-9e8a-fbeb1c864d86
2023-08-12T03:47:09+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | amazon-ssm-agent[514]: 2023-08-12 03:47:09 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 28, 2023

There's also some exception in node 1, which doesn't look good, but maybe it's OK after all:

2023-08-12T03:25:18+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] cql_server - exception while processing connection: std::_Nested_exception<std::system_error> (error GnuTLS:-53, Error in the push function.): std::system_error (error system:32, sendmsg: Broken pipe)

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 28, 2023

Interesting, because at both 2023-08-12T03:25:21 and 2023-08-12T03:45:26 a repair for node 1 started on all 14 shards, but it seems like it only started and got stuck (?) I only see the following logs and nothing more:

2023-08-12T03:45:26+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - repair[e2c95e1d-2446-4621-84d1-5adb55aa2b88]: starting user-requested repair for keyspace system_auth, repair id 2, options {{trace -> false}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> true}}
2023-08-12T03:45:26+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - primary-range repair
2023-08-12T03:45:26+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - repair[e2c95e1d-2446-4621-84d1-5adb55aa2b88]: Skipped sending repair_flush_hints_batchlog to nodes={10.4.11.104, 10.4.8.155, 10.4.8.141, 10.4.10.40}
2023-08-12T03:45:26+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-1     !INFO | scylla[5498]:  [shard  0] repair - repair[e2c95e1d-2446-4621-84d1-5adb55aa2b88]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_attributes, table_id=6b8c7359-a843-33f2-a1d8-5dc6a187436f,
...
(last message is repeated for all shards, i.e. 14 times in total)

That being said I see that repair attempts happening before 2023-08-12T03:25:21 were successful.
Last successful repair happened at:

2023-08-12T03:24:02+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | scylla[5596]:  [shard 0] repair - bootstrap_with_repair: finished with keyspace=system_auth, nr_ranges=256

The next one ended on 2 out of 4:

2023-08-12T03:25:21+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | scylla[5596]:  [shard 0] repair - repair[da02aa4f-92b6-4937-b53a-71eae835dc9f]: Started to repair 2 out of 4 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=repair

And the last one ended on 1 out of 4:

2023-08-12T03:45:26+00:00 longevity-50gb-12h-master-db-node-18cdaa5c-4     !INFO | scylla[5596]:  [shard 7] repair - repair[80e5b71b-49b4-41a2-8f1d-d04ef717cf38]: Started to repair 1 out of 4 tables in keyspace=system_auth, table=role_permissions, table_id=3afbe79f-2194-31a7-add7-f5ab90d8ec9c, repair_reason=repair

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 28, 2023

I don't see anything else in the log files. More details could help understand why repair gets stuck.
@asias could you please advise?

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Aug 30, 2023

@asias ping

@asias
Copy link
Contributor

asias commented Aug 31, 2023

This repar is expected to take ~ 3 seconds, as seen in the previous run:

Was the previous run using the exact same SCT version and VM types (number of shards)?

If SCT version was updated, check if it contains the SCT change that modifies how we repair system_auth table and the RF of the system_auth table (e.g, simple strategy to network strategy, RF of the system_auth). I remember there was a bug in SCT in this area and it was fixed.

Note the number of shards was different on the nodes. This is going to make the mutation reader eviction issue more visible.

Another thing is that, I noticed, at 3:45 a secondrepair system_auth was started before the previous one started at 3:25 had finished. This looks like a SCT issue. There is no point starting the second repair before the first repair of system_auth finishes. Please fix it.

Aug 12 03:25:20 longevity-50gb-12h-master-db-node-18cdaa5c-2 scylla[5518]:  [shard  0] repair - repair[ab1ccf55-3a3e-4adf-90cc-b5cce8aace7e]: starting user-requested repair for keyspace system_auth, repair id 5, options {{trace -> false}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> true}}

Aug 12 03:45:25 longevity-50gb-12h-master-db-node-18cdaa5c-2 scylla[5518]:  [shard  0] repair - repair[2ab42a79-40d0-4e3c-a064-cc71c1fe2ad9]: starting user-requested repair for keyspace system_auth, repair id 6, options {{trace -> false}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> true}} 

Also, please increase the timeout in SCT, it is important to know if eventually the system_auth finishes or it got stuck.

@mykaul
Copy link
Contributor

mykaul commented Aug 31, 2023

If SCT version was updated, check if it contains the SCT change that modifies how we repair system_auth table and the RF of the system_auth table (e.g, simple strategy to network strategy, RF of the system_auth). I remember there was a bug in SCT in this area and it was fixed.

Does it matter - what can justify 22m to repair it?

@asias
Copy link
Contributor

asias commented Aug 31, 2023

If SCT version was updated, check if it contains the SCT change that modifies how we repair system_auth table and the RF of the system_auth table (e.g, simple strategy to network strategy, RF of the system_auth). I remember there was a bug in SCT in this area and it was fixed.

Does it matter - what can justify 22m to repair it?

It matters in the sense we want to know what was changed and what contributed to longer repair time. If the repair indeed finished eventually, the main factor is supposed to be the diff-shard config caused mutation reader eviction problem. Another factor is the network latency between nodes and high RF (not in this test, it only has 4 nodes).

@DoronArazii DoronArazii removed the triage/master Looking for assignee label Aug 31, 2023
@roydahan
Copy link

Is anyone looking at this issue?
@ptrsmrn / @asias ?

@betsch85
Copy link

betsch85 commented Sep 27, 2023

I found this thread by googling "Systems Manager's instance management role is not configured for account", weirdly, every time this message appears in the syslog, everything starts to take forever, so be aware that it might be caused by amazon-ssm-agent, not scylladb.

ignore the above

@mykaul
Copy link
Contributor

mykaul commented Sep 27, 2023

@asias - can you please take a look at this?

@asias
Copy link
Contributor

asias commented Sep 27, 2023

Did you mean the system manager config thing? I am not sure what it is about.

@mykaul
Copy link
Contributor

mykaul commented Sep 27, 2023

I found this thread by googling "Systems Manager's instance management role is not configured for account", weirdly, every time this message appears in the syslog, everything starts to take forever, so be aware that it might be caused by amazon-ssm-agent, not scylladb.

I don't see the direct correlation here. If it were, it should have impacted all repairs.

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Oct 5, 2023

@asias @yarongilor do you think we could try to reproduce this issue and collect more info/logs?

@mykaul
Copy link
Contributor

mykaul commented Oct 12, 2023

@ptrsmrn - not sure if it's materially different than https://github.com/scylladb/scylla-enterprise/issues/3103#issuecomment-1758913004

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Oct 14, 2023

@ptrsmrn - not sure if it's materially different than scylladb/scylla-enterprise#3103 (comment)

The root cause could be the same, but @asias pointed out a couple of things in his comment (#15159 (comment)), which have not been resolved and are something extra on top of https://github.com/scylladb/scylla-enterprise/issues/3103, so these issues don't seem to be 1:1.
@yarongilor could you please answer questions/resolve issues @asias mentioned in #15159 (comment) ?

asias added a commit to asias/scylla that referenced this issue Nov 7, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 3}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#15159
@mykaul mykaul added this to the 6.0 milestone Nov 7, 2023
@mykaul mykaul added P1 Urgent Field-Tier1 Urgent issues as per FieldEngineering request labels Nov 7, 2023
asias added a commit to asias/scylla that referenced this issue Nov 7, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 3}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#15159
asias added a commit to asias/scylla that referenced this issue Nov 8, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#15159
asias added a commit to asias/scylla that referenced this issue Nov 8, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#15159
@mykaul mykaul added Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Nov 8, 2023
asias added a commit to asias/scylla that referenced this issue Nov 9, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#16011
Refs  scylladb#15159
asias added a commit to asias/scylla that referenced this issue Nov 13, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#16011
Refs  scylladb#15159
asias added a commit to asias/scylla that referenced this issue Nov 14, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#16011
Refs  scylladb#15159
asias added a commit to asias/scylla that referenced this issue Nov 20, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#16011
Refs  scylladb#15159
asias added a commit to asias/scylla that referenced this issue Nov 20, 2023
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes scylladb#16011
Refs  scylladb#15159
denesb added a commit that referenced this issue Nov 24, 2023
repair: Introduce small table optimization

*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes #16011
Refs  #15159

Closes #15974

* github.com:scylladb/scylladb:
  repair: Introduce small table optimization
  repair: Convert put_row_diff_with_rpc_stream to use coroutine
@mykaul mykaul removed backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Jan 29, 2024
@mykaul
Copy link
Contributor

mykaul commented Feb 1, 2024

Closing for the time being - the small table optimization is in 6.0.

@mykaul mykaul closed this as completed Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/repair Field-Tier1 Urgent issues as per FieldEngineering request P1 Urgent
Projects
None yet
Development

No branches or pull requests

7 participants