Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy #11017

Closed
fruch opened this issue Jul 11, 2022 · 60 comments
Assignees
Labels
Milestone

Comments

@fruch
Copy link
Contributor

fruch commented Jul 11, 2022

Installation details

Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash): 5.1.dev-20220706.a0ffbf3291b7 with build-id 3490fa9f14da510e97a1d0f53f693cac13a70494
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-9 (63.35.163.81 | 10.0.3.187) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-8 (52.31.41.157 | 10.0.0.95) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-7 (34.244.145.19 | 10.0.0.195) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-6 (34.241.120.110 | 10.0.0.111) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-5 (34.244.16.3 | 10.0.1.141) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-4 (34.253.116.115 | 10.0.0.12) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-3 (3.251.71.251 | 10.0.1.19) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-2 (52.16.243.180 | 10.0.2.34) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-11 (3.251.65.93 | 10.0.2.8) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-10 (34.243.184.173 | 10.0.0.66) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-1 (63.33.208.165 | 10.0.1.209) (shards: 14)

OS / Image: ami-07d73e5ea1fc772eb (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Issue description

during disrupt_rolling_config_change_internode_compression we got this error that we never encountered before

(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=9a182d91-14f4-47e3-ae39-724e47badd47: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=604687 node=longevity-tls-50gb-3d-master-db-node-fb8b36a7-5
2022-07-11T10:09:07+00:00 longevity-tls-50gb-3d-master-db-node-fb8b36a7-5 !ERR | scylla[69983]: [shard 13] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
  • Restore Monitor Stack command: $ hydra investigate show-monitor fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a

Logs:

No logs captured during this run.

Jenkins job URL

@fruch fruch added the triage/master Looking for assignee label Jul 11, 2022
@avikivity
Copy link
Member

/cc @elcallio, looks like listener restart forgets to unregister metrics

@fruch
Copy link
Contributor Author

fruch commented Jul 12, 2022

test case is still running:

it happened again on diffrent nodes, during disrupt_rolling_config_change_internode_compression and disrupt_rebuild_streaming_err:

(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=9a182d91-14f4-47e3-ae39-724e47badd47: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1134465 node=longevity-tls-50gb-3d-master-db-node-fb8b36a7-6
2022-07-12T00:05:30+00:00 longevity-tls-50gb-3d-master-db-node-fb8b36a7-6 !ERR | scylla[67120]: [shard 8] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=9a182d91-14f4-47e3-ae39-724e47badd47: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1612960 node=longevity-tls-50gb-3d-master-db-node-fb8b36a7-11
2022-07-12T03:06:37+00:00 longevity-tls-50gb-3d-master-db-node-fb8b36a7-11 !ERR | scylla: [shard 13] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

@fruch
Copy link
Contributor Author

fruch commented Jul 12, 2022

happened once more time during disrupt_rolling_restart_cluster:

(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=9a182d91-14f4-47e3-ae39-724e47badd47: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=3136603 node=longevity-tls-50gb-3d-master-db-node-fb8b36a7-11
2022-07-12T09:17:19+00:00 longevity-tls-50gb-3d-master-db-node-fb8b36a7-11 !ERR | scylla: [shard 13] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

@DoronArazii DoronArazii added type/bug and removed triage/master Looking for assignee labels Jul 13, 2022
@elcallio
Copy link
Contributor

So, while I did not write nor actually read any of the code before this hour, it seems that the storage_proxy stats in question are never unregistered (until objects are destroyed), i.e. we don't "forget" a DC or anything.
The stats guard creating counters for a given DC using a map, so assuming only a single storage_proxy per shard, we should never create duplicated counters here.

Assuming we don't create two storage_proxies, I can only guess the problem is that we somehow switch scheduling groups between loading the stats object (stored as scheduling group storage) and we derive the actual counter labels (looking up current scheduling group).

Again, I am not familiar with this code, but either someone who knows where we potentially switch groups should maybe try to keep us in the original one, or we could consider keeping track of the group in the actual split_stats object holding the counter(s) and use this, not the actual one.

@DoronArazii DoronArazii added this to the 5.1 milestone Jul 14, 2022
@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2022

Happens also in this week run, while doing config change with rolling restart

(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=755d4ba9-d490-4856-b201-d11a25b0c932: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2271904 node=longevity-tls-50gb-3d-master-db-node-8dce4a7e-8
2022-07-17T11:07:36+00:00 longevity-tls-50gb-3d-master-db-node-8dce4a7e-8 !ERR | scylla[38553]: [shard 13] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=755d4ba9-d490-4856-b201-d11a25b0c932: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=402718 node=longevity-tls-50gb-3d-master-db-node-8dce4a7e-1
2022-07-15T16:18:47+00:00 longevity-tls-50gb-3d-master-db-node-8dce4a7e-1 !ERR | scylla[46538]: [shard 12] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.dev-20220714.98aa3ec99b9b with build-id e27b38549f89da2f016567fc243c8404a6460ebc
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-9 (34.244.142.219 | 10.0.1.13) (shards: 14)
    ...
  • longevity-tls-50gb-3d-master-db-node-8dce4a7e-1 (34.246.186.17 | 10.0.1.31) (shards: 14)

OS / Image: ami-08cbc2649e53039f8 (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: 8dce4a7e-e326-4373-bd48-e772fbfbe457
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Logs:

No logs captured during this run.

Jenkins job URL

@juliayakovlev
Copy link

Reproduced with 5.1.dev-20220719.274691f45e3d during longevity-tls-50gb, RebuildStreamingErr nemesis.
Test is still running.

2022-07-23 14:55:42.540 <2022-07-23 14:55:37.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=37023e13-0be0-479e-9378-4389051838db: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] l
ine_number=1316581 node=longevity-tls-50gb-3d-master-db-node-18f65c02-8
2022-07-23T14:55:37+00:00 longevity-tls-50gb-3d-master-db-node-18f65c02-8      !ERR | scylla[67910]:  [shard 5] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.dev-20220719.274691f45e3d with build-id 8bedc6db4fe1f245a70c67d2d2135b056a361a78
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-18f65c02-9 (52.50.223.37 | 10.0.2.20) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-8 (34.250.158.91 | 10.0.3.100) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-7 (3.250.24.106 | 10.0.1.161) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-6 (54.170.138.198 | 10.0.2.190) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-5 (34.245.58.121 | 10.0.3.23) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-4 (52.17.191.16 | 10.0.0.236) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-3 (34.244.86.218 | 10.0.3.238) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-2 (54.171.127.68 | 10.0.3.106) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-15 (34.247.183.150 | 10.0.0.35) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-14 (176.34.148.131 | 10.0.2.137) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-13 (54.75.58.7 | 10.0.1.26) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-12 (52.17.235.246 | 10.0.3.173) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-11 (54.75.98.54 | 10.0.1.206) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-10 (3.250.42.138 | 10.0.3.105) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-18f65c02-1 (54.154.115.110 | 10.0.1.82) (shards: 14)

OS / Image: ami-02742dcbeb0063751 (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: 18f65c02-9d8e-4183-a79f-659995a96759
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Logs:

No logs captured during this run.

Jenkins job URL

@soyacz
Copy link
Contributor

soyacz commented Jul 27, 2022

I spotted this issue in upcoming 2022.1.rc9-0.20220721.9c95c3a8c (upgrade test):

2022-07-27 09:17:13.366 <2022-07-27 09:17:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=859a03f3-bd6c-4c57-9185-f4ebcd7c0240: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=7342 node=rolling-upgrade--centos-db-node-980f74a6-3
2022-07-27T09:17:13+00:00 rolling-upgrade--centos-db-node-980f74a6-3      !ERR | scylla[9285]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-27 09:17:13.868 <2022-07-27 09:17:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=859a03f3-bd6c-4c57-9185-f4ebcd7c0240: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=8200 node=rolling-upgrade--centos-db-node-980f74a6-2
2022-07-27T09:17:13+00:00 rolling-upgrade--centos-db-node-980f74a6-2      !ERR | scylla[9216]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

Test details

Test: upgrade_test.UpgradeTest.test_rolling_upgrade
Build number: 40
Backend: aws: eu-west-1
Kernel version: 5.15.0-1015-aws
Test-id: 980f74a6-b3fe-4563-a817-358127d37089
Start time: 2022-07-27 08:20:48
End time: 2022-07-27 12:37:17
Started by user: lukasz.sojka
Cassandra-stress uses shared-aware driver

Test result

FAILED

System under test

ScyllaDB version: 5.0.1-0.20220719.b177dacd3 with build-id 217f31634f8c8722cadcfe57ade8da58af05d415 (ami-03c187d4a5a22faa0)
Target ScyllaDB repo: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.1/deb/unified/2022-07-21T03:19:04Z/scylladb-2022.1/scylla.list
Target ScyllaDB version: 2022.1.rc9-0.20220721.9c95c3a8c
Instance type: i3.2xlarge
Number of ScyllaDB nodes: 4

Restore commands:

Restore Monitor Stack command: $ hydra investigate show-monitor 980f74a6-b3fe-4563-a817-358127d37089
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 980f74a6-b3fe-4563-a817-358127d37089

Logs:

grafana - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_123717/grafana-screenshot-overview-20220727_123717-rolling-upgrade--centos-monitor-node-980f74a6-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_123717/grafana-screenshot-rolling-multi-dc-upgrade-ami-test-staging2-scylla-per-server-metrics-nemesis-20220727_123826-rolling-upgrade--centos-monitor-node-980f74a6-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_125046/db-cluster-980f74a6.tar.gz
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_125046/loader-set-980f74a6.tar.gz
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_125046/monitor-set-980f74a6.tar.gz
sct - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_125046/sct-runner-980f74a6.tar.gz

Links:

Build URL
Download "Overview metrics" Grafana Screenshot
Download "Per server metrics nemesis" Grafana Screenshot

@ShlomiBalalis
Copy link

ShlomiBalalis commented Aug 1, 2022

Installation details

Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.0~dev-20220726.29c28dcb0c33 with build-id f0f40bc93cc45be63928bbe9eaf674885347ba58
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-7c261c1d-9 (54.247.51.66 | 10.4.3.6) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-8 (34.245.108.207 | 10.4.3.144) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-7 (54.194.154.246 | 10.4.3.101) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-6 (34.244.16.3 | 10.4.3.91) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-5 (34.243.214.29 | 10.4.3.68) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-4 (54.75.11.23 | 10.4.2.5) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-3 (3.248.210.110 | 10.4.3.71) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-25 (52.30.122.141 | 10.4.1.173) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-24 (52.209.250.184 | 10.4.0.81) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-23 (34.250.247.185 | 10.4.2.171) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-22 (54.155.70.161 | 10.4.3.205) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-21 (54.170.174.90 | 10.4.0.71) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-20 (3.251.80.50 | 10.4.1.70) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-2 (34.243.140.201 | 10.4.3.13) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-19 (54.217.22.249 | 10.4.2.37) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-18 (54.246.27.153 | 10.4.3.9) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-17 (52.51.126.229 | 10.4.1.45) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-16 (34.243.41.137 | 10.4.1.56) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-15 (3.250.40.143 | 10.4.1.180) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-14 (34.245.130.150 | 10.4.1.20) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-13 (34.242.227.38 | 10.4.0.128) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-12 (34.247.184.140 | 10.4.1.117) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-11 (34.243.251.238 | 10.4.3.193) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-10 (54.154.141.220 | 10.4.0.199) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-7c261c1d-1 (3.250.78.58 | 10.4.0.249) (shards: 14)

OS / Image: ami-0c90fd2f4dcfdd273 (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: 7c261c1d-0266-41ae-a139-e74e2338e99b
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Issue description

The issue was reproduced several times in this run:
A few times during the disrupt_rolling_restart_cluster nemesis, and occurred both times the nemesis ran:
(This time, after node 8 was starting up)

2022-07-30 15:55:27.269 <2022-07-30 15:55:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1525989 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-30T15:55:27+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[88139]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-30 15:55:27.323 <2022-07-30 15:55:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526034 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-30T15:55:27+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[88139]:  [shard 13] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-30 15:55:27.355 <2022-07-30 15:55:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526055 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-30T15:55:27+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[88139]:  [shard 1] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-30 15:55:27.357 <2022-07-30 15:55:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526061 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-30T15:55:27+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[88139]:  [shard 4] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-30 15:55:27.453 <2022-07-30 15:55:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526121 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-30T15:55:27+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[88139]:  [shard 5] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-30 15:55:27.518 <2022-07-30 15:55:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1526160 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-30T15:55:27+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[88139]:  [shard 11] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

(After node 6 has started)

2022-07-31 03:56:22.058 <2022-07-31 03:56:20.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2775250 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-6
2022-07-31T03:56:20+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-6      !ERR | scylla[74369]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

(After node 8 was restarted)

2022-07-31 03:59:24.592 <2022-07-31 03:59:24.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2533645 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-31T03:59:24+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[36165]:  [shard 1] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 03:59:24.639 <2022-07-31 03:59:24.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2533663 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-31T03:59:24+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[36165]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 03:59:25.672 <2022-07-31 03:59:24.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=2534144 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-8
2022-07-31T03:59:24+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-8      !ERR | scylla[36165]:  [shard 4] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

(After node 12 was restarted)

2022-07-31 04:02:25.388 <2022-07-31 04:02:25.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1265976 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-12
2022-07-31T04:02:25+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-12      !ERR | scylla[65981]:  [shard 12] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:02:25.413 <2022-07-31 04:02:25.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1265993 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-12
2022-07-31T04:02:25+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-12      !ERR | scylla[65981]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:02:25.444 <2022-07-31 04:02:25.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1266003 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-12
2022-07-31T04:02:25+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-12      !ERR | scylla[65981]:  [shard 2] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:02:25.457 <2022-07-31 04:02:25.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=1266024 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-12
2022-07-31T04:02:25+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-12      !ERR | scylla[65981]:  [shard 3] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

(After node 17 was restarted)

2022-07-31 04:05:32.266 <2022-07-31 04:05:32.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=317553 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-17
2022-07-31T04:05:32+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-17      !ERR | scylla[29538]:  [shard 1] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:05:32.296 <2022-07-31 04:05:32.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=317563 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-17
2022-07-31T04:05:32+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-17      !ERR | scylla[29538]:  [shard 3] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:05:32.299 <2022-07-31 04:05:32.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=317565 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-17
2022-07-31T04:05:32+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-17      !ERR | scylla[29538]:  [shard 4] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:05:32.373 <2022-07-31 04:05:32.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=317596 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-17
2022-07-31T04:05:32+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-17      !ERR | scylla[29538]:  [shard 8] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:05:32.400 <2022-07-31 04:05:32.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=317612 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-17
2022-07-31T04:05:32+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-17      !ERR | scylla[29538]:  [shard 0] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 04:05:32.518 <2022-07-31 04:05:32.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=317659 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-17
2022-07-31T04:05:32+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-17      !ERR | scylla[29538]:  [shard 5] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

Another time was during MultipleHardRebootNode nemesis, with node 18 as the target node and started at 2022-07-31 08:15:49.935 and ended at 2022-07-31 08:56:02.254:

2022-07-31 08:24:19.317 <2022-07-31 08:24:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=604172 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:24:13+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1005]:  [shard 1] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:24:19.317 <2022-07-31 08:24:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=604173 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:24:13+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1005]:  [shard 6] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:24:19.318 <2022-07-31 08:24:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=604174 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:24:13+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1005]:  [shard 3] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:24:19.318 <2022-07-31 08:24:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=604176 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:24:13+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1005]:  [shard 5] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:24:19.318 <2022-07-31 08:24:13.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=604177 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:24:13+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1005]:  [shard 8] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
...
2022-07-31 08:30:09.668 <2022-07-31 08:30:03.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=650764 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:30:03+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[957]:  [shard 10] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:30:09.668 <2022-07-31 08:30:03.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=650767 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:30:03+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[957]:  [shard 4] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:30:09.669 <2022-07-31 08:30:03.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=650770 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:30:03+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[957]:  [shard 3] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
...
2022-07-31 08:54:45.581 <2022-07-31 08:54:22.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=880392 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:54:22+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1008]:  [shard 11] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:54:45.581 <2022-07-31 08:54:22.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=880393 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:54:22+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[1008]:  [shard 4] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-07-31 08:53:52.882 <2022-07-31 08:53:46.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=665fdf51-b4ea-4a39-8e41-6eeef9aac68f: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=875620 node=longevity-tls-50gb-3d-master-db-node-7c261c1d-18
2022-07-31T08:53:46+00:00 longevity-tls-50gb-3d-master-db-node-7c261c1d-18      !ERR | scylla[968]:  [shard 10] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

EDIT: Removed mention of the coredump, it was #11118

  • Restore Monitor Stack command: $ hydra investigate show-monitor 7c261c1d-0266-41ae-a139-e74e2338e99b
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7c261c1d-0266-41ae-a139-e74e2338e99b

Logs:

Jenkins job URL

@fruch
Copy link
Contributor Author

fruch commented Aug 2, 2022

@ShlomiBalalis the coredump doesn't sound related to this issue, please raise it on it's own issue

@elcallio
Copy link
Contributor

Again, I would like to point out that while I can guess the reason for the error (above), I can't really verify it as it stands, since I can¨t really repro. Is there any way to provoke the issue not involving nemesis? Other than that I don't know how to find and verify potential schedule group switching...

@elcallio
Copy link
Contributor

But if I were to speculate, I would think it is abstract_write_response_handler::timeout_cb, which is called from a timer, yet uses the "stats" object acquired from the creating task queue. Which might be different?
This would suggest that the correct fix is to keep track of scheduling group for these counters, and use the one we are created with. Or eagerly create the counters so we never risk hitting them first in a timer? If that is what happens.

@ShlomiBalalis
Copy link

The issue was replacated in a 2022.2 job:

Installation details

Kernel Version: 5.15.0-1017-aws
Scylla version (or git commit hash): 2022.2.0~rc0-20220814.eb67e6127d6b with build-id 5fc03e7326d7fd7e3eb1228a48382a35fcd8c08f
Cluster size: 5 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-9 (54.217.53.187 | 10.4.1.74) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-8 (54.76.191.138 | 10.4.0.61) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-7 (34.242.186.118 | 10.4.2.134) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-6 (34.252.218.62 | 10.4.0.53) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-5 (18.202.219.108 | 10.4.3.222) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4 (34.240.218.65 | 10.4.0.105) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-3 (63.33.202.250 | 10.4.3.103) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-2 (3.250.32.71 | 10.4.0.135) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-10 (54.216.121.16 | 10.4.3.48) (shards: 10)
  • longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-1 (54.171.229.22 | 10.4.3.156) (shards: 10)

OS / Image: ami-044020f115ee292eb (aws: eu-west-1)

Test: longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis-test
Test id: 106870ad-dc22-4572-8fd1-87455e1af652
Test name: enterprise-2022.2/longevity/longevity-2TB-48h-authorization-and-tls-ssl-1dis-2nondis-nemesis-test
Test config file(s):

Issue description

At 2022-08-23 17:42:34.285, a kill_scylla nemesis has started on node 4. After the scylla service was killed and while it started up, the following occurred:

2022-08-23T17:47:32+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4     !INFO | scylla[21464]:  [shard 0] init - Waiting for gossip to settle before accepting client requests...
2022-08-23T17:47:33+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4     !INFO | scylla[21464]:  [shard 7] compaction - [Compact system.peers aa11d060-230b-11ed-aa29-108e18ecc7a2] Compacted 4 sstables to [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-367-big-Data.db:level=0]. 200kB to 89kB (~44% of original) in 438ms = 203kB/s. ~512 total partitions merged to 2.
2022-08-23T17:47:33+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4     !INFO | scylla[21464]:  [shard 8] compaction - [Compact mview.users a900fca0-230b-11ed-b543-108718ecc7a2] Compacted 2 sstables to [/var/lib/scylla/data/mview/users-2d6c9c00230311ed989951bb33648802/me-88-big-Data.db:level=0]. 21MB to 12MB (~57% of original) in 2500ms = 4MB/s. ~87424 total partitions merged to 50060.
...
2022-08-23T17:47:41+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4      !ERR | scylla[21464]:  [shard 9] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-08-23 17:47:41.468 <2022-08-23 17:47:41.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=64d7de25-7af5-4d01-8451-f76d8801d6c9: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=76644 node=longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4
2022-08-23T17:47:41+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4      !ERR | scylla[21464]:  [shard 7] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-08-23T17:47:44+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4     !INFO | scylla[21464]:  [shard 0] gossip - No gossip backlog; proceeding
2022-08-23T17:47:44+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4     !INFO | scylla[21464]:  [shard 0] init - allow replaying hints
2022-08-23T17:47:44+00:00 longevity-tls-2tb-48h-1dis-2nondis--db-node-106870ad-4     !INFO | scylla[21464]:  [shard 0] init - Launching generate_mv_updates for non system tables

It's imporatnt to note that a few minutes before, #11252 occurred in the same node.

  • Restore Monitor Stack command: $ hydra investigate show-monitor 106870ad-dc22-4572-8fd1-87455e1af652
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 106870ad-dc22-4572-8fd1-87455e1af652

Logs:

Jenkins job URL

@slivne
Copy link
Contributor

slivne commented Sep 1, 2022

@elcallio ping - this reoccurs

@soyacz
Copy link
Contributor

soyacz commented Sep 6, 2022

Happened also on recent master branch - where we saw multiple errors related to reader_concurrency_semaphore on affected shard [shard 12].

system.log|Sep 02 11:04:07 longevity-tls-50gb-3d-master-db-node-46f53c04-3 scylla[971]:  [shard 12] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

Installation details

Kernel Version: 5.15.0-1019-aws
Scylla version (or git commit hash): 5.2.0~dev-20220831.4e413787d2dd with build-id a94ac836ea302c1afb37a74b6941797733026747
Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-46f53c04-9 (13.51.161.104 | 10.0.0.221) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-8 (13.48.24.204 | 10.0.2.78) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-7 (13.53.115.221 | 10.0.3.144) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-6 (13.49.225.98 | 10.0.1.52) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-5 (13.51.234.218 | 10.0.2.52) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-4 (13.51.238.151 | 10.0.1.9) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-3 (13.49.65.127 | 10.0.1.137) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-2 (13.53.182.112 | 10.0.1.205) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-46f53c04-1 (16.16.67.188 | 10.0.3.113) (shards: 14)

OS / Image: ami-0b2d0b0f4850fb338 (aws: eu-north-1)

Test: longevity-50gb-3days
Test id: 46f53c04-5f2b-4e22-b342-67984f6607fc
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Issue description

>>>>>>>
Your description here...
<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor 46f53c04-5f2b-4e22-b342-67984f6607fc
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 46f53c04-5f2b-4e22-b342-67984f6607fc

Logs:

Jenkins job URL

@fruch
Copy link
Contributor Author

fruch commented Oct 6, 2022

still happens on the same job:

Installation details

Kernel Version: 5.15.0-1020-aws
Scylla version (or git commit hash): 5.2.0~dev-20220929.c194c811df01 with build-id 3e922fa2f3dd7546827b4bb02ec4cebd184e062b

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-dac566bd-9 (35.168.111.10 | 10.12.23.164) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-dac566bd-8 (34.204.176.22 | 10.12.21.82) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-dac566bd-7 (44.200.47.152 | 10.12.23.62) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-dac566bd-6 (44.210.108.65 | 10.12.22.45) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-dac566bd-5 (3.234.239.222 | 10.12.23.39) (shards: 14)
    ...

OS / Image: ami-0b2aef7db77d5711a (aws: us-east-1)

Test: longevity-50gb-3days
Test id: dac566bd-2a9e-4be6-986c-90e3b181a5b7
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Logs:

Jenkins job URL

@DoronArazii
Copy link

@eliransin please have a look and determine the priority

@DoronArazii DoronArazii modified the milestones: 5.1, 5.2 Oct 12, 2022
@yarongilor
Copy link

Reproduced in a terminate-and-replace nemesis as well (in An Alternator-TTL test:

(DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=4d4f44a9-e39e-4f8e-9b58-e93df2718728: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=101059 node=alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5
2022-11-04T06:28:08+00:00 alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5 !ERR | scylla[58893]: [shard 6] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

Installation details

Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-20221009.994a5f0fbb4c with build-id 756ea8d62c25ed4acdf087054e11b3d07596a117
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-9 (34.248.19.247 | 10.4.0.151) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-8 (52.209.108.120 | 10.4.0.160) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-7 (34.245.201.183 | 10.4.3.188) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-6 (34.250.32.253 | 10.4.2.151) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-5 (34.245.124.139 | 10.4.2.213) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-4 (54.195.144.153 | 10.4.0.41) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-3 (34.254.89.227 | 10.4.0.77) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-2 (54.155.84.133 | 10.4.3.211) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-16 (3.250.192.234 | 10.4.0.39) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-15 (3.251.81.51 | 10.4.0.27) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-14 (3.250.105.87 | 10.4.2.126) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-13 (54.194.207.154 | 10.4.0.92) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-12 (34.244.29.173 | 10.4.1.91) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-11 (54.194.213.46 | 10.4.1.151) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-10 (52.212.227.101 | 10.4.0.71) (shards: 14)
  • alternator-ttl-4-loaders-no-lwt-sis-db-node-7da36ba4-1 (63.35.163.138 | 10.4.0.55) (shards: 14)

OS / Image: ami-0b9c9dd9d3af4cec6 (aws: eu-west-1)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test id: 7da36ba4-479e-42fd-bc55-641409ff1c77
Test name: scylla-staging/yarongilor/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis
Test config file(s):

Issue description

>>>>>>>
Your description here...
<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor 7da36ba4-479e-42fd-bc55-641409ff1c77
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7da36ba4-479e-42fd-bc55-641409ff1c77

Logs:

Jenkins job URL

@avikivity
Copy link
Member

@eliransin may be related to per-scheduling-group stats

@KnifeyMoloko
Copy link
Contributor

Reproduced a few times during RebuildStreamingErr and MultipleHardReboot nemesis:

During MultipleHardRebootNode:

2022-11-05 11:54:34.675 <2022-11-05 11:54:30.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=90edb6d2-1cd7-4180-a46c-bfef872978de: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=806149 node=longevity-tls-50gb-3d-2022-2-db-node-7f685b49-10
2022-11-05T11:54:30+00:00 longevity-tls-50gb-3d-2022-2-db-node-7f685b49-10      !ERR | scylla[934]:  [shard 11] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

and RebuildStreamingErr:

2022-11-07 11:02:31.660 <2022-11-07 11:02:27.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=90edb6d2-1cd7-4180-a46c-bfef872978de: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=250446 node=longevity-tls-50gb-3d-2022-2-db-node-7f685b49-24
2022-11-07T11:02:27+00:00 longevity-tls-50gb-3d-2022-2-db-node-7f685b49-24      !ERR | scylla[53463]:  [shard  2] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy

Installation details

Kernel Version: 5.15.0-1021-aws
Scylla version (or git commit hash): 2022.2.0~rc3-20221009.994a5f0fbb4c with build-id ad2ce7ef5214afc9f921ccfa2ce00706376f77dc
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-aarch64-package-2022.2.0-rc3.0.20221009.994a5f0fbb4c.tar.gz
Cluster size: 6 nodes (im4gn.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-4 (54.74.223.126 | 10.4.8.186) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-2 (34.244.91.86 | 10.4.11.8) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-8 (63.33.59.83 | 10.4.9.36) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-3 (34.244.34.161 | 10.4.11.184) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-5 (34.244.198.144 | 10.4.8.222) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-11 (34.246.185.229 | 10.4.8.69) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-7 (3.252.148.218 | 10.4.11.0) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-14 (54.78.166.116 | 10.4.11.241) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-13 (54.170.39.69 | 10.4.10.62) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-10 (54.74.187.127 | 10.4.9.87) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-17 (54.217.160.162 | 10.4.9.181) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-6 (54.194.155.233 | 10.4.8.125) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-16 (52.18.244.120 | 10.4.10.251) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-9 (54.74.252.233 | 10.4.10.212) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-18 (52.18.223.205 | 10.4.9.228) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-22 (52.17.24.151 | 10.4.9.33) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-12 (34.242.35.121 | 10.4.8.206) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-19 (34.244.170.40 | 10.4.9.23) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-20 (54.170.213.42 | 10.4.11.112) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-25 (54.216.162.126 | 10.4.9.20) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-15 (54.194.158.55 | 10.4.11.5) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-26 (3.249.36.220 | 10.4.11.208) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-1 (34.245.60.47 | 10.4.10.193) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-21 (63.35.178.122 | 10.4.10.136) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-23 (34.245.48.250 | 10.4.8.143) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-24 (34.255.122.210 | 10.4.9.61) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-27 (54.74.231.159 | 10.4.8.21) (shards: 15)
  • longevity-tls-50gb-3d-2022-2-db-node-7f685b49-28 (54.77.180.124 | 10.4.11.209) (shards: 15)

OS / Image: ami-0dcd7f54f9f910597 (aws: eu-west-1)

Test: longevity-50gb-3days-arm-test
Test id: 7f685b49-4c05-4d4d-987e-8287694d3a4e
Test name: enterprise-2022.2/longevity/longevity-50gb-3days-arm-test
Test config file(s):

Issue description

>>>>>>>
Your description here...
<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor 7f685b49-4c05-4d4d-987e-8287694d3a4e
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 7f685b49-4c05-4d4d-987e-8287694d3a4e

Logs:

Jenkins job URL

@ilya-rarov
Copy link

Got it in 2022.2.rc4 job.

Installation details

Kernel Version: 5.15.0-1022-aws
Scylla version (or git commit hash): 2022.2.0~rc4-20221106.f5714e0db12f with build-id f4a927b8a00fbcd8d48640835192aeaa7968b1f2
Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.2/relocatable/2022-11-06T15:44:05Z/scylla-enterprise-x86_64-package.tar.gz
Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-9 (35.178.200.195 | 10.3.1.68) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-8 (13.40.154.111 | 10.3.3.247) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-7 (18.130.108.114 | 10.3.0.41) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-6 (3.9.16.111 | 10.3.1.133) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-5 (13.40.151.248 | 10.3.1.112) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-4 (52.56.200.142 | 10.3.0.176) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-33 (13.40.115.39 | 10.3.0.101) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-32 (13.40.123.99 | 10.3.2.159) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-31 (3.9.115.191 | 10.3.0.118) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-30 (3.8.170.243 | 10.3.3.89) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3 (18.135.104.106 | 10.3.0.170) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-29 (13.40.101.236 | 10.3.0.29) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-28 (18.130.232.235 | 10.3.3.117) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27 (18.170.61.6 | 10.3.3.209) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-26 (18.134.198.186 | 10.3.1.5) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-25 (18.169.157.184 | 10.3.3.5) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-24 (35.176.158.13 | 10.3.1.214) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-23 (3.8.131.38 | 10.3.1.112) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-22 (3.9.177.44 | 10.3.0.93) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-21 (35.178.161.42 | 10.3.3.223) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-20 (35.176.103.236 | 10.3.0.99) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-2 (3.8.139.104 | 10.3.2.112) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-19 (18.133.175.26 | 10.3.2.156) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-18 (52.56.125.9 | 10.3.2.89) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-17 (35.178.199.29 | 10.3.1.188) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-16 (52.56.195.17 | 10.3.1.19) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-15 (18.135.105.217 | 10.3.3.172) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-14 (13.40.73.39 | 10.3.1.213) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-13 (3.8.143.101 | 10.3.0.34) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-12 (18.133.238.4 | 10.3.0.217) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-11 (18.132.9.223 | 10.3.0.175) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-10 (18.133.244.123 | 10.3.1.199) (shards: 14)
  • longevity-tls-50gb-3d-2022-2-db-node-a94220e9-1 (52.56.171.91 | 10.3.1.49) (shards: 14)

OS / Image: ami-06afe72a2cd370dbb (aws: eu-west-2)

Test: longevity-50gb-3days-aws-i4i-test
Test id: a94220e9-cdb8-4806-9590-185b346bbad1
Test name: enterprise-2022.2/longevity/longevity-50gb-3days-aws-i4i-test
Test config file(s):

Issue description

  • During disrupt_multiple_hard_reboot_node nemesis
2022-11-11 02:30:19.519 <2022-11-11 02:30:16.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=a7c58898-f063-4860-a493-bb656d421201: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=783006 node=longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3
2022-11-11T02:30:16+00:00 longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3      !ERR | scylla[917]:  [shard 10] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-11-11 02:30:19.519 <2022-11-11 02:30:16.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=a7c58898-f063-4860-a493-bb656d421201: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=783007 node=longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3
2022-11-11T02:30:16+00:00 longevity-tls-50gb-3d-2022-2-db-node-a94220e9-3      !ERR | scylla[917]:  [shard  5] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
  • During disrupt_memory_stress nemesis
2022-11-12 22:53:49.724 <2022-11-12 22:53:49.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=a7c58898-f063-4860-a493-bb656d421201: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=354844 node=longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27
2022-11-12T22:53:49+00:00 longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27      !ERR | scylla[899]:  [shard  4] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
2022-11-12 22:53:49.725 <2022-11-12 22:53:49.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=a7c58898-f063-4860-a493-bb656d421201: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=354845 node=longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27
2022-11-12T22:53:49+00:00 longevity-tls-50gb-3d-2022-2-db-node-a94220e9-27      !ERR | scylla[899]:  [shard  5] storage_proxy - Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy
  • Restore Monitor Stack command: $ hydra investigate show-monitor a94220e9-cdb8-4806-9590-185b346bbad1
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a94220e9-cdb8-4806-9590-185b346bbad1

Logs:

Jenkins job URL

@DoronArazii DoronArazii modified the milestones: 5.2, 5.x Nov 27, 2022
@DoronArazii DoronArazii added Requires-Backport-to-5.1 backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Jun 22, 2023
@DoronArazii DoronArazii modified the milestones: 5.3, 5.4 Jul 4, 2023
@roydahan
Copy link

roydahan commented Jul 6, 2023

@avikivity - This issue bothers QA in all patch releases because we hit it in almost every rolling upgrade.

@mykaul
Copy link
Contributor

mykaul commented Jul 6, 2023

@avikivity - This issue bothers QA in all patch releases because we hit it in almost every rolling upgrade.

Any issue with ignoring it?

@aleksbykov
Copy link
Contributor

@mykaul , no any problems, but It look like should be backported to 2022.2, So we don't need to wait?

@mykaul
Copy link
Contributor

mykaul commented Jul 9, 2023

@mykaul , no any problems, but It look like should be backported to 2022.2, So we don't need to wait?

The issue has all the right flags to be backport, it just takes longer than QA ignoring it.

@DoronArazii
Copy link

@scylladb/scylla-maint please consider backport to all the active 5.x branches.

denesb pushed a commit that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294

(cherry picked from commit f18e967)
denesb pushed a commit that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294

(cherry picked from commit f18e967)
@denesb
Copy link
Contributor

denesb commented Jul 11, 2023

Backported to 5.3 and 5.2. Backport to 5.1 is not clean, @elcallio please provide a backport PR for 5.1.

@denesb denesb removed backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.3 labels Jul 11, 2023
@elcallio
Copy link
Contributor

There no longer is a branch-5.1 in scylla-enterprise. What is the base branch you with me to backport to? (Confused)...

@elcallio
Copy link
Contributor

Oh, you meant scylla-scylla. How quaint...

elcallio pushed a commit to elcallio/scylla that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes scylladb#11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
elcallio pushed a commit to elcallio/scylla that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes scylladb#11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
elcallio pushed a commit to elcallio/scylla that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes scylladb#11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
@denesb
Copy link
Contributor

denesb commented Jul 11, 2023

Had to dequeue 5.2 backport, it breaks the build. @elcallio please also create a backport PR for 5.2.

elcallio pushed a commit to elcallio/scylla that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes scylladb#11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
denesb pushed a commit that referenced this issue Jul 11, 2023
…ent scheduling group

Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14631
denesb pushed a commit that referenced this issue Jul 12, 2023
…ent scheduling group

Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14636
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.