New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to obtain stats (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)), fall-back to dummy #11017
Comments
/cc @elcallio, looks like listener restart forgets to unregister metrics |
test case is still running: it happened again on diffrent nodes, during
|
happened once more time during
|
So, while I did not write nor actually read any of the code before this hour, it seems that the Assuming we don't create two storage_proxies, I can only guess the problem is that we somehow switch scheduling groups between loading the stats object (stored as scheduling group storage) and we derive the actual counter labels (looking up current scheduling group). Again, I am not familiar with this code, but either someone who knows where we potentially switch groups should maybe try to keep us in the original one, or we could consider keeping track of the group in the actual |
Happens also in this week run, while doing config change with rolling restart
Installation detailsKernel Version: 5.15.0-1015-aws Scylla Nodes used in this run:
OS / Image: Test:
Logs:No logs captured during this run. |
Reproduced with
Installation detailsKernel Version: 5.15.0-1015-aws Scylla Nodes used in this run:
OS / Image: Test:
Logs:No logs captured during this run. |
I spotted this issue in upcoming 2022.1.rc9-0.20220721.9c95c3a8c (upgrade test):
Test detailsTest: upgrade_test.UpgradeTest.test_rolling_upgrade Test resultFAILED System under testScyllaDB version: 5.0.1-0.20220719.b177dacd3 with build-id 217f31634f8c8722cadcfe57ade8da58af05d415 (ami-03c187d4a5a22faa0) Restore commands:Restore Monitor Stack command: $ hydra investigate show-monitor 980f74a6-b3fe-4563-a817-358127d37089 Logs:grafana - https://cloudius-jenkins-test.s3.amazonaws.com/980f74a6-b3fe-4563-a817-358127d37089/20220727_123717/grafana-screenshot-overview-20220727_123717-rolling-upgrade--centos-monitor-node-980f74a6-1.png Links:Build URL |
Installation detailsKernel Version: 5.15.0-1015-aws Scylla Nodes used in this run:
OS / Image: Test: Issue descriptionThe issue was reproduced several times in this run:
(After node 6 has started)
(After node 8 was restarted)
(After node 12 was restarted)
(After node 17 was restarted)
Another time was during
EDIT: Removed mention of the coredump, it was #11118
Logs:
|
@ShlomiBalalis the coredump doesn't sound related to this issue, please raise it on it's own issue |
Again, I would like to point out that while I can guess the reason for the error (above), I can't really verify it as it stands, since I can¨t really repro. Is there any way to provoke the issue not involving nemesis? Other than that I don't know how to find and verify potential schedule group switching... |
But if I were to speculate, I would think it is |
The issue was replacated in a 2022.2 job: Installation detailsKernel Version: 5.15.0-1017-aws Scylla Nodes used in this run:
OS / Image: Test: Issue descriptionAt
It's imporatnt to note that a few minutes before, #11252 occurred in the same node.
Logs:
|
@elcallio ping - this reoccurs |
Happened also on recent master branch - where we saw multiple errors related to
Installation detailsKernel Version: 5.15.0-1019-aws Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>>
Logs:
|
@eliransin please have a look and determine the priority |
Reproduced in a terminate-and-replace nemesis as well (in An Alternator-TTL test:
Installation detailsKernel Version: 5.15.0-1021-aws Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>>
Logs:
|
@eliransin may be related to per-scheduling-group stats |
Reproduced a few times during RebuildStreamingErr and MultipleHardReboot nemesis: During
and
Installation detailsKernel Version: 5.15.0-1021-aws Scylla Nodes used in this run:
OS / Image: Test: Issue description>>>>>>>
Logs:
|
Got it in 2022.2.rc4 job. Installation detailsKernel Version: 5.15.0-1022-aws Scylla Nodes used in this run:
OS / Image: Test: Issue description
Logs:
|
@avikivity - This issue bothers QA in all patch releases because we hit it in almost every rolling upgrade. |
Any issue with ignoring it? |
@mykaul , no any problems, but It look like should be backported to 2022.2, So we don't need to wait? |
The issue has all the right flags to be backport, it just takes longer than QA ignoring it. |
@scylladb/scylla-maint please consider backport to all the active 5.x branches. |
…ent scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14294 (cherry picked from commit f18e967)
…ent scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14294 (cherry picked from commit f18e967)
Backported to 5.3 and 5.2. Backport to 5.1 is not clean, @elcallio please provide a backport PR for 5.1. |
There no longer is a branch-5.1 in scylla-enterprise. What is the base branch you with me to backport to? (Confused)... |
Oh, you meant scylla-scylla. How quaint... |
…ent scheduling group Fixes scylladb#11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics.
…ent scheduling group Fixes scylladb#11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics.
…ent scheduling group Fixes scylladb#11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics.
Had to dequeue 5.2 backport, it breaks the build. @elcallio please also create a backport PR for 5.2. |
…ent scheduling group Fixes scylladb#11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics.
…ent scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14631
…ent scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14636
Installation details
Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash):
5.1.dev-20220706.a0ffbf3291b7
with build-id3490fa9f14da510e97a1d0f53f693cac13a70494
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-07d73e5ea1fc772eb
(aws: eu-west-1)Test:
longevity-50gb-3days
Test id:
fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Test name:
scylla-master/longevity/longevity-50gb-3days
Test config file(s):
Issue description
during
disrupt_rolling_config_change_internode_compression
we got this error that we never encountered before$ hydra investigate show-monitor fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
$ hydra investigate show-logs fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Logs:
No logs captured during this run.
Jenkins job URL
The text was updated successfully, but these errors were encountered: