-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair task from manager failed due to coredump on one of the node #8059
Comments
@bhalevy |
@denesb apparently the artifacts path has changed. /cc @hagitsegev |
Thanks @bhalevy. |
One immediately strange observation is that the end bound of
On second look |
|
Looking at c3b4c3f, just to be sure, so the issue is a false-positive validation not a real issue. |
Yes, the validation is using the wrong range to validate the emitted partition, so it triggers a false-positive validation failure. |
@denesb do we need to backport this to 4.4 / 4.3 ? |
Yes, 4.3 already has the position self validation. |
@avikivity please backport. In any case that would be a prerequisite for backporting the fix for #8923, #8893 |
@avikivity ping |
`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> (cherry picked from commit c3b4c3f)
Backported to 4.4 (was already in 4.5). |
`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: scylladb#8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> (cherry picked from commit c3b4c3f) Conflicts: - Decoroutinize evictable_reader::fast_forward_to - test_reader::next_partition: return void.
`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> [avi: add #include] (cherry picked from commit c3b4c3f)
`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> [avi: add #include] (cherry picked from commit c3b4c3f)
`_range_override` is used to store the modified range the reader reads after it has to be recreated (when recreating a reader it's read range is reduced to account for partitions it already read). When engaged, this field overrides the `_pr` field as the definitive range the reader is supposed to be currently reading. Fast forwarding conceptually overrides the range the reader is currently reading, however currently it doesn't reset the `_range_override` field. This resulted in `_range_override` (containing the modified pre-fast-forward range) incorrectly overriding the fast-forwarded-to range in `_pr` when validating the first partition produced by the just recreated reader, resulting in a false-positive validation failure. Fixes: #8059 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210217164744.420100-1-bdenes@scylladb.com> [avi: add #include] (cherry picked from commit c3b4c3f)
Installation details
Scylla version (or git commit hash):
4.5.dev-0.20210204.7f3083739 with build-id 4abdd1a158a7b6e39afe5f03c27cd50a3cd9d46d
Cluster size: 6 nodes (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI):
ami-0edd71dfeca9e0df2
(aws: eu-north-1)Test id:
2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Test:
longevity-50gb-3days
Test name:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
====================================
The job failed after 22 hours. Next list of nemesis action:
After that Next nemesis started ManagementRepair. Repair task was created and repair was started on cluster.
< t:2021-02-06 00:39:32,711 f:cli.py l:495 c:sdcm.mgmt.cli p:DEBUG > Created task id is: repair/0d092ce1-dd22-4ca9-863f-09aef4130160
During repair on node2 next error and coredump happened:
Decoded backtrace
Coredump
This cause that repair task was failed:
2021-02-06 01:24:10.746: (DisruptionEvent Severity.ERROR): type=ManagementRepair subtype=end node=Node longevity-tls-50gb-3d-master-db-node-2b6831c3-7 [13.48.13.190 | 10.0.3.254] (seed: False) duration=2704 error=Task: repair/0d092ce1-dd22-4ca9-863f-09aef4130160 final status is: ERROR.
====================================
Restore Monitor Stack command:
$ hydra investigate show-monitor 2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Show all stored logs command:
$ hydra investigate show-logs 2b6831c3-23d5-45bd-a7d3-10544ec41d9a
Logs:
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_021033/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210206_021420-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_021033/grafana-screenshot-overview-20210206_021034-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022121/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210206_022439-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022121/grafana-screenshot-overview-20210206_022121-longevity-tls-50gb-3d-master-monitor-node-2b6831c3-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/db-cluster-2b6831c3.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/loader-set-2b6831c3.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/monitor-set-2b6831c3.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/2b6831c3-23d5-45bd-a7d3-10544ec41d9a/20210206_022932/sct-runner-2b6831c3.zip
Jenkins job URL
The text was updated successfully, but these errors were encountered: