-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validator crashes node with "Unexpected mutation fragment" on range tombstone change of clustering row #10553
Comments
We need a reproducer with a system up after this occurs - its a corruption - https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-harry-2h-test/7/ |
here job (set to keep all the instances): takes ~1.5 hour for it to get to it. |
crashes has start for whom want to take a look: http://13.49.225.42:3000/d/alternator-master/longevity-harry-2h-test-scylla-per-server-metrics-nemesis-master?orgId=1 |
I've kill the cluster in staging (cause seems, no on it looking at). here's a link to start the reproducer when needed (it takes ~1.5h to get to the coredump point): |
@mikolajsieluzycki please look into this. |
@bhalevy Seems like exactly the same error |
Having an AMI or RPMs with this fix. @benipeled is the build in PRs create RPMs ? Is it being upload to S3 ? Also can you point @mikolajsieluzycki to the jobs that should build him RPMs or AMIs from forks ? |
The CI job doesn't archive RPMs, but logs (build & tests),
BYO can be used for building RPM and AMI from a fork - https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/byo_build_tests_dtest/ |
@mikolajsieluzycki / @fruch can we close this issue with #10643? |
Waiting for https://jenkins.scylladb.com/view/master/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/lastBuild/console to finish (hopefully kicked it off correctly). According to the description the error should show up after 1.5h. It's over 2h since start so I'm cautiously optimistic. |
The test finished successfully on master, I think it can be closed. |
@fruch please consider closing this issue as per the above |
If that test passed with master, then yes, closing this one |
Installation details
Kernel version:
5.13.0-1022-aws
Scylla version (or git commit hash):
5.1.dev-0.20220504.b26a3da584cc with build-id ab2a33a30756c1513f4c516cd272291e75acec0e
Cluster size: 6 nodes (i3.large)
Scylla running with shards number (live nodes):
longevity-harry-2h-fix-cass-db-node-eddd82cc-1 (16.171.62.87 | 10.0.3.241): 2 shards
longevity-harry-2h-fix-cass-db-node-eddd82cc-2 (13.53.37.177 | 10.0.1.86): 2 shards
longevity-harry-2h-fix-cass-db-node-eddd82cc-3 (13.48.26.58 | 10.0.3.223): 2 shards
longevity-harry-2h-fix-cass-db-node-eddd82cc-4 (13.48.71.161 | 10.0.3.236): 2 shards
longevity-harry-2h-fix-cass-db-node-eddd82cc-5 (16.16.27.153 | 10.0.1.98): 2 shards
longevity-harry-2h-fix-cass-db-node-eddd82cc-6 (13.48.1.47 | 10.0.3.109): 2 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI):
ami-0f0e4c1a732cd9815
(aws: eu-north-1)Test:
longevity-harry-2h-test
Test name:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
While running cassandra-harry (New test for SCT)
after ~1 hour of run it fail with the following failure/abort:
It's 100% reproducible, failed 3 times on row, with exact same failure.
Ops made by cassandra-harry
Example of queries used to insert/update data:
full log of all operation cassandra-harry was doing: (it's deflated to 22Gb):
https://cloudius-jenkins-test.s3.amazonaws.com/77d9f946-9eff-455e-ba63-e4211ff9d8e0/20220511_183754/operation.log.tar.gz
Coredump:
Restore Monitor Stack command:
$ hydra investigate show-monitor eddd82cc-d745-4a4f-afc2-d8ab979c84aa
Restore monitor on AWS instance using Jenkins job
Show all stored logs command:
$ hydra investigate show-logs eddd82cc-d745-4a4f-afc2-d8ab979c84aa
Test id:
eddd82cc-d745-4a4f-afc2-d8ab979c84aa
Logs
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220511_235515/grafana-screenshot-longevity-harry-2h-test-scylla-per-server-metrics-nemesis-20220511_235639-longevity-harry-2h-fix-cass-monitor-node-eddd82cc-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220511_235515/grafana-screenshot-overview-20220511_235515-longevity-harry-2h-fix-cass-monitor-node-eddd82cc-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220512_000756/db-cluster-eddd82cc.tar.gz
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220512_000756/loader-set-eddd82cc.tar.gz
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220512_000756/monitor-set-eddd82cc.tar.gz
sct - https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220512_000756/sct-runner-eddd82cc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/eddd82cc-d745-4a4f-afc2-d8ab979c84aa/20220512_000756/sct-runner-eddd82cc.tar.gz
Jenkins job URL
The text was updated successfully, but these errors were encountered: