-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in EndToEndTopicRecovery
.test_restore_with_aborted_tx
/ test_shadow_indexing_aborted_txs
#7043
Comments
Can please provide a link to failed build ? |
Sorry, missed that somehow. Updated the issue. |
Looks like the recovery downloads the segment correctly but then it gets truncated right after raft bootstrap. |
There is also a different failure but same test in https://buildkite.com/redpanda/redpanda/builds/18401#01846341-7814-4a89-ac1b-a5a67ce4b863/6-532
|
seen a couple of times on a ci-repeat in #7242 https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7346-4aac-97ab-b2b489c2cf85
|
again this failure
https://buildkite.com/redpanda/redpanda/builds/18524#01846f6f-36d1-4107-b9e0-02cf1ab05e0a |
in the release-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b7-4777-a76d-322555029e92 |
Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737. |
I'll work on it today. Looks like the test consumed more data then it
produced which may indicate that the consumer was able to see batches from
aborted transactions.
…On Wed, Nov 16, 2022, 11:23 Vlad Lazar ***@***.***> wrote:
Seen another here:
https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737
.
Same failure mode as in the original issue.
—
Reply to this email directly, view it on GitHub
<#7043 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWMNUQJYCCI6FYLN6QW7LWISYZRANCNFSM6AAAAAARUCHUTI>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
another one https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ e.True.redpanda.remote.read.True (1/30 runs) |
@ https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239 |
Another instance related to "produced and cnsumed msesages differ" reported above by Nyalia https://buildkite.com/redpanda/redpanda/builds/19555#0184f32e-7e57-43f3-aeb4-4ce620d01df5 |
This is still live today, after the merge of #7366 |
Another https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f with |
Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043
aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043
Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043
aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043
Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043
aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043
Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043 (cherry picked from commit a2046bf)
aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043 (cherry picked from commit 2cb2987)
This surfaced again after including the fixes from #7819 version: v23.1.0-dev-639-g85cc3500b - 85cc350-dirty https://buildkite.com/redpanda/vtools/builds/4801#018533b7-87a2-4c64-b23b-db696e5251de
|
the clustered ducktape environment didn't have the right kgo-verifier, it's fixed today https://github.com/redpanda-data/vtools/pull/1230 |
Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043 (cherry picked from commit a2046bf)
aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043 (cherry picked from commit 2cb2987)
Redpanda version:
dev
The failure essentially means that we've consumed less messages than were produced.
https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49
https://buildkite.com/redpanda/redpanda/builds/17663#01843078-297e-4672-b260-c73ad6969674
The text was updated successfully, but these errors were encountered: