Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs` #7043

VladLazar · 2022-11-01T13:43:15Z

Redpanda version: dev

  File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 288, in test_restore_with_aborted_tx
    for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object

The failure essentially means that we've consumed less messages than were produced.

https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49
https://buildkite.com/redpanda/redpanda/builds/17663#01843078-297e-4672-b260-c73ad6969674

The text was updated successfully, but these errors were encountered:

mmaslankaprv · 2022-11-02T11:15:41Z

Can please provide a link to failed build ?

VladLazar · 2022-11-02T11:19:24Z

Can please provide a link to failed build ?

Sorry, missed that somehow. Updated the issue.

dlex · 2022-11-02T16:39:43Z

https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49

Lazin · 2022-11-03T18:59:09Z

Looks like the recovery downloads the segment correctly but then it gets truncated right after raft bootstrap.

dlex · 2022-11-04T15:32:14Z

https://buildkite.com/redpanda/redpanda/builds/17898#01844128-535b-4789-9063-333aa64e6d4c
https://buildkite.com/redpanda/redpanda/builds/17906#01844252-a6c3-404f-bd53-f187581fbca0
https://buildkite.com/redpanda/redpanda/builds/17895#0184410d-c880-49d2-ae2f-a9b90a8b3aa8

NyaliaLui · 2022-11-07T19:54:03Z

https://buildkite.com/redpanda/redpanda/builds/18018#018452c7-44cf-4178-981e-ca6b739b4a3a/6-1158

NyaliaLui · 2022-11-09T14:52:46Z

Another in https://buildkite.com/redpanda/redpanda/builds/18143#0184586f-918e-4f87-ac6e-c179c53438ac/6-680

NyaliaLui · 2022-11-10T14:55:28Z

https://buildkite.com/redpanda/redpanda/builds/18290#01845eaf-1a8e-455b-81fa-294d69c8301c/6-732

NyaliaLui · 2022-11-11T15:14:51Z

https://buildkite.com/redpanda/redpanda/builds/18463#01846631-9be1-45a6-b5fe-028eb8f3d59d/6-451

NyaliaLui · 2022-11-11T15:32:46Z

There is also a different failure but same test in

https://buildkite.com/redpanda/redpanda/builds/18401#01846341-7814-4a89-ac1b-a5a67ce4b863/6-532

   AssertionError("produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/shadow_indexing_tx_test.py", line 132, in test_shadow_indexing_aborted_txs
    assert (not first_mismatch), (
AssertionError: produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)

abhijat · 2022-11-14T07:40:44Z

seen a couple of times on a ci-repeat in #7242

https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7346-4aac-97ab-b2b489c2cf85
https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7348-4f66-8c63-92500bfc3a34

test_id:    rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
status:     FAIL
run time:   1 minute 11.075 seconds
 
    TypeError('cannot unpack non-iterable NoneType object')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 313, in test_restore_with_aborted_tx
    for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object

andijcr · 2022-11-14T15:13:35Z

again this failure

AssertionError("produced and consumed messages differ, produced length: 13765, consu    med length: 13774, first mismatch: produced: None, consumed: b'14866' (offset: 15265)")

https://buildkite.com/redpanda/redpanda/builds/18524#01846f6f-36d1-4107-b9e0-02cf1ab05e0a

andijcr · 2022-11-15T15:17:46Z

in the release-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b7-4777-a76d-322555029e92
but this time seems a problem with the test itself TypeError('cannot unpack non-iterable NoneType object')

VladLazar · 2022-11-16T10:23:09Z

Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737.
Same failure mode as in the original issue.

Lazin · 2022-11-16T10:45:36Z

I'll work on it today. Looks like the test consumed more data then it produced which may indicate that the consumer was able to see batches from aborted transactions.

…

On Wed, Nov 16, 2022, 11:23 Vlad Lazar ***@***.***> wrote: Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737 . Same failure mode as in the original issue. — Reply to this email directly, view it on GitHub <#7043 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAWMNUQJYCCI6FYLN6QW7LWISYZRANCNFSM6AAAAAARUCHUTI> . You are receiving this because you were assigned.Message ID: ***@***.***>

andijcr · 2022-11-16T16:07:32Z

another one https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e

FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ e.True.redpanda.remote.read.True (1/30 runs)
failure at 2022-11-15T08:06:49.706Z: TypeError('cannot unpack non-iterable NoneType object')
in job https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e

andijcr · 2022-11-18T15:12:55Z

  5 FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ    e.True.redpanda.remote.read.True (1/18 runs)
  6   failure at 2022-11-18T08:14:27.356Z: TypeError('cannot unpack non-iterable NoneType object')
  7       in job https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239

@ https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239

bharathv · 2022-12-08T20:55:20Z

Another instance: https://ci-artifacts.dev.vectorized.cloud/redpanda/0184f328-8dd6-44ca-b4c1-688c8a356b8d/vbuild/ducktape/results/2022-12-08--001/report.html

dotnwat · 2022-12-09T19:33:10Z

Another instance related to "produced and cnsumed msesages differ" reported above by Nyalia

https://buildkite.com/redpanda/redpanda/builds/19555#0184f32e-7e57-43f3-aeb4-4ce620d01df5

jcsp · 2022-12-12T14:21:46Z

This is still live today, after the merge of #7366
https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f

VadimPlh · 2022-12-12T15:12:24Z

Another https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f

with TypeError('cannot unpack non-iterable NoneType object')

VadimPlh · 2022-12-14T15:33:43Z

https://buildkite.com/redpanda/redpanda/builds/19686#01850c1a-44b5-4130-8fd6-fa3e72657613

Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043

aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043

Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043

aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043

Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043

aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043

Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043 (cherry picked from commit a2046bf)

aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043 (cherry picked from commit 2cb2987)

bharathv · 2022-12-21T17:28:42Z

This surfaced again after including the fixes from #7819

version: v23.1.0-dev-639-g85cc3500b - 85cc350-dirty

https://buildkite.com/redpanda/vtools/builds/4801#018533b7-87a2-4c64-b23b-db696e5251de

====================================================================================================
test_id:    rptest.tests.shadow_indexing_tx_test.ShadowIndexingTxTest.test_shadow_indexing_aborted_txs
status:     FAIL
run time:   46.286 seconds


    AssertionError()
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/shadow_indexing_tx_test.py", line 75, in test_shadow_indexing_aborted_txs
    assert 0 < committed_messages < msg_count
AssertionError

jcsp · 2022-12-21T17:41:34Z

the clustered ducktape environment didn't have the right kgo-verifier, it's fixed today https://github.com/redpanda-data/vtools/pull/1230

Use KgoVerifierProducer/SeqConsumer, for much higher volume of messages, giving much higher chance of hitting issues. Increasing the message count on the python producer made the test very slow. Also use the validation in the consumer to robustly and explicitly flag when a bad read happens. This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as well as the bad transactional read from redpanda-data#7043. Related: redpanda-data#7043 (cherry picked from commit a2046bf)

aborted_transactions and make_reader were using different rules on when to use remote vs. local partition. In the bug case seen in redpanda-data#7043, we were reading from the remote partition, but then abort_transactions was hitting the local partition. Fix this by modifying the condition in aborted_transactions to use the same kafka offset comparison that is used in make_reader. Fixes: redpanda-data#7043 (cherry picked from commit 2cb2987)

VladLazar added kind/bug Something isn't working ci-failure labels Nov 1, 2022

mmedenjak added area/tests area/cloud-storage Shadow indexing subsystem labels Nov 2, 2022

mmedenjak assigned Lazin Nov 3, 2022

BenPope mentioned this issue Nov 4, 2022

Schema Registry && REST Proxy: Auto-auth tests & fixes #7048

Merged

7 tasks

bharathv mentioned this issue Nov 4, 2022

tx/compaction: Adjust aborted tx ranges for compacted segments. #7081

Merged

6 tasks

dotnwat mentioned this issue Nov 7, 2022

De-coroutinize capturing lambdas #7113

Merged

6 tasks

This was referenced Nov 7, 2022

archival_policy: fix off-by-1 error for timeboxed uploads #7096

Merged

tests: amend retention in test_create_or_delete_topics_while_busy #7097

Merged

jcsp mentioned this issue Nov 8, 2022

CI failure in ShadowIndexingTxTest.test_shadow_indexing_aborted_txs #7147

Closed

BenPope mentioned this issue Nov 10, 2022

kafka/client/consumer: Shutdown and fetch improvements #7210

Merged

6 tasks

VladLazar mentioned this issue Nov 16, 2022

tree_wide: add metrics for ops dashboard #7287

Merged

6 tasks

NyaliaLui mentioned this issue Nov 29, 2022

tests: add basic auth mtls tests #7362

Merged

6 tasks

r-vasquez mentioned this issue Dec 2, 2022

[v22.3.x] rpk cloud patch #7605

Merged

bharathv mentioned this issue Dec 8, 2022

tx/observability: utilities and metrics for memory usage #7585

Merged

6 tasks

jcsp added sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. and removed sev/high loss of availability, pathological performance degradation, recoverable corruption labels Dec 12, 2022

jcsp assigned jcsp and unassigned Lazin Dec 13, 2022

jcsp mentioned this issue Dec 16, 2022

kafka: fix transactional reads on tiered storage recovered partitions #7819

Merged

6 tasks

jcsp closed this as completed in #7819 Dec 19, 2022

vbotbuildovich mentioned this issue Dec 19, 2022

[v22.3.x] Failure in EndToEndTopicRecovery.test_restore_with_aborted_tx / test_shadow_indexing_aborted_txs #7842

Closed

bharathv reopened this Dec 21, 2022

jcsp closed this as completed Dec 21, 2022

jcsp mentioned this issue Dec 23, 2022

Failure in EndToEndTopicRecovery.test_restore_with_aborted_tx #7955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs` #7043

Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs` #7043

VladLazar commented Nov 1, 2022 •

edited

mmaslankaprv commented Nov 2, 2022

VladLazar commented Nov 2, 2022

dlex commented Nov 2, 2022

Lazin commented Nov 3, 2022

dlex commented Nov 4, 2022

NyaliaLui commented Nov 7, 2022

NyaliaLui commented Nov 9, 2022

NyaliaLui commented Nov 10, 2022

NyaliaLui commented Nov 11, 2022

NyaliaLui commented Nov 11, 2022

abhijat commented Nov 14, 2022 •

edited

andijcr commented Nov 14, 2022

andijcr commented Nov 15, 2022

VladLazar commented Nov 16, 2022

Lazin commented Nov 16, 2022 via email

andijcr commented Nov 16, 2022

andijcr commented Nov 18, 2022

bharathv commented Dec 8, 2022

dotnwat commented Dec 9, 2022

jcsp commented Dec 12, 2022

VadimPlh commented Dec 12, 2022

VadimPlh commented Dec 14, 2022

bharathv commented Dec 21, 2022

jcsp commented Dec 21, 2022

Failure in EndToEndTopicRecovery.test_restore_with_aborted_tx / test_shadow_indexing_aborted_txs #7043

Failure in EndToEndTopicRecovery.test_restore_with_aborted_tx / test_shadow_indexing_aborted_txs #7043

Comments

VladLazar commented Nov 1, 2022 • edited

mmaslankaprv commented Nov 2, 2022

VladLazar commented Nov 2, 2022

dlex commented Nov 2, 2022

Lazin commented Nov 3, 2022

dlex commented Nov 4, 2022

NyaliaLui commented Nov 7, 2022

NyaliaLui commented Nov 9, 2022

NyaliaLui commented Nov 10, 2022

NyaliaLui commented Nov 11, 2022

NyaliaLui commented Nov 11, 2022

abhijat commented Nov 14, 2022 • edited

andijcr commented Nov 14, 2022

andijcr commented Nov 15, 2022

VladLazar commented Nov 16, 2022

Lazin commented Nov 16, 2022 via email

andijcr commented Nov 16, 2022

andijcr commented Nov 18, 2022

bharathv commented Dec 8, 2022

dotnwat commented Dec 9, 2022

jcsp commented Dec 12, 2022

VadimPlh commented Dec 12, 2022

VadimPlh commented Dec 14, 2022

bharathv commented Dec 21, 2022

jcsp commented Dec 21, 2022

Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs` #7043

Failure in `EndToEndTopicRecovery`.`test_restore_with_aborted_tx` / `test_shadow_indexing_aborted_txs` #7043

VladLazar commented Nov 1, 2022 •

edited

abhijat commented Nov 14, 2022 •

edited