Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in EndToEndTopicRecovery.test_restore_with_aborted_tx / test_shadow_indexing_aborted_txs #7043

Closed
VladLazar opened this issue Nov 1, 2022 · 31 comments · Fixed by #7819
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@VladLazar
Copy link
Contributor

VladLazar commented Nov 1, 2022

Redpanda version: dev

  File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 288, in test_restore_with_aborted_tx
    for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object

The failure essentially means that we've consumed less messages than were produced.

https://buildkite.com/redpanda/redpanda/builds/17716#018433cb-f177-4b27-9fe4-e071ecf19f49
https://buildkite.com/redpanda/redpanda/builds/17663#01843078-297e-4672-b260-c73ad6969674

@VladLazar VladLazar added kind/bug Something isn't working ci-failure labels Nov 1, 2022
@mmaslankaprv
Copy link
Member

Can please provide a link to failed build ?

@VladLazar
Copy link
Contributor Author

Can please provide a link to failed build ?

Sorry, missed that somehow. Updated the issue.

@mmedenjak mmedenjak added area/tests area/cloud-storage Shadow indexing subsystem labels Nov 2, 2022
@dlex
Copy link
Contributor

dlex commented Nov 2, 2022

@Lazin
Copy link
Contributor

Lazin commented Nov 3, 2022

Looks like the recovery downloads the segment correctly but then it gets truncated right after raft bootstrap.

@NyaliaLui
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

There is also a different failure but same test in

https://buildkite.com/redpanda/redpanda/builds/18401#01846341-7814-4a89-ac1b-a5a67ce4b863/6-532

   AssertionError("produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/shadow_indexing_tx_test.py", line 132, in test_shadow_indexing_aborted_txs
    assert (not first_mismatch), (
AssertionError: produced and consumed messages differ, produced length: 13254, consumed length: 0, first mismatch: produced: b'0', consumed: None (offset: -1)

@abhijat
Copy link
Contributor

abhijat commented Nov 14, 2022

seen a couple of times on a ci-repeat in #7242

https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7346-4aac-97ab-b2b489c2cf85
https://buildkite.com/redpanda/redpanda/builds/18462#01846631-7348-4f66-8c63-92500bfc3a34

test_id:    rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
status:     FAIL
run time:   1 minute 11.075 seconds
 
    TypeError('cannot unpack non-iterable NoneType object')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 313, in test_restore_with_aborted_tx
    for p_key, (c_key, c_offset) in zip_longest(producer.keys, consumed):
TypeError: cannot unpack non-iterable NoneType object

@andijcr
Copy link
Contributor

andijcr commented Nov 14, 2022

again this failure

AssertionError("produced and consumed messages differ, produced length: 13765, consu    med length: 13774, first mismatch: produced: None, consumed: b'14866' (offset: 15265)")

https://buildkite.com/redpanda/redpanda/builds/18524#01846f6f-36d1-4107-b9e0-02cf1ab05e0a

@andijcr
Copy link
Contributor

andijcr commented Nov 15, 2022

in the release-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b7-4777-a76d-322555029e92
but this time seems a problem with the test itself TypeError('cannot unpack non-iterable NoneType object')

@VladLazar
Copy link
Contributor Author

Seen another here: https://buildkite.com/redpanda/redpanda/builds/18618#01847be9-e191-447b-80a7-45df27069737.
Same failure mode as in the original issue.

@Lazin
Copy link
Contributor

Lazin commented Nov 16, 2022 via email

@andijcr
Copy link
Contributor

andijcr commented Nov 16, 2022

another one https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e

FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ e.True.redpanda.remote.read.True (1/30 runs)
failure at 2022-11-15T08:06:49.706Z: TypeError('cannot unpack non-iterable NoneType object')
in job https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5b9-4a0e-b59b-e01d2e099b3e

@andijcr
Copy link
Contributor

andijcr commented Nov 18, 2022

  5 FAIL test: EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.writ    e.True.redpanda.remote.read.True (1/18 runs)
  6   failure at 2022-11-18T08:14:27.356Z: TypeError('cannot unpack non-iterable NoneType object')
  7       in job https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239

@ https://buildkite.com/redpanda/redpanda/builds/18803#01848939-1855-4c6e-9bf6-fe9911044239

@dotnwat
Copy link
Member

dotnwat commented Dec 9, 2022

Another instance related to "produced and cnsumed msesages differ" reported above by Nyalia

https://buildkite.com/redpanda/redpanda/builds/19555#0184f32e-7e57-43f3-aeb4-4ce620d01df5

@jcsp
Copy link
Contributor

jcsp commented Dec 12, 2022

@VadimPlh
Copy link
Contributor

Another https://buildkite.com/redpanda/redpanda/builds/19621#018504dc-4135-4db7-968d-778a14c6c14f

with TypeError('cannot unpack non-iterable NoneType object')

@jcsp jcsp added sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. and removed sev/high loss of availability, pathological performance degradation, recoverable corruption labels Dec 12, 2022
@jcsp jcsp assigned jcsp and unassigned Lazin Dec 13, 2022
@VadimPlh
Copy link
Contributor

jcsp added a commit to jcsp/redpanda that referenced this issue Dec 15, 2022
Use KgoVerifierProducer/SeqConsumer, for much higher volume
of messages, giving much higher chance of hitting issues.  Increasing
the message count on the python producer made the test very slow.

Also use the validation in the consumer to robustly and explicitly
flag when a bad read happens.

This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as
well as the bad transactional read from redpanda-data#7043.

Related: redpanda-data#7043
jcsp added a commit to jcsp/redpanda that referenced this issue Dec 16, 2022
aborted_transactions and make_reader were using different rules
on when to use remote vs. local partition.

In the bug case seen in redpanda-data#7043, we were reading from the remote
partition, but then abort_transactions was hitting the local
partition.

Fix this by modifying the condition in aborted_transactions
to use the same kafka offset comparison that is used in make_reader.

Fixes: redpanda-data#7043
jcsp added a commit to jcsp/redpanda that referenced this issue Dec 16, 2022
Use KgoVerifierProducer/SeqConsumer, for much higher volume
of messages, giving much higher chance of hitting issues.  Increasing
the message count on the python producer made the test very slow.

Also use the validation in the consumer to robustly and explicitly
flag when a bad read happens.

This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as
well as the bad transactional read from redpanda-data#7043.

Related: redpanda-data#7043
jcsp added a commit to jcsp/redpanda that referenced this issue Dec 16, 2022
aborted_transactions and make_reader were using different rules
on when to use remote vs. local partition.

In the bug case seen in redpanda-data#7043, we were reading from the remote
partition, but then abort_transactions was hitting the local
partition.

Fix this by modifying the condition in aborted_transactions
to use the same kafka offset comparison that is used in make_reader.

Fixes: redpanda-data#7043
jcsp added a commit to jcsp/redpanda that referenced this issue Dec 19, 2022
Use KgoVerifierProducer/SeqConsumer, for much higher volume
of messages, giving much higher chance of hitting issues.  Increasing
the message count on the python producer made the test very slow.

Also use the validation in the consumer to robustly and explicitly
flag when a bad read happens.

This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as
well as the bad transactional read from redpanda-data#7043.

Related: redpanda-data#7043
jcsp added a commit to jcsp/redpanda that referenced this issue Dec 19, 2022
aborted_transactions and make_reader were using different rules
on when to use remote vs. local partition.

In the bug case seen in redpanda-data#7043, we were reading from the remote
partition, but then abort_transactions was hitting the local
partition.

Fix this by modifying the condition in aborted_transactions
to use the same kafka offset comparison that is used in make_reader.

Fixes: redpanda-data#7043
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Dec 19, 2022
Use KgoVerifierProducer/SeqConsumer, for much higher volume
of messages, giving much higher chance of hitting issues.  Increasing
the message count on the python producer made the test very slow.

Also use the validation in the consumer to robustly and explicitly
flag when a bad read happens.

This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as
well as the bad transactional read from redpanda-data#7043.

Related: redpanda-data#7043
(cherry picked from commit a2046bf)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Dec 19, 2022
aborted_transactions and make_reader were using different rules
on when to use remote vs. local partition.

In the bug case seen in redpanda-data#7043, we were reading from the remote
partition, but then abort_transactions was hitting the local
partition.

Fix this by modifying the condition in aborted_transactions
to use the same kafka offset comparison that is used in make_reader.

Fixes: redpanda-data#7043
(cherry picked from commit 2cb2987)
@bharathv
Copy link
Contributor

This surfaced again after including the fixes from #7819

version: v23.1.0-dev-639-g85cc3500b - 85cc350-dirty

https://buildkite.com/redpanda/vtools/builds/4801#018533b7-87a2-4c64-b23b-db696e5251de

====================================================================================================
test_id:    rptest.tests.shadow_indexing_tx_test.ShadowIndexingTxTest.test_shadow_indexing_aborted_txs
status:     FAIL
run time:   46.286 seconds


    AssertionError()
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/shadow_indexing_tx_test.py", line 75, in test_shadow_indexing_aborted_txs
    assert 0 < committed_messages < msg_count
AssertionError

@bharathv bharathv reopened this Dec 21, 2022
@jcsp
Copy link
Contributor

jcsp commented Dec 21, 2022

the clustered ducktape environment didn't have the right kgo-verifier, it's fixed today https://github.com/redpanda-data/vtools/pull/1230

@jcsp jcsp closed this as completed Dec 21, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Feb 28, 2023
Use KgoVerifierProducer/SeqConsumer, for much higher volume
of messages, giving much higher chance of hitting issues.  Increasing
the message count on the python producer made the test very slow.

Also use the validation in the consumer to robustly and explicitly
flag when a bad read happens.

This reliably reproduces the tx_fence consumer hang from redpanda-data#7043, as
well as the bad transactional read from redpanda-data#7043.

Related: redpanda-data#7043
(cherry picked from commit a2046bf)
jcsp added a commit to jcsp/redpanda that referenced this issue Feb 28, 2023
aborted_transactions and make_reader were using different rules
on when to use remote vs. local partition.

In the bug case seen in redpanda-data#7043, we were reading from the remote
partition, but then abort_transactions was hitting the local
partition.

Fix this by modifying the condition in aborted_transactions
to use the same kafka offset comparison that is used in make_reader.

Fixes: redpanda-data#7043
(cherry picked from commit 2cb2987)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.