Failure in `test_availability_when_one_node_failed` #2568

jcsp · 2021-10-07T16:12:47Z

Seen this exactly once while testing a PR:

https://buildkite.com/vectorized/redpanda/builds/3034#43a317bb-dd99-408e-9866-967c3eeefb68

[INFO  - 2021-10-07 12:03:59,631 - runner_client - log - lineno:266]: RunnerClient: rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed: FAIL: TimeoutError('Producer failed to produce messages for 180s.')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/availability_test.py", line 66, in test_availability_when_one_node_failed
    self.validate_records()
  File "/root/tests/rptest/tests/availability_test.py", line 35, in validate_records
    self.run_validation(min_records=min_records,
  File "/root/tests/rptest/tests/end_to_end.py", line 142, in run_validation
    wait_until(lambda: self.producer.num_acked > min_records,
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Producer failed to produce messages for 180s.

@mmaslankaprv looks like a test you added recently?

The text was updated successfully, but these errors were encountered:

jcsp · 2021-10-08T12:53:32Z

Just seen on tip of dev too, seems to be not so rare

https://buildkite.com/vectorized/redpanda/builds/3105#80b8371a-5715-44f4-8741-028a9e3d34d5

twmb · 2021-10-11T14:16:39Z

https://buildkite.com/vectorized/redpanda/builds/3164#fbec01cb-9863-4a45-aafb-dd4e5e885527

twmb · 2021-10-12T00:10:26Z

https://buildkite.com/vectorized/redpanda/builds/3213#e1ca4ef1-4e2e-440c-bab7-4e40f2925c90

Related: redpanda-data#2568 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp · 2021-10-12T09:29:47Z

I looked in this one in detail (https://buildkite.com/vectorized/redpanda/builds/3222#2c2c95b7-c7e1-4164-ab90-509cbce36409)

It's a hang on shutdown (node docker_n_14)

Recently another shutdown hang was fixed (#2553), looks like it wasn't the only one

mmaslankaprv · 2021-10-13T12:52:33Z

I've analyzed those failures is details, the failure is always caused by the producer not being able to produce with requested throughput.

jcsp · 2021-10-13T15:48:36Z

Merged #2632 to address the failures -- if the shutdown hang reappears we'll open a fresh issue.

jcsp · 2021-10-14T10:28:09Z

Seen this happening again here https://buildkite.com/vectorized/redpanda/builds/3291#74cf9182-59d6-4cb6-bc2e-d78d7d82fe10 -- it looks like the same issue of the consumer not seeing enough messages:

[INFO  - 2021-10-14 09:46:12,796 - runner_client - log - lineno:266]: RunnerClient: rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed: Summary: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='test-topic', partition=3): 43862, TopicPartition(topic='test-topic', partition=0): 36547, TopicPartition(topic='test-topic', partition=1): 9840, TopicPartition(topic='test-topic', partition=2): 12695, TopicPartition(topic='test-topic', partition=4): 10298, TopicPartition(topic='test-topic', partition=5): 25010} after waiting 180s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/availability_test.py", line 66, in test_availability_when_one_node_failed
    self.validate_records()
  File "/root/tests/rptest/tests/availability_test.py", line 35, in validate_records
    self.run_validation(min_records=min_records,
  File "/root/tests/rptest/tests/end_to_end.py", line 151, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 117, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='test-topic', partition=3): 43862, TopicPartition(topic='test-topic', partition=0): 36547, TopicPartition(topic='test-topic', partition=1): 9840, TopicPartition(topic='test-topic', partition=2): 12695, TopicPartition(topic='test-topic', partition=4): 10298, TopicPartition(topic='test-topic', partition=5): 25010} after waiting 180s.

jcsp · 2021-10-29T09:16:47Z

Here's another failure https://buildkite.com/vectorized/redpanda/builds/3769#34186e9e-908d-47b7-bbe9-b2a98696b8a2

Before the eventual failure, there are these worrying exceptions that seem to be consumers thinking they're being rewound:

File "/usr/local/lib/python3.8/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 254, in _worker
    self._update_global_position(event, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 275, in _update_global_position
    raise AssertionError(msg)
AssertionError: Consumed position 4500 is behind the current committed offset 71799 for partition TopicPartition(topic='test-topic', partition=2)

Then eventually it fails with a "failed to consume up to offsets".

jcsp · 2021-11-17T23:24:36Z

https://buildkite.com/vectorized/redpanda/builds/4443

jcsp · 2021-12-02T10:01:36Z

https://buildkite.com/vectorized/redpanda/builds/4935#d6cc2c60-e596-4c72-947f-937dc996acb8

dotnwat · 2021-12-16T19:40:57Z

@mmaslankaprv did one of these PRs that reference this ticket end up fixing this problem?

mmaslankaprv · 2021-12-17T06:59:11Z

@mmaslankaprv did one of these PRs that reference this ticket end up fixing this problem?
yes, last instance of this problem was caused by the consumer receiving incorrect error code when doing offset commit operation.

jcsp assigned mmaslankaprv Oct 7, 2021

jcsp mentioned this issue Oct 7, 2021

tests: fix rare failure in topic_delete_test #2565

Merged

jcsp added area/raft ci-failure kind/bug Something isn't working labels Oct 7, 2021

jcsp changed the title ~~Rare failure in test_availability_when_one_node_failed~~ Failure in test_availability_when_one_node_failed Oct 8, 2021

jcsp added a commit to jcsp/redpanda that referenced this issue Oct 12, 2021

tests: disable test_availability_when_one_node_failed

8ae8ea4

Related: redpanda-data#2568 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp mentioned this issue Oct 12, 2021

tests: disable partition_moving_test and test_availability_when_one_node_failed #2617

Merged

jcsp closed this as completed Oct 13, 2021

jcsp reopened this Oct 14, 2021

ztlpn mentioned this issue Oct 29, 2021

Archival: avoid unwanted timebox uploads #2814

Merged

jcsp mentioned this issue Nov 17, 2021

Ducktape test test_availability_when_one_node_failed failure detcted in CI #2992

Closed

graphcareful mentioned this issue Nov 17, 2021

cluster: Fix for topic table not notifying subscribers #2970

Merged

jcsp mentioned this issue Nov 18, 2021

test: Fix ArchivalTest.test_isolate #2978

Merged

jcsp mentioned this issue Dec 3, 2021

tests: reinstate NodesDecommissioningTest.test_working node #3139

Merged

mmaslankaprv mentioned this issue Dec 13, 2021

Fix mapping shutdown error #3205

Merged

mmaslankaprv closed this as completed Dec 17, 2021

andrewhsu mentioned this issue Jan 12, 2022

Failure in rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed from consumer #3450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `test_availability_when_one_node_failed` #2568

Failure in `test_availability_when_one_node_failed` #2568

jcsp commented Oct 7, 2021

jcsp commented Oct 8, 2021

twmb commented Oct 11, 2021

twmb commented Oct 12, 2021

jcsp commented Oct 12, 2021

mmaslankaprv commented Oct 13, 2021

jcsp commented Oct 13, 2021

jcsp commented Oct 14, 2021

jcsp commented Oct 29, 2021

jcsp commented Nov 17, 2021

jcsp commented Dec 2, 2021

dotnwat commented Dec 16, 2021

mmaslankaprv commented Dec 17, 2021

Failure in test_availability_when_one_node_failed #2568

Failure in test_availability_when_one_node_failed #2568

Comments

jcsp commented Oct 7, 2021

jcsp commented Oct 8, 2021

twmb commented Oct 11, 2021

twmb commented Oct 12, 2021

jcsp commented Oct 12, 2021

mmaslankaprv commented Oct 13, 2021

jcsp commented Oct 13, 2021

jcsp commented Oct 14, 2021

jcsp commented Oct 29, 2021

jcsp commented Nov 17, 2021

jcsp commented Dec 2, 2021

dotnwat commented Dec 16, 2021

mmaslankaprv commented Dec 17, 2021

Failure in `test_availability_when_one_node_failed` #2568

Failure in `test_availability_when_one_node_failed` #2568