Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in test_availability_when_one_node_failed #2568

Closed
jcsp opened this issue Oct 7, 2021 · 12 comments
Closed

Failure in test_availability_when_one_node_failed #2568

jcsp opened this issue Oct 7, 2021 · 12 comments
Assignees
Labels
area/raft ci-failure kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Oct 7, 2021

Seen this exactly once while testing a PR:

https://buildkite.com/vectorized/redpanda/builds/3034#43a317bb-dd99-408e-9866-967c3eeefb68

[INFO  - 2021-10-07 12:03:59,631 - runner_client - log - lineno:266]: RunnerClient: rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed: FAIL: TimeoutError('Producer failed to produce messages for 180s.')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/availability_test.py", line 66, in test_availability_when_one_node_failed
    self.validate_records()
  File "/root/tests/rptest/tests/availability_test.py", line 35, in validate_records
    self.run_validation(min_records=min_records,
  File "/root/tests/rptest/tests/end_to_end.py", line 142, in run_validation
    wait_until(lambda: self.producer.num_acked > min_records,
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Producer failed to produce messages for 180s.

@mmaslankaprv looks like a test you added recently?

@jcsp jcsp added area/raft ci-failure kind/bug Something isn't working labels Oct 7, 2021
@jcsp jcsp changed the title Rare failure in test_availability_when_one_node_failed Failure in test_availability_when_one_node_failed Oct 8, 2021
@jcsp
Copy link
Contributor Author

jcsp commented Oct 8, 2021

Just seen on tip of dev too, seems to be not so rare

https://buildkite.com/vectorized/redpanda/builds/3105#80b8371a-5715-44f4-8741-028a9e3d34d5

@twmb
Copy link
Contributor

twmb commented Oct 11, 2021

@twmb
Copy link
Contributor

twmb commented Oct 12, 2021

jcsp added a commit to jcsp/redpanda that referenced this issue Oct 12, 2021
Related: redpanda-data#2568

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp
Copy link
Contributor Author

jcsp commented Oct 12, 2021

I looked in this one in detail (https://buildkite.com/vectorized/redpanda/builds/3222#2c2c95b7-c7e1-4164-ab90-509cbce36409)

It's a hang on shutdown (node docker_n_14)

Recently another shutdown hang was fixed (#2553), looks like it wasn't the only one

@mmaslankaprv
Copy link
Member

I've analyzed those failures is details, the failure is always caused by the producer not being able to produce with requested throughput.

@jcsp
Copy link
Contributor Author

jcsp commented Oct 13, 2021

Merged #2632 to address the failures -- if the shutdown hang reappears we'll open a fresh issue.

@jcsp jcsp closed this as completed Oct 13, 2021
@jcsp
Copy link
Contributor Author

jcsp commented Oct 14, 2021

Seen this happening again here https://buildkite.com/vectorized/redpanda/builds/3291#74cf9182-59d6-4cb6-bc2e-d78d7d82fe10 -- it looks like the same issue of the consumer not seeing enough messages:

[INFO  - 2021-10-14 09:46:12,796 - runner_client - log - lineno:266]: RunnerClient: rptest.tests.availability_test.AvailabilityTests.test_availability_when_one_node_failed: Summary: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='test-topic', partition=3): 43862, TopicPartition(topic='test-topic', partition=0): 36547, TopicPartition(topic='test-topic', partition=1): 9840, TopicPartition(topic='test-topic', partition=2): 12695, TopicPartition(topic='test-topic', partition=4): 10298, TopicPartition(topic='test-topic', partition=5): 25010} after waiting 180s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/tests/availability_test.py", line 66, in test_availability_when_one_node_failed
    self.validate_records()
  File "/root/tests/rptest/tests/availability_test.py", line 35, in validate_records
    self.run_validation(min_records=min_records,
  File "/root/tests/rptest/tests/end_to_end.py", line 151, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 117, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='test-topic', partition=3): 43862, TopicPartition(topic='test-topic', partition=0): 36547, TopicPartition(topic='test-topic', partition=1): 9840, TopicPartition(topic='test-topic', partition=2): 12695, TopicPartition(topic='test-topic', partition=4): 10298, TopicPartition(topic='test-topic', partition=5): 25010} after waiting 180s.

@jcsp
Copy link
Contributor Author

jcsp commented Oct 29, 2021

Here's another failure https://buildkite.com/vectorized/redpanda/builds/3769#34186e9e-908d-47b7-bbe9-b2a98696b8a2

Before the eventual failure, there are these worrying exceptions that seem to be consumers thinking they're being rewound:

File "/usr/local/lib/python3.8/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 254, in _worker
    self._update_global_position(event, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 275, in _update_global_position
    raise AssertionError(msg)
AssertionError: Consumed position 4500 is behind the current committed offset 71799 for partition TopicPartition(topic='test-topic', partition=2)

Then eventually it fails with a "failed to consume up to offsets".

@jcsp
Copy link
Contributor Author

jcsp commented Nov 17, 2021

@jcsp
Copy link
Contributor Author

jcsp commented Dec 2, 2021

@dotnwat
Copy link
Member

dotnwat commented Dec 16, 2021

@mmaslankaprv did one of these PRs that reference this ticket end up fixing this problem?

@mmaslankaprv
Copy link
Member

@mmaslankaprv did one of these PRs that reference this ticket end up fixing this problem?
yes, last instance of this problem was caused by the consumer receiving incorrect error code when doing offset commit operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft ci-failure kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants