Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ci Failure (Failed to get metadata: Local: Timed out) in ThroughputLimitsSnc.test_configuration #8809

Closed
NyaliaLui opened this issue Feb 10, 2023 · 7 comments · Fixed by #8886, #9078 or #10462
Assignees
Labels
area/kafka ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Feb 10, 2023

Version & Environment

Redpanda version: dev:

This happened during cluster startup in CI

https://buildkite.com/redpanda/redpanda/builds/22997#01863c84-7997-4df4-a53b-3be8d31d445d/6-2309

Module: rptest.tests.throughput_limits_snc_test
Class:  ThroughputLimitsSnc
Method: test_configuration
test_id:    rptest.tests.throughput_limits_snc_test.ThroughputLimitsSnc.test_configuration
status:     FAIL
run time:   15.756 seconds


    KafkaException(KafkaError{code=_TIMED_OUT,val=-185,str="Failed to get metadata: Local: Timed out"})
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 133, in run
    self.setup_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 218, in setup_test
    self.test.setup()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/test.py", line 91, in setup
    self.setUp()
  File "/root/tests/rptest/tests/redpanda_test.py", line 99, in setUp
    self.redpanda.start()
  File "/root/tests/rptest/services/redpanda.py", line 1022, in start
    self.wait_for_membership(first_start=first_start)
  File "/root/tests/rptest/services/redpanda.py", line 937, in wait_for_membership
    wait_until(lambda: {n
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/services/redpanda.py", line 937, in <lambda>
    wait_until(lambda: {n
  File "/root/tests/rptest/services/redpanda.py", line 939, in <setcomp>
    if self.registered(n)} == expected,
  File "/root/tests/rptest/services/redpanda.py", line 2209, in registered
    brokers = client.brokers()
  File "/root/tests/rptest/clients/python_librdkafka.py", line 33, in brokers
    return client.list_topics(timeout=10).brokers
cimpl.KafkaException: KafkaError{code=_TIMED_OUT,val=-185,str="Failed to get metadata: Local: Timed out"}
@dlex
Copy link
Contributor

dlex commented Feb 14, 2023

Egress TP limit is set by the test to 64 B/s. Controller log leader gets this sequence of KAPI requests:

  • api_versions
  • metadata
  • metadata (ok so far)
  • (another connection established)
  • api_versions (throttle requested: 3.6s)
  • metadata (throttle enforced: 3.6s, requested: 12.6s)
  • metadata (throttle enforced: 12.6s)

12.6s is too much for the client and it times out the metadata request.

As a mitigation, I will double the minimum egress TP limit.

dlex added a commit to dlex/redpanda that referenced this issue Feb 14, 2023
Fixes redpanda-data#8809

Double the minimum tested value for TP limit (both ingress and egress)
because there is evidence that 64 B/s on egress side cause timeouts
while a client connects to the cluster.
@mmaslankaprv
Copy link
Member

@dlex
Copy link
Contributor

dlex commented Apr 28, 2023

Happened again in CDT: https://buildkite.com/redpanda/vtools/builds/7321#0187c490-ccb8-4b5c-89bc-a3d4ebd3ee93

128 B/s still seems to be on the edge, needs a bump to 256 B/s

@dlex dlex reopened this Apr 28, 2023
dlex added a commit to dlex/redpanda that referenced this issue Apr 28, 2023
Double the minimum tested value for TP limit (both ingress and egress)
because there is evidence that 128 B/s is still close to the edge
and may fail the test when used for both ingress and egress.

Fixes redpanda-data#8809
dlex added a commit to dlex/redpanda that referenced this issue Apr 28, 2023
Double the minimum tested value for TP limit (both ingress and egress)
because there is evidence that 128 B/s is still close to the edge
and may fail the test when used for both ingress and egress.

Fixes redpanda-data#8809
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 5, 2023
Double the minimum tested value for TP limit (both ingress and egress)
because there is evidence that 128 B/s is still close to the edge
and may fail the test when used for both ingress and egress.

Fixes redpanda-data#8809

(cherry picked from commit 6b439c4)
travisdowns pushed a commit to travisdowns/redpanda that referenced this issue May 5, 2023
Double the minimum tested value for TP limit (both ingress and egress)
because there is evidence that 128 B/s is still close to the edge
and may fail the test when used for both ingress and egress.

Fixes redpanda-data#8809
@travisdowns travisdowns reopened this Feb 8, 2024
@travisdowns travisdowns assigned BenPope and unassigned dlex Feb 8, 2024
@travisdowns
Copy link
Member

travisdowns commented Feb 8, 2024

@BenPope - this is still happening and as you see there has been a history of just bumping the limit which hasn't panned out (yet?).

As you are planning big changes in this area I figure this failure may be obsoleted by them, so looking into it is probably fruitless in light of that. I guess after you changes go in and this hasn't happened for a while we can simply close it.

WDYT?

@michael-redpanda
Copy link
Contributor

I'm going to mark this as sev/low and point at the epic https://github.com/redpanda-data/core-internal/issues/917

@michael-redpanda michael-redpanda added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Feb 9, 2024
@BenPope
Copy link
Member

BenPope commented Feb 21, 2024

@BenPope - this is still happening and as you see there has been a history of just bumping the limit which hasn't panned out (yet?).

As you are planning big changes in this area I figure this failure may be obsoleted by them, so looking into it is probably fruitless in light of that. I guess after you changes go in and this hasn't happened for a while we can simply close it.

WDYT?

This makes sense to me.

@travisdowns
Copy link
Member

travisdowns commented Mar 13, 2024

Closing this as discussed above, cc @BenPope .

This is less good than more recent issues as the link in the OP does not exist (> 1 y/o so it is cleaned up on the BK side), so I'm going to unduplicate #16508 so that one can be re-opened if this is still failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
7 participants