Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) ManyPartitionsTest.test_omb #6334

Closed
abhijat opened this issue Sep 8, 2022 · 5 comments
Assignees

Comments

@abhijat
Copy link
Contributor

abhijat commented Sep 8, 2022

FAIL test: ManyPartitionsTest.test_omb (2/2 runs)
failure at 2022-09-08T07:19:27.327Z: AssertionError("['Metric aggregatedEndToEndLatencyAvg, value 137.5648158112895, Expected to be <= 50, check failed.']")
in job https://buildkite.com/redpanda/vtools/builds/3487#01831b0e-22b1-4569-998d-ba2249309e87



test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_omb
--
  | status:     FAIL
  | run time:   6 minutes 23.748 seconds
  |  
  |  
  | AssertionError("['Metric aggregatedEndToEndLatencyAvg, value 137.5648158112895, Expected to be <= 50, check failed.']")
  | Traceback (most recent call last):
  | File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 696, in test_omb
  | self._run_omb(scale)
  | File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 677, in _run_omb
  | benchmark.check_succeed()
  | File "/home/ubuntu/redpanda/tests/rptest/services/openmessaging_benchmark.py", line 284, in check_succeed
  | OMBSampleConfigurations.validate_metrics(metrics, self.validator)
  | File "/home/ubuntu/redpanda/tests/rptest/services/openmessaging_benchmark_configs.py", line 75, in validate_metrics
  | assert len(results) == 0, str(results)
  | AssertionError: ['Metric aggregatedEndToEndLatencyAvg, value 137.5648158112895, Expected to be <= 50, check failed.']

@abhijat abhijat added kind/bug Something isn't working ci-failure labels Sep 8, 2022
@abhijat abhijat changed the title CI Failure (assertion error) ManyPartitionsTest.test_omb CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) ManyPartitionsTest.test_omb Sep 8, 2022
@jcsp
Copy link
Contributor

jcsp commented Sep 12, 2022

This test mainly functions as a quick check that if someone runs OMB against a system with high partition count it doesn't fall over -- the actual latency pass/fail threshold is inherited from the pre-existing UNIT_TEST_LATENCY_VALIDATOR that's meant to be quite liberal (although apparently not liberal enough)

The actual performance is expected to be somewhat below-par on a system with maximum density of partitions/core, although it is still kind of interesting that it's this inconsistent.

@jcsp
Copy link
Contributor

jcsp commented Sep 12, 2022

The latency is fine initially but goes bad ~100s into the test, around the same time there are a bunch of rpc request timeouts.

These tests only run with INFO-level logging, so that's the extent of how much information is available. We can't see if e.g. there was leadership instability.

@piyushredpanda piyushredpanda assigned ballard26 and unassigned jcsp Sep 23, 2022
@piyushredpanda
Copy link
Contributor

@ballard26 : Would love your help on this one. Thanks!

@ballard26
Copy link
Contributor

Looking into this now

@ballard26
Copy link
Contributor

An immediate, but temporary, solution to this is to increase the upper limit for avg. latency. In the longer term we are seeing similar spikes in latency in our benchmarking efforts with OMB. So my plan would be increasing the latency limit for now. Then we will be able to look at this issue more closely in our benchmarking runs and hopefully figure out an actionable reason for these latency spikes. Once that is done we should be to return the latency limit back to what it was.

To that end I'll be opening a PR later today with the latency limit for this test increased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants