CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) `ManyPartitionsTest.test_omb` #6334

abhijat · 2022-09-08T11:17:05Z

FAIL test: ManyPartitionsTest.test_omb (2/2 runs)
failure at 2022-09-08T07:19:27.327Z: AssertionError("['Metric aggregatedEndToEndLatencyAvg, value 137.5648158112895, Expected to be <= 50, check failed.']")
in job https://buildkite.com/redpanda/vtools/builds/3487#01831b0e-22b1-4569-998d-ba2249309e87



test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_omb
--
  | status:     FAIL
  | run time:   6 minutes 23.748 seconds
  |  
  |  
  | AssertionError("['Metric aggregatedEndToEndLatencyAvg, value 137.5648158112895, Expected to be <= 50, check failed.']")
  | Traceback (most recent call last):
  | File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 696, in test_omb
  | self._run_omb(scale)
  | File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_partitions_test.py", line 677, in _run_omb
  | benchmark.check_succeed()
  | File "/home/ubuntu/redpanda/tests/rptest/services/openmessaging_benchmark.py", line 284, in check_succeed
  | OMBSampleConfigurations.validate_metrics(metrics, self.validator)
  | File "/home/ubuntu/redpanda/tests/rptest/services/openmessaging_benchmark_configs.py", line 75, in validate_metrics
  | assert len(results) == 0, str(results)
  | AssertionError: ['Metric aggregatedEndToEndLatencyAvg, value 137.5648158112895, Expected to be <= 50, check failed.']

The text was updated successfully, but these errors were encountered:

jcsp · 2022-09-12T18:16:26Z

This test mainly functions as a quick check that if someone runs OMB against a system with high partition count it doesn't fall over -- the actual latency pass/fail threshold is inherited from the pre-existing UNIT_TEST_LATENCY_VALIDATOR that's meant to be quite liberal (although apparently not liberal enough)

The actual performance is expected to be somewhat below-par on a system with maximum density of partitions/core, although it is still kind of interesting that it's this inconsistent.

jcsp · 2022-09-12T18:23:04Z

The latency is fine initially but goes bad ~100s into the test, around the same time there are a bunch of rpc request timeouts.

These tests only run with INFO-level logging, so that's the extent of how much information is available. We can't see if e.g. there was leadership instability.

piyushredpanda · 2022-09-23T10:03:30Z

@ballard26 : Would love your help on this one. Thanks!

ballard26 · 2022-09-29T14:43:06Z

Looking into this now

ballard26 · 2022-10-05T17:08:12Z

An immediate, but temporary, solution to this is to increase the upper limit for avg. latency. In the longer term we are seeing similar spikes in latency in our benchmarking efforts with OMB. So my plan would be increasing the latency limit for now. Then we will be able to look at this issue more closely in our benchmarking runs and hopefully figure out an actionable reason for these latency spikes. Once that is done we should be to return the latency limit back to what it was.

To that end I'll be opening a PR later today with the latency limit for this test increased.

abhijat added kind/bug Something isn't working ci-failure labels Sep 8, 2022

abhijat changed the title ~~CI Failure (assertion error) ManyPartitionsTest.test_omb~~ CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) ManyPartitionsTest.test_omb Sep 8, 2022

mmedenjak assigned jcsp Sep 8, 2022

mmedenjak added area/redpanda area/tests labels Sep 8, 2022

piyushredpanda assigned ballard26 and unassigned jcsp Sep 23, 2022

ballard26 mentioned this issue Oct 6, 2022

Increase limits for allowed latency in many partitions test #6645

Merged

6 tasks

ballard26 closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) `ManyPartitionsTest.test_omb` #6334

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) `ManyPartitionsTest.test_omb` #6334

abhijat commented Sep 8, 2022

jcsp commented Sep 12, 2022

jcsp commented Sep 12, 2022

piyushredpanda commented Sep 23, 2022

ballard26 commented Sep 29, 2022

ballard26 commented Oct 5, 2022

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) ManyPartitionsTest.test_omb #6334

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) ManyPartitionsTest.test_omb #6334

Comments

abhijat commented Sep 8, 2022

jcsp commented Sep 12, 2022

jcsp commented Sep 12, 2022

piyushredpanda commented Sep 23, 2022

ballard26 commented Sep 29, 2022

ballard26 commented Oct 5, 2022

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) `ManyPartitionsTest.test_omb` #6334

CI Failure (assertion error for metric aggregatedEndToEndLatencyAvg) `ManyPartitionsTest.test_omb` #6334