Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rptest: fix test_exceed_broker_limit flake #17932

Merged

Conversation

travisdowns
Copy link
Member

ConnectionLimitsTest.test_exceed_broker_limit had a spurious failure in CI. This test starts 2 consumers which (should) consume all 6 available connections, then checks that a producer started after that fails to produce (due to connection limit being hit).

However the consumer & producer starts are all async, so the producer can race ahead of one of the consumers and grab the connections for itself, failing the test.

Change the test to wait for the consumers to connect, by waiting until the connection metric hits 6, then starts the producer.

Fixes #17897.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

oleiman
oleiman previously approved these changes Apr 17, 2024
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. lgtm

# producer, since otherwise the consumers and the producer race and the producer
# may win in which case it would be one of the consumers that fail to connect
self.redpanda.wait_until(
lambda: connection_count() == 6, 60, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance this becomes flaky? for example a once-in-a-while connection gets dropped and re-opened by franz-go/rpk but the metric doesn't update so quickly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider the possible flakiness here.

For one thing, this just wouldn't work on a system where there are any unaccounted-for connections, e.g., in a cloud test with other stuff connected (say, kminion) this condition might simply fail (connections may never hit 6, they may be above even from the start, or jump from 1 to 4 to 7 or something like that. However, this test is already written in a way that expects a clean system since it counts connections "exactly".

The metric itself updates instantly, but it's based on RP's view of the connections, so if a connection was silently dropped then it could continue to reflect a non-existent connection for a while as you suggest. This doesn't seem that likely as we are making fresh connections and waiting for the number to hit the expected count which generally happens almost instantly.

However, I think it would reduce the chance of future flakiness if I made this >= 6, rather than == 6, at the cost of not being informed about unexpected changes in the connection behavior. I think that's probably closer to the original intent of this test. I'll make that change unless anyone disagrees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to >= 6 in cb22f0a

ConnectionLimitsTest.test_exceed_broker_limit had a spurious failure in
CI. This test starts 2 consumers which (should) consume all 6
available connections, then checks that a producer started after that
fails to produce (due to connection limit being hit).

However the consumer & producer starts are all async, so the producer
can race ahead of one of the consumers and grab the connections for
itself, failing the test.

Change the test to wait for the consumers to connect, by waiting
until the connection metric hits 6, then starts the producer.

Fixes redpanda-data#17897.
@dotnwat
Copy link
Member

dotnwat commented Apr 19, 2024

gtest_raft_rpunit failed. this pr touches no .cc file or files associated with unit testing.

@dotnwat dotnwat merged commit ce44d56 into redpanda-data:dev Apr 19, 2024
13 of 16 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v23.3.x] CI Failure (key symptom) in WriteCachingFailureInjectionTest.test_unavoidable_data_loss
4 participants