Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve stream reader metrics cleanup #3340

Merged
merged 8 commits into from Aug 31, 2021

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Aug 26, 2021

When stream readers crash, we want to cleanup metrics, otherwise stream consumer and publisher metrics will diverge from reality, and nodes will report a higher number of these than there are in reality.

We cleaned up some logs in the process, and the replica_recovery test in rabbit_stream_queue SUITE was made more reliable.

@gerhard gerhard force-pushed the improve-stream-reader-metrics-cleanup branch from b288ffc to 97704ec Compare August 27, 2021 10:55
gerhard and others added 7 commits August 31, 2021 15:29
Otherwise metrics will not get cleaned up correctly when processes crash.

It's also tidier to do this in a single place, in terminate/3

Pair: @kjnilsson

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Pair: @kjnilsson

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Pair: @kjnilsson

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This ensures that only nodes that are ready to host stream members
are included in the election. This avoids continuous restart attempts
when the rabbit application is stopped.
Some logs used ~p to format a full stack trace. Given these warnings are emitted during
any nodedown this unnecessarily pollutes the logs. Trimmed using ~W instead.
Also increased the tick timeout to avoid checking for new rabbit nodes
to auto add too often.

Also increased sleep times for nodedowns to retry less often.
When the server initiate connection close.
@gerhard gerhard force-pushed the improve-stream-reader-metrics-cleanup branch from 97704ec to c240ec2 Compare August 31, 2021 14:29
Rather than sleeping for 6 seconds, we want to check that replica
recovered multiple times within 30 seconds, and either eventually
succeed, or fail if this does not recover within 30 seconds, the default
await_condition time interval.

Pair: @kjnilsson

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard gerhard marked this pull request as ready for review August 31, 2021 17:36
@gerhard gerhard merged commit 2a35b1c into master Aug 31, 2021
@gerhard gerhard deleted the improve-stream-reader-metrics-cleanup branch August 31, 2021 17:36
@gerhard gerhard added this to the 3.9.5 milestone Aug 31, 2021
@michaelklishin michaelklishin modified the milestones: 3.9.5, 3.9.6 Sep 1, 2021
@michaelklishin
Copy link
Member

@Mergifyio backport v3.9.x

@mergify
Copy link

mergify bot commented Sep 1, 2021

Command backport v3.9.x: success

Backports have been created

michaelklishin added a commit that referenced this pull request Sep 1, 2021
Improve stream reader metrics cleanup (backport #3340)
gerhard added a commit that referenced this pull request Sep 1, 2021
I've missed adding notes to the release part of
#3340

I am going to add a new item to the PULL_REQUEST_TEMPLATE.md checklist
so that we get reminded about it next time.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
gerhard added a commit that referenced this pull request Sep 1, 2021
I've missed adding notes to the release part of
#3340

I am going to add a new item to the PULL_REQUEST_TEMPLATE.md checklist
so that we get reminded about it next time.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
(cherry picked from commit f6c8356)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants