Skip to content

Conversation

@the-mikedavis
Copy link
Collaborator

@the-mikedavis the-mikedavis commented Oct 29, 2025

This introduces a metric calculated at every batch which records the difference between the timestamps of the last chunks in the logs of the writer and its replicas. This can be used to watch replicas catch up on replication, or to diagnose situations when replication is being starved out (network-wise) by high-throughput publishing. The replication diff should usually be low but, for streams seeing traffic, non-zero.

This calculation is also used in the stream coordinator when adding a member, via osiris_writer:query_replication_state/1. With this change the stream coordinator could be updated to use the counter instead of calling the writer.

@the-mikedavis the-mikedavis self-assigned this Oct 29, 2025
@the-mikedavis the-mikedavis force-pushed the replication-diff-metric branch from 8272b66 to 0fa0df1 Compare October 30, 2025 14:40
@the-mikedavis
Copy link
Collaborator Author

In addition to this (or maybe instead of this) we could track replica freshness. The stream coordinator is calculates "freshness" as a requirement in add_replica/3: https://github.com/rabbitmq/rabbitmq-server/blob/0e38285330c4e1e77726300c0f6dbd0aa28e524f/deps/rabbit/src/rabbit_stream_coordinator.erl#L214-L219

This introduces a metric calculated at every batch which records the
difference between the timestamps of the last chunks in the logs of the
writer and its replicas. This can be used to watch replicas catch up on
replication, or to diagnose situations when replication is being starved
out (network-wise) by high-throughput publishing. The replication diff
should usually be low but, for streams seeing traffic, non-zero.

This calculation is also used in the stream coordinator when adding a
member, via `osiris_writer:query_replication_state/1`. With this change
the stream coordinator could be updated to use the counter instead of
calling the writer.
@the-mikedavis the-mikedavis force-pushed the replication-diff-metric branch from 0fa0df1 to 680b7a9 Compare December 17, 2025 21:18
@the-mikedavis
Copy link
Collaborator Author

I updated this to perform the same calculation as the stream coordinator does: https://github.com/rabbitmq/rabbitmq-server/blob/765d2c5d748f1a3227b97e966a31a73f4b561867/deps/rabbit/src/rabbit_stream_coordinator.erl#L209-L221

(Also see discussion in rabbitmq/rabbitmq-server#15098)

So we could use this metric instead of querying the replication state with a call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant