Fix `object_count` metric when grouping is on #4398

etiennedi · 2024-03-05T12:00:02Z

Please see object_count metrics reports nonsense values when prom grouping is active #4396 for a description of what issue is being addressed here
This PR fixes this by making the object count async and iterate over all collections and shards every 30s.
With the new .ObjectCountAsync that was introduced lately getting the object count is very cheap – even with 10s of thousand of shards. This is because it only ever hits disk segments which have a pre-computed object count. In the past we would also hit memtables which relied on a very expensive calculation to find out if a write is an update or a new insertion.
This PR makes the object_count metric async (same as /v1/nodes API). In a sense it's "double async", first the object count itself only reflects already flushed segments, then there is a loop that loops over all collections and all shards at a regular interval
The eventually consistent (async) behavior should be acceptable for monitoring purposes where everything is async (metric collection, scraping, etc.)
Note that no behavior is altered if PROMETHEUS_MONITORING_GROUP is not active, in this case, we can just keep relying on the flush event to update the metric per shard.
This PR also respects lazy shard loading. It adds an allShardsReady flag to each collection. The async monitoring cycle is only started when all shards of all collections are ready
fixes object_count metrics reports nonsense values when prom grouping is active #4396

aliszka

LGTM

sonarcloud · 2024-03-07T13:15:54Z

Quality Gate passed

etiennedi force-pushed the fix-object-count-metric branch from 34cbfac to fdddf54 Compare March 5, 2024 12:03

etiennedi added 2 commits March 5, 2024 19:20

gh-4396 fix object count metric when grouping is on

882031a

gh-4396 handle shutdown correctly

06c498c

etiennedi force-pushed the fix-object-count-metric branch from aac617c to 06c498c Compare March 5, 2024 18:20

etiennedi marked this pull request as ready for review March 5, 2024 18:20

aliszka previously approved these changes Mar 6, 2024

View reviewed changes

resolve merge conflicts

ebea11f

parkerduckworth dismissed aliszka’s stale review via ebea11f March 7, 2024 13:13

parkerduckworth approved these changes Mar 7, 2024

View reviewed changes

parkerduckworth merged commit d188c38 into stable/v1.24 Mar 7, 2024
34 of 35 checks passed

parkerduckworth deleted the fix-object-count-metric branch March 7, 2024 13:33