Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix object_count metric when grouping is on #4398

Merged
merged 3 commits into from
Mar 7, 2024

Conversation

etiennedi
Copy link
Member

@etiennedi etiennedi commented Mar 5, 2024

What's being changed:

  • Please see object_count metrics reports nonsense values when prom grouping is active #4396 for a description of what issue is being addressed here
  • This PR fixes this by making the object count async and iterate over all collections and shards every 30s.
  • With the new .ObjectCountAsync that was introduced lately getting the object count is very cheap – even with 10s of thousand of shards. This is because it only ever hits disk segments which have a pre-computed object count. In the past we would also hit memtables which relied on a very expensive calculation to find out if a write is an update or a new insertion.
  • This PR makes the object_count metric async (same as /v1/nodes API). In a sense it's "double async", first the object count itself only reflects already flushed segments, then there is a loop that loops over all collections and all shards at a regular interval
  • The eventually consistent (async) behavior should be acceptable for monitoring purposes where everything is async (metric collection, scraping, etc.)
  • Note that no behavior is altered if PROMETHEUS_MONITORING_GROUP is not active, in this case, we can just keep relying on the flush event to update the metric per shard.
  • This PR also respects lazy shard loading. It adds an allShardsReady flag to each collection. The async monitoring cycle is only started when all shards of all collections are ready
  • fixes object_count metrics reports nonsense values when prom grouping is active #4396

Review checklist

  • Documentation has been updated, if necessary. Link to changed documentation:
  • Chaos pipeline run or not necessary. Link to pipeline:
  • All new code is covered by tests where it is reasonable.
  • Performance tests have been run or not necessary.

@etiennedi etiennedi marked this pull request as ready for review March 5, 2024 18:20
aliszka
aliszka previously approved these changes Mar 6, 2024
Copy link
Member

@aliszka aliszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

sonarcloud bot commented Mar 7, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@parkerduckworth parkerduckworth merged commit d188c38 into stable/v1.24 Mar 7, 2024
34 of 35 checks passed
@parkerduckworth parkerduckworth deleted the fix-object-count-metric branch March 7, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants