Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumented servers #6074

Merged
merged 12 commits into from Feb 13, 2023
Merged

Instrumented servers #6074

merged 12 commits into from Feb 13, 2023

Conversation

fpetkovski
Copy link
Contributor

@fpetkovski fpetkovski commented Jan 26, 2023

This PR adds an InstrumentedStoreServer that exposes metrics
about Series requests and uses it as a decorator around all Store APIs.
The instrumented store currently exposes two histogram metrics, series requested
and chunks requested. Additional metrics can be added as needed.

This PR also implements a RateLimitedStoreServer which can be used
to apply various limits to Series calls in components that implement
the Store API.

Rate limits are disabled by default but can be enabled selectively
for each individual Thanos component.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Expose histogram metrics from stores about series and chunks per Series request.
  • Allow configuring limits for series and chunks per Series request.

Verification

Manual verification and tests.

@fpetkovski fpetkovski force-pushed the instrumented-servers branch 2 times, most recently from 4822670 to 0dbde4f Compare January 26, 2023 11:25
docs/components/receive.md Outdated Show resolved Hide resolved
pkg/store/ratelimit.go Outdated Show resolved Hide resolved
seriesRequested: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{
Name: "thanos_store_server_series_requested",
Help: "Number of requested series for Series calls",
Buckets: []float64{1, 10, 100, 1000, 10000, 100000},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can bump this even more? I have some servers sending millions of series in certain calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for allowing this the bucket to be configured via CLI args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With native histograms we won't have to configure buckets anymore. Let's add a couple more buckets for chunks temporarily until we have complete native histograms support.

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 perhaps it would be worth including a paragraph or two in the docs about all the different kinds of limits that we have? It's becoming a lot 😄

@douglascamata
Copy link
Contributor

Great work, @fpetkovski! This will be so useful.

I was wondering though about the overall limiting situation that we will end having.

Particularly thinking about Thanos Store, which already implements limits through the CLI args --store.grpc.touched-series-limit and --store.grpc.series-sample-limit.

Maybe we can deprecate them (and clean up the code later on) and point people to the new options introduced by this PR? I think it will be better to avoid confusion.

@fpetkovski
Copy link
Contributor Author

Hm good point, I didn't see those in the bucket store. Should we remove those and rename the newly added ones so that they can be applied to any store? I would like to avoid deprecating arguments for the sake of renaming.

@douglascamata
Copy link
Contributor

douglascamata commented Feb 7, 2023

@fpetkovski personally I like more the new names, as they are component-agnostic and thus great candidates to standardize on. I don't think touched-series-limit for series limit and series-sample-limit for samples are good names to standardize on... they are confusingly different.

@fpetkovski fpetkovski force-pushed the instrumented-servers branch 3 times, most recently from 1639b53 to 157889e Compare February 8, 2023 09:05
This commit adds an instrumented store server that exposes metrics
about Series requests and uses it as a decorator around all Store APIs.

The instrumented store currenly exposes only two metrics, series requested
and chunks requested. Additional metrics can be added as needed.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This commit implements a RateLimited store server which can be used
to apply various limits to Series calls in components that implement
the Store API.

Rate limits are disabled by default but can be enabled selectively
for each individual Thanos component.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski
Copy link
Contributor Author

fpetkovski commented Feb 8, 2023

So I didn't realize we had series and chunk limiters already, so now the PR reuses what's already available for the bucket store. I've also consolidated the names of all flags so that they're the same across stores.

The only exception is the bucket  store which does not have a rate limited server wrapped around it. It seems to apply limits deeper in the stack before requesting data from object store. For other stores limits are still applied through a decorator.

PTAL again :)

Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some very small remarks, but overall I'm pretty happy with the work here. Great initiative, @fpetkovski!

CHANGELOG.md Outdated
### Changed

- [#6035](https://github.com/thanos-io/thanos/pull/6035) Replicate: Support all types of matchers to match blocks for replication. Change matcher parameter from string slice to a single string.
>>>>>>> 0dbde4f1 (Add CHANGELOG entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small conflict leftover here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed.

Comment on lines 103 to 106
type RateLimits struct {
SeriesPerRequest uint64
SamplesPerRequest uint64
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to explain in a comment here what's the purpose of this type and/or how the limit works.

For example, I wouldn't even call this type of limiting "rate limiting" because it's not happening over a period of time, like 30 series per second. this limiting is, as the name of the struct fields show, per request limits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment on this type to explain how limits are applied. I also renamed the server to limitedServer and dropped the rate from it.

}

// rateLimitedStoreServer is a storepb.StoreServer that can apply rate limits against Series requests.
type rateLimitedStoreServer struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the classic assignment to interface type to ensure this type keeps implementing the storepb.StoreServer interface at compile time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, done.

}

// rateLimitedServer is a storepb.Store_SeriesServer that tracks statistics about sent series.
type rateLimitedServer struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the classic assignment to interface type to ensure this type keeps implementing the storepb.Store_SeriesServer interface at compile time?

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
douglascamata
douglascamata previously approved these changes Feb 8, 2023
Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot! 🙇

Comment on lines +130 to +133
failedRequestsCounter: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: "thanos_store_selects_dropped_total",
Help: "Number of select queries that were dropped due to configured limits.",
}, []string{"reason"}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we already have this inside inside the limiters created on the lines above. Again, I like more the one introduced here for its standard potential and I think we can deprecate the old one later.

Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you @fpetkovski 🙇 . I have small nits, especially regarding the flags, but I'm happy to get this in current form.

Comment on lines 432 to 442
--store.grpc.samples-limit=0
The maximum samples allowed for a single
Series request, The Series call fails if
this limit is exceeded. 0 means no limit.
NOTE: For efficiency the limit is internally
implemented as 'chunks limit' considering each
chunk contains a maximum of 120 samples.
--store.grpc.series-limit=0
The maximum series allowed for a single Series
request. The Series call fails if this limit is
exceeded. 0 means no limit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but it seems like we have two distinct store flag categories:

  • One is to configure the stores to which query connects
  • The second one added in this PR refers to the store server of the query

I wonder if we could distinguish them more clearly. Maybe store-limit.*?

Copy link
Contributor Author

@fpetkovski fpetkovski Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, should we use something like store.limits.request-series and store.limits.request-samples?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would still be mixing both in one bag, since both flag groups are still under flag. But on the second thought, I think we shot ourself in the foot already since even in existing code we have dual use in query vs store...

And yet now that I'm thinking, we deprecated store in favor of endpoint, so eventually any store-connection related flags will be actually under endpoint.

I guess we can reclaim store as config for the actual store server, then store.limits.* is fine 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in this case store would be the store server, whereas endpoint would be something we configure on the client. I updated the names of the flags.

}

if err := i.seriesLimiter.Reserve(1); err != nil {
return errors.Wrapf(err, "failed to send series")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder once this error bubbles up, if we should use a specific gRPC code. I think currently we use Unknown for all errors, but perhaps in this case we could use ResourceExhausted or something similar. Could be useful when debugging issues to see this directly in response or server metrics / span errors.

But can be done additionally, not a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea, but I'm concerned it might grow the scope of the PR even further. Right now we have no way to distinguish error types on the client side, and we mostly rely on surfacing the actual message of the user. In this case they would see something like failed to send series: limit X violated (got Y)

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
matej-g
matej-g previously approved these changes Feb 10, 2023
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the conflicts and get this in 🚀

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski
Copy link
Contributor Author

Should be ready to merge @matej-g

Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @fpetkovski 🙇 , let's get this in!

@matej-g matej-g merged commit c4218c7 into thanos-io:main Feb 13, 2023
ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request Apr 17, 2023
* Add instrumentation to Store servers

This commit adds an instrumented store server that exposes metrics
about Series requests and uses it as a decorator around all Store APIs.

The instrumented store currenly exposes only two metrics, series requested
and chunks requested. Additional metrics can be added as needed.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add rate limits to Store servers

This commit implements a RateLimited store server which can be used
to apply various limits to Series calls in components that implement
the Store API.

Rate limits are disabled by default but can be enabled selectively
for each individual Thanos component.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add CHANGELOG entry

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Reuse existing limiters

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix chunks limit binding

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Unify flag names

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Run make docs

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add another series bucket

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Code review comments

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix changelog

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Rename flags

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request Apr 17, 2023
* Add instrumentation to Store servers

This commit adds an instrumented store server that exposes metrics
about Series requests and uses it as a decorator around all Store APIs.

The instrumented store currenly exposes only two metrics, series requested
and chunks requested. Additional metrics can be added as needed.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add rate limits to Store servers

This commit implements a RateLimited store server which can be used
to apply various limits to Series calls in components that implement
the Store API.

Rate limits are disabled by default but can be enabled selectively
for each individual Thanos component.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add CHANGELOG entry

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Reuse existing limiters

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix chunks limit binding

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Unify flag names

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Run make docs

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Add another series bucket

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Code review comments

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix changelog

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Rename flags

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants