Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: Allow remote write request limits to be defined per file and tenant #5565

Merged
merged 24 commits into from Sep 6, 2022

Conversation

douglascamata
Copy link
Contributor

@douglascamata douglascamata commented Aug 2, 2022

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This is part of the broader work outlined by #5404.

  • Allow request limits & gates to be configured using a file.
  • A user can save the file to disk and pass its path to receive.limits-config-file or pass the file content inline to receive.limits-config.
  • Documentation was updated with an example and explanation about how default and tenant limits interact.

Follow ups

  • Add a mechanism to reload the tenant limit configuration to avoid having to restart the Receive and reprocess the WAL.
    • I am already preparing a solution for this. It'll come in a separate PR soon after this is merged.
  • Eventually move the configuration for the active series limit added in Meta-monitoring based active series limiting #5520 to the same file.

Verification

  • Unit tests.
  • Ran it locally.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
…s-per-tenant-config-file

Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@squat
Copy link
Member

squat commented Aug 2, 2022

xref: #5527 (comment)

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata douglascamata changed the title Limits per tenant config file Allow remote write request limits to be defined per file and tenant Aug 2, 2022
@douglascamata douglascamata changed the title Allow remote write request limits to be defined per file and tenant [receive] Allow remote write request limits to be defined per file and tenant Aug 2, 2022
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata
Copy link
Contributor Author

douglascamata commented Aug 3, 2022

@yeya24 based on the conversation we had at #5527 (comment), after I tried to add CLI args to configure the default limits in this PR while keeping also the file my opinion on this has changed for the following reasons:

  1. It adds complexity for users of Thanos to understand the configuration
    1. The logic becomes convoluted. There would be CLI args that override missing or present default values that override missing tenant values.
    2. It even allows for partial configuration of defaults via CLI and file at the same time, creating potential confusing setups for users (i.e. CLI defines default limit of series while config file defines default limit of request size).
  2. It adds complexity in the code without clear benefits
    1. Each one of these levels have different default values (zero or nil) that have to be handled.
    2. The type we use as a standard for passing configuration files (extflag.PathOrContent) also supports inlining of the configuration as a CLI argument. It's not difficult to use it to inline the configuration proposed in this PR in case you only want to set defaults.

I don't think adding extra CLI args is worth the price in extra complexity given the features and simplicity that we get from extflag.PathOrContent. What do you think?

@douglascamata douglascamata changed the title [receive] Allow remote write request limits to be defined per file and tenant Receive: Allow remote write request limits to be defined per file and tenant Aug 3, 2022
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
The next sections explain what each configuration value means.

```yaml mdox-exec="cat pkg/receive/testdata/limits_config/good_limits.yaml"
write:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea from Community Hours: generalize on any label - not only scope to tenant labels.

This allows us to keep tenancy topic separate.

Perhaps this can help with other use cases 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it clear for potential reviewers, from what I understood in the community hours, this PR and be reviewed as merged as is. The feature is documented as experimental and thus we can introduce changes to the configuration file and behavior in case we see fit, especially after the planned discussion regarding tenancy in Thanos.

@douglascamata
Copy link
Contributor Author

douglascamata commented Aug 4, 2022

I found out during verification tests that the default values for limits are not being exported in the Receive's metrics. Will fix this.

edit: Fixed!

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata douglascamata marked this pull request as ready for review August 5, 2022 17:26
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@fpetkovski
Copy link
Contributor

Would changing the limits require receivers to be redeployed?

@douglascamata
Copy link
Contributor Author

@fpetkovski yes. We might want to improve this with some mechanism to reload this configuration in a follow up PR, like the config reload endpoint other components have.

Probably during reload time the limits would be shortly disabled to avoid synchronization issues under high load. WDYT?

@fpetkovski
Copy link
Contributor

Yeah I think config reloading will be more than a nice to have for this feature. Large receiver deployments can take hours to rollout, so we should try to speed up configuration changes if we can.

saswatamcode
saswatamcode previously approved these changes Aug 10, 2022
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for awesome work! 💪🏻
LGTM minus some comments!

@@ -194,6 +194,18 @@ func runReceive(
return errors.Wrap(err, "parse relabel configuration")
}

var limitsConfig *receive.RootLimitsConfig
if conf.limitsConfig != nil {
limitsContentYaml, err := conf.limitsConfig.Content()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @fpetkovski mentioned, having some /~/reload endpoint like we have in Thanos Ruler, to reload such configuration would be amazing, as adding some tenant limits now, would mean redeploying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, leaving this for another PR to avoid this one getting too big.

type limiter struct {
requestLimiter requestLimiter
writeGate gate.Gate
// activeSeriesLimiter *activeSeriesLimiter `yaml:"active_series"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these commented struct fields?
I can add an active series limit to this config in a follow-up PR, maybe a TODO would be better here? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5c171d. 👍

)

// RootLimitsConfig is the root configuration for limits.
type RootLimitsConfig struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just LimitsConfig? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many things end up being called by almost the same name and it gets confusing over time. This is already a result of many iterations on the naming. 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Copyright (c) The Thanos Authors.
// Licensed under the Apache License 2.0.

package receive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for having separate limiter and request_limiter files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiter is mean to hold a bunch of different "limiters" (i.e. limits per request, the concurrency gate for writing data, active series limiter, and possibly a rate limiter in the future), parse the configuration of limits, initialize every limiter to avoid too much clutter all over around that code that does handler creation and setup (which also helps with reusability/dont-repeat-yourself principle plus encapsulation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have an eye in the future, where we would be able to partially reuse a bunch of the code that is in these files in the Query component. It already has a similar gate and I plan to add similar per-request limits to it.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
bwplotka
bwplotka previously approved these changes Aug 11, 2022
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clean, however, it still is tenant aware, even though it could be just "per label". I assume the conclusion is to use per tenant limits at the end?

LGTM otherwise, great job! (:

@douglascamata
Copy link
Contributor Author

it still is tenant aware, even though it could be just "per label". I assume the conclusion is to use per tenant limits at the end?

I can refactor this to be purely label-aware and take a few more days (or weeks) to get it ready, tested, and documented while the feature doesn't land.

Or land the feature as experimental (very likely to change) and then work on extending and changing it to be just label-aware afterwards. I thought this was our conclusion from the community meeting. Wasn't it, @bwplotka?

@douglascamata
Copy link
Contributor Author

I can even already fill up the code, example configuration file, and logs with warnings that the configuration structure will be changing soon to work on a per-label basis.

@douglascamata
Copy link
Contributor Author

douglascamata commented Aug 11, 2022

A point of the discussion: being straightforward that this feature supports tenancy is a plus for people operating Thanos clusters. Changing it to be label aware is friendly to maintainers/contributors (keep tenancy logic away and "hidden") but not to users (tenancy logic is hidden even though Receive is tenant-aware).

@douglascamata
Copy link
Contributor Author

There might be other points to consider before committing to make this purely label aware solution too:

  • Does it even make sense to apply a "request body size limit" to a subset of the request?
    • Only applicable limits are maximum amount of timeseries and samples.
  • Does it make sense performance-wise?
    • Regex matching or simple exact match?
    • What would the request response look like if only part of it hit a limit?
  • Complex logic vs simple per tenant limits
    • It becomes possible that two limit rules apply to given set of timeseries, which one would have priority?
    • What if there are multiple labels per group? i.e. labelA=valueA && labelB=valueB.
    • What if there are multiple values to group by, will regex performance be acceptable? i.e. labelA=valueA || labelA=valueB.
    • How to enable configuration if each individual value should be its own group? i.e. apply the limits for each unique value of labelA.

So I prefer to not rush directly into it.

@saswatamcode
Copy link
Member

Maybe this can be merged now (once conflicts are resolved) while discussion is under heavy progress, as this feature is experimental and would likely go through some iterations to get prod-ready? 🙂

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
…nant-config-file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much progress on long term goals for tenancy specific, so merging, since those are hidden flags anyway.

Thanks!

@bwplotka bwplotka merged commit b8eb3d8 into thanos-io:main Sep 6, 2022
prajain12 pushed a commit to prajain12/thanos that referenced this pull request Sep 6, 2022
… tenant (thanos-io#5565)

* Allow per-tenant limits to be configured via file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor Receive's limiting logic

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix some methods that were in plural

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Improve metric description

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add a TODO for later

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Do some cleanup after moving limits to config file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Isolate rest of limiting logic from the handler

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Small refactor to the request limiter

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rename MergeWith -> OverlayWith

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update documentation

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add missing copyright notice to few files

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix test after change in config file tenants

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI because of bundled-Cortex failing test

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Expose default limits as metrics

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Replace comment with a TODOs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix changelog after bad merge

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Prakul Jain <prakul.jain@udaan.com>
prajain12 pushed a commit to prajain12/thanos that referenced this pull request Sep 6, 2022
… tenant (thanos-io#5565)

* Allow per-tenant limits to be configured via file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor Receive's limiting logic

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix some methods that were in plural

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Improve metric description

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add a TODO for later

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Do some cleanup after moving limits to config file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Isolate rest of limiting logic from the handler

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Small refactor to the request limiter

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rename MergeWith -> OverlayWith

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update documentation

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add missing copyright notice to few files

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix test after change in config file tenants

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI because of bundled-Cortex failing test

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Expose default limits as metrics

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Replace comment with a TODOs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix changelog after bad merge

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Prakul Jain <prakul.jain@udaan.com>
prajain12 pushed a commit to prajain12/thanos that referenced this pull request Sep 6, 2022
… tenant (thanos-io#5565)

* Allow per-tenant limits to be configured via file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor Receive's limiting logic

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix some methods that were in plural

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Improve metric description

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add a TODO for later

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Do some cleanup after moving limits to config file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Isolate rest of limiting logic from the handler

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Small refactor to the request limiter

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rename MergeWith -> OverlayWith

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update documentation

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add missing copyright notice to few files

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix test after change in config file tenants

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI because of bundled-Cortex failing test

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Expose default limits as metrics

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Replace comment with a TODOs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix changelog after bad merge

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Prakul Jain <prakul.jain@udaan.com>
douglascamata added a commit to douglascamata/thanos that referenced this pull request Sep 8, 2022
… tenant (thanos-io#5565)

* Allow per-tenant limits to be configured via file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor Receive's limiting logic

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix some methods that were in plural

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Improve metric description

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add a TODO for later

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Do some cleanup after moving limits to config file

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Isolate rest of limiting logic from the handler

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Small refactor to the request limiter

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rename MergeWith -> OverlayWith

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update documentation

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add missing copyright notice to few files

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix test after change in config file tenants

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI because of bundled-Cortex failing test

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Expose default limits as metrics

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Replace comment with a TODOs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix changelog after bad merge

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: GitHub <noreply@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants