Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split query by time range #489

Open
BertHartm opened this issue Sep 18, 2020 · 19 comments
Open

split query by time range #489

BertHartm opened this issue Sep 18, 2020 · 19 comments
Labels
2.0 release Feature/Fix will be available in 2.0 Release delta-proxy affects the delta proxy+cache enhancement New feature or request

Comments

@BertHartm
Copy link

It would be helpful if trickster could take a query for a long time range of data and have it become several queries for shorter time ranges of data.

Use case is for instances where querying say a months worth of data may time out, but querying for 1 day a month ago would succeed. By having trickster combine datapoints from several 1 day queries, it can increase the likelihood of overall success, and potentially lower the load on the backend.

@jranson
Copy link
Member

jranson commented Sep 18, 2020

@BertHartm that sounds really interesting, and worth looking at. I was wondering what your experience is with scale. Do you think the timeouts you experience are bottlenecked by the # of timestamps, or by the age of the time range? Which provider are you using (prometheus, influxdb) and can they be improved/configured to do this kind of partitioning rather than moving it to a downstream component? Having those kinds of answers would help us make decisions like whether to slice up the requested ranges into buckets based on a configurable time window (regardless of granularity) versus size in # of timestamps, and whether we should configure any providers to enable it by default. One thing to note is that Trickster is meant to cache recent data, so, in your example, even though the query would be de-articulated upstream with this proposed feature, it would likely not be cached in trickster due to the age of the range, unless you have set a really long TTL. So you'd avoid timeouts, but likely not see a lot in the way of future acceleration. Would that still meet your use case?

@BertHartm
Copy link
Author

Thanks for the quick reply @jranson . We use prometheus and m3db and the main problem we're trying to solve is folks loading dashboards for long time windows, so we'll probably be bumping to the default limit on how long a window trickster caches to account for that.

@jranson jranson added 2.0 release Feature/Fix will be available in 2.0 Release delta-proxy affects the delta proxy+cache enhancement New feature or request labels Mar 9, 2021
@jranson
Copy link
Member

jranson commented May 19, 2021

Hi @BertHartm and the rest of the Community. We are going to start implementing this feature in Trickster 2.0 and will use this issue to discuss and share progress.

The main area where we'd like feedback is on the configurability of this feature - specifically, how does a team running Trickster tune the shard size to their liking via configuration. Since it is impossible for Trickster to know exactly how many series might be returned for a given query, I believe we need to focus on the number of timestamps that would be returned - which can be derived mathematically at request time from its step, start and end parameters. To that end, I believe a single configuration (e.g., shard_size) that defines the maximum number of timestamps per shard should suffice. We're open to other ideas and opinions from anyone who would like to provide input. My goal is to provide this feature in the next 2.0 Beta, so we'll move to implement it as described here, absent any feedback.

@BertHartm
Copy link
Author

Hey @jranson, We'd love to see this in 2.0.

I agree on the controlling time instead of series or datapoints. I'd like to note that my primary concern is aligning this to the internal model of the database to optimize the fetches, so if it's possible to align based on actual utc time that might be the best way of configuring for us.
e.g. the prometheus tsdb writes blocks every 2h, so if I could configure the proxy to split a query into only hitting 1 2h block at a time, that would be the ideal access case for prometheus. This might cause a small query to get split somewhat unnecessarily if it crosses the block boundary though. I think that becomes less of an issue if we look at larger block sizes though.

I'm less familiar with things outside prometheus and m3db, but I believe the time range / block configuration aligns for all the prometheus based systems, and I would guess aligns with most tsdbs.

I think it also makes reasoning as an end user easy if I can think about the configuration as multiples of the database block size (e.g. I'm okay hitting 4 2h blocks in a single hit, so set the proxy to split at 8h), as opposed to the number of timestamps and trying to factor in step size of queries in trying to optimize the system (making the above example something like 8h * 30s steps size being 960 timestamps in most cases, though it could be more than 8h if the step size is larger, which is probable when this feature is used to split up multi-day queries).

Thoughts on that as an option?

@jranson
Copy link
Member

jranson commented May 20, 2021

Thanks for the feedback, it's really helpful. So for your use case, i am gathering that 1) the number of unique timestamps per shard is not important and 2) when optimally configured, no two shards should request data from the same storage block.

My thinking on the timestamp-based use case was that it can help with problems like Prometheus returning errors for queries that are too large, which I think there is also demand for. The good news is we should be able to implement all of these use cases, since they only determine how a list of time ranges might be further subdivided.

Instead of shard_size, I'd like to propose we expose shard_size_ms and shard_size_points configurations, which can work in tandem to clamp the maximum size of a shard. Then, a third option shard_step_ms (we use "step" to specifically mean a cadence aligning to the epoch) could be used control the alignment of the shards. A value of 0 value for any of these would exclude it from the equation.

So in that scenario, you could set shard_step_ms = 7200000, shard_size_ms to anything >= 7200000, and shard_size_points to any value >= 0 (probably 0 for you).

Let me know if that makes sense, would meet your needs, and is an agreeable approach.

@BertHartm
Copy link
Author

Yeah, that actually would work well.

Would it make sense to include max somewhere in the names for shard_size_ms and shard_size_points to make it clear that they're limits and not guarantees? playing around a bit with different configurations in my head that made them read easier.

@jranson
Copy link
Member

jranson commented May 20, 2021

Great, yes i will definitely include max in that. I have a fully-functioning version of the originally-proposed points implementation ready to PR. That means we can already take a list of time ranges, shard the requests concurrently, and reconstitute the results, based on configuration. So adding in these other 2 knobs to the timerange sharding function should be straightforward, and I anticipate having this available in the codebase in the next 24 hours.

@jranson
Copy link
Member

jranson commented May 20, 2021

The only real (but minor) concern I'm seeing with this approach, in practice, is when max_shard_size_ms is not exactly divisible by max_shard_step_ms. I propose, at configuration load time, max_shard_size_ms be rounded down to the nearest multiple of max_shard_step_ms and a WARN log entry be issued noting the adjustment, if applicable. Let me know if that sounds alright.

@BertHartm
Copy link
Author

seems reasonable. I was wondering if max_shard_size_ms could be defined as a multiple of max_shard_step_ms which would avoid the possibility, but I guess there are valid usecases where max_shard_size_ms is set without the step, so the warn approach is probably better.

@wesleyk
Copy link

wesleyk commented May 21, 2021

+1 to this approach, and max_shard_step_ms satisfies our use cases as well.

I wonder if we could detail the use cases for max_shard_size_ms / max_shard_size_points, or else if it's worth only offering max_shard_step_ms to begin with.

@jranson
Copy link
Member

jranson commented May 21, 2021

@wesleyk the use case for points is the max 11,000 points-per-query limitation in the base Prometheus product. ML and other applications that need to get large amounts of prometheus data at a high resolution (e.g., 6 months @ 15s) must break apart those queries and reconstitute the data, manually, to avoid that limitation. In that case, an operator could set max_shard_size_points = 11000 and Trickster would handle it.

@wesleyk
Copy link

wesleyk commented May 21, 2021

@jranson gotcha I may be confusing myself, but how would max_shard_size_points help in this case, given the query could load a variable number of timeseries?

Also one thought re: naming. I wonder if "Range" makes more sense to use in place of "Step", given this has more to do with the time range that's being queried and not the step size of the results?

@wesleyk
Copy link

wesleyk commented May 21, 2021

One more comment: if we do split queries up, I imagine #291 may become more of a frequent issue?

We noticed in the missing range handling that errors seem to be ignored. Not sure if they're handled elsewhere, but I imagine if we utilize similar logic with split queries, we'd more likely see this soft failure case occur more often.

@jranson
Copy link
Member

jranson commented May 26, 2021

I have submitted PR #573 to implement the sharding capability. A few notes about how things turned out and responses to some of the thoughts above:

I named the proposed max_shard_* configs as shard_max_*, so that all sharding configs start with shard_*. Otherwise, I have kept the naming as proposed.

@wesleyk as far as I am aware from my research and testing, the Prometheus 11K limitation pertains to unique timestamps (not unique timestamps * # of individual series), so this configuration should solve for that issue. To simplify the functionality, we did make shard_max_size_points mutually exclusive from shard_max_size_ms and shard_step_ms, so each backend config can only use one or the other. To simplify configurations, you can use shard_step_ms by itself, or with shard_max_size_ms (so long as shard_max_size_ms is a multiple of shard_step_ms). If you use shard_step_ms by itself, shard_max_size_ms is assumed to be shard_step_ms.

Regarding the value names, we use "Range" to generally mean a timespan with absolute start and end times. In this case, the shard_max_size_ms and shard_step_ms indicate a generic duration rather than a range. So i think the naming makes sense for shard_max_size_ms, as it literally controls the maximum size in milliseconds of a shard. As I described in a comment above above, we use "Step" any time we are talking about a duration that is used in ($timestamp / $step) * $step integer operations so as to align a timestamp to the epoch on the given step cadence. If you look at the implementation in #573, that is actually exactly what we are doing with shard_step_ms to properly align the shards, so I believe that naming is appropriate and consistent with other config namings.

I did improve on the Missing Ranges issue just a bit in this patch: every failed shard should be reliably logged and counted in the metrics. My intention is to fully address #291 more directly in a separate patch, which will insert Warnings detailing the data integrity issues in the result set that is delivered to the client, whenever a dependent upstream request fails. This should be included in the next beta release alongside the sharding enhancement.

The PR includes a new example config file for sharding, which sets shard_step_ms = 7200000, which should be the required setting for M3 backends partitioned into 2h blocks.

@robskillington
Copy link

My intention is to fully address #291 more directly in a separate patch, which will insert Warnings detailing the data integrity issues in the result set that is delivered to the client, whenever a dependent upstream request fails. This should be included in the next beta release alongside the sharding enhancement.
@jranson can we make this configurable?

Pretty much all users I work closely with prefer a failed request and to retry it, rather than have to guess whether the missing range is due to a partial failure or some other side effect (a lot of users assume the data is missing in storage). Most UIs don't properly show warnings the way you would like either, so I don't think a warning is sufficient, or rather I think it should at least be configurable to make it a hard failure to the end user in a partial failure case.

@jranson
Copy link
Member

jranson commented May 26, 2021

Thanks, @robskillington. Just to be sure I understand fully. In a hypothetical situation where Trickster has 99% of data needed to serve a request in the cache already; if the delta proxy request to retrieve the remaining 1% fails, the originating client request should be returned a full failure? If that is the preferred route, it's great for us since that is the easiest implementation to support.

Separately, we do have configurable backend retries (in the event of an origin request failure) on the list of features to provide as part of 2.0's new ALB. However, it does make sense to permit that capability for any backend provider and not just the ALB, so we'll make sure it is properly scoped.

Let us know if those thoughts sound good, and if the implementation in the PR will meet your team's needs. Thanks again for the great feedback!

@wesleyk
Copy link

wesleyk commented May 26, 2021

if the delta proxy request to retrieve the remaining 1% fails, the originating client request should be returned a full failure

This is the desired behavior, yeah.

The natural next step there, IMO, would be to still populate the cache in the background with the proxy requests that did succeed. So if a request results in four sharded queries, with three succeeding and one failing, then the client will see a failure. On retry, the three successful queries would already have cached results, and the one previously failed query would be retried.

@jranson
Copy link
Member

jranson commented May 26, 2021

Thanks @wesleyk. We will make sure that is part of the next beta release (and will be a separate PR from the sharding feature PR).

As for the sharding issue, we are going to go ahead and merge #573 - not because we are finished with this issue, but because we renamed the org today from tricksterproxy to trickstercache and that is breaking build/imports/gomod, etc. So we want to get that cleaned up as soon as possible, and changing 400+ files is not something we want to then backmerge into this PR.

If there are any issues or concerns with the merged sharding implementation, report them here and we will modify it further to meet the community's needs. I will keep this issue open until we have consensus that the needed use cases are met and function as expected.

@wesleyk
Copy link

wesleyk commented May 26, 2021

Will do! Will ask the team to take a pass on the PR as well 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0 release Feature/Fix will be available in 2.0 Release delta-proxy affects the delta proxy+cache enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants