Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication in StoreAPI #2700

Closed
dgl opened this issue Jun 1, 2020 · 6 comments
Closed

Deduplication in StoreAPI #2700

dgl opened this issue Jun 1, 2020 · 6 comments

Comments

@dgl
Copy link

dgl commented Jun 1, 2020

Currently deduplication is done in the querier, at the Select stage. This means Store API gRPC queries, even going via query do not get deduplication. For thanos-remote-read this means it does not get deduplication via querier.

Add a store API request option to ask for deduplication.

This could be useful for other purposes, e.g. it would be possible to push deduplication down to a child query node, to avoid sending extra data over the network (likely would need to be optional, even if #2699 made it possible to do this in a streaming way this would mean decoding chunks twice).

@kakkoyun
Copy link
Member

@bwplotka Is this fixed by #2711?

@bwplotka
Copy link
Member

bwplotka commented Jun 15, 2020

No, we dedup the duplicated, identical chunks yes, but only if the series is duplicated without "replica label removing" phase. This ticket is about doing the replica rem phase and some kind of initial dedup on Store API.

I think it would be a solid request, except that it might be highly expensive. This is because for true deduplication you have to decode chunks and encode them again (because StoreAPI exposes chunks). On Querier, however, PromQL operates on samples so we decode + duplicate in one go, there is little overhead. This is something to be aware of here.

There is option C, where we could do partial deduplication, so we could remove labels and merge chunks without encoding/decoding in a way that: For 1:1 dups we kill dups, for time overlapping chunks that are not identical we just pass them through. Querier sorts chunks and deduplicates them anyway, so that could work. Still, for Prometheus replicas, it's not very helpful - chunks are rather never identical due to scrape interval randomness.

So.. at then end, I don't really see much value. I think we should rather port deduplication to thanos-remote-read for now, WDYT? @dgl

@dgl
Copy link
Author

dgl commented Jun 15, 2020

I agree doing deduplication in thanos-remote-read makes sense if there's not really a use case elsewhere. The attraction for me is the code for deduplication isn't currently a public API, but I'm sure I can work something out.

(I still think there are cases where being able to push some of the querier work to other places has value but I suspect it's not a priority right now as there's plenty of more important things to work on. e.g. It might be useful for a custom store API server to know that deduplication is happening or not to optimise its responses.)

@stale
Copy link

stale bot commented Jul 15, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 15, 2020
@kakkoyun kakkoyun removed the stale label Jul 15, 2020
@stale
Copy link

stale bot commented Aug 14, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 14, 2020
@stale
Copy link

stale bot commented Aug 21, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 21, 2020
@kakkoyun kakkoyun removed the stale label Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants