Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resort store response set on internal label dedup #6317

Merged
merged 14 commits into from Aug 10, 2023

Conversation

fpetkovski
Copy link
Contributor

@fpetkovski fpetkovski commented Apr 25, 2023

When deduplicating on labels which are stored internally in TSDB, the store response set needs to be resorted after replica labels are removed.

In order to detect when deduplication by internal labels happens, this PR adds a cuckoo filter with all label names to all store implementations. When a replica label is present in this filter, the store will resort the Series response set before returning it to the querier.

Fixes #6257.
Closes #6296.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@stale
Copy link

stale bot commented Jun 18, 2023

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days.
Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 18, 2023
@fpetkovski fpetkovski force-pushed the resort-dataset-on-internal-dedup branch from e0a0529 to 4f3c8ad Compare July 13, 2023 07:46
@stale stale bot removed the stale label Jul 13, 2023
pkg/store/proxy_test.go Outdated Show resolved Hide resolved
pkg/bloom/bloom.go Outdated Show resolved Hide resolved
Copy link
Contributor

@douglascamata douglascamata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of reasoning kind of hidden behind the "bloom filter" name.

We are adding it in many different places, including the gRPC definition of the Store API, and its purpose is still not clear. On top of this, it also couples the Store API definition to one specific implementation (bloom filter) of "a way to do this thing".

I think this happens because it is named after what is it and not what it does. In Go, we kind of don't need to name things after what they are -- their type will tell us this.

But what is this bloom filter used for? Seems like is doing a "quick check" for whether a series request includes internal replica labels. So potentially we can name this better? Maybe "internal label checker" seems like a good proposal. One day we might switch from a bloom filter to something else, or add another mechanism (i.e. what if I wanted to store this information within a Redis or Memcache instance?) and things would be more extensible.

Now on a more technical question: why do we need to transfer the bloom filter over gRPC? Storing on Redis with the client-side cache features we get from the rueidis lib seem like a nice approach too.

var labelsToRemove map[string]struct{}
if !st.SupportsWithoutReplicaLabels() && len(req.WithoutReplicaLabels) > 0 {
labelsToRemove := make(map[string]struct{})
dedupByInternalLabel := hasInternalReplicaLabels(st, req)
Copy link
Contributor

@yeya24 yeya24 Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this behind a feature flag? For Cortex, this code path seems an unnecessary overhead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cortex use the proxy_heap? But yes, we should add a FF in the bucket store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah proxy heap will be used by store gateway for lazy series? So it is always used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a feature flag in fpetkovski#5

@douglascamata
Copy link
Contributor

Was talking with @saswatamcode earlier today and I was wondering whether we could add some metrics to the global sorting and bloom filter to try to quantify how many times a global sort is being executed or skipped. Potentially could also measure how long the bloom filter update is taking.

What do you think?

@fpetkovski
Copy link
Contributor Author

Sounds like a good idea 👍

Copy link
Contributor

@moadz moadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits for cleanliness but otherwise thanks for doing this :) i'm excited about what we can do wtih this bloomfilter now it's there

// This test is expected to fail until the bug outlined in https://github.com/thanos-io/thanos/issues/6257
// is fixed. This means that it will return double the expected series until then.
// This is a regression test for the bug outlined in https://github.com/thanos-io/thanos/issues/6257.
// Until the bug was fixed, this testcase would return double the expected series.
expectedDedupBug: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be cleaning up this bool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great found, sir. We should get rid of this bool.

// This should've returned only 2 series, but is returning 4 until the problem reported in
// https://github.com/thanos-io/thanos/issues/6257 is fixed
// This is a regression test for the bug outlined in https://github.com/thanos-io/thanos/issues/6257.
// Until the bug was fixed, this testcase would return 4 series instead of 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Until the bug was fixed, this testcase would return 4 series instead of 2.

nit: Unnecessary clarification.

return &infopb.StoreInfo{
MinTime: mint,
MaxTime: maxt,
SupportsSharding: true,
SupportsWithoutReplicaLabels: true,
TsdbInfos: proxy.TSDBInfos(),
LabelNamesBloom: infopb.NewBloomFilter(labelNamesBloom),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: LabelNamesBloom does not elude to what it's actually used for. Type can be inferred, should instead allude to what it's used for. In this case 'indexedLabels' for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, the store info is a data transfer struct, so I think it's fine to keep the Bloom suffix. I think it's important to know what those bytes actually are. The same way we have MinTime and MaxTime

Comment on lines 830 to 861
// Start bloom name filter updater.
{
ctx, cancel := context.WithCancel(context.Background())
level.Debug(logger).Log("msg", "setting up periodic label names bloom filter update")
g.Add(func() error {
return runutil.Repeat(10*time.Second, ctx.Done(), func() error {
level.Debug(logger).Log("msg", "Starting label names bloom filter update")

if err := proxy.UpdateLabelNamesBloom(ctx); err != nil {
return err
}

level.Debug(logger).Log("msg", "Finished label names bloom filter update")
return nil
})
}, func(err error) {
cancel()
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might be cleaner have goroutine errors/invocation handled in a different object (e.g. StoreLabelIndexer) that takes a store Client and invokes UpdateLabelNamesBloom at some refresh interval. We need to do this for every store, and doing it inline is a bit of a refactoring nightmare.

"github.com/bits-and-blooms/bloom"
)

const FilterErrorRate = 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be good to add some clarification on what this error rate represents (margin of error that bloom filter will return false for a value that it does contain)

labelSetFunc func() []labelpb.ZLabelSet
timeRangeFunc func() (int64, int64)
tsdbOpts *tsdb.Options
store *store.TSDBStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup <3

Comment on lines 361 to 364
bmtx sync.Mutex
labelNamesBloom bloom.Filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superNit: Doesn't seem like the right place for the mutex to be managed, perhaps this should be moved into LabelNamesBloom for cleaner concurrency safeness.

Comment on lines 1668 to 1669
mtx.Lock()
for _, n := range result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superNit: Another argument for moving bmtx is that we're juggling more than one mutex in this func, would reduce mental load reading if we didn't have to reason about concurrency in multiple dimensions :) (supernit for a reason
)

g, _ := errgroup.WithContext(ctx)

var mtx sync.Mutex
names := make(map[string]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we're using a map instead of struct if we never populate struct is always empty?

pkg/store/bucket.go Outdated Show resolved Hide resolved
@fpetkovski fpetkovski force-pushed the resort-dataset-on-internal-dedup branch from 032e64b to a29cc58 Compare July 21, 2023 05:52
pkg/bloom/bloom.go Outdated Show resolved Hide resolved
@fpetkovski fpetkovski force-pushed the resort-dataset-on-internal-dedup branch from 61a00d7 to 341e874 Compare July 27, 2023 13:21
pkg/store/proxy.go Outdated Show resolved Hide resolved
When deduplicating on labels which are stored internally in TSDB,
the store response set needs to be resorted after replica labels are removed.

In order to detect when deduplication by internal labels happens, this PR adds a
bloom filter with all label names to the Info response. When a replica label is present
in this bloom filter for an individual store, the proxy heap would resort a response set from
that store before merging in the result with the rest of the set.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski force-pushed the resort-dataset-on-internal-dedup branch from 4fb3558 to 0ea795e Compare July 29, 2023 07:12
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski marked this pull request as ready for review July 31, 2023 12:57
@fpetkovski
Copy link
Contributor Author

@GiedriusS @saswatamcode @douglascamata @moadz I have modified this PR to use a cuckoo filter and resort the series response in the store itself. Please take another look, the implementation is now simpler since we dont have to send additional data to the querier.

saswatamcode
saswatamcode previously approved these changes Jul 31, 2023
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me! Let's get this merged and release v0.32! 🙂

pkg/stringset/set.go Show resolved Hide resolved
}

func NewFromStrings(items ...string) Set {
f := cuckoo.NewFilter(uint(len(items)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could estimate the size of the underlying slices as per the comments here https://github.com/seiflotfy/cuckoofilter/blob/master/cuckoofilter.go#L21-L26 and add some way of adding an upper limit for this, something like maybe 5MB by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what action do we take if the limit is exceeded?

Copy link
Member

@GiedriusS GiedriusS Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good choice would be to print a warning message and then always force sorting? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that is a good tradeoff though, because 5MB is a fairly low price to pay compared to the increase in memory required for buffering and resorting series. If 1000000 is ~1MB, I think it will be very hard to have so many label names for memory of the filter to be a problem. Maybe we should check behavior in production before we add limits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 makes sense

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very elegant implementation, thanks for this!

I think it's worth the try. A little bit concerned on the overhead for the whole system (generally should be insignificant, but we have to try to know). I also wonder how system handles eventual consistency (I assume we give inaccurate query results?). Perhaps there is a benefit to turn on/off this filtering system on demand?

cmd/thanos/store.go Outdated Show resolved Hide resolved
cmd/thanos/store.go Show resolved Hide resolved
// Start bloom name filter updater.
{
ctx, cancel := context.WithCancel(context.Background())
level.Debug(logger).Log("msg", "setting up periodic update for label names")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level.Debug(logger).Log("msg", "setting up periodic update for label names")
level.Info(logger).Log("msg", "setting up periodic update for label names")

I guess info would make sense here

pkg/store/bucket.go Outdated Show resolved Hide resolved
@GiedriusS
Copy link
Member

GiedriusS commented Jul 31, 2023

As for consistency it's the same like with a sharded Thanos Store - blocks are not loaded at the same time on all nodes.

#6317 (comment) perhaps the limit flag could serve as a way to disable this i.e. setting to 0 would disable this functionality on a node and show that all label names are available.

Not in all setups reading data from remote object storage costs. And also I would be against removing the optimizations since they cut down query duration by 30%-40%.

@douglascamata
Copy link
Contributor

I think in case of lack of consistency, or even a false positive, what happens is a global resort. Am I right?

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@bwplotka
Copy link
Member

bwplotka commented Aug 1, 2023

Not in all setups reading data from remote object storage costs. And also I would be against removing the optimizations since they cut down query duration by 30%-40%.

@GiedriusS happy to hear I didn't break Thanos for nothing =DDDDD

Coincidentally with this implementation we go into Monarch design even more (public info: https://www.vldb.org/pvldb/vol13/p3181-adams.pdf, and yes, I'm biased 🙃). Essentially Monarch has Field Hints, (so kind of our labels) that it updates and consult on every query 🙈

So... let's double check consistency issue, otherwise LGTM (:

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
GiedriusS
GiedriusS previously approved these changes Aug 9, 2023
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 let's see how much RAM the filters will need

@@ -1240,7 +1253,9 @@ func debugFoundBlockSetOverview(logger log.Logger, mint, maxt, maxResolutionMill
}

// Series implements the storepb.StoreServer interface.
func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storepb.Store_SeriesServer) (err error) {
func (s *BucketStore) Series(req *storepb.SeriesRequest, seriesSrv storepb.Store_SeriesServer) (err error) {
srv := newFlushableServer(seriesSrv, s.LabelNamesSet(), req.WithoutReplicaLabels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a yet another way of performing a race inside of the storegateway but let's work on fixing this now before the release 👍 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually might be wrong because the Close() functions will still run at the same time 🤔 let's merge this and test it

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski
Copy link
Contributor Author

Ok, time to try it out :)

@fpetkovski fpetkovski merged commit 84567ec into thanos-io:main Aug 10, 2023
16 checks passed
@douglascamata
Copy link
Contributor

Awesome! So happy to see this merged. Thanks a lot, folks! 🙇

@douglascamata
Copy link
Contributor

Btw, small nit: next time we need to remember to squash and merge. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplication returning deduped and non-deduped results in 0.31.0+
7 participants