Long-Range Query Weirdness #396

treed · 2018-06-29T23:46:05Z

Thanos, Prometheus and Golang version used

thanos, version v0.1.0-rc.1 (branch: master, revision: 466371fca24319f0bf07b579685fe13c5f28d19d)
  build user:       ir@irp250-v23g04
  build date:       20180629-22:42:58
  go version:       go1.10.2

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10

What happened

Queried for weeks of data, and found a bunch of it missing:

With a tighter view, Thanos can return the data from the lines on the right, just not with a wide view.

What you expected to happen

Seeing weeks of data (here's one of the underlying Prometheus instances):

How to reproduce it (as minimally and precisely as possible):

Unsure. I've done some experimentation to see exactly where the threshold between "full data" and "partial data" lies.

At a query width (unsure what the term for this is) of:

979498s, Thanos shows full data for both sides
979499s, replica A exhibits the problem, but B still shows full data
979512s, Still showing B but not A
979513s, I get a banner saying "no data found" (but the data from the 979512s attempt sticks around)
979529s, the banner goes away and the graph is blank, but the legend says it has data for replica A (but nothing from B)

(I have screenshots for all of the above)

The same problem also happens if you remove the irate() part, and just query for raw data.

Full logs to relevant components

Logs

Querier:

level=info ts=2018-06-29T22:44:14.513851044Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=10.62.203.43:10909
level=info ts=2018-06-29T22:44:14.528827684Z caller=flags.go:68 msg="QueryAPI address that will be propagated through gossip" address=10.62.203.43:9090
level=info ts=2018-06-29T22:44:14.543604525Z caller=query.go:251 msg="starting query node"
level=info ts=2018-06-29T22:44:14.549198932Z caller=query.go:243 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10909
level=info ts=2018-06-29T22:44:14.549569349Z caller=query.go:225 msg="Listening for query and metrics" address=0.0.0.0:9090
level=info ts=2018-06-29T22:44:19.551188012Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=10.62.203.43:10905
level=info ts=2018-06-29T22:44:19.551287768Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=10.62.203.43:10901

Sidecar for Replica A:

level=info ts=2018-06-29T22:44:09.700629869Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=10.62.203.43:10901
level=info ts=2018-06-29T22:44:09.714410305Z caller=sidecar.go:233 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-29T22:44:09.714455003Z caller=sidecar.go:270 msg="starting sidecar" peer=
level=info ts=2018-06-29T22:44:09.714541843Z caller=main.go:243 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-29T22:44:09.714571609Z caller=sidecar.go:215 component=sidecar msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901

The other replica's sidecar has identical logs except for the timestamps and port numbers. (All components are running on the same VM)

Anything else we need to know

Object store isn't, in play here. This is just a pair of Prometheus instances with the default 15d retention.

I updated to latest master today after first seeing the problem before starting to gather data for a bug report.

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-07-02T07:43:56Z

Thanks for reporting. Can you try without auto downsampling option? (choose raw data). Can you see all data then?

I might think this is a duplicate of #376 but let us know.

Seems like downsampling is not working 100%, we need to investigate it closer.

pcallewaert · 2018-07-02T13:14:10Z

I've the same problem, same setup (Prom v2.2.1 and Thanos v0.1.0-rc1).
In the Thanos WebUI I have tried the Downsampling option Raw data, but did not have any effect.

bwplotka · 2018-07-02T14:12:59Z

Oh I misunderstood. There is no thanos strore. So indeed no downsampling issue for sure. Interesting.

Is the query results reproducible all the time?
What if you ask for different time range.. for just 1h? Does it show anything?
Have you changed the sidecar external labels lately? What are these now and in the past (if there was a change)?

I am suspecting some bug / configuration error with matching the time range and labels, but let's see. As discussed with @pcallewaert, he has different issues (downsampling one)

treed · 2018-07-05T20:25:39Z

The results I filed the bug about happened consistently throughout an afternoon. I just came back from a (very) long weekend, and found that the queries that were failing then (including the original timeframe, ending on June 29th) are now fully loading, but, if I move the window to look at more recent data, the same problem occurs. It seems to be about crossing some boundary back from the current time.

If I ask for a shorter range, as mentioned in the bug report, things are fine, until the range crosses some threshold and then it hits.

No external labels have been changed since this was all set up a while back.

And, yeah, changing the downsampling setting has no effect. In fact, I'm not even sure it was available on the version I was using when I first noticed this bug.

bwplotka · 2018-07-16T08:48:49Z

Interesting characteristics of the downsampling bug found by @asbjxrn:

peculiar case of missing data for long term queries (Cluster is one query node and one store node only. The blobstore contains 2 blocks 162h blocks compacted by Prometheus and uploaded. thanos downsample has been run on it since):

I get a result (graph in ui) when querying the time series directly. The query is a 1 week range query for: node_time_seconds{instance="hostname"}

If I wrap it in rate(...[5m]) I get "no datapoints found" using downsampling (Auto, 5m and 1h all fails, I believe there is only 5m downsamples in the store).

If I choose only raw data, the rate query works again.

Using a shorter range (2d) I can scroll across the whole week using the rate function and downsampled data and get result back each time.

I know there is a known issue with downsampling/long time range queries, but was surprised wrapping it in a function changed the outcome when the plain query worked.

Will be useful to track it down.

trevorwhitney · 2018-11-28T18:09:33Z

Seeing this same behavior on 0.1.0-rc2. Graph for one week of data:

Graph for two weeks of data:

Changing deduplication, resolution, and downsampling/raw data option has no effect.

Are there any updates on this issue? We have a large environment setup with 20 HA Prometheus pairs each with a Thanos sidecar and a similar set of metrics. We're happy to provide access if it can help debug the problem.

robertjsullivan · 2019-01-07T21:35:42Z

I spent some time debugging this today. It appears that the Prometheus XOR encoding scheme used in the sidecar cannot encode/decode more than 2^16th number of samples. I think what is happening is the sidecar queries the remote read endpoint without a step size, gets more than 2^16 number of points back and that overflows a uint16 counter on the compression. Those points are then sent back to the querier via grpc with a size indicating only the amount of samples that overflowed the buffer.

Any thoughts on what the proper fix is @bwplotka ?

Here is a quick test that proves it. Apologies for copying the encodeChunk method out of store/prometheus.go

package encoding_test

import (
    "github.com/improbable-eng/thanos/pkg/store/prompb"
    "github.com/improbable-eng/thanos/pkg/store/storepb"
    "github.com/improbable-eng/thanos/pkg/testutil"
    "github.com/prometheus/tsdb/chunkenc"
    "testing"
    "time"
)

func TestEncoding_EncodesAndDecodes_Correctly(t *testing.T) {
    largestNumberOfSamples := 65535
    samples := createSamples(largestNumberOfSamples)
    testutil.Equals(t, len(samples), largestNumberOfSamples)

    _, cb, err := encodeChunk(samples)
    testutil.Ok(t, err)

    chk, err := chunkenc.FromData(chunkenc.EncXOR, cb)
    testutil.Ok(t, err)

    testutil.Equals(t, chk.NumSamples(), largestNumberOfSamples)
}

func TestEncoding_EncodesAndDecodes2ToThe16th_Fails(t *testing.T) {
    tooManySamples := 65536 //2^16
    samples := createSamples(tooManySamples)
    testutil.Equals(t, len(samples), tooManySamples)

    _, cb, err := encodeChunk(samples)
    testutil.Ok(t, err)

    chk, err := chunkenc.FromData(chunkenc.EncXOR, cb)
    testutil.Ok(t, err)

    testutil.Equals(t, chk.NumSamples(), tooManySamples)
}

func createSamples(size int) []prompb.Sample {
    samples := make([]prompb.Sample, 0)
    for i := 0; i < size; i++ {
        sample := prompb.Sample{
            Timestamp: time.Now().UnixNano() + 1,
            Value:     float64(i),
        }
        samples = append(samples, sample)
    }

    return samples
}

func encodeChunk(ss []prompb.Sample) (storepb.Chunk_Encoding, []byte, error) {
    c := chunkenc.NewXORChunk()

    a, err := c.Appender()
    if err != nil {
        return 0, nil, err
    }

    for _, s := range ss {
        a.Append(int64(s.Timestamp), float64(s.Value))
    }
    return storepb.Chunk_XOR, c.Bytes(), nil
}

robertjsullivan · 2019-01-07T21:43:10Z

I think this is the uint16 enforcing line in prometheus/tsdb . https://github.com/prometheus/tsdb/blob/8d991bdc1e494074ca2d559e5a75f49f13a789ac/chunkenc/xor.go#L75

kitemongerer · 2019-01-07T21:50:25Z

It seems like this comment might explain what's going on https://github.com/improbable-eng/thanos/blob/270d81feec83ba5defa19d64e022d84e3b2e1a2a/pkg/store/prometheus.go#L158-L160

If there are more than 2^16 samples, they might need to be split into multiple chunks

bwplotka · 2019-01-08T18:32:50Z

Wow huge thanks @robertjsullivan for digging into it so deeply. Once again, I got positively surprised by commuinty! ❤️ 💪

So overall.. The limit on of uint16 is totally fine from Prometheus and TSDB perspective, because there should be no more than 120 samples within single chunk and this is done purpose: https://github.com/prometheus/tsdb/blob/master/head.go#L1514

It appears that the Prometheus XOR encoding scheme used in the sidecar cannot encode/decode more than 2^16th number of samples. I think what is happening is the sidecar queries the remote read endpoint without a step size, gets more than 2^16 number of points back and that overflows a uint16 counter on the compression. Those points are then sent back to the querier via grpc with a size indicating only the amount of samples that overflowed the buffer.

You and @kitemongerer are totally right here. I think you spotted valid bug. We totally missed it as we were focused on progressing on this: (https://docs.google.com/document/d/1JqrU3NjM9HoGLSTPYOvR217f5HBKBiJTqikEB9UiJL0/edit?ts=5c2cdaf0#) which would solve this issue entirely.

Since stream remote read is still in progress and since we care about backward compatibility the bug is valid and we should fix the the implementation.

The fix is extremely straightforward to me, we should split series by 2^16 and produce biggest chunks possible, because the comment there is still valid:

// We generally expect all samples of the requested range to be traversed
// so we just encode all samples into one big chunk regardless of size.

I will leave the honour of fixing this to you @robertjsullivan (: If you prefer me or anyone doing fix, I am fine, let us know.

…2^16 to avoid overflowing XOR compression Signed-off-by: Robert Sullivan <robert.j.sullivan@gmail.com>

robertjsullivan · 2019-01-08T22:57:16Z

Created the PR. Let me know if anything needs to change. Thanks!

…16 samples (#718) * fixes issue #396, split response into chunks no bigger than 2^16 to avoid overflowing XOR compression Signed-off-by: Robert Sullivan <robert.j.sullivan@gmail.com> * use single underlying array for chunkSamples * make SplitSamplesIntoChunksWithMaxSizeOfUint16_e2e use 3 chunks

bwplotka added the bug label Jul 2, 2018

bwplotka added difficulty: easy good first issue help wanted labels Jan 8, 2019

trevorwhitney added a commit to robertjsullivan/thanos that referenced this issue Jan 8, 2019

fixes issue thanos-io#396, split response into chunks no bigger than …

cc862d1

…2^16 to avoid overflowing XOR compression Signed-off-by: Robert Sullivan <robert.j.sullivan@gmail.com>

robertjsullivan mentioned this issue Jan 8, 2019

sidecar: fix issue #396: split response into chunks no bigger than 2^16 samples #718

Merged

bwplotka closed this as completed Jan 11, 2019

mreichardt95 mentioned this issue Jan 31, 2019

Sidecar stops responding when querying over multiple days #773

Closed

jojohappy mentioned this issue Feb 7, 2019

query: Metrics missing when using query component #821

Closed

bwplotka mentioned this issue Mar 7, 2019

downsampling: Inconsistent query results on store data #376

Closed

davtsur mentioned this issue Apr 16, 2019

Long Range Queries from Rule component store return weird results #1038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-Range Query Weirdness #396

Long-Range Query Weirdness #396

treed commented Jun 29, 2018 •

edited

Loading

bwplotka commented Jul 2, 2018

pcallewaert commented Jul 2, 2018

bwplotka commented Jul 2, 2018

treed commented Jul 5, 2018

bwplotka commented Jul 16, 2018

trevorwhitney commented Nov 28, 2018

robertjsullivan commented Jan 7, 2019

robertjsullivan commented Jan 7, 2019

kitemongerer commented Jan 7, 2019

bwplotka commented Jan 8, 2019 •

edited

Loading

robertjsullivan commented Jan 8, 2019

Long-Range Query Weirdness #396

Long-Range Query Weirdness #396

Comments

treed commented Jun 29, 2018 • edited Loading

bwplotka commented Jul 2, 2018

pcallewaert commented Jul 2, 2018

bwplotka commented Jul 2, 2018

treed commented Jul 5, 2018

bwplotka commented Jul 16, 2018

trevorwhitney commented Nov 28, 2018

robertjsullivan commented Jan 7, 2019

robertjsullivan commented Jan 7, 2019

kitemongerer commented Jan 7, 2019

bwplotka commented Jan 8, 2019 • edited Loading

robertjsullivan commented Jan 8, 2019

treed commented Jun 29, 2018 •

edited

Loading

bwplotka commented Jan 8, 2019 •

edited

Loading