Description
Thanos, Prometheus and Golang version used:
v0.10.1
Object Storage Provider:
AWS S3
What happened:
We use thanos-receive to get data from multiple locations.
Due to #1624 we have to use short tsdb.block-duration=15m
. So, in case any issue happens and thanos-receive is down for >15m remote prometheuses are accumulating data. When connectivity to thanos-receive reestablishes, prometheuses start to send delayed data - which is great. But thanos-receive won't accept that as timestamp is in the past chunk. So, we're getting holes on graphs.
What you expected to happen:
Thanos-receive is able to upload old data to S3.
Note that we cannot do both - upload to S3 from remote locations and also send to remote-write. Because thanos-compact cannot deduplicate such blobs. Which leads to sum() = x2
etc.
These remote prometheuses are running inside k8s on emptyDir
. Switching to just thanos-sidecar (s3 upload) model would lead again to holes on graphs when prometheus pod is restarted. Because sidecar only uploads completed chunk, and we loose WAL in emptyDir
on restart. Another option is to use pvc, but this would bind prometheus to some AZ, as EBS are per-AZ entity. And single AZ could be down.
That's why remote-write model looks so competitive for us.
How to reproduce it (as minimally and precisely as possible):
Shutdown working thanos-receive for >2h, then turn it back on. See that prometheuses uploads stale data, but thanos-receive reject it.