Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Support backfilling data through remote write (in case of origin got disconnected) #2114

Closed
sepich opened this issue Feb 10, 2020 · 21 comments

Comments

@sepich
Copy link
Contributor

sepich commented Feb 10, 2020

Thanos, Prometheus and Golang version used:
v0.10.1

Object Storage Provider:
AWS S3

What happened:
We use thanos-receive to get data from multiple locations.
Due to #1624 we have to use short tsdb.block-duration=15m. So, in case any issue happens and thanos-receive is down for >15m remote prometheuses are accumulating data. When connectivity to thanos-receive reestablishes, prometheuses start to send delayed data - which is great. But thanos-receive won't accept that as timestamp is in the past chunk. So, we're getting holes on graphs.

What you expected to happen:
Thanos-receive is able to upload old data to S3.

Note that we cannot do both - upload to S3 from remote locations and also send to remote-write. Because thanos-compact cannot deduplicate such blobs. Which leads to sum() = x2 etc.

These remote prometheuses are running inside k8s on emptyDir. Switching to just thanos-sidecar (s3 upload) model would lead again to holes on graphs when prometheus pod is restarted. Because sidecar only uploads completed chunk, and we loose WAL in emptyDir on restart. Another option is to use pvc, but this would bind prometheus to some AZ, as EBS are per-AZ entity. And single AZ could be down.

That's why remote-write model looks so competitive for us.

How to reproduce it (as minimally and precisely as possible):
Shutdown working thanos-receive for >2h, then turn it back on. See that prometheuses uploads stale data, but thanos-receive reject it.

@squat
Copy link
Member

squat commented Feb 10, 2020

What do you mean exactly by:

What you expected to happen:
Thanos-receive is able to upload old data to S3.

Thanos receive is already able to upload old data so S3. Each instance of Thanos receive in the hashring should use a different replica label. Thanos querier will correctly deduplicate data from this when querying, even when using aggregations likse sum, so no queries should be incorrect.

This sounds like we are trying to solve a symptom of the actual issue in 1624. I would prefer to focus on fixing that so that everyone can use block durations of 2hrs and avoid anything like what you are seeing. What do you think?

@sepich
Copy link
Contributor Author

sepich commented Feb 10, 2020

What do you mean exactly by:
Thanos receive is already able to upload old data so S3

Sorry, I've meant such errors after thanos-receive is down for some time and than back on:

level=debug ts=2020-02-06T14:51:27.634126008Z caller=writer.go:69 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_processes\", instance=...}" sample="timestamp:1581000525667 "
level=warn ts=2020-02-06T14:51:27.635264235Z caller=writer.go:83 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2

@squat
Copy link
Member

squat commented Feb 10, 2020

aha, ok so you mean something like:

What you expected to happen:
Thanos-receive is able to continue ingesting data even after closing and uploading a block without seeing out of bounds errors.

I still think that this is really addressing a symptom rather than the root of the issue. I am not against investigating this avenue but I think it will be more valuable and solve more people's issues if we go to the root and prevent receive from OOMing while replaying the WAL. This way you can have longer block durations and as a result see fewer out of bounds errors.

@brancz
Copy link
Member

brancz commented Feb 10, 2020

I think there is space for something like this in some shape or form in the prometheus remote write protocol. Essentially the protocol itself should have the ability to even if the backend was down for >2h be able to backfill data that has been flushed to a block from the head block/WAL. I probed some other Prometheus team members and it did seem like there were other parties that would be interested in such a thing.

@sepich
Copy link
Contributor Author

sepich commented Feb 14, 2020

I still think that this is really addressing a symptom rather than the root of the issue.

Unfortunately even with default tsdb.block-duration=2h on thanos-receive side remote-write scheme is unable to recover after being down for >2h.
Logs on prometheus v2.15.2 side:

ts=2020-02-14T08:38:06.999Z caller=dedupe.go:111 component=remote level=error remote_name=f9a55a url=https://xxx/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: 3 errors: failed to non-fast add 28 samples: out of bounds: conflict; 409; 409"
ts=2020-02-14T08:38:11.921Z caller=dedupe.go:111 component=remote level=warn remote_name=f9a55a url=https://xxx/api/v1/receive msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1581657554 minSendTimestamp=1581669481

And lastSendTimestamp stays the same in each event, not growing.
On thanos-receive side (3 hashring endpoints with replica=1):

level=warn ts=2020-02-14T08:42:27.873428299Z caller=writer.go:83 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=35
level=error ts=2020-02-14T08:42:27.874889268Z caller=handler.go:347 component=receive component=receive-handler msg="storing locally" err="failed to non-fast add 35 samples: out of bounds: conflict" endpoint=http://thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive
level=error ts=2020-02-14T08:42:27.884528922Z caller=handler.go:395 component=receive component=receive-handler msg="forwarding returned non-200 status" err=409 endpoint=http://thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive
level=error ts=2020-02-14T08:42:27.887348733Z caller=handler.go:395 component=receive component=receive-handler msg="forwarding returned non-200 status" err=409 endpoint=http://thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive

Metric thanos_receive_forward_requests_total dropped after receive being down:
image
It seems prometheuses sending same old data over and over again, and unable to move on to new data.
I have to manually restart all prometheuses with deleting /wal/* folder content to recover from this.

@brancz
Copy link
Member

brancz commented Feb 14, 2020

There are some caveats:

  1. The receive component will only ingest data up to 1 hour old.
  2. Prometheus will stop retrying sending data once it's not available in the WAL anymore, and from there only attempt to send the new data newly available in the WAL, so data older than 2h won't be possible to be ingested/backfilled, only within 1 hour.

@sepich
Copy link
Contributor Author

sepich commented Feb 14, 2020

Thank you for such an important information!
I believe this should be mentioned in docs. I'm confirming that setting --storage.tsdb.min-block-duration=15m --storage.tsdb.max-block-duration=15m on prometheus side make the data acceptable by thanos-receive.

  1. Is there a reason on why it is 1h? And not 2h or more for default 2h prometheus WAL?
  2. Maybe i misunderstand your sentence - but issue when started to happen will not fix by itself. I do see in prometheus logs each 2h that WAL is rotated, block written, but lastSendTimestamp stays the same and data is not accepted by thanos-receive for 6h.

@brancz
Copy link
Member

brancz commented Feb 14, 2020

  1. The heuristic is 0.5x of the smallest block produced and since the default smallest block is 2h, the heuristic determines 1h. Further explained: this is because the WAL accumulates 1.5x the smallest block worth of data, and as it stops accepting writes for 0.5x it can safely flush 1.0x of the head block to disk, without having to worry about appends happening from earlier than that.
  2. Your logs say the receive component returned status code 409, on 4xx errors the Prometheus remote write client drops the requests it's trying to make and goes to the next and tries that. Could you enable debug logs on Prometheus?

@stale
Copy link

stale bot commented Mar 15, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 15, 2020
@stale stale bot closed this as completed Mar 22, 2020
@Kampe
Copy link

Kampe commented Oct 15, 2020

Wish this wasn't stale - this is useful to have. Prometheus attempts to write them, Receive should accept them?

@bwplotka
Copy link
Member

bwplotka commented Jun 4, 2021

Still valid, and a big chunk of work to make it happen.

Some idea presented already in the past to Prometheus and Cortex teams: Start second, separate WAL just for older series.

@bwplotka bwplotka changed the title receive: to support backfill receive: Support backfilling data through remote write (in case of origin got disconnected) Jun 4, 2021
@stale
Copy link

stale bot commented Aug 4, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 4, 2021
@GiedriusS GiedriusS removed the stale label Aug 11, 2021
@Kampe
Copy link

Kampe commented Aug 24, 2021

Love to see the stale tag removed <3

@bwplotka
Copy link
Member

bwplotka commented Sep 1, 2021

Still valid!

@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@yeya24
Copy link
Contributor

yeya24 commented Jan 10, 2022

Valid

@stale stale bot removed the stale label Jan 10, 2022
@stale
Copy link

stale bot commented Apr 17, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 17, 2022
@Kampe
Copy link

Kampe commented Apr 18, 2022

Seems there's work being done upstream on prometheus for support here: https://github.com/grafana/mimir-prometheus/compare/out-of-order

@stale stale bot removed the stale label Apr 18, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 21, 2022
@yeya24
Copy link
Contributor

yeya24 commented Sep 21, 2022

prometheus/prometheus#11075 Finally this is merged into Prometheus so probably we could give it a try.

@yeya24
Copy link
Contributor

yeya24 commented Nov 13, 2022

Since the OOO feature has been merged into main, I will close this issue. Feel free to reopen if you think it is not addressed.

@yeya24 yeya24 closed this as completed Nov 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants