Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RECEIVE] Make out of bounds a non-recoverable error #5131

Closed

Conversation

wiardvanrij
Copy link
Member

Signed-off-by: Wiard van Rij wvanrij@roku.com

Relates to: #4831

Prometheus' remote write receiver treats ErrOutOfOrder as non-recoverable.
prometheus/prometheus#5649 (comment)_

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Out of bounds would result in a http 409, which causes Prometheus to retry, eventually causing an infinite loop which can only be solved by flushing the entire Prometheus data dir.

This change removes OutOfBounds err from ConflictErr to it's own entity and it will allow Prometheus to deal with this as non-recoverable (i.e. dropping it, instead of retry).

Also made a small adjustment in the Debug.Log, as it could not print the sample, which causes a confusing debug message like: sample="unsupported value type" - Now it will print the sample value and timestamp.

Verification

I mean, i'll be honest that it was a pain in the a. to test and debug this. What I have checked is that Prometheus now handles this as non-recoverable and nothing will build up anymore as pending.

I have adjusted the test cases and added the OutOfBounds for it.

Signed-off-by: Wiard van Rij <wvanrij@roku.com>
Signed-off-by: Wiard van Rij <wvanrij@roku.com>
Signed-off-by: Wiard van Rij <wvanrij@roku.com>
@Kampe
Copy link

Kampe commented Feb 7, 2022

I appreciate the time you took to find this case!

Comment on lines +363 to +364
case errOutOfBounds:
http.Error(w, err.Error(), http.StatusBadRequest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something here, but my understanding was that Prometheus treats as recoverable only network errors, 5xx responses and, if configured to, 429 responses (see this line and below).

It therefore should not make difference if we're responding with bad request (400) or conflict (409), since conflict is already unrecoverable, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Prometheus, 400 = unrecoverable. If that's your question? Also, if that's untrue, than I'm happy to get pointed to some resource. However, I've tested this and the response now becomes unrecoverable for Prometheus.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood, only 5xx codes are recoverable for Prometheus (see the link in my comment). Meaning if until now we were returning 409 for out of bounds error, it should have already been working as expected, i.e. there should not be a real difference between the codes. As long as we return any 4xx code, it should be treated as unrecoverable (with the possible exception of 429).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I understand you and the code (missed the link) now. Funny thing is though, that I did not experience this behaviour. Only after my change I actually see the non-recoverable errors in the Prometheus logs. 🤔

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking forward to seeing this fixed. it's causing us pain in production.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wiardvanrij I did a small test today: I spun up a VM where I skewed the clock on purpose (made it run in the 'past' compared to my host machine), I brought up an instance of Prometheus, configured to remote write to Thanos instance running on my host.

I first run it against Thanos build from latest main. The result - Prometheus informs me this is a non-recoverable error:

ts=2022-02-12T12:26:39.089Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=9 exemplarCount=0 err="server returned HTTP status 409 Conflict: store locally for endpoint 127.0.0.1:10901: conflict"

Built from your branch:

ts=2022-02-12T12:34:32.266Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=63 exemplarCount=0 err="server returned HTTP status 400 Bad Request: store locally for endpoint 127.0.0.1:10901: out of bounds"

I haven't noticed any unexpected behavior on either side (Thanos or Prometheus). Prometheus treated it (in both cases) simply as unrecoverable error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @matej-g really appreciate the testing - Then I need more user input. @Kampe did you have any chance to test it or could you provide more info? Because if we now look at the code, it already should be correct. I would agree that we sometimes see odd behaviour but then we really need to know why that is happening.

If I have some more time, i'll try to figure things out as well.

Copy link

@Kampe Kampe Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded in the issue ticket #4831, but there definitely seems to be two issues happening with 500 level errors and 409's. Unsure if they are related, but it seems we can on atleast a monthly cadence (sometimes weekly) can produce them.

We have not had the chance to roll the changes to any of our environments, we don't have much in the way of tooling setup to be able to roll non registry hosted third party tools to even developer environments so testing without a prebuilt container for us is pretty intensive, however when we have a container we can reference I can get that rolled in a matter of minutes.

@phillebaba
Copy link
Contributor

@wiardvanrij do you still need some testing for this feature? I think I experience this issue with a specific tenant which causes the receiver to consume a lot more memory than usual during certain scenarios, to the point where it will eventually OOM kill. Looking at the logs it seems like the issue is consistent with the description as the Prometheus agents will continue to retry the same failed out of order requests over and over again.

@phillebaba
Copy link
Contributor

Just wanted to add that I migrated to the new router->ingestor model and it solved most of my issues. I am still seeing a few out of bounds errors but they are no where near where they were before.

@wiardvanrij
Copy link
Member Author

@wiardvanrij do you still need some testing for this feature? I think I experience this issue with a specific tenant which causes the receiver to consume a lot more memory than usual during certain scenarios, to the point where it will eventually OOM kill. Looking at the logs it seems like the issue is consistent with the description as the Prometheus agents will continue to retry the same failed out of order requests over and over again.

sorry for the lag.
The problem with this issue is that it's very hard to replicate and for some reason, I thought I had fixed it (as my tests were successful) but it does not make any sense given the input I've got here.
I think we might just have to close this MR and follow up in the issue?

@gabrielSoudry
Copy link

Can you merge this please ? The message "unsupported value type" is very confusing and we take a lot of time to figure out what's going on ... (And we still have a lot of 409).

Our solution : send less metrics...
Thanks you !

@wiardvanrij
Copy link
Member Author

wiardvanrij commented May 27, 2022

No, this isn't the solution. Please follow up on the issue with input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants