Consider expanding the cases where Vector retries requests #10870

jszwedko · 2022-01-14T20:57:21Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Current behavior

Vector's retry behavior varies by sink, but typically we retry requests that are viewed to be temporary failures that we expect to recover. This includes things like HTTP 503s and 429s. This does not include failures that are viewed as non-temporary like HTTP 403s and 404s. Events in these requests are dropped and Vector continues processing.

Possible issue

The above means that Vector drops events under circumstances like:

Misconfiguration in Vector like using an invalid Datadog API key
Misconfiguration in upstreams such as a deploy of an internal Elasticsearch instance causing it to start returning 403s where previously the credentials were valid or provider outages like when a GCP outage caused customer load balancers to start returning 404s

Under these circumstances, I think it is reasonable for users to not want Vector to drop events, but just retry and block processing until a human can intervene and fix the issue.

Idea

One idea would be to retry all failures up until a maximum number of retries after which failed batches would be routed to a dead-letter queue (unimplemented).

I'm curious to hear other thoughts here though. This is a nuanced issue.

Refs

jerome-kleinen-kbc-be · 2022-01-17T08:21:43Z

Next to the retry behavior (number of attempts etc.) perhaps the http status codes to retry could be configurable, of course with sensible defaults. I am struggling to picture how this could look, I guess the configuration will quickly become difficult to read/understand.

I do like your suggestion to route events to a new queue after having failed x retry attempts.

akutta · 2022-02-10T15:37:56Z

I could see scenarios where dead-letters are useful; however, I'd like to see it configurable.

When using the Elastic sink ( #10839 ) we recently ran into an issue where the response was:

ERROR sink{component_id=out_aws component_kind="sink" component_type=elasticsearch component_name=out_aws}:request{request_id=196897}: vector::sinks::elasticsearch::retry: Response contained errors. response=Response { status: 200, version: HTTP/1.1, headers: {"date": "Mon, 07 Feb 2022 03:11:13 GMT", "content-type": "application/json; charset=UTF-8", "content-length": "1245855", "connection": "keep-alive", "access-control-allow-origin": "*"}, body: b"{\"errors\":true,\"items\":[{\"index\":{\"_id\":null,\"status\":403,\"error\":{\"type\":\"index_create_block_exception\",\"reason\":\"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];\"}}}

In this case, rebalancing shards across our Elastic Domain resolved the underlying issue. We are indexing at a rate of ~150M/min so a dead-letter queue becomes less useful. If instead we applied back-pressure and stopped acking entirely, our messages would have been queued upstream.

Note: This was the internal status. The Request status was a 200.

We intend to expand out retried requests to cover a much broader swatch (likely all requests) as part of #10870 but #12220 is blocking a user from trying out Vector so adding these ahead of time. Fixes: #12220 Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

* enhancement(datadog provider): Retry forbidden requests We intend to expand out retried requests to cover a much broader swatch (likely all requests) as part of #10870 but #12220 is blocking a user from trying out Vector so adding these ahead of time. Fixes: #12220 Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

jszwedko · 2022-06-10T20:16:30Z

More context in #655

jszwedko · 2022-06-16T21:34:27Z

Per #13130 (comment) we should also see if we can encapsulate this logic to share it for HTTP clients.

jszwedko · 2022-07-01T18:18:29Z

Related: #13414

kevinpark1217 · 2022-08-19T18:36:42Z

@jszwedko On #13414, you mentioned this work might be picked up on Q3. Is there any plan for this to be implemented?

jszwedko · 2022-08-23T16:36:02Z

Hi @kevinpark1217 ! This is still on our nearterm roadmap, but may not make it in Q3.

unkempthenry · 2023-02-07T23:00:11Z

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

jszwedko · 2023-02-13T13:24:51Z

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

Hey! We plan to write an RFC around this this quarter.

yoelk · 2023-03-16T09:48:46Z

@jszwedko our configuration is Kafka->Filter->S3 with acknowledgements, and like others have mentioned, events are dropped in certain cases. The one case we checked (which was the easiest) is setting the wrong role name in AWS.

I've seen in the description that events would be dropped depending on the HTTP response the sink gets, and whether it's seen as temporary or not.
Until this issue is resolved, I'd like to evaluate the risk for our clients' data, depending on each scenario (wrong password being one of them).

Is there a place where I can see the different cases? and which ones will result un dropped events?
Thanks a lot!

jszwedko · 2023-03-24T21:53:18Z

I responded in Discord, but unfortunately doing this sort of survey would mean spelunking through the source code.

cameronbraid · 2023-10-10T14:31:42Z

Is this still on the roadmap?

jszwedko · 2023-10-10T18:04:25Z

Is this still on the roadmap?

It's definitely still on our radar as high impact area to improve, but probably nothing happening on this before the end of the year given competing priorities. We are open to seeing PRs addressing this for individual sinks though. We've seen some already 🙂

yoelk · 2023-10-11T06:15:21Z

@jszwedko Can you please point out the PRs you mentioned?
If the effort is not big, we might be interested in creating such PRs for aws_s3 and azure_blob sinks.
Would be great to have a reference for such an effort 🙏

jszwedko · 2023-10-11T14:31:40Z

Hi @yoelk ,

That would great! Thanks for the interest! Here's a couple of examples:

For the two sinks you mentioned it would mean updating:

vector/src/aws/mod.rs

Lines 42 to 53 in 67c4beb

    
           pub fn is_retriable_error<T>(error: &SdkError<T>) -> bool { 
        
               match error { 
        
                   SdkError::TimeoutError(_) | SdkError::DispatchFailure(_) => true, 
        
                   SdkError::ConstructionFailure(_) => false, 
        
                   SdkError::ResponseError(err) => check_response(err.raw()), 
        
                   SdkError::ServiceError(err) => check_response(err.raw()), 
        
                   _ => { 
        
                       warn!("AWS returned unknown error, retrying request."); 
        
                       true 
        
                   } 
        
               } 
        
           }

or

vector/src/sinks/s3_common/config.rs

Lines 307 to 314 in 67c4beb

    
           impl RetryLogic for S3RetryLogic { 
        
               type Error = SdkError<PutObjectError>; 
        
               type Response = S3Response; 
        
               fn is_retriable_error(&self, error: &Self::Error) -> bool { 
        
                   is_retriable_error(error) 
        
               } 
        
           }

for the aws_s3 sink

vector/src/sinks/azure_common/config.rs

Lines 58 to 66 in 67c4beb

    
           impl RetryLogic for AzureBlobRetryLogic { 
        
               type Error = HttpError; 
        
               type Response = AzureBlobResponse; 
        
               fn is_retriable_error(&self, error: &Self::Error) -> bool { 
        
                   error.status().is_server_error() 
        
                       || StatusCode::TOO_MANY_REQUESTS.as_u16() == Into::<u16>::into(error.status()) 
        
               } 
        
           }

for the azure_blob_storage sink

Hopefully this helps get you going in the right directions. Just let us know if you need some more pointers!

jiaozi07 · 2024-01-25T08:39:02Z

elasticsearch Is this still on the roadmap?

suikast42 · 2024-01-26T07:49:00Z

The same is if you use journal_d source and nats sink. Failed logs to send nats are acked by vector.

danielcollishaw · 2024-05-09T15:26:57Z

Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map?

I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period.

We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way.

jszwedko · 2024-05-09T15:30:30Z

Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map?

I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period.

We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way.

No worries! We are still chipping away at this as we go, but we haven't been able to make a concerted effort.

For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s.

danielcollishaw · 2024-05-09T15:34:22Z

For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s.

Sorry I should have been a bit more descriptive onto why they can expire, we assume the STS role elsewhere and push the returned credentials to the constrictive hosts. Sometimes this push can lag out of sync, which is why we are using a buffer period on the duration of the role.

Thank you for the ongoing progress and the update on the roadmap, appreciate the speedy response!

jszwedko · 2024-05-09T15:37:12Z

Aha I see. It looks like the AWS components don't currently retry failed credentials requests, but I think it'd be a relatively straightforward change to https://github.com/vectordotdev/vector/blob/master/src/aws/mod.rs if you (or others) are interested.

danielcollishaw · 2024-05-09T15:38:11Z

I actually have a period of time next week that I may be able to allocate into this. Will look into contribution docs!

coutug · 2024-06-14T14:49:52Z

Hey @jszwedko, just hit the issue with 404 response for the http sink. Is there any update with that specific sink?

We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation?

jszwedko · 2024-06-14T20:44:16Z

Hey @jszwedko, just hit the issue with 404 response for the http sink. Is there any update with that specific sink?

We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation?

Would it be possible to have the ingress return a 503 instead (no upstream available?). Otherwise, I think we'd accept a contribution to the http sink to retry 404s.

coutug · 2024-06-17T22:16:00Z

Ah yes, found a way to return the proper status code from the ingress controller. Thanks a lot for the help!

noble-varghese · 2024-10-08T21:02:07Z

Hey @jszwedko have added an enhancement on the http sink to retry on some of the 4XXs :)

jszwedko added meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. domain: sinks Anything related to the Vector's sinks labels Jan 14, 2022

jszwedko mentioned this issue Jan 14, 2022

retry behavior in elasticsearch sink #10839

Closed

jszwedko mentioned this issue Feb 14, 2022

Elasticsearch sink drops messages if ES rejects execution #11359

Closed

jszwedko mentioned this issue Apr 13, 2022

Vector sometimes doesn't fetch AWS credentials from IRSA #12197

Closed

zsherman mentioned this issue Apr 14, 2022

Retry failed authentication requests in Datadog sinks #12220

Closed

jszwedko mentioned this issue Apr 19, 2022

enhancement(datadog provider): Retry forbidden requests #12291

Merged

jszwedko mentioned this issue Apr 29, 2022

chore(observability): OP-282 Ensure config reporting resilience #12442

Merged

jszwedko mentioned this issue May 25, 2022

The datadog_logs does not retry requests due to aborted connections #12859

Closed

jszwedko mentioned this issue Jun 10, 2022

Rethink retry strategies #655

Closed

neuronull mentioned this issue Jun 16, 2022

fix(datadog_logs sink): retry HTTP requests and improve datadog sink error handling consistency #13130

Merged

jszwedko mentioned this issue Jul 1, 2022

Override HTTP Status Codes that are Retried #13414

Closed

jszwedko mentioned this issue Jul 6, 2022

Zombie disk buffer data when Non-retriable error on S3 sink #13455

Closed

jszwedko mentioned this issue Oct 26, 2022

kafka log-end-offset changed when failing on sinking data to elasticsearch #14963

Closed

This was referenced Dec 5, 2022

Retry failed sends for nats sink #6345

Open

request.retry_4xx_responses parameter for http sink #15498

Closed

This was referenced Dec 28, 2022

Socket sink should only ack when the event could be sent #4602

Open

File sink should not ack the buffer unless it could successfully write the event out #4601

Open

Enhancement Request - HTTP sink - retryable codes #6420

Closed

jszwedko mentioned this issue Jan 9, 2023

Sink Nats event buffer settings not works #15857

Closed

jszwedko mentioned this issue Aug 17, 2023

Missing error metrics on datadog_logs sink #18296

Closed

infor7 mentioned this issue Aug 30, 2023

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

Closed

StephenWakely mentioned this issue Sep 18, 2023

fix(gcp service): retry on unauthorized #18586

Merged

jszwedko mentioned this issue Jan 12, 2024

gcp_stackdriver_logs: 401 Unauthorised each hour #19614

Closed

tanushri-sundar mentioned this issue Jan 25, 2024

Vector acks messages sent by S3 source even when delivery failed #19711

Closed

This was referenced Jan 30, 2024

azure_blob failed to upload logs even when acknowledgements are enabled #19741

Open

Data is discarded when elasticsearch index is in read-only state, which is not expected #19773

Open

joseluisjimenez1 mentioned this issue Apr 9, 2024

Sink(Elasticsearch): Dropping events with AWS auth strategy #20266

Open

frankh mentioned this issue Aug 8, 2024

Kafka sink silently discards events on connection errors #21031

Closed

noble-varghese mentioned this issue Oct 8, 2024

enhancement(http sink): Retrying the HTTP sink in case of 404s and request timeouts #21457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider expanding the cases where Vector retries requests #10870

Consider expanding the cases where Vector retries requests #10870

jszwedko commented Jan 14, 2022 •

edited

Loading

jerome-kleinen-kbc-be commented Jan 17, 2022

akutta commented Feb 10, 2022 •

edited

Loading

jszwedko commented Jun 10, 2022

jszwedko commented Jun 16, 2022

jszwedko commented Jul 1, 2022

kevinpark1217 commented Aug 19, 2022

jszwedko commented Aug 23, 2022

unkempthenry commented Feb 7, 2023

jszwedko commented Feb 13, 2023

yoelk commented Mar 16, 2023

jszwedko commented Mar 24, 2023

cameronbraid commented Oct 10, 2023

jszwedko commented Oct 10, 2023

yoelk commented Oct 11, 2023

jszwedko commented Oct 11, 2023

jiaozi07 commented Jan 25, 2024

suikast42 commented Jan 26, 2024

danielcollishaw commented May 9, 2024 •

edited

Loading

jszwedko commented May 9, 2024

danielcollishaw commented May 9, 2024 •

edited

Loading

jszwedko commented May 9, 2024

danielcollishaw commented May 9, 2024

coutug commented Jun 14, 2024

jszwedko commented Jun 14, 2024

coutug commented Jun 17, 2024

noble-varghese commented Oct 8, 2024

Consider expanding the cases where Vector retries requests #10870

Consider expanding the cases where Vector retries requests #10870

Comments

jszwedko commented Jan 14, 2022 • edited Loading

Community Note

Current behavior

Possible issue

Idea

Refs

jerome-kleinen-kbc-be commented Jan 17, 2022

akutta commented Feb 10, 2022 • edited Loading

jszwedko commented Jun 10, 2022

jszwedko commented Jun 16, 2022

jszwedko commented Jul 1, 2022

kevinpark1217 commented Aug 19, 2022

jszwedko commented Aug 23, 2022

unkempthenry commented Feb 7, 2023

jszwedko commented Feb 13, 2023

yoelk commented Mar 16, 2023

jszwedko commented Mar 24, 2023

cameronbraid commented Oct 10, 2023

jszwedko commented Oct 10, 2023

yoelk commented Oct 11, 2023

jszwedko commented Oct 11, 2023

jiaozi07 commented Jan 25, 2024

suikast42 commented Jan 26, 2024

danielcollishaw commented May 9, 2024 • edited Loading

jszwedko commented May 9, 2024

danielcollishaw commented May 9, 2024 • edited Loading

jszwedko commented May 9, 2024

danielcollishaw commented May 9, 2024

coutug commented Jun 14, 2024

jszwedko commented Jun 14, 2024

coutug commented Jun 17, 2024

noble-varghese commented Oct 8, 2024

jszwedko commented Jan 14, 2022 •

edited

Loading

akutta commented Feb 10, 2022 •

edited

Loading

danielcollishaw commented May 9, 2024 •

edited

Loading

danielcollishaw commented May 9, 2024 •

edited

Loading