Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sink(Elasticsearch): Dropping events with AWS auth strategy #20266

Open
joseluisjimenez1 opened this issue Apr 9, 2024 · 2 comments
Open

Sink(Elasticsearch): Dropping events with AWS auth strategy #20266

joseluisjimenez1 opened this issue Apr 9, 2024 · 2 comments
Labels
type: bug A code related bug.

Comments

@joseluisjimenez1
Copy link

joseluisjimenez1 commented Apr 9, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

TL'DR : Elasticsearch sink drop events unintentionally even when Acknowledgments are enabled due to failure on loading AWS credentials.

  • Also when basic auth user do not have permissions (403 status code)

I tried to be really concise, please let me know if can provide any extra information that maybe I missed. Thanks in advance.

Context:

  • Vector running into AWS ECS Fargate as a service ( 1 to 3 task, with autoscaling enabled).
  • Kafka Source configured
  • Elasticsearch and AWS S3 Sinks configured using aws auth strategy.
  • End to end acknowledgments are enabled.

Description:

Vector is running smoothly until some increase of load arrives Kafka (for us, Kubernetes velero backups every hour). Sometimes those spikes do not drop anything, others drop a few and sometimes drop a lot.

But always same errors in the logs:
Screenshot 2024-04-09 at 16 17 37

Things that have been tried:

  • Increase CPU resources: improve a little bit, but still facing the issue
  • Scale horizontally: Also improve but still been able to reproduce.
  • Configure IMDS timeouts, nothing change.

Workaround:

Switch to basic authentication is the only way to avoid dropping events when those spikes comes that was found.

Proposal:

Vector should be able to handle credentials errors and apply back pressure instead of dropping events when:

  • AWS credential provider failed to load credential, EcsContainer in this case.
  • Http response status 403 forbidden due to lack of user permissions.

Troubleshooting:

Seems like vector is using AWS rust SDK to sing the request to OpenSearch, but apparently, loads the credentials every single request and not use the cache that is defined?

Screenshot 2024-03-28 at 10 12 31

Configuration

sinks:
  opensearch:
    type: elasticsearch
    inputs: 
      - "parse_istio_log"
    endpoints:
      - "${OPENSEARCH_ENDPOINT}"
    auth:
      strategy: "aws"
      load_timeout_secs: 120 # 4 retries of 30 seconds
      imds:
        connect_timeout_seconds: 30
        read_timeout_seconds: 10
    aws:
      region: "eu-central-1"
    mode: data_stream
    bulk:
      action: "create"
    data_stream:
      type: "kubernetes"
      dataset: "v1"
      namespace: "prod"
    batch:
      max_bytes: 10000000
    request:
      timeout_secs: 60

Version

0.36.0 -> 0.37.0

Debug Output

@timestamp	message	vector\.component_type	metadata\.level	metadata\.module_path	error	provider
Mar 28, 2024 @ 10:32:18.357	pooling idle connection for ("http", 169.254.170.2)	elasticsearch	DEBUG	hyper::client::pool	 
Mar 28, 2024 @ 10:32:18.358	entering 'before deserialization' phase	elasticsearch	DEBUG	aws_smithy_runtime_api::client::interceptors::context	 
Mar 28, 2024 @ 10:32:18.358	a retry is either unnecessary or not possible, exiting attempt loop	elasticsearch	DEBUG	aws_smithy_runtime::client::orchestrator	 
Mar 28, 2024 @ 10:32:18.358	loaded credentials	elasticsearch	DEBUG	aws_config::meta::credentials::chain	 
Mar 28, 2024 @ 10:32:18.358	entering 'after deserialization' phase	elasticsearch	DEBUG	aws_smithy_runtime_api::client::interceptors::context	 
Mar 28, 2024 @ 10:32:18.358	entering 'deserialization' phase	elasticsearch	DEBUG	aws_smithy_runtime_api::client::interceptors::context	 
Mar 28, 2024 @ 10:32:18.361	reuse idle connection for ("https", opensearch-internal-dev-fra.dev-fra)	elasticsearch	DEBUG	hyper::client::pool	 
Mar 28, 2024 @ 10:32:18.361	Sending HTTP request.	elasticsearch	DEBUG	vector::internal_events::http_client	 
Mar 28, 2024 @ 10:32:18.365	encountered orchestrator error; halting	elasticsearch	DEBUG	aws_smithy_runtime::client::orchestrator	 
Mar 28, 2024 @ 10:32:18.366	a retry is either unnecessary or not possible, exiting attempt loop	elasticsearch	DEBUG	aws_smithy_runtime::client::orchestrator	 
Mar 28, 2024 @ 10:32:18.366	Unhandled error response.	elasticsearch	WARN	vector::sinks::util::adaptive_concurrency::controller	unexpected credentials error
Mar 28, 2024 @ 10:32:18.366	Failed to build request.	elasticsearch	ERROR	vector::internal_events::common	unexpected credentials error
Mar 28, 2024 @ 10:32:18.366	Unexpected error type; dropping the request.	elasticsearch	ERROR	vector::sinks::util::retries	unexpected credentials error
Mar 28, 2024 @ 10:32:18.366	provider failed to provide credentials	elasticsearch	WARN	aws_config::meta::credentials::chain	unexpected credentials error: dispatch failure: other: connection closed before message completed (Unhandled(Unhandled { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(Some(TransientError)), source: hyper::Error(IncompleteMessage), connection: Unknown } }) }))
Mar 28, 2024 @ 10:32:18.367	Events dropped	elasticsearch	ERROR	vector_common::internal_event::component_events_dropped	 
Mar 28, 2024 @ 10:32:18.367	Service call failed. No retries or retries exhausted.	elasticsearch	ERROR	vector_common::internal_event::service	Some(Unhandled(Unhandled { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(Some(TransientError)), source: hyper::Error(IncompleteMessage), connection: Unknown } }) }))

Example Data

No response

Additional Context

Vector is running at AWS ECS Fargate.

References

@joseluisjimenez1 joseluisjimenez1 added the type: bug A code related bug. label Apr 9, 2024
@jszwedko
Copy link
Member

jszwedko commented Apr 9, 2024

Thanks @joseluisjimenez1 . I think this is a specific case of #10870. We'd like to improve retries in Vector, generally, over time.

@joseluisjimenez1
Copy link
Author

Thanks @joseluisjimenez1 . I think this is a specific case of #10870. We'd like to improve retries in Vector, generally, over time.

Hey @jszwedko,

it seems that could fit under #10870 , but the error loading aws credentials from the aws-sdk maybe is kind of different thing here? not 100% sure tbh...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants