Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(AWS config): Make SQS client timeout settings configurable #20120

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

andjhop
Copy link

@andjhop andjhop commented Mar 18, 2024

Issue #20017 highlights a problem with the AWS S3 source, whereby it may stall, requiring a restart of Vector. This happens when Vector is unable to detect broken connections, for example when a connection is lost at the network level and will never be properly terminated. SQS connections seem to be particularly prone to this because of their typical long polling interval. This change is an attempt to resolve #20017 by exposing the timeout configuration of the underlying AWS client under a new optional field.

@bits-bot
Copy link

bits-bot commented Mar 18, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sources Anything related to the Vector's sources domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation labels Mar 18, 2024
@andjhop andjhop marked this pull request as ready for review March 18, 2024 16:30
@andjhop andjhop requested review from a team as code owners March 18, 2024 16:30
@andjhop
Copy link
Author

andjhop commented Mar 18, 2024

I've left out precise details in the description of the naming and structure of the new settings awaiting further guidance. SQS is the primary point of concern but this configuration could potentially be exported to all AWS sources and sinks.

@StephenWakely
Copy link
Contributor

@andjhop Do you have a reliable way to replicate the issue that this PR is fixing? I'm not quite sure how I could get SQS to timeout...

@andjhop
Copy link
Author

andjhop commented Apr 3, 2024

Hi @StephenWakely, yes. I've been testing by "swapping" out docker networks. First create two docker networks network1 and network2 and start the development environment:

docker network create network1
docker network create network2
ENVIRONMENT_NETWORK=network1 make environment

With Vector running, disconnect from network1 and connect to network2:

docker network disconnect network1 vector-environment
docker network connect network2 vector-environment

@StephenWakely
Copy link
Contributor

I'm not able to see this fixed.

I have this config:

sources:
  in:
    type: aws_s3
    decoding:
      codec: bytes
    region: us-east-1
    endpoint: http://zork:4566
    sqs:
      queue_url: http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/zork
      delete_message: true
    timeout:
      connect_timeout_seconds: 1
      operation_timeout_seconds: 1
      read_timeout_seconds: 1

Running against openstack using the above repro steps, things work fine until I disconnect the network. Then Vector no longer picks up any object pushed to the bucket.

Any ideas what I am doing wrong?

@andjhop
Copy link
Author

andjhop commented May 15, 2024

Hi @StephenWakely, yes. The timeout block should be nested beneath sqs. Also, it should be better to use timeout values greater than sqs.poll_secs which is 15 by default.

@StephenWakely
Copy link
Contributor

Ok, yes that works.

I wonder though, is there any reason why you would want to have different timeout settings for s3 and sqs? From a ux perspective this could get confusing. Unless there is a pressing reason for it, I would be tempted to propagate any timeout settings from s3 down to the sqs config.

I also wonder if it would make sense to just have a single timeout option that would apply to connect, operation and read. Is there a reason why people would absolutely need different values for each of these?

Perhaps someone from @vectordotdev/ux-team could chime in?

Also, it should be better to use timeout values greater than sqs.poll_secs which is 15 by default.

This should be emphasised in the docs. If I have a timeout of 15 seconds or less my logs get flooded with:

2024-05-21T11:17:34.560516Z ERROR source{component_kind="source" component_id=in component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=dispatch failure error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
2024-05-21T11:17:34.560581Z ERROR source{component_kind="source" component_id=in component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.
2024-05-21T11:17:57.002973Z ERROR source{component_kind="source" component_id=in component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 29730 times.

Which could easily lead a user to think that something catastrophic has occurred.

Is there a way we could validate this at run time?

@neuronull
Copy link
Contributor

I also wonder if it would make sense to just have a single timeout option that would apply to connect, operation and read. Is there a reason why people would absolutely need different values for each of these?

I would tend to agree with this. But am curious to hear if there is empirical need that we're not aware of.

@jszwedko
Copy link
Member

I also wonder if it would make sense to just have a single timeout option that would apply to connect, operation and read. Is there a reason why people would absolutely need different values for each of these?

Do you mean one timeout that wraps all 3 of these? That is: a timeout that is on the sum of all three? Or that the same value would be used for each? I could see doing the former, but I don't think the latter would make sense since I'd expect each of those activities (connect, operation, and read) to typically take different amounts of time (operation timeout, in particular, would be the longest; where I might want to abort early if I can't connect).

I personally think it is fine to expose all three though. I anticipate the SDK exposes all three because there are valid use-cases for configuring them separately (granted I could be wrong in that assumption!).

@andjhop
Copy link
Author

andjhop commented May 22, 2024

Regarding different timeout settings for S3 and SQS, the decision to split them was made following the advice here: #20017 (comment) with concern over the difficulty of choosing a single value which makes sense across different services.

I think that @jszwedko is right in that valid use cases for separately exposing the connect, operation and read settings in the SDK could reasonably translate to valid use case for exposing them in the source configuration. Also, for consistency, we already expose the separate connect and read settings for IMDS requests under auth.imds.

@StephenWakely
Copy link
Contributor

Regarding different timeout settings for S3 and SQS, the decision to split them was made following the advice here: #20017 (comment) with concern over the difficulty of choosing a single value which makes sense across different services.

I think that @jszwedko is right in that valid use cases for separately exposing the connect, operation and read settings in the SDK could reasonably translate to valid use case for exposing them in the source configuration. Also, for consistency, we already expose the separate connect and read settings for IMDS requests under auth.imds.

Ah right, I completely missed that this has already been discussed in depth.

@andjhop
Copy link
Author

andjhop commented May 30, 2024

To the goal of resolving issue #20017, I've made a few small changes. It's the long polling interval of SQS that's problematic here, so for now, I've only exposed the timeout settings for the SQS client. To be consistent with the structure we already have for IMDS requests, I've flattened the timeout settings under sqs, so how we already have auth.imds.{read,connect}_timeout_seconds we now also have sqs.{read,connect,operation}_timeout_seconds, each optional with no default value. I've also tried to emphasise in the documentation that consideration needs to be made for the value of sqs.poll_secs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws_s3 source: async receive_messages does not return after poll_secs
6 participants