enhancement(AWS config): Make SQS client timeout settings configurable #20120

andjhop · 2024-03-18T08:41:46Z

Issue #20017 highlights a problem with the AWS S3 source, whereby it may stall, requiring a restart of Vector. This happens when Vector is unable to detect broken connections, for example when a connection is lost at the network level and will never be properly terminated. SQS connections seem to be particularly prone to this because of their typical long polling interval. This change is an attempt to resolve #20017 by exposing the timeout configuration of the underlying AWS client under a new optional field.

bits-bot · 2024-03-18T08:41:51Z

All committers have signed the CLA.

andjhop · 2024-03-18T16:35:49Z

I've left out precise details in the description of the naming and structure of the new settings awaiting further guidance. SQS is the primary point of concern but this configuration could potentially be exported to all AWS sources and sinks.

StephenWakely · 2024-03-25T16:32:19Z

@andjhop Do you have a reliable way to replicate the issue that this PR is fixing? I'm not quite sure how I could get SQS to timeout...

andjhop · 2024-04-03T08:26:35Z

Hi @StephenWakely, yes. I've been testing by "swapping" out docker networks. First create two docker networks network1 and network2 and start the development environment:

docker network create network1
docker network create network2
ENVIRONMENT_NETWORK=network1 make environment

With Vector running, disconnect from network1 and connect to network2:

docker network disconnect network1 vector-environment
docker network connect network2 vector-environment

StephenWakely · 2024-05-10T15:42:22Z

I'm not able to see this fixed.

I have this config:

sources:
  in:
    type: aws_s3
    decoding:
      codec: bytes
    region: us-east-1
    endpoint: http://zork:4566
    sqs:
      queue_url: http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/zork
      delete_message: true
    timeout:
      connect_timeout_seconds: 1
      operation_timeout_seconds: 1
      read_timeout_seconds: 1

Running against openstack using the above repro steps, things work fine until I disconnect the network. Then Vector no longer picks up any object pushed to the bucket.

Any ideas what I am doing wrong?

andjhop · 2024-05-15T10:04:23Z

Hi @StephenWakely, yes. The timeout block should be nested beneath sqs. Also, it should be better to use timeout values greater than sqs.poll_secs which is 15 by default.

StephenWakely · 2024-05-21T11:33:14Z

Ok, yes that works.

I wonder though, is there any reason why you would want to have different timeout settings for s3 and sqs? From a ux perspective this could get confusing. Unless there is a pressing reason for it, I would be tempted to propagate any timeout settings from s3 down to the sqs config.

I also wonder if it would make sense to just have a single timeout option that would apply to connect, operation and read. Is there a reason why people would absolutely need different values for each of these?

Perhaps someone from @vectordotdev/ux-team could chime in?

Also, it should be better to use timeout values greater than sqs.poll_secs which is 15 by default.

This should be emphasised in the docs. If I have a timeout of 15 seconds or less my logs get flooded with:

2024-05-21T11:17:34.560516Z ERROR source{component_kind="source" component_id=in component_type=aws_s3}: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=dispatch failure error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
2024-05-21T11:17:34.560581Z ERROR source{component_kind="source" component_id=in component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] is being suppressed to avoid flooding.
2024-05-21T11:17:57.002973Z ERROR source{component_kind="source" component_id=in component_type=aws_s3}: vector::internal_events::aws_sqs: Internal log [Failed to fetch SQS events.] has been suppressed 29730 times.

Which could easily lead a user to think that something catastrophic has occurred.

Is there a way we could validate this at run time?

neuronull · 2024-05-21T15:24:54Z

I also wonder if it would make sense to just have a single timeout option that would apply to connect, operation and read. Is there a reason why people would absolutely need different values for each of these?

I would tend to agree with this. But am curious to hear if there is empirical need that we're not aware of.

jszwedko · 2024-05-21T20:38:27Z

I also wonder if it would make sense to just have a single timeout option that would apply to connect, operation and read. Is there a reason why people would absolutely need different values for each of these?

Do you mean one timeout that wraps all 3 of these? That is: a timeout that is on the sum of all three? Or that the same value would be used for each? I could see doing the former, but I don't think the latter would make sense since I'd expect each of those activities (connect, operation, and read) to typically take different amounts of time (operation timeout, in particular, would be the longest; where I might want to abort early if I can't connect).

I personally think it is fine to expose all three though. I anticipate the SDK exposes all three because there are valid use-cases for configuring them separately (granted I could be wrong in that assumption!).

andjhop · 2024-05-22T08:48:11Z

Regarding different timeout settings for S3 and SQS, the decision to split them was made following the advice here: #20017 (comment) with concern over the difficulty of choosing a single value which makes sense across different services.

I think that @jszwedko is right in that valid use cases for separately exposing the connect, operation and read settings in the SDK could reasonably translate to valid use case for exposing them in the source configuration. Also, for consistency, we already expose the separate connect and read settings for IMDS requests under auth.imds.

StephenWakely · 2024-05-24T14:14:31Z

Regarding different timeout settings for S3 and SQS, the decision to split them was made following the advice here: #20017 (comment) with concern over the difficulty of choosing a single value which makes sense across different services.

I think that @jszwedko is right in that valid use cases for separately exposing the connect, operation and read settings in the SDK could reasonably translate to valid use case for exposing them in the source configuration. Also, for consistency, we already expose the separate connect and read settings for IMDS requests under auth.imds.

Ah right, I completely missed that this has already been discussed in depth.

andjhop · 2024-05-30T09:49:02Z

To the goal of resolving issue #20017, I've made a few small changes. It's the long polling interval of SQS that's problematic here, so for now, I've only exposed the timeout settings for the SQS client. To be consistent with the structure we already have for IMDS requests, I've flattened the timeout settings under sqs, so how we already have auth.imds.{read,connect}_timeout_seconds we now also have sqs.{read,connect,operation}_timeout_seconds, each optional with no default value. I've also tried to emphasise in the documentation that consideration needs to be made for the value of sqs.poll_secs.

andjhop added 2 commits March 18, 2024 08:35

enhancement(AWS config): Make SQS client timeout settings configurable

0b34556

docs: Generate SQS client timeout docs

1c239ac

github-actions bot added domain: sources Anything related to the Vector's sources domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation labels Mar 18, 2024

docs: Fix typos in SQS client timeout documentation

72ebe1a

andjhop marked this pull request as ready for review March 18, 2024 16:30

andjhop requested review from a team as code owners March 18, 2024 16:30

urseberry approved these changes Mar 18, 2024

View reviewed changes

jszwedko assigned StephenWakely Mar 25, 2024

hanley-patrick mentioned this pull request Apr 29, 2024

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

Open

andjhop added 3 commits May 30, 2024 09:14

Remove timeout configuration for S3 client

a9d29bf

Flatten SQS timeout configuration and make fields optional

a850f43

docs: Regenerate SQS client timeout docs

2c1118d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement(AWS config): Make SQS client timeout settings configurable #20120

enhancement(AWS config): Make SQS client timeout settings configurable #20120

andjhop commented Mar 18, 2024 •

edited

bits-bot commented Mar 18, 2024 •

edited

andjhop commented Mar 18, 2024

StephenWakely commented Mar 25, 2024

andjhop commented Apr 3, 2024 •

edited

StephenWakely commented May 10, 2024

andjhop commented May 15, 2024 •

edited

StephenWakely commented May 21, 2024

neuronull commented May 21, 2024

jszwedko commented May 21, 2024

andjhop commented May 22, 2024 •

edited

StephenWakely commented May 24, 2024

andjhop commented May 30, 2024

enhancement(AWS config): Make SQS client timeout settings configurable #20120

Are you sure you want to change the base?

enhancement(AWS config): Make SQS client timeout settings configurable #20120

Conversation

andjhop commented Mar 18, 2024 • edited

bits-bot commented Mar 18, 2024 • edited

andjhop commented Mar 18, 2024

StephenWakely commented Mar 25, 2024

andjhop commented Apr 3, 2024 • edited

StephenWakely commented May 10, 2024

andjhop commented May 15, 2024 • edited

StephenWakely commented May 21, 2024

neuronull commented May 21, 2024

jszwedko commented May 21, 2024

andjhop commented May 22, 2024 • edited

StephenWakely commented May 24, 2024

andjhop commented May 30, 2024

andjhop commented Mar 18, 2024 •

edited

bits-bot commented Mar 18, 2024 •

edited

andjhop commented Apr 3, 2024 •

edited

andjhop commented May 15, 2024 •

edited

andjhop commented May 22, 2024 •

edited