-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(AWS config): Make SQS client timeout settings configurable #20120
base: master
Are you sure you want to change the base?
Conversation
I've left out precise details in the description of the naming and structure of the new settings awaiting further guidance. SQS is the primary point of concern but this configuration could potentially be exported to all AWS sources and sinks. |
@andjhop Do you have a reliable way to replicate the issue that this PR is fixing? I'm not quite sure how I could get SQS to timeout... |
Hi @StephenWakely, yes. I've been testing by "swapping" out docker networks. First create two docker networks docker network create network1
docker network create network2
ENVIRONMENT_NETWORK=network1 make environment With Vector running, disconnect from docker network disconnect network1 vector-environment
docker network connect network2 vector-environment |
I'm not able to see this fixed. I have this config: sources:
in:
type: aws_s3
decoding:
codec: bytes
region: us-east-1
endpoint: http://zork:4566
sqs:
queue_url: http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/zork
delete_message: true
timeout:
connect_timeout_seconds: 1
operation_timeout_seconds: 1
read_timeout_seconds: 1 Running against openstack using the above repro steps, things work fine until I disconnect the network. Then Vector no longer picks up any object pushed to the bucket. Any ideas what I am doing wrong? |
Hi @StephenWakely, yes. The |
Ok, yes that works. I wonder though, is there any reason why you would want to have different timeout settings for s3 and sqs? From a ux perspective this could get confusing. Unless there is a pressing reason for it, I would be tempted to propagate any timeout settings from I also wonder if it would make sense to just have a single Perhaps someone from @vectordotdev/ux-team could chime in?
This should be emphasised in the docs. If I have a timeout of 15 seconds or less my logs get flooded with:
Which could easily lead a user to think that something catastrophic has occurred. Is there a way we could validate this at run time? |
I would tend to agree with this. But am curious to hear if there is empirical need that we're not aware of. |
Do you mean one timeout that wraps all 3 of these? That is: a timeout that is on the sum of all three? Or that the same value would be used for each? I could see doing the former, but I don't think the latter would make sense since I'd expect each of those activities (connect, operation, and read) to typically take different amounts of time (operation timeout, in particular, would be the longest; where I might want to abort early if I can't connect). I personally think it is fine to expose all three though. I anticipate the SDK exposes all three because there are valid use-cases for configuring them separately (granted I could be wrong in that assumption!). |
Regarding different timeout settings for S3 and SQS, the decision to split them was made following the advice here: #20017 (comment) with concern over the difficulty of choosing a single value which makes sense across different services. I think that @jszwedko is right in that valid use cases for separately exposing the connect, operation and read settings in the SDK could reasonably translate to valid use case for exposing them in the source configuration. Also, for consistency, we already expose the separate connect and read settings for IMDS requests under |
Ah right, I completely missed that this has already been discussed in depth. |
To the goal of resolving issue #20017, I've made a few small changes. It's the long polling interval of SQS that's problematic here, so for now, I've only exposed the timeout settings for the SQS client. To be consistent with the structure we already have for IMDS requests, I've flattened the timeout settings under |
Issue #20017 highlights a problem with the AWS S3 source, whereby it may stall, requiring a restart of Vector. This happens when Vector is unable to detect broken connections, for example when a connection is lost at the network level and will never be properly terminated. SQS connections seem to be particularly prone to this because of their typical long polling interval. This change is an attempt to resolve #20017 by exposing the timeout configuration of the underlying AWS client under a new optional field.