Skip to content

[aws_s3 sink] output can't get files over 256Mb #22866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cduverne opened this issue Apr 14, 2025 · 3 comments
Open

[aws_s3 sink] output can't get files over 256Mb #22866

cduverne opened this issue Apr 14, 2025 · 3 comments
Labels
sink: aws_s3 Anything `aws_s3` sink related type: bug A code related bug.

Comments

@cduverne
Copy link

cduverne commented Apr 14, 2025

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello,

We're setting up a new vector sink towards AWS S3, after reading data from Kafka.
Unfortunately, eventhough there are Gb of data read from Kafka, the output files are always capped at approx. 256Mb.

We've setup disk buffer, with a large max buffer size, without success.

Any help would be much appreciated.

Configuration

[sinks.aws_s3]
  type = "aws_s3"
  inputs = ["kafka_topic"]
  bucket = "{{ outputBucketName }}"
  endpoint = "{{ s3Endpoint }}"
  key_prefix = "file_{{ dateFormat }}"
  filename_extension = "{{ outputFileFormat }}"
  filename_time_format = ""
  filename_append_uuid = false
  compression = "none"

  # Encoding
  encoding.codec = "{{ outputFileFormat }}"
  encoding.csv.delimiter = "{{ delimiter }}"
  encoding.csv.quote_style = "{{ quoteStyle }}"
  encoding.csv.fields = ["md_version", "created_on", "topic", "id", "payload"]
  # Maximum size of internal buffer for writing CSV - 1GB in bytes
  encoding.csv.capacity = 1073741824

  # Healthcheck
  healthcheck.enabled = true
  # Bufferring events on the disk
  buffer.type = "disk"
  # Maximum 5Gb in the buffer
  buffer.max_size = 5368709120
  # If buffer is full wait for free space
  buffer.when_full = "block"

  # Every hour (in seconds)
  batch.timeout_secs = 3600
  # Maximum size of single batch - 1GB in bytes
  batch.max_bytes = 1073741824

Version

0.46.0

Debug Output


Example Data

No response

Additional Context

No response

References

#22839

@cduverne cduverne added the type: bug A code related bug. label Apr 14, 2025
@pront pront added the sink: aws_s3 Anything `aws_s3` sink related label Apr 14, 2025
@scMarkus
Copy link
Contributor

@cduverne
let me try to clarify some the different size settings you are using. first the encoding.csv.capacity which is only an internal buffer of the csv writer. it should be configured to about the size a single csv line would have in bytes. I would assume you are allocating 10 Gigabytes where you are only ever filling up kilobytes of them. the buffer.max_size is used, if the downstream service does not accept the data you want to send. in case of s3 I am assuming you are sending data as quickly as possible so this should be empty most of the time. you might want to double check by looking into some metrics. batch.max_bytes probably is what you want. but as the documentation states it is the size before they are serialized/compressed. I am assuming your event is of such a large size that it fills up the batch and gets trimmed down afterwards which makes the resulting file comparably small.

so on the one hand you may be able to trim down or compress your event using some remap transformer before your event hits the sink. or we need to look forward to progress on #10281 and its sub tasks.

@cduverne
Copy link
Author

cduverne commented Apr 22, 2025

Hi @scMarkus ! Thanks a lot for answering !

Let's breakdown your comment so I'm sure I fully get it. (Quite new to Vector, sorry)
encoding.csv.capacity = 1073741824
-> If this should be capped as close to the size of 1 event, what is the downside of making it way bigger ? The intent was to avoid this parameter to trim anything.
-> Could this trim the events and prevent some of them to be pushed to the sink ?

buffer.max_size = 5368709120
-> We push a CSV file to the s3 sink every 15 minutes, hence we intended on keeping a reasonable amount of data in buffer. 5Gb might be too big, agreed. Could this way bigger parameter prevent some event from ending in the output CSV file ?

batch.max_bytes = 1073741824
-> This is already set to 1Gb, but we never get more than 256-260Mb in a single file, eventhough we spot in the source that there are more events.

Regarding your comment about using "some remap transformer", here is what we implemented :

[transforms.topic]
  type = "remap"
  inputs = [{{ server_keys | join(', ') }}]
  source = '''
    # Convert the JSON event to a string and store it in file_content
    .payload = encode_json(.)

    # ddt_metadata_version
    .ddt_metadata_version = "1.1"

    # created_on
    .created_on = format_timestamp!(now(), "%Y%m%d%H%M%S")

    # tm_topic
    .tm_topic = "topic"

    # tm_id - unique identifier of each event
    .tm_id = uuid_v7(now())
  '''

I'm a bit lost though. We should get 20M events a day, we get 11M.
Thanks a lot !

@scMarkus
Copy link
Contributor

hi @cduverne

to encoding.csv.capacity. it so happens that I have implemented this part in vector (just stolen the ideas from the rust csv lib though) and in fact it never trims anything. it is just the size of a write buffer. meaning if chosen smaller then the length of a line it will just do multiple flushes within a single line (which may be slower) a large number does not hurt in case you have memory to spare.

to buffer.max_size it would in fact drop data in the case vector can not move your data downstream (to s3) and additionally you have configured vector to drop new incoming data in case of full buffer. the size config itself does not drop or trim data.

to batch.max_bytes as I understand your case this one you want to make as large as possible. it will consume more memory.
and again... it measures the deserialized form of the event. not the serialized output in s3.

your transformer seams to only add attributes to your event. If possible you may parse your incoming event end extract only data needed. If all the info from the event is needed then it can not be helped.

I do not spot anything which would be dropping data as far as my understanding goes. If you are concerned I recommend having a look at vector top this shows live counter for all the components in your topology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: aws_s3 Anything `aws_s3` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

3 participants