[aws_s3 sink] output can't get files over 256Mb #22866

cduverne · 2025-04-14T07:32:02Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello,

We're setting up a new vector sink towards AWS S3, after reading data from Kafka.
Unfortunately, eventhough there are Gb of data read from Kafka, the output files are always capped at approx. 256Mb.

We've setup disk buffer, with a large max buffer size, without success.

Any help would be much appreciated.

Configuration

[sinks.aws_s3]
  type = "aws_s3"
  inputs = ["kafka_topic"]
  bucket = "{{ outputBucketName }}"
  endpoint = "{{ s3Endpoint }}"
  key_prefix = "file_{{ dateFormat }}"
  filename_extension = "{{ outputFileFormat }}"
  filename_time_format = ""
  filename_append_uuid = false
  compression = "none"

  # Encoding
  encoding.codec = "{{ outputFileFormat }}"
  encoding.csv.delimiter = "{{ delimiter }}"
  encoding.csv.quote_style = "{{ quoteStyle }}"
  encoding.csv.fields = ["md_version", "created_on", "topic", "id", "payload"]
  # Maximum size of internal buffer for writing CSV - 1GB in bytes
  encoding.csv.capacity = 1073741824

  # Healthcheck
  healthcheck.enabled = true
  # Bufferring events on the disk
  buffer.type = "disk"
  # Maximum 5Gb in the buffer
  buffer.max_size = 5368709120
  # If buffer is full wait for free space
  buffer.when_full = "block"

  # Every hour (in seconds)
  batch.timeout_secs = 3600
  # Maximum size of single batch - 1GB in bytes
  batch.max_bytes = 1073741824

Version

0.46.0

Debug Output

Example Data

No response

Additional Context

No response

References

#22839

The text was updated successfully, but these errors were encountered:

scMarkus · 2025-04-18T07:13:51Z

@cduverne
let me try to clarify some the different size settings you are using. first the encoding.csv.capacity which is only an internal buffer of the csv writer. it should be configured to about the size a single csv line would have in bytes. I would assume you are allocating 10 Gigabytes where you are only ever filling up kilobytes of them. the buffer.max_size is used, if the downstream service does not accept the data you want to send. in case of s3 I am assuming you are sending data as quickly as possible so this should be empty most of the time. you might want to double check by looking into some metrics. batch.max_bytes probably is what you want. but as the documentation states it is the size before they are serialized/compressed. I am assuming your event is of such a large size that it fills up the batch and gets trimmed down afterwards which makes the resulting file comparably small.

so on the one hand you may be able to trim down or compress your event using some remap transformer before your event hits the sink. or we need to look forward to progress on #10281 and its sub tasks.

cduverne · 2025-04-22T06:41:05Z

Hi @scMarkus ! Thanks a lot for answering !

Let's breakdown your comment so I'm sure I fully get it. (Quite new to Vector, sorry)
encoding.csv.capacity = 1073741824
-> If this should be capped as close to the size of 1 event, what is the downside of making it way bigger ? The intent was to avoid this parameter to trim anything.
-> Could this trim the events and prevent some of them to be pushed to the sink ?

buffer.max_size = 5368709120
-> We push a CSV file to the s3 sink every 15 minutes, hence we intended on keeping a reasonable amount of data in buffer. 5Gb might be too big, agreed. Could this way bigger parameter prevent some event from ending in the output CSV file ?

batch.max_bytes = 1073741824
-> This is already set to 1Gb, but we never get more than 256-260Mb in a single file, eventhough we spot in the source that there are more events.

Regarding your comment about using "some remap transformer", here is what we implemented :

[transforms.topic]
  type = "remap"
  inputs = [{{ server_keys | join(', ') }}]
  source = '''
    # Convert the JSON event to a string and store it in file_content
    .payload = encode_json(.)

    # ddt_metadata_version
    .ddt_metadata_version = "1.1"

    # created_on
    .created_on = format_timestamp!(now(), "%Y%m%d%H%M%S")

    # tm_topic
    .tm_topic = "topic"

    # tm_id - unique identifier of each event
    .tm_id = uuid_v7(now())
  '''

I'm a bit lost though. We should get 20M events a day, we get 11M.
Thanks a lot !

scMarkus · 2025-04-22T11:26:14Z

hi @cduverne

to encoding.csv.capacity. it so happens that I have implemented this part in vector (just stolen the ideas from the rust csv lib though) and in fact it never trims anything. it is just the size of a write buffer. meaning if chosen smaller then the length of a line it will just do multiple flushes within a single line (which may be slower) a large number does not hurt in case you have memory to spare.

to buffer.max_size it would in fact drop data in the case vector can not move your data downstream (to s3) and additionally you have configured vector to drop new incoming data in case of full buffer. the size config itself does not drop or trim data.

to batch.max_bytes as I understand your case this one you want to make as large as possible. it will consume more memory.
and again... it measures the deserialized form of the event. not the serialized output in s3.

your transformer seams to only add attributes to your event. If possible you may parse your incoming event end extract only data needed. If all the info from the event is needed then it can not be helped.

I do not spot anything which would be dropping data as far as my understanding goes. If you are concerned I recommend having a look at vector top this shows live counter for all the components in your topology.

cduverne added the type: bug label Apr 14, 2025

pront added the sink: aws_s3 label Apr 14, 2025

cduverne mentioned this issue Apr 17, 2025

Better support for large S3 batches #3829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[aws_s3 sink] output can't get files over 256Mb #22866

[aws_s3 sink] output can't get files over 256Mb #22866

cduverne commented Apr 14, 2025 •

edited

Loading

scMarkus commented Apr 18, 2025

Uh oh!

cduverne commented Apr 22, 2025 •

edited

Loading

Uh oh!

scMarkus commented Apr 22, 2025

Uh oh!

[aws_s3 sink] output can't get files over 256Mb #22866

[aws_s3 sink] output can't get files over 256Mb #22866

Comments

cduverne commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

scMarkus commented Apr 18, 2025

Uh oh!

cduverne commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scMarkus commented Apr 22, 2025

Uh oh!

cduverne commented Apr 14, 2025 •

edited

Loading

cduverne commented Apr 22, 2025 •

edited

Loading