[data] Allow users to specify how many rows/bytes to write per file for write ops #41219
Labels
data
Ray Data-related issues
data-ux
P0
Issues that should be fixed in short order
ray 2.10
size:small
When we have a massive amount of data to write as parquet files, we usually want to save many rows in one file, instead of ending up with too many files.
Today the write api doesn't have a parameter to control how many rows/bytes to write per file. One workaround is to an empty map_batches op:
ds.map_batches(lambda x: x, batch_size=N).write_parquet(...)
. But the output of map_batches will be chunked based on thetarget_max_block_size
.The text was updated successfully, but these errors were encountered: