[Data] - write_parquet enable both partition by & min_rows_per_file, max_rows_per_file #53930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

goutamvenkat-anyscale wants to merge 14 commits into ray-project:master from goutamvenkat-anyscale:goutam/write_parquet_partition_by_min_rows

Contributor

goutamvenkat-anyscale commented Jun 18, 2025 •

edited

Loading

Why are these changes needed?

Allow users to pass both partition by & min_rows_per_file, max_rows_per_file into write_parquet.
max_rows_per_file is guaranteed, but not min_rows_per_file (it's deemed as best effort)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(


          [Data] - write_parquet enable both partition by & min_rows_per_file

916df53

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner

June 18, 2025 19:54


          Add max_rows_per_file as well

c4142e1

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale changed the title ~~[Data] - write_parquet enable both partition by & min_rows_per_file~~ [Data] - write_parquet enable both partition by & min_rows_per_file, max_rows_per_file

goutamvenkat-anyscale added 4 commits

June 18, 2025 14:11


          Fix chunk size logic

9fe5d2b

Signed-off-by: Goutam V <goutam@anyscale.com>


          Fix test_json

3aad949

Signed-off-by: Goutam V <goutam@anyscale.com>


          Fix test

Signed-off-by: Goutam V <goutam@anyscale.com>


          Fix test

5144a75

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale added the go label

bveeramani reviewed

View reviewed changes

python/ray/data/dataset.py

Comment on lines +3415 to +3422

+                          max_rows_per_file: [Experimental] The target maximum number of rows to write
+                              to each file. If ``None``, Ray Data writes a system-chosen number of
+                              rows to each file. If the number of rows per block is smaller than the
+                              specified value, Ray Data writes the number of rows per block to each file.
+                              The specified value is a hint, not a strict limit. Ray Data
+                              might write more or fewer rows to each file. If both ``min_rows_per_file``
+                              and ``max_rows_per_file`` are specified, ``max_rows_per_file`` takes
+                              precedence when they cannot both be satisfied.

Member

bveeramani Jun 19, 2025

Are we adding this parameter for all the other APIs that support min_rows_per_file, or just for write_parquet?

Contributor Author

goutamvenkat-anyscale Jun 19, 2025

Good question. We probably should add this paging mechanism to other APIs. But for now I believe it's just write_parquet.

Maybe a better design for this approach is to leverage tagged unions so that we can more easily swap between options

Collaborator

iamjustinhsu Jun 19, 2025

i think it would apply to FileDatasinks, because I don't sql or any table-centric sink makes sense


          Post hao discussion

e41eab6

Signed-off-by: Goutam V <goutam@anyscale.com>

iamjustinhsu reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

iamjustinhsu reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

gvspraveen reviewed

View reviewed changes

python/ray/data/dataset.py

-                      effective_min_rows = _validate_rows_per_file_args(
-                          num_rows_per_file=num_rows_per_file, min_rows_per_file=min_rows_per_file
+                      effective_min_rows, effective_max_rows = _validate_rows_per_file_args(

Contributor

gvspraveen Jun 19, 2025

ooc should we try to keep it simple and only allow users to specify either min or max? It looks like in the write_parquet anyways max is taking precedence


          Address comments

29173b3

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale removed the go label

goutamvenkat-anyscale added 3 commits

June 19, 2025 16:57


          Merge branch 'master' into goutam/write_parquet_partition_by_min_rows

100cdad


          Fix test

a290280

Signed-off-by: Goutam V <goutam@anyscale.com>


          Fix test

a8e13ff

Signed-off-by: Goutam V <goutam@anyscale.com>

goutamvenkat-anyscale added the go label

iamjustinhsu reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits

June 23, 2025 11:01


          Address comment

1277a7f

Signed-off-by: Goutam V <goutam@anyscale.com>


          Merge branch 'master' into goutam/write_parquet_partition_by_min_rows

de84bb1

goutamvenkat-anyscale added go and removed go labels

alexeykudinkin reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasink.py

Comment on lines 50 to +51

		self.min_rows_per_file = min_rows_per_file
		self.max_rows_per_file = max_rows_per_file

Contributor

alexeykudinkin Jun 24, 2025

Let's assert min <= max

python/ray/data/_internal/datasource/parquet_datasink.py Outdated

Comment on lines 135 to 164

+                      # Determine the effective row limit based on priority: max takes precedence
+                      if self.max_rows_per_file is not None:
+                          # Split based on max_rows_per_file
+                          if total_rows <= self.max_rows_per_file:
+                              # Single file is sufficient
+                              self._write_single_file(
+                                  path, [table], filename, output_schema, write_kwargs
+                              )
+                          else:
+                              # Need to split into multiple files
+                              self._split_and_write_table(
+                                  table,
+                                  path,
+                                  filename,
+                                  output_schema,
+                                  write_kwargs,
+                              )
+                      elif self.min_rows_per_file is not None:
+                          # Only min_rows_per_file is set
+                          if total_rows >= self.min_rows_per_file:
+                              # Single file meets minimum requirement
+                              self._write_single_file(
+                                  path, [table], filename, output_schema, write_kwargs
+                              )
+                          else:
+                              # This case should be handled at a higher level by combining blocks
+                              # For now, write as single file
+                              self._write_single_file(
+                                  path, [table], filename, output_schema, write_kwargs
+                              )

Contributor

alexeykudinkin Jun 24, 2025

Not following this logic: there should be no precedence -- we enforce both min and max if these are specified (and right now you're enforcing either one or the other

python/ray/data/_internal/datasource/parquet_datasink.py

Comment on lines +166 to +173

+                  def _split_and_write_table(
+                      self,
+                      table: "pyarrow.Table",
+                      path: str,
+                      base_filename: str,
+                      output_schema: "pyarrow.Schema",
+                      write_kwargs: Dict[str, Any],
+                  ) -> None:

Contributor

alexeykudinkin Jun 24, 2025

nit: Let's make this method more explicit relative to all of its deps (ie min/max)

python/ray/data/_internal/datasource/parquet_datasink.py

Comment on lines +182 to +184

+                      estimated_max_rows = _get_max_chunk_size(
+                          table, self._data_context.target_max_block_size
+                      )

Contributor

alexeykudinkin Jun 24, 2025

Why do we need this?

Contributor Author

goutamvenkat-anyscale Jun 24, 2025

We need to know the number of rows associated with the block size

python/ray/data/_internal/datasource/parquet_datasink.py

Comment on lines +225 to +230

+                          # Generate filename with index suffix
+                          name_parts = base_filename.rsplit(".", 1)
+                          if len(name_parts) == 2:
+                              chunk_filename = f"{name_parts[0]}_{file_idx:06d}.{name_parts[1]}"
+                          else:
+                              chunk_filename = f"{base_filename}_{file_idx:06d}"

Contributor

alexeykudinkin Jun 24, 2025

This should be deferred to Pyarrow

Contributor Author

goutamvenkat-anyscale Jun 24, 2025

[I'm assuming you're referring to this API] (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html)


          Address a couple comments

25e6985

Signed-off-by: Goutam V <goutam@anyscale.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

alexeykudinkin alexeykudinkin left review comments

gvspraveen gvspraveen left review comments

bveeramani bveeramani left review comments

iamjustinhsu iamjustinhsu left review comments

At least 1 approving review is required to merge this pull request.

Labels

go