Skip to content

[data] fix: forward try_create_dir to pyarrow.dataset.write_dataset#58302

Merged
goutamvenkat-anyscale merged 3 commits into
ray-project:masterfrom
ljstrnadiii:try_create_dir_write_dataset
May 8, 2026
Merged

[data] fix: forward try_create_dir to pyarrow.dataset.write_dataset#58302
goutamvenkat-anyscale merged 3 commits into
ray-project:masterfrom
ljstrnadiii:try_create_dir_write_dataset

Conversation

@ljstrnadiii
Copy link
Copy Markdown
Contributor

@ljstrnadiii ljstrnadiii commented Oct 30, 2025

Description

Consider the case where my aws role has permissions to only a prefix in a bucket and not the entire bucket and we use write_parquet to write parquet. This causes permissions issues when we attempt to create the dirs, which would normally be ok if a role has permissions to the entire bucket.

I noticed try_create_dir does not get passed to the underlying parquet dataset write_dataset function here, which I am assuming likely recursively checks all "subdirs" in the S3 path and assumes we can get_file_info. I stopped at python/pyarrow/_dataset.pyx while investigating.

Related issues

Additional information

Fixes:

Trace:
      ...
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Message:

    RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224)
        for b_out in map_transformer.apply_transform(iter(blocks), ctx):
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp>
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn
        ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write(
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write
        call_with_retry(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path
        self._write_parquet_files(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com>
@ljstrnadiii ljstrnadiii marked this pull request as ready for review October 30, 2025 03:13
@ljstrnadiii ljstrnadiii requested a review from a team as a code owner October 30, 2025 03:13
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 30, 2025
@ljstrnadiii
Copy link
Copy Markdown
Contributor Author

@goutamvenkat-anyscale was there any particular reason we kept the try_make_dir arg out of ds.write_dataset?

@goutamvenkat-anyscale
Copy link
Copy Markdown
Contributor

@goutamvenkat-anyscale was there any particular reason we kept the try_make_dir arg out of ds.write_dataset?

Thanks for fixing. It might have been left out on accident. Can you please add a simple test for this?

@bveeramani bveeramani added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 13, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 27, 2025
@goutamvenkat-anyscale
Copy link
Copy Markdown
Contributor

@ljstrnadiii Gentle Ping. Just wanted to follow up on this PR. Thanks.

@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 1, 2025
@omatthew98 omatthew98 changed the title fix: forward try_create_dir to pyarrow.dataset.write_dataset [data] fix: forward try_create_dir to pyarrow.dataset.write_dataset Dec 4, 2025
@ljstrnadiii
Copy link
Copy Markdown
Contributor Author

@goutamvenkat-anyscale I'll try to add a test early next week!

@iamjustinhsu
Copy link
Copy Markdown
Contributor

Hi @ljstrnadiii, are you still working on this?

@goutamvenkat-anyscale goutamvenkat-anyscale enabled auto-merge (squash) May 8, 2026 20:21
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label May 8, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale removed the unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. label May 8, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit e30975c into ray-project:master May 8, 2026
7 of 8 checks passed
chillCode404 pushed a commit to chillCode404/ray-contrib that referenced this pull request May 9, 2026
…t` (ray-project#58302)

## Description
Consider the case where my aws role has permissions to only a prefix in
a bucket and not the entire bucket and we use
[write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html)
to write parquet. This causes permissions issues when we attempt to
create the dirs, which would normally be ok if a role has permissions to
the entire bucket.

I noticed `try_create_dir` does not get passed to the underlying parquet
dataset write_dataset function
[here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276),
which I am assuming likely recursively checks all "subdirs" in the S3
path and assumes we can
[get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315).
I stopped at
[python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132)
while investigating.

## Related issues

## Additional information
Fixes:
```python
Trace:
      ...
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Message:

    RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224)
        for b_out in map_transformer.apply_transform(iter(blocks), ctx):
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp>
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn
        ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write(
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write
        call_with_retry(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path
        self._write_parquet_files(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.
```

Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…t` (ray-project#58302)

## Description
Consider the case where my aws role has permissions to only a prefix in
a bucket and not the entire bucket and we use
[write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html)
to write parquet. This causes permissions issues when we attempt to
create the dirs, which would normally be ok if a role has permissions to
the entire bucket.

I noticed `try_create_dir` does not get passed to the underlying parquet
dataset write_dataset function
[here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276),
which I am assuming likely recursively checks all "subdirs" in the S3
path and assumes we can
[get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315).
I stopped at
[python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132)
while investigating.

## Related issues

## Additional information
Fixes:
```python
Trace:
      ...
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Message:

    RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224)
        for b_out in map_transformer.apply_transform(iter(blocks), ctx):
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp>
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn
        ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write(
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write
        call_with_retry(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path
        self._write_parquet_files(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.
```

Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…t` (ray-project#58302)

## Description
Consider the case where my aws role has permissions to only a prefix in
a bucket and not the entire bucket and we use
[write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html)
to write parquet. This causes permissions issues when we attempt to
create the dirs, which would normally be ok if a role has permissions to
the entire bucket.

I noticed `try_create_dir` does not get passed to the underlying parquet
dataset write_dataset function
[here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276),
which I am assuming likely recursively checks all "subdirs" in the S3
path and assumes we can
[get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315).
I stopped at
[python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132)
while investigating.

## Related issues

## Additional information
Fixes:
```python
Trace:
      ...
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Message:

    RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224)
        for b_out in map_transformer.apply_transform(iter(blocks), ctx):
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp>
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn
        ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write(
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write
        call_with_retry(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path
        self._write_parquet_files(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.
```

Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…t` (ray-project#58302)

## Description
Consider the case where my aws role has permissions to only a prefix in
a bucket and not the entire bucket and we use
[write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html)
to write parquet. This causes permissions issues when we attempt to
create the dirs, which would normally be ok if a role has permissions to
the entire bucket.

I noticed `try_create_dir` does not get passed to the underlying parquet
dataset write_dataset function
[here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276),
which I am assuming likely recursively checks all "subdirs" in the S3
path and assumes we can
[get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315).
I stopped at
[python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132)
while investigating.

## Related issues

## Additional information
Fixes:
```python
Trace:
      ...
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.

Message:

    RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224)
        for b_out in map_transformer.apply_transform(iter(blocks), ctx):
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp>
        block_accessors = [BlockAccessor.for_block(block) for block in blocks]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__
        yield from self._block_fn(input, ctx)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn
        ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write(
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write
        call_with_retry(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path
        self._write_parquet_files(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files
        ds.write_dataset(
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset
        _filesystemdataset_write(
      File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
      File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
      File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir
        return self._retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation
        return call_with_retry(
               ^^^^^^^^^^^^^^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry
        raise e from None
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry
        return f()
               ^^^
      File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda>
        lambda: self._fs.create_dir(path, recursive=recursive),
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir
      File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.
```

Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants