[data] fix: forward try_create_dir to pyarrow.dataset.write_dataset#58302
Conversation
Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com>
|
@goutamvenkat-anyscale was there any particular reason we kept the |
Thanks for fixing. It might have been left out on accident. Can you please add a simple test for this? |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
@ljstrnadiii Gentle Ping. Just wanted to follow up on this PR. Thanks. |
try_create_dir to pyarrow.dataset.write_datasettry_create_dir to pyarrow.dataset.write_dataset
|
@goutamvenkat-anyscale I'll try to add a test early next week! |
|
Hi @ljstrnadiii, are you still working on this? |
e30975c
into
ray-project:master
…t` (ray-project#58302) ## Description Consider the case where my aws role has permissions to only a prefix in a bucket and not the entire bucket and we use [write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html) to write parquet. This causes permissions issues when we attempt to create the dirs, which would normally be ok if a role has permissions to the entire bucket. I noticed `try_create_dir` does not get passed to the underlying parquet dataset write_dataset function [here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276), which I am assuming likely recursively checks all "subdirs" in the S3 path and assumes we can [get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315). I stopped at [python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132) while investigating. ## Related issues ## Additional information Fixes: ```python Trace: ... ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. Message: RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp> block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write call_with_retry( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path self._write_parquet_files( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. ``` Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
…t` (ray-project#58302) ## Description Consider the case where my aws role has permissions to only a prefix in a bucket and not the entire bucket and we use [write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html) to write parquet. This causes permissions issues when we attempt to create the dirs, which would normally be ok if a role has permissions to the entire bucket. I noticed `try_create_dir` does not get passed to the underlying parquet dataset write_dataset function [here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276), which I am assuming likely recursively checks all "subdirs" in the S3 path and assumes we can [get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315). I stopped at [python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132) while investigating. ## Related issues ## Additional information Fixes: ```python Trace: ... ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. Message: RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp> block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write call_with_retry( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path self._write_parquet_files( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. ``` Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
…t` (ray-project#58302) ## Description Consider the case where my aws role has permissions to only a prefix in a bucket and not the entire bucket and we use [write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html) to write parquet. This causes permissions issues when we attempt to create the dirs, which would normally be ok if a role has permissions to the entire bucket. I noticed `try_create_dir` does not get passed to the underlying parquet dataset write_dataset function [here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276), which I am assuming likely recursively checks all "subdirs" in the S3 path and assumes we can [get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315). I stopped at [python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132) while investigating. ## Related issues ## Additional information Fixes: ```python Trace: ... ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. Message: RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp> block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write call_with_retry( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path self._write_parquet_files( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. ``` Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
…t` (ray-project#58302) ## Description Consider the case where my aws role has permissions to only a prefix in a bucket and not the entire bucket and we use [write_parquet](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html) to write parquet. This causes permissions issues when we attempt to create the dirs, which would normally be ok if a role has permissions to the entire bucket. I noticed `try_create_dir` does not get passed to the underlying parquet dataset write_dataset function [here](https://github.com/ray-project/ray/blob/eec5c69db0d7625326f7f5430d4204e5ec31037a/python/ray/data/_internal/datasource/parquet_datasink.py#L276), which I am assuming likely recursively checks all "subdirs" in the S3 path and assumes we can [get_file_info](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/fs.py#L315). I stopped at [python/pyarrow/_dataset.pyx](https://github.com/apache/arrow/blob/430ad81c2b563bc2d57e81bef76da0c4bddc95e8/python/pyarrow/_dataset.pyx#L4132) while investigating. ## Related issues ## Additional information Fixes: ```python Trace: ... ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. Message: RayTaskError(OSError): �[36mray::Write()�[39m (pid=1654, ip=10.121.31.224) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in fn block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 68, in <listcomp> block_accessors = [BlockAccessor.for_block(block) for block in blocks] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 377, in __call__ yield from self._block_fn(input, ctx) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/planner/plan_write_op.py", line 48, in fn ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 204, in write call_with_retry( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 194, in write_blocks_to_path self._write_parquet_files( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasink.py", line 281, in _write_parquet_files ds.write_dataset( File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/pyarrow/dataset.py", line 1035, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_create_dir File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1278, in create_dir return self._retry_operation( ^^^^^^^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1259, in _retry_operation return call_with_retry( ^^^^^^^^^^^^^^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1436, in call_with_retry raise e from None File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1422, in call_with_retry return f() ^^^ File "/wherobots-rasterflow/.venv/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1279, in <lambda> lambda: self._fs.create_dir(path, recursive=recursive), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_fs.pyx", line 638, in pyarrow._fs.FileSystem.create_dir File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status OSError: When testing for existence of bucket 'my-bucket': AWS Error ACCESS_DENIED during HeadBucket operation: No response body. ``` Signed-off-by: ljstrnadiii <ljstrnadiii@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
Description
Consider the case where my aws role has permissions to only a prefix in a bucket and not the entire bucket and we use write_parquet to write parquet. This causes permissions issues when we attempt to create the dirs, which would normally be ok if a role has permissions to the entire bucket.
I noticed
try_create_dirdoes not get passed to the underlying parquet dataset write_dataset function here, which I am assuming likely recursively checks all "subdirs" in the S3 path and assumes we can get_file_info. I stopped at python/pyarrow/_dataset.pyx while investigating.Related issues
Additional information
Fixes: