Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid pyarrow.fs import for local storage #14321

Merged
merged 5 commits into from
Oct 24, 2023

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 23, 2023

Description

This is not a resolution, but may help mitigate problems from aws/aws-sdk-cpp#2681

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora requested a review from a team as a code owner October 23, 2023 23:58
@rjzamora rjzamora requested review from vyasr and shwina October 23, 2023 23:58
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 23, 2023
@rjzamora rjzamora added 2 - In Progress Currently a work in progress non-breaking Non-breaking change helps: Python improvement Improvement / enhancement to an existing function and removed Python Affects Python cuDF API. labels Oct 23, 2023
@rjzamora rjzamora self-assigned this Oct 23, 2023
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1000 on this change. Thanks @rjzamora !

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, but LGTM

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Oct 24, 2023

Is this sufficient? The import of pyarrow.dataset (here https://github.com/rapidsai/cudf/blob/branch-23.12/python/cudf/cudf/io/parquet.py#L18) also seems to provoke importing pyarrow._fs.

Can we add a test (maybe run in a subprocess) that checks that import cudf; "pyarrow._fs" in sys.modules returns False?

@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 24, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support @wence-'s request for a test but otherwise LGTM.

@rjzamora
Copy link
Member Author

Is this sufficient? The import of pyarrow.dataset (here https://github.com/rapidsai/cudf/blob/branch-23.12/python/cudf/cudf/io/parquet.py#L18) also seems to provoke importing pyarrow._fs.

This is a good point. However, do we care about pyarrow._fs, or just pyarrow._s3fs? As far as I can tell, importing pyarrow.dataset does bring in pyarrow._fs, but pyarrow._s3fs isn't loaded unless pyarrow.fs is imported.

import sys
import pyarrow
pa_mods = set(sys.modules)
_fs = "pyarrow._fs"
_s3fs = "pyarrow._s3fs"

print("import pyarrow.dataset")
import pyarrow.dataset
ds_mods = set(sys.modules)
print(f"{_fs} imported: {_fs in (ds_mods - pa_mods)}")
print(f"{_s3fs} imported: {_s3fs in (ds_mods - pa_mods)}")

print("import pyarrow.fs")
import pyarrow.fs
fs_mods = set(sys.modules)
print(f"{_s3fs} imported: {_s3fs in (fs_mods - ds_mods)}")
import pyarrow.dataset
pyarrow._fs imported: True
pyarrow._s3fs imported: False
import pyarrow.fs
pyarrow._s3fs imported: True

Can we add a test (maybe run in a subprocess) that checks that import cudf; "pyarrow._fs" in sys.modules returns False?

A test is a good idea. Let me know if you think that we can just check that pyarrow._s3fs isn't imported.

@rjzamora
Copy link
Member Author

I have some bad news. I started adding the test suggested by @wence- (thanks Lawrence!). However, I then discovered that pyarrow indeed imports pyarrow.fs/pyarrow._s3fs in other problematic places. For example:

In [1]: import sys
In [2]: mods = set(sys.modules)
In [3]: import pyarrow.parquet as pq
In [4]: after_mods = set(sys.modules)
In [5]: "pyarrow._s3fs" in after_mods
Out[5]: True

@@ -533,3 +533,17 @@ def test_write_chunked_parquet(s3_base, s3so):
actual.sort_values(["b"]).reset_index(drop=True),
cudf.concat([df1, df2]).sort_values(["b"]).reset_index(drop=True),
)


def test_no_s3fs_on_cudf_import():
Copy link
Member Author

@rjzamora rjzamora Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is useful, but adding it has rained a bit on our mini parade. We now see that the pyarrow._s3fs problem pops up whenever we import pyarrow.parquet or pyarrow.orc. (So, much less constrained than the "remote storage only" case we originally expected).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing that investigation Rick. I see you've already adjusted the remainder of the code to take that into consideration and this test is still passing, so I'm good with things as they are now.

@@ -533,3 +533,17 @@ def test_write_chunked_parquet(s3_base, s3so):
actual.sort_values(["b"]).reset_index(drop=True),
cudf.concat([df1, df2]).sort_values(["b"]).reset_index(drop=True),
)


def test_no_s3fs_on_cudf_import():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing that investigation Rick. I see you've already adjusted the remainder of the code to take that into consideration and this test is still passing, so I'm good with things as they are now.

@pentschev
Copy link
Member

Given sentiment was positive regarding this change and to reduce probability of segfaults in Dask et al., I'm merging this. Thanks a lot @rjzamora for working on this change!

@pentschev
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 19d791c into rapidsai:branch-23.12 Oct 24, 2023
57 checks passed
@rjzamora rjzamora deleted the avoid-pyarrow-fs-import branch October 24, 2023 20:51
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants