Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air][tune] Inconsistency when parsing hdfs path for syncing #31673

Closed
mizhazha opened this issue Jan 13, 2023 · 1 comment · Fixed by #31940
Closed

[air][tune] Inconsistency when parsing hdfs path for syncing #31673

mizhazha opened this issue Jan 13, 2023 · 1 comment · Fixed by #31940
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks tune Tune-related issues

Comments

@mizhazha
Copy link

mizhazha commented Jan 13, 2023

What happened + What you expected to happen

In get_fs_and_path, the returned path could be different for cached key. If the cached_key exists in _cached_fs, the returned path will be from path = parsed.netloc + parsed.path; however, if the cached_key not exists in _cached_fs, the return path is from
fs, path = pyarrow.fs.FileSystem.from_uri(uri). In my case, the two paths are different.

Versions / Dependencies

Ray 2.1

Reproduction script

>>> from ray.air._internal.remote_storage import get_fs_and_path
>>> uri = "hdfs://grid.com:9000/jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json"
>>> fs, path = get_fs_and_path(uri)
>>> fs
<pyarrow._hdfs.HadoopFileSystem object at 0x7f883a8ecef0>
>>> path
'jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json'

# Do it another time to trigger cache
>>> fs, path = get_fs_and_path(uri)
>>> path
'grid.com:9000/jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json'

# The path returned for cached key is from urllib instead of pyArrrow 
>>> import urllib.parse
>>> parsed = urllib.parse.urlparse(uri)
>>> path = parsed.netloc + parsed.path
>>> path
'grid.com:9000/jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json'

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@mizhazha mizhazha added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 13, 2023
@matthewdeng
Copy link
Contributor

Thank you so much for the concise repro! @krfricke could you take a look at this?

I took a quick look at pyarrow's logic for resolve_filesystem_and_path and it seems like we would want to return just parsed.path here since the netloc is already part of the filesystem, but I'm not sure if there are additional implications for the fsspec route.

Would the following sudo-logic handle all cases here?

  1. Retrieve parsed.scheme, parsed.netloc, parsed.path.
  2. Check cache for (parsed.scheme, parsed.netloc).
  3. Check cache for parsed.scheme.
  4. Try using pyarrow to resolve, and if valid set cache[(parsed.scheme, parsed.netloc)] = parsed.path.
  5. Try using fsspec to resolve, and if valid set cache[parsed.scheme] = parsed.netloc+parsed.path.

@matthewdeng matthewdeng added P1 Issue that should be fixed within a few weeks tune Tune-related issues air and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 17, 2023
@matthewdeng matthewdeng changed the title [RayTune: Syncing to hdfs path parsing issue] [air][tune] Inconsistency when parsing hdfs path for syncing Jan 17, 2023
@xwjiang2010 xwjiang2010 self-assigned this Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants