You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In get_fs_and_path, the returned path could be different for cached key. If the cached_key exists in _cached_fs, the returned path will be from path = parsed.netloc + parsed.path; however, if the cached_key not exists in _cached_fs, the return path is from fs, path = pyarrow.fs.FileSystem.from_uri(uri). In my case, the two paths are different.
Versions / Dependencies
Ray 2.1
Reproduction script
>>> from ray.air._internal.remote_storage import get_fs_and_path
>>> uri = "hdfs://grid.com:9000/jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json"
>>> fs, path = get_fs_and_path(uri)
>>> fs
<pyarrow._hdfs.HadoopFileSystem object at 0x7f883a8ecef0>
>>> path
'jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json'
# Do it another time to trigger cache
>>> fs, path = get_fs_and_path(uri)
>>> path
'grid.com:9000/jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json'
# The path returned for cached key is from urllib instead of pyArrrow
>>> import urllib.parse
>>> parsed = urllib.parse.urlparse(uri)
>>> path = parsed.netloc + parsed.path
>>> path
'grid.com:9000/jobs/test/run-2023-01-13-10-18-09/run-2023-01-13-10-18-09/basic-variant-state-2023-01-13_18-18-19.json'
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
mizhazha
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 13, 2023
Thank you so much for the concise repro! @krfricke could you take a look at this?
I took a quick look at pyarrow's logic for resolve_filesystem_and_path and it seems like we would want to return just parsed.pathhere since the netloc is already part of the filesystem, but I'm not sure if there are additional implications for the fsspec route.
Would the following sudo-logic handle all cases here?
Try using pyarrow to resolve, and if valid set cache[(parsed.scheme, parsed.netloc)] = parsed.path.
Try using fsspec to resolve, and if valid set cache[parsed.scheme] = parsed.netloc+parsed.path.
matthewdeng
added
P1
Issue that should be fixed within a few weeks
tune
Tune-related issues
air
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 17, 2023
matthewdeng
changed the title
[RayTune: Syncing to hdfs path parsing issue]
[air][tune] Inconsistency when parsing hdfs path for syncing
Jan 17, 2023
What happened + What you expected to happen
In get_fs_and_path, the returned path could be different for cached key. If the cached_key exists in _cached_fs, the returned path will be from
path = parsed.netloc + parsed.path
; however, if the cached_key not exists in _cached_fs, the return path is fromfs, path = pyarrow.fs.FileSystem.from_uri(uri)
. In my case, the two paths are different.Versions / Dependencies
Ray 2.1
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: