-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence: Introduce StorageContext
and use it for driver syncing (1/n)
#37690
[air] pyarrow.fs
persistence: Introduce StorageContext
and use it for driver syncing (1/n)
#37690
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> Missing arg Signed-off-by: Justin Yu <justinvyu@anyscale.com> Fix typehint Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
self.storage_filesystem = storage_filesystem | ||
if is_uri(self.storage_path): | ||
raise ValueError("TODO") | ||
self.storage_fs_path = self.storage_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use a custom pyarrow filesystem, then the user needs to pass in an already stripped path, rather than a URI.
(storage_filesystem=CustomS3FileSystem(), storage_path="bucket/a/b/c")
# Cannot use "s3://bucket/a/b/c", since pyarrow.fs.copy_files will complain about a URI being present
The default case uses pyarrow.fs.from_uri
to strip the prefix from the path:
(storage_filesystem=None, storage_path="s3://bucket/a/b/c")
# from_uri(storage_path) -> (DefaultS3FileSystem(), "bucket/a/b/c")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is this the desired way to specify a storage path, in the case of a custom filesystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does normalize_path() fix this? https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.normalize_path
Not sure I see any other APIs in arrow that would resolve the URI relative to a filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is we can parse URIs and strip the prefix in the case of a custom storage filesystem being passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tried normalize and it doesn't strip the prefix.
I think stripping the prefix for the user with some urllib parse probably works in most cases, but there are some edge cases (I am worried that this logic is not robust enough). For example, for HDFS:
ray/python/ray/air/_internal/remote_storage.py
Lines 367 to 378 in 6ac4521
# for uri="hdfs://48bb8ca83706:8020/test": | |
# netloc="48bb8ca83706:8020/" | |
# netloc's information is taken into account from the pyarrow client. | |
# so path should not include netloc. | |
# On the other hand, for uri="s3://my_bucket/test": | |
# netloc="my_bucket/" and path="test/" | |
# netloc's information is not part of the pyarrow client. | |
# so path should include netloc information hence the concatenation. | |
if uri.startswith("hdfs://"): | |
path = parsed.path | |
else: | |
path = parsed.netloc + parsed.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, probably safest to not doing any special parsing then. We can document that users should not specify the URI prefix when using a custom storage filesystem.
( | ||
self.storage_filesystem, | ||
self.storage_fs_path, | ||
) = get_fs_and_path(self.storage_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stick with get_fs_and_path
and the utils in ray.air._internal.remote_storage
, since there are many fixes in there that are worthwhile to keep (ex: hdfs uri parsing, pyarrow bug causing uploads of many files to hang forever, as well as bad performance from pyarrow defaults).
There may be a bug in windows path parsing that needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss. I think we need to take a strong stand here, and simplify how this works.
At a high level, Ray Data is working fine with pyarrow filesystem URI resolution, and I don't see why Train needs to diverge here. I doubt the assumptions made before are necessarily valid now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, now that I'm looking at this HDFS URI parsing again, the only reason this is needed is because we cache the pyarrow filesystem and then try to perform the prefix stripping ourselves on the next filesystem operation (using the cached one). We can simplify / remove this.
The one thing that I think we should keep from get_fs_and_path
is replacing the default s3/gs syncers with the corresponding fsspec implementations. See #34663
In particular, the default pyarrow S3FileSystem has many issues:
- It has a bug that causes uploads of many files to hang forever when using multiple threads to upload. So, we need to manually limit this to use a single thread.
- It's slow due to 1 (see the linked PR for performance comparisons).
So, my new imagined get_fs_and_path
:
if fsspec installed:
return fsspec s3fs/gs filesystem from URI
return pyarrow.fs.from_uri(uri)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm aware of the benchmarks. We're talking about what looks like ~40% performance improvement, however, for an operation that is typically infrequent (<1% of training overhead, given reasonably configured checkpoint intervals).
I'd like us to drop the fsspec dependency altogether here and standardize, given that this is also likely to be fixed in the future. Particularly performance sensitive users can pass in a custom filesystem.
def init_shared_storage_context(storage_context: StorageContext): | ||
global _storage_context | ||
_storage_context = storage_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A global context is used in a follow-up PR, and it was the easiest way for me to pipe the same storage context from the Tune Trainable to the Trainer.
self._local_experiment_path = storage.experiment_cache_path | ||
self._remote_experiment_path = storage.experiment_fs_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: These concepts of "local" and "remote" should be rebranded as "cache_path" and "storage_fs_path".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the storage case, shall we avoid assigning local attributes like this and keep them in the storage context only?
E.g., set local_experiment_path=None, etc.
I realize this may require some more code changes, but this is probably a necessary refactoring.
# local vs. remote storage paths and should be reworked in a follow-up. | ||
_get_defaults_results_dir() | ||
if _use_storage_context() | ||
else run_config.storage_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Even though the old path takes in a storage_path
, the get_experiment_checkpoint_dir
method resolves the storage path and only takes the "local" version, which is the cache path in the case of cloud storage.
self.storage_filesystem = storage_filesystem | ||
if is_uri(self.storage_path): | ||
raise ValueError("TODO") | ||
self.storage_fs_path = self.storage_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does normalize_path() fix this? https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html#pyarrow.fs.FileSystem.normalize_path
Not sure I see any other APIs in arrow that would resolve the URI relative to a filesystem.
self, | ||
storage_path: str, | ||
sync_config: SyncConfig, | ||
experiment_dir_name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always know the experiment_dir_name when this context object is first created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, I create this storage context after the experiment dir name has been resolved.
The whole experiment directory name generation code is pretty complex, so I just left that stuff alone for now.
self._local_experiment_path = storage.experiment_cache_path | ||
self._remote_experiment_path = storage.experiment_fs_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the storage case, shall we avoid assigning local attributes like this and keep them in the storage context only?
E.g., set local_experiment_path=None, etc.
I realize this may require some more code changes, but this is probably a necessary refactoring.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/storage_context_driver
python/ray/tune/syncer.py
Outdated
return delete_at_uri, dict(uri=uri) | ||
|
||
|
||
class _FilesystemSyncer(_BackgroundSyncer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the new syncer that doesn't depend on remote_storage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this class to the storage.py file?
@ericl Thanks for the suggestions! Took another pass, let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok close enough. Let's merge and iterate. Please revert the changes to remote_storage.py and copy paste any necessary utils to storage.py as commented before merging.
I really don't want to entangle old and new code paths.
Signed-off-by: Justin Yu <justinvyu@anyscale.com> Remove fs_utils Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/storage_context_driver
@ericl Should be good to merge now. |
Merged, thanks! |
… for driver syncing (1/n) (ray-project#37690) Signed-off-by: NripeshN <nn2012@hw.ac.uk>
… for driver syncing (1/n) (ray-project#37690) Signed-off-by: harborn <gangsheng.wu@intel.com>
… for driver syncing (1/n) (ray-project#37690)
… for driver syncing (1/n) (ray-project#37690) Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
… for driver syncing (1/n) (ray-project#37690) Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR introduces a
StorageContext
class that holds onto syncing related parameters and serves as the source of truth for paths such as the storage, experiment, and trial paths as well as their local counterparts. The PR mostly consists of piping thisStorageContext
throughout Tune, and using it for driver syncing.Notes
One major change that the prototype introduces is making it so that syncing always happens whenever a
storage_path
is set. For example, if specifying an NFS storage_path:So, in many places in the code, the concepts of syncing from
local_path
->remote_path
now becomes syncing fromlocal_cache_path
->storage_path
. Whenever a storage path is set, this is now handled with the "cloud checkpointing" codepath from before.One consequence of this is that specifying a local directory as the
storage_path
will now create 2 copies of the experiment directory. The way to get around this is to set the cache path instead using the environment variable. I don't think this way of configuring a local / NFS dir is very nice, so maybe it makes sense to re-introduce a "local_dir" and "sync_dir" concept to make this behavior more explicit.Reference
This PR is mostly based off of the prototype here: #36969
Still use ourAllowray.air._internal.remote_storage
utils, includingget_fs_and_path
instead ofpyarrow.fs.FileSystem.from_uri
.ray.air._internal.remote_storage
utils to take in a custom storage filesystem.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.