Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] New persistence mode: Move SyncConfig to train and deprecate Syncer #38855

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5111c23
mv tune.syncer -> train._internal.syncer
justinvyu Aug 24, 2023
36bfe7b
remove train dependencies on tune
justinvyu Aug 24, 2023
ce9150a
remove reenable head node sync flag + some leftover cleanup
justinvyu Aug 24, 2023
6b792ca
remove syncer callback pt 1
justinvyu Aug 24, 2023
f10108e
Remove outdated files
justinvyu Aug 24, 2023
a51212a
Add train.SyncConfig alias
justinvyu Aug 24, 2023
5e88387
remove head node sync deprecation warning deps
justinvyu Aug 24, 2023
1c94795
fix imports to use train.SyncConfig
justinvyu Aug 24, 2023
5979546
fix tests depending on syncercallback
justinvyu Aug 24, 2023
55bb02c
remove test_syncer_callback
justinvyu Aug 24, 2023
bc0db8e
deprecate args of SyncConfig
justinvyu Aug 24, 2023
dfab96a
remove validate_storage_path usage
justinvyu Aug 24, 2023
d8f9d2a
more targeted warnings
justinvyu Aug 24, 2023
d152150
temp fix for old codepath
justinvyu Aug 24, 2023
9605205
softer deprecation for tune.SyncConfig
justinvyu Aug 24, 2023
c3f1165
tune.SyncConfig -> train.SyncConfig
justinvyu Aug 24, 2023
7ace577
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu Aug 25, 2023
eb605df
fix test_syncer
justinvyu Aug 25, 2023
bcaa15c
fix test_trainable
justinvyu Aug 25, 2023
7f7e5a1
fix test_trainer_restore
justinvyu Aug 25, 2023
fb698da
fix lint
justinvyu Aug 25, 2023
348b448
address sync config comments
justinvyu Aug 25, 2023
9bfc4fc
fix doc
justinvyu Aug 25, 2023
70e0431
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu Aug 25, 2023
4ce65a4
Use sync_artifacts_on_checkpoint
justinvyu Aug 25, 2023
d4be2b9
fix validate_save_restore (failed due to artifacts pr)
justinvyu Aug 25, 2023
7e3d515
fix lint
justinvyu Aug 25, 2023
e38b056
Revert "Remove outdated files"
justinvyu Aug 25, 2023
0cef782
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu Aug 25, 2023
00c8095
enable sync artifacts for test
justinvyu Aug 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -948,13 +948,11 @@
}
],
"source": [
"from ray.tune.syncer import SyncConfig\n",
"# Save AIR checkpoints according to the performance on validation set\n",
"run_config = RunConfig(\n",
" storage_path=storage_path,\n",
" name=\"finetune_dolly-v2-7b\",\n",
" checkpoint_config=CheckpointConfig(),\n",
" sync_config=SyncConfig(sync_artifacts=False),\n",
")\n",
"\n",
"# Scale the DDP training workload across 16 GPUs\n",
Expand Down
2 changes: 1 addition & 1 deletion doc/source/train/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Train Configuration
Trainers are configured with configuration objects. There are two main configuration classes,
the :class:`ScalingConfig <ray.air.config.ScalingConfig>` and the :class:`RunConfig <ray.air.config.RunConfig>`.
The latter contains subconfigurations, such as the :class:`FailureConfig <ray.air.config.FailureConfig>`,
:class:`SyncConfig <ray.tune.syncer.SyncConfig>` and :class:`CheckpointConfig <ray.air.config.CheckpointConfig>`.
:class:`SyncConfig <ray.train.SyncConfig>` and :class:`CheckpointConfig <ray.air.config.CheckpointConfig>`.

.. _train-key-concepts-results:

Expand Down
46 changes: 3 additions & 43 deletions doc/source/tune/api/syncing.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
Syncing in Tune (tune.SyncConfig, tune.Syncer)
==============================================
Syncing in Tune (train.SyncConfig)
==================================

.. seealso::

See :doc:`this user guide </tune/tutorials/tune-storage>` for more details and examples.


.. currentmodule:: ray.tune.syncer

.. _tune-sync-config:

Tune Syncing Configuration
Expand All @@ -16,42 +14,4 @@ Tune Syncing Configuration
.. autosummary::
:toctree: doc/

SyncConfig

.. _tune-syncer:

Remote Storage Syncer Interface (tune.Syncer)
---------------------------------------------

Constructor
~~~~~~~~~~~

.. autosummary::
:toctree: doc/

Syncer


Syncer Methods to Implement
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: doc/

Syncer.sync_up
Syncer.sync_down
Syncer.delete
Syncer.wait
Syncer.wait_or_retry


Tune Built-in Syncers
---------------------

.. autosummary::
:toctree: doc/

SyncerCallback
_DefaultSyncer
_BackgroundSyncer

ray.train.SyncConfig
96 changes: 0 additions & 96 deletions doc/source/tune/doc_code/faq.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,102 +249,6 @@ def f(config, data=None):
tuner.fit()
# __log_1_end__

# __log_2_start__
from ray.tune.syncer import Syncer

class CustomSyncer(Syncer):
def sync_up(
self, local_dir: str, remote_dir: str, exclude: list = None
) -> bool:
pass # sync up

def sync_down(
self, remote_dir: str, local_dir: str, exclude: list = None
) -> bool:
pass # sync down

def delete(self, remote_dir: str) -> bool:
pass # delete

tuner = tune.Tuner(
MyTrainableClass,
run_config=train.RunConfig(storage_path="s3://my-log-dir"),
)
tuner.fit()
# __log_2_end__

# __custom_command_syncer_start__
import subprocess
from ray.tune.syncer import Syncer

class CustomCommandSyncer(Syncer):
def __init__(
self,
sync_up_template: str,
sync_down_template: str,
delete_template: str,
sync_period: float = 300.0,
):
self.sync_up_template = sync_up_template
self.sync_down_template = sync_down_template
self.delete_template = delete_template

super().__init__(sync_period=sync_period)

def sync_up(
self, local_dir: str, remote_dir: str, exclude: list = None
) -> bool:
cmd_str = self.sync_up_template.format(
source=local_dir,
target=remote_dir,
)
try:
subprocess.check_call(cmd_str, shell=True)
except Exception as e:
print(f"Exception when syncing up {local_dir} to {remote_dir}: {e}")
return False
return True

def sync_down(
self, remote_dir: str, local_dir: str, exclude: list = None
) -> bool:
cmd_str = self.sync_down_template.format(
source=remote_dir,
target=local_dir,
)
try:
subprocess.check_call(cmd_str, shell=True)
except Exception as e:
print(f"Exception when syncing down {remote_dir} to {local_dir}: {e}")
return False
return True

def delete(self, remote_dir: str) -> bool:
cmd_str = self.delete_template.format(
target=remote_dir,
)
try:
subprocess.check_call(cmd_str, shell=True)
except Exception as e:
print(f"Exception when deleting {remote_dir}: {e}")
return False
return True

def retry(self):
raise NotImplementedError

def wait(self):
pass

sync_config = tune.SyncConfig(
syncer=CustomCommandSyncer(
sync_up_template="aws s3 sync {source} {target}",
sync_down_template="aws s3 sync {source} {target}",
delete_template="aws s3 rm {target} --recursive",
),
)
# __custom_command_syncer_end__


if not MOCK:
# __s3_start__
Expand Down
77 changes: 4 additions & 73 deletions doc/source/tune/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -462,24 +462,10 @@ to do that depending on whether you are using class or functional Trainable API.

**You are training a large number of trials on a cluster, or you are saving huge checkpoints**

Checkpoints and logs are synced between nodes
- usually at least to the driver on the head node, but sometimes between worker nodes if needed (e.g. when
using :ref:`Population Based Training <tune-scheduler-pbt>`). If these checkpoints are very large (e.g. for
NLP models), or if you are training a large number of trials, this syncing can take a long time.

If nothing else is specified, syncing happens via SSH, which can lead to network overhead as connections are
not kept open by Ray Tune.

**Solution**: There are multiple solutions, depending on your needs:

1. You can disable syncing to the driver in the :class:`tune.SyncConfig <ray.tune.SyncConfig>`. In this case,
logs and checkpoints will not be synced to the driver, so if you need to access them later, you will have to
transfer them where you need them manually.

2. You can use :ref:`cloud checkpointing <tune-cloud-checkpointing>` to save logs and checkpoints to a specified `storage_path`.
This is the preferred way to deal with this. All syncing will be taken care of automatically, as all nodes
are able to access the cloud storage. Additionally, your results will be safe, so even when you're working on
pre-emptible instances, you won't lose any of your data.
**Solution**: You can use :ref:`cloud checkpointing <tune-cloud-checkpointing>` to save logs and checkpoints to a specified `storage_path`.
This is the preferred way to deal with this. All syncing will be taken care of automatically, as all nodes
are able to access the cloud storage. Additionally, your results will be safe, so even when you're working on
pre-emptible instances, you won't lose any of your data.

**You are reporting results too often**

Expand Down Expand Up @@ -604,14 +590,6 @@ Here is an example of uploading to S3, using a bucket called ``my-log-dir``:
:start-after: __log_1_start__
:end-before: __log_1_end__

You can customize synchronization behavior by implementing your own Syncer:

.. literalinclude:: doc_code/faq.py
:dedent:
:language: python
:start-after: __log_2_start__
:end-before: __log_2_end__

By default, syncing occurs whenever one of the following conditions are met:

* if you have used a :py:class:`~ray.train.CheckpointConfig` with ``num_to_keep`` and a trial has checkpointed more than ``num_to_keep`` times since last sync,
Expand All @@ -628,53 +606,6 @@ For AWS set up, this involves adding an IamInstanceProfile configuration for wor
Please :ref:`see here for more tips <aws-cluster-s3>`.


.. _tune-cloud-syncing-command-line-example:

How can I use the awscli or gsutil command line commands for syncing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some users reported to run into problems with the default pyarrow-based syncing.
In this case, a custom syncer that invokes the respective command line tools
for transferring files between nodes and cloud storage can be implemented.

Here is an example for a syncer that uses string templates that will be run
as a command:

.. literalinclude:: doc_code/faq.py
:dedent:
:language: python
:start-after: __custom_command_syncer_start__
:end-before: __custom_command_syncer_end__

For different cloud services, these are example templates you can use with this syncer:

AWS S3
''''''

.. code-block::

sync_up_template="aws s3 sync {source} {target} --exact-timestamps --only-show-errors"
sync_down_template="aws s3 sync {source} {target} --exact-timestamps --only-show-errors"
delete_template="aws s3 rm {target} --recursive --only-show-errors"

Google cloud storage
''''''''''''''''''''

.. code-block::

sync_up_template="gsutil rsync -r {source} {target}"
sync_down_template="down": "gsutil rsync -r {source} {target}"
delete_template="delete": "gsutil rm -r {target}"

HDFS
''''

.. code-block::

sync_up_template="hdfs dfs -put -f {source} {target}"
sync_down_template="down": "hdfs dfs -get -f {source} {target}"
delete_template="delete": "hdfs dfs -rm -r {target}"


.. _tune-docker:

How can I use Tune with Docker?
Expand Down
44 changes: 8 additions & 36 deletions doc/source/tune/tutorials/tune-output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,14 @@ or use a custom logging library that requires multi-process logging.
For example, you may want to do this if you're trying to log images to TensorBoard.
We refer to these saved files as **trial artifacts**.

.. note::

If :class:`SyncConfig(sync_artifacts=True) <ray.train.SyncConfig>`, trial artifacts
are uploaded periodically from each trial (or from each remote training worker for Ray Train)
to the :class:`RunConfig(storage_path) <ray.train.RunConfig>`.

See the :class:`~ray.train.SyncConfig` API reference for artifact syncing configuration options.

You can save trial artifacts directly in the trainable, as shown below:

.. tip:: Make sure that any logging calls or objects stay within scope of the Trainable.
Expand Down Expand Up @@ -269,42 +277,6 @@ should be configured to log to the Trainable's *working directory.* By default,
the current working directory of both functional and class trainables is set to the
corresponding trial directory once it's been launched as a remote Ray actor.

.. warning::

When running in a multi-node cluster using the *deprecated* :ref:`head node storage option <tune-default-syncing>`,
trial artifacts are synchronized to the driver node under the specified path.
This will allow you to visualize and analyze logs of all distributed training workers on a single machine.

When :ref:`specifying a cloud upload directory <tune-cloud-checkpointing>`, trial artifacts are uploaded to that cloud bucket
for later analysis. Note that the driver node does not necessarily contain
artifacts from *all* trials -- only the ones that were running on that node.
To disable artifacts from being uploaded to the cloud, set ``SyncConfig(sync_artifacts=False)`` in :class:`~ray.tune.syncer.SyncConfig`.

.. warning::

Appending to trial artifacts upon restoration is not supported.
As a workaround, save trial artifacts to separate files with unique filenames.

For example, instead of doing this:

.. code-block:: python

def appending_train_fn(config):
for i in range(config["num_epochs"]):
with open("./artifact.txt", "a") as f:
f.write(f"Some data about iteration {i}\n")

Log artifacts as independent files with unique filenames:

.. code-block:: python

def separate_files_train_fn(config):
for i in range(config["num_epochs"]):
with open(f"./artifact_{i}.txt", "w") as f:
f.write(f"Some data about iteration {i}\n")

If you are running into issues, `file an issue <https://github.com/ray-project/ray/issues>`_


How to Build Custom Tune Loggers?
---------------------------------
Expand Down
Loading
Loading