Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Logs from frameworks (lightning_logs, wandb, transformers output_dir) in the working directory can be synced unintentionally #40634

Closed
justinvyu opened this issue Oct 24, 2023 · 1 comment · Fixed by #43403
Assignees
Labels
P1 Issue that should be fixed within a few weeks train Ray Train Related Issue UX The issue is not only about technical bugs

Comments

@justinvyu
Copy link
Contributor

Many frameeworks set default logging directories to the working directory.

Train/Tune changes the working directory to the trial directory, and the contents of this directory can get synced to cloud unintentionally. This can cause double uploading of checkpoints (once for the Train checkpoint, and once as an artifact in the directory).

The uploading happens from either:

  1. Driver syncing if the trial happens to live on the head node. This can be fixed by converting the sync exclude-list into an explicit include-list instead.
  2. Trial artifact syncing enabled by SyncConfig(sync_artifacts=True). We should either recommend to configure the logging directory of these frameworks to an external directory in the docs, or add a configurable artifact exclude-list.
@justinvyu justinvyu added P1 Issue that should be fixed within a few weeks triage Needs triage (eg: priority, bug/not-bug, and owning component) train Ray Train Related Issue UX The issue is not only about technical bugs labels Oct 24, 2023
@matthewdeng matthewdeng removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 24, 2023
@woshiyyya woshiyyya self-assigned this Oct 25, 2023
@justinvyu justinvyu changed the title [train] Logs from frameworks (lightning_logs, wandb) in the working directory can be synced unintentionally [train] Logs from frameworks (lightning_logs, wandb, transformers output_dir) in the working directory can be synced unintentionally Feb 16, 2024
@justinvyu
Copy link
Contributor Author

justinvyu commented Feb 16, 2024

Workaround 1: Configure framework logging directories

One workaround is to set the log directory for these frameworks to some path outside the Ray Train experiment directory. (The default behavior for a lot of these is the current working directory in the training worker, which is in the experiment dir.)

Huggingface Transformers Trainer:

TrainingArguments(output_dir="/tmp/path")

Lightning Trainer:

pl.Trainer(default_root_dir="/tmp/path")

wandb:

wandb.init(dir="/tmp/path")

Workaround 2: Disable CWD change behavior

Another workaround is to run with the environment variable RAY_CHDIR_TO_TRIAL_DIR=0.

See https://docs.ray.io/en/master/train/user-guides/persistent-storage.html#keep-the-original-current-working-directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks train Ray Train Related Issue UX The issue is not only about technical bugs
Projects
None yet
3 participants