Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ScheduleImplMixin.save_schedule method appears to save a copy of the same schedule file on each machine #11

Closed
Davidham3 opened this issue Dec 8, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@Davidham3
Copy link

🐛 Bug

After setting the trainer.ckpt_path variable to resume the fine-tuning process, the fine-tuning scheduler appears to save an Encoder_ft_schedule.yaml file on each machine.

However, as all machines are utilizing the same shared remote storage during my training process, this concurrent saving action results in an Input/Output error. It may be necessary to decorate the ScheduleImplMixin.save_schedule static method with @rank_zero_only to ensure that the file is saved only once by the primary node.

To address this issue, I decorated the ScheduleImplMixin.save_schedule static method with @rank_zero_only, ensuring that the file is only saved once by the primary node. After making this change, the model training proceeded without further I/O errors.

Environment

  • Fine-Tuning Scheduler Version (e.g., 0.1.0): 2.1.1
  • Lightning Version (e.g., 1.5.0): 2.1.1
  • PyTorch Version (e.g., 2.0): 2.1.1
  • Python version (e.g., 3.11): 3.9
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.8
  • GPU models and configuration: A100 80g
  • How you installed PyTorch (conda, pip, source): pip
@Davidham3 Davidham3 added the bug Something isn't working label Dec 8, 2023
speediedan added a commit that referenced this issue Dec 20, 2023
…ve_schedule` from non-`rank_zero_only` guarded contexts. Explicitly guarding `save_schedule` as well as `gen_ft_schedule` themselves to ensure similar bugs surface during development if those functions are directly accessed in future non-`rank_zero_only` contexts. Includes associated test enhancements.
@speediedan
Copy link
Owner

Thanks for finding and submitting this issue @Davidham3!

As you observed, some codepaths were incorrectly invoking ScheduleImplMixin.save_schedule (and ScheduleImplMixin.gen_ft_schedule actually) from non-rank_zero_only guarded contexts. I've explicitly guarded both those methods themselves rather than relying upon the rank-zero guarding of upstream codepaths. The fix for this is in today's patch release (2.1.2).

Thanks again for your contribution. Feel free to reach out anytime if you have other issues or want to share more about your use case.

Best of luck with your work!

speediedan added a commit that referenced this issue Dec 20, 2023
…ve_schedule` from non-`rank_zero_only` guarded contexts. Explicitly guarding `save_schedule` as well as `gen_ft_schedule` themselves to ensure similar bugs surface during development if those functions are directly accessed in future non-`rank_zero_only` contexts. Includes associated test enhancements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants