Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint' #21

Closed
fwangut opened this issue Oct 5, 2022 · 1 comment

Comments

@fwangut
Copy link

fwangut commented Oct 5, 2022

@HaokunLiu @dptam Thank you for your great work and congrats on the neurips acceptance!

I have got an issue when using ddp as follows:
AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'

It's raised by the following line:

self.trainer.model.save_checkpoint(distributed_save_path)

Any suggestion would be appreciated!

Another related question is why the ddp ckpt also needs to be processed by zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)? I thought it should be applied to deepspeed zero ckpts only. This is done in:

trainable_states = zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)

@HaokunLiu
Copy link
Collaborator

HaokunLiu commented Oct 6, 2022

I guess this could be an issue with the version? But I don't have that much experience with all the parallelism techniques to diagnose why. I guess the easiest (and ugliest) fix is to just turn off save_model.
If you dig around pytorch distributed data parallelism, and their recommended way to save a model, you might find an elegant solution.
Also, although we had deepspeed, but we never actually used it in any experiments, because it was too slow on our machine. I can't even guarantee it is correct in the main branch, so try to avoid use deepspeed. (never set compute_strategy to deepspeed_blahblah)

@fwangut fwangut closed this as completed Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants