AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint' #21

fwangut · 2022-10-05T19:33:58Z

@HaokunLiu @dptam Thank you for your great work and congrats on the neurips acceptance!

I have got an issue when using ddp as follows:
AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'

It's raised by the following line:

t-few/src/models/EncoderDecoder.py

Line 305 in 4e581fa

self.trainer.model.save_checkpoint(distributed_save_path)

Any suggestion would be appreciated!

Another related question is why the ddp ckpt also needs to be processed by zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)? I thought it should be applied to deepspeed zero ckpts only. This is done in:

t-few/src/models/EncoderDecoder.py

Line 308 in 4e581fa

    
           trainable_states = zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)

The text was updated successfully, but these errors were encountered:

HaokunLiu · 2022-10-06T20:26:59Z

I guess this could be an issue with the version? But I don't have that much experience with all the parallelism techniques to diagnose why. I guess the easiest (and ugliest) fix is to just turn off save_model.
If you dig around pytorch distributed data parallelism, and their recommended way to save a model, you might find an elegant solution.
Also, although we had deepspeed, but we never actually used it in any experiments, because it was too slow on our machine. I can't even guarantee it is correct in the main branch, so try to avoid use deepspeed. (never set compute_strategy to deepspeed_blahblah)

fwangut closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint' #21

AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint' #21

fwangut commented Oct 5, 2022

HaokunLiu commented Oct 6, 2022 •

edited

Loading

AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint' #21

AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint' #21

Comments

fwangut commented Oct 5, 2022

HaokunLiu commented Oct 6, 2022 • edited Loading

HaokunLiu commented Oct 6, 2022 •

edited

Loading