You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another related question is why the ddp ckpt also needs to be processed by zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)? I thought it should be applied to deepspeed zero ckpts only. This is done in:
I guess this could be an issue with the version? But I don't have that much experience with all the parallelism techniques to diagnose why. I guess the easiest (and ugliest) fix is to just turn off save_model.
If you dig around pytorch distributed data parallelism, and their recommended way to save a model, you might find an elegant solution.
Also, although we had deepspeed, but we never actually used it in any experiments, because it was too slow on our machine. I can't even guarantee it is correct in the main branch, so try to avoid use deepspeed. (never set compute_strategy to deepspeed_blahblah)
@HaokunLiu @dptam Thank you for your great work and congrats on the neurips acceptance!
I have got an issue when using
ddp
as follows:AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'
It's raised by the following line:
t-few/src/models/EncoderDecoder.py
Line 305 in 4e581fa
Any suggestion would be appreciated!
Another related question is why the ddp ckpt also needs to be processed by
zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint(distributed_save_path)
? I thought it should be applied to deepspeed zero ckpts only. This is done in:t-few/src/models/EncoderDecoder.py
Line 308 in 4e581fa
The text was updated successfully, but these errors were encountered: