-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot update part of the parameters in DistributedDataParallel. #22049
Comments
Did you try the instructions in the error message? |
@pietern The instruction there just give a way to find which variable is not included. However, Since I want to train the first part of the net first. I intend not the update the parameters for the second net. |
How do you freeze the parameters you don't want to train? If you set |
@pietern Hi, Thanks for your answer. I want to train two network alternately, so, It is set dynamically after DDP. I think DDP should have some functions to dynamically freeze some part of the network, I feel this is a commonly used function. |
Just to make sure I understand correctly:
For the last one, do you freeze ALL parameters or only a subset? I believe that you can freeze the whole model today, and it should work out of the box. Only if you freeze a subset of the model will you get the error message you posted. |
I have 2 models but I wrapped them in one DDP. This may be the problem. Is it possible to wrap all models in one DDP and dynamically freeze the parameters? I think this may make thing much easier? |
It is not possible today to partially freeze a DDP wrapped model. Either you freeze the whole thing (and no model parameter receives gradients), or none at all. If you want to alternate between two models, it is best to wrap them separately and freeze them entirely, separately, as well. |
I was recently encountering the same problem. I guess PyTorch or the backend library is implemented in such manner due to the synchronization issue. My walk-around is to set the gradients of all 'freezed' parameters to zeros, right after calling |
@xf3227 Did you try the fix that the error message suggests ( If you freeze a subset of parameters, there is currently no way for DDP to know if the same set is frozen across all processes. Therefore, the parameters that don't receive gradients will be made to contribute zeroes, and the reduction is executed as expected. If the parameters are frozen on all processes, the reduced gradient should be all zeroes on all processes (assuming you have called |
@pietern Thank you for pointing that out. I guess I somehow misstated my idea. What I was trying doing was not to Based on my knowledge, if I'm right, gradient reduction (or synchronization) happens during Please let me know if I misunderstand any point. I will really appreciate that. |
It's safe, but it's better to not synchronize at all if you don't have to. |
… part of the parameters in DistributedDataParallel
I met the same issue. |
Did we solve this issue? Specifically, to partially freeze a DDP-wrapped model? |
In March 2023, this is still a problem unsolved |
I was having issues while fine-tuning some layers of a pre-trained model with DDP. AS @pietern suggested, setting param.requires_grad = False before wrapping the model with DDP solved the issue. Thanks! |
🐛 Bug
When I use multiple GPU while the loss is calculated by only part of the parameters. I get the following errors. Use only one GPU works well.
To Reproduce
Steps to reproduce the behavior:
Define a network in which the loss only depends on part of the parameters. We get:
Expected behavior
Environment
PyTorch version: 1.2.0.dev20190620
CUDA used to build PyTorch: 9.0.176
OS: CentOS Linux release 7.5.1804 (Core)
GCC version: (crosstool-NG 1.23.0.449-a04d0) 7.3.0
CMake version: version 2.8.12.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080
GPU 2: GeForce GTX 1080
GPU 3: GeForce GTX 1080
Nvidia driver version: 396.26
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.3.2
[pip3] numpy==1.15.4
[pip3] pytorch-pretrained-bert==0.4.0
[pip3] torch==1.0.1.post2
[pip3] torchfile==0.1.0
[pip3] torchtext==0.4.0
[pip3] torchvision-nightly==0.2.1
[conda] pytorch-pretrained-bert 0.6.2 pypi_0 pypi
[conda] torch-nightly 1.2.0.dev20190620 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchtext 0.4.0 pypi_0 pypi
The text was updated successfully, but these errors were encountered: