Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RPC Param server example for multiple trainers #877

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Jan 28, 2021

When running with multiple trainers, we were running into the following issue with the parameter server example:

Process Process-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "rpc_parameter_server.py", line 223, in run_worker
    run_training_loop(rank, num_gpus, train_loader, test_loader)
  File "rpc_parameter_server.py", line 182, in run_training_loop
    dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [32, 1, 3, 3]] is at version 28; expected version 27 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

At a high level, this resulted from one trainer running a backwards pass while another was updating params with the optimizer. Still coordinating with folks internally to understand if this is expected behavior or not.

In the meantime, we have changed the example as follows:

  1. Each trainer now has its own model on the parameter server, eliminating the issue of trainers stepping on each other
  2. These models are periodically synced at a given interval by averaging them on the PS.

I think for the most part the changes still ensure the example fulfills its main purpose which is to demonstrate RPC and dist autograd.

Tested by running 4 trainers with no crash, repeatedly:
python3 rpc_parameter_server.py --world_size=5 --rank=2

If this looks good, we will update the corresponding tutorial in pytorch/tutorials accordingly.

@mrshenli
Copy link
Contributor

At a high level, this resulted from one trainer running a backwards pass while another was updating params with the optimizer. Still coordinating with folks internally to understand if this is expected behavior or not.

This makes sense. If the the param is modified between forward and backward passes, the autograd algorithm is no longer correct. Thanks for digging into this!

Regarding the fix, will it also work if we force a barrier before every optimizer.step(), which guarantees no unintentional param changes? In this way, we don't need multiple model copies. If the model is on CUDA, since all updates use the same default stream, they won't be race contention issues either. Not sure about CPU models though.

@lucasleesw
Copy link

Hi, is there any examples for multiple gpu training? e.g. each gpu runs one trainer.

@lucasleesw
Copy link

Could help me understand where the forward calculation happened?
In my understand, whenever each Trainer run model_output = self.param_server_rref.rpc_sync().forward(self.rank, x) , the forward calculation is done by the "parameter server", because the self.param_server_rref.owner() is the "parameter server". And the RPC docs says rref.rpc_sync() run on worker rref.owner(). Plz correct me if I am misunderstand, thank you!

@msaroufim
Copy link
Member

Hi @rohan-varma @mrshenli is this an example you'd still like to see merged in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants