-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: initialization error when calling torch.distributed.init_process_group using torch multiprocessing #68256
Comments
@ParamsRaman Can you try with PyTorch 1.9? A bunch of logic for init_process_group was changed where we no longer use barrier as part of init_process_group. |
I see |
@ParamsRaman if you are still looking for a solution that works with the latest PyTorch, here is what we came up with for DeepSpeed unit testing: We replace the import inspect
from abc import ABC
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.multiprocessing import Process
import pytest
DEEPSPEED_UNIT_WORKER_TIMEOUT = 120
class DistributedTest(ABC):
is_dist_test = True
world_size = 2
backend = "nccl"
def _run_test(self, request):
self.current_test = self._get_current_test_func(request)
self.test_kwargs = self._get_test_kwargs(request)
if isinstance(self.world_size, int):
self.world_size = [self.world_size]
for procs in self.world_size:
self._launch_procs(procs)
time.sleep(0.5)
def _get_current_test_func(self, request):
# DistributedTest subclasses may have multiple test methods
func_name = request.function.__name__
return getattr(self, func_name)
def _get_test_kwargs(self, request):
# Grab fixture / parametrize kwargs from pytest request object
test_kwargs = {}
params = inspect.getfullargspec(self.current_test).args
params.remove("self")
for p in params:
test_kwargs[p] = request.getfixturevalue(p)
return test_kwargs
def _launch_procs(self, num_procs):
mp.set_start_method('forkserver', force=True)
processes = []
for local_rank in range(num_procs):
p = Process(target=self._dist_init, args=(local_rank, num_procs))
p.start()
processes.append(p)
# Now loop and wait for a test to complete. The spin-wait here isn't a big
# deal because the number of processes will be O(#GPUs) << O(#CPUs).
any_done = False
while not any_done:
for p in processes:
if not p.is_alive():
any_done = True
break
# Wait for all other processes to complete
for p in processes:
p.join(DEEPSPEED_UNIT_WORKER_TIMEOUT)
failed = [(rank, p) for rank, p in enumerate(processes) if p.exitcode != 0]
for rank, p in failed:
# If it still hasn't terminated, kill it because it hung.
if p.exitcode is None:
p.terminate()
pytest.fail(f'Worker {rank} hung.', pytrace=False)
if p.exitcode < 0:
pytest.fail(f'Worker {rank} killed by signal {-p.exitcode}',
pytrace=False)
if p.exitcode > 0:
pytest.fail(f'Worker {rank} exited with code {p.exitcode}',
pytrace=False)
def _dist_init(self, local_rank, num_procs):
"""Initialize deepspeed.comm and execute the user function. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = get_master_port()
os.environ['LOCAL_RANK'] = str(local_rank)
# NOTE: unit tests don't support multi-node so local_rank == global rank
os.environ['RANK'] = str(local_rank)
os.environ['WORLD_SIZE'] = str(num_procs)
# turn off NCCL logging if set
os.environ.pop('NCCL_DEBUG', None)
set_cuda_visibile()
dist.init_process_group(backend=self.backend, rank=local_rank, world_size=num_procs)
dist.barrier()
if torch.cuda.is_available():
torch.cuda.set_device(local_rank)
self.current_test(**self.test_kwargs)
# make sure all ranks finish at the same time
dist.barrier()
# tear down after test completes
dist.destroy_process_group() You will need to add the following to your # Override of pytest "runtest" for DistributedTest class
# This hook is run before the default pytest_runtest_call
@pytest.hookimpl(tryfirst=True)
def pytest_runtest_call(item):
# We want to use our own launching function for distributed tests
if getattr(item.cls, "is_dist_test", False):
dist_test_class = item.cls()
dist_test_class._run_test(item._request)
item.runtest = lambda: True # Dummy function so test is not run twice and then your distributed tests will need to be refactored: # OLD TEST
@pytest.mark.parametrize('foo,bar', [(1,2),(3,4)])
def test_example(foo, bar):
@distributed_test(world_size=[1,4])
def _go():
assert foo < bar
_go()
# NEW TEST
@pytest.mark.parametrize('foo,bar', [(1,2),(3,4)])
class TestExample(DistributedTest):
world_size=[1,4]
def test(self, foo, bar):
assert foo < bar |
@mrwyattii Thanks for the snippet. Will try this out! |
❓ Questions and Help
I created a pytest fixture using decorator to create multiple processes (using torch multiprocessing) for running model parallel distributed unit tests using pytorch distributed. I randomly encountered the below CUDA initialization error all of a sudden (when I was trying to fix some unit tests logic). Since then, all my unit tests have been failing and I traced the failure back to my pytest fixture which calls torch.distributed.init_process_group(..).
Error traceback:
Below is the pytest fixture I created:
Below is how I call the pytest fixture:
I have seen some issues raised in the past when torch multiprocessing and CUDA not working well together, not sure if this is related to that. Perhaps a different way I should be creating my multiple processes to avoid this problem? Any help is appreciated.
I am using pytorch version: 1.8.0a0+ae5c2fe
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
The text was updated successfully, but these errors were encountered: