Skip to content

Multiprocessing with GPUs setup for workers has errors #2758

@aingo03304

Description

@aingo03304

🐛 Bug

In xla_multiprocessing.py, _parse_workers_config returns OrderedDict.

def _parse_workers_config(config):
# XRT_WORKERS='worker:0;ismz9:25822'
workers = collections.OrderedDict()
for worker in config.split('|'):
m = re.match(r'(\w+):(\d+);((grpc://)?[a-zA-Z0-9_\-\.]+:\d+)', worker)
if not m:
raise ValueError('Bad worker syntax: {}'.format(worker))
workers['{}:{}'.format(m.group(1), m.group(2))] = WorkerConfigEntry(
worker_name=m.group(1), ordinal=int(m.group(2)), host_port=m.group(3))
return workers

However

for h, worker in enumerate(wcfg):
m = re.match(r'(.*):(\d+)$', worker.host_port)

this code tries to access worker.host_port and this raises the error which can be fixed by replacing worker.host_port to wcfg[worker].host_port.

And,

workers.append('{}:{};grpc://{}:{}'.format(worker.worker_name, gindex,
m.group(1),
int(m.group(2)) + i))

this code is appending '{}:{};grpc://{}:{}' but '{}:{};{}:{}' is correct based on the following configuration in CI, because m.group(1) includes 'grpc://'.

export XRT_WORKERS="localservice:0;grpc://localhost:40934"

To Reproduce

Steps to reproduce the behavior:

  1. With GPUs, run test_train_mp_mnist.py

Expected behavior

Environment

  • Reproducible on XLA backend [CPU/TPU]: GPUs
  • torch_xla version: master

Additional context

Metadata

Metadata

Assignees

Labels

staleHas not had recent activity

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions