-
Notifications
You must be signed in to change notification settings - Fork 560
Closed
Labels
staleHas not had recent activityHas not had recent activity
Description
🐛 Bug
In xla_multiprocessing.py
, _parse_workers_config
returns OrderedDict.
xla/torch_xla/distributed/xla_multiprocessing.py
Lines 54 to 63 in 9917e41
def _parse_workers_config(config): | |
# XRT_WORKERS='worker:0;ismz9:25822' | |
workers = collections.OrderedDict() | |
for worker in config.split('|'): | |
m = re.match(r'(\w+):(\d+);((grpc://)?[a-zA-Z0-9_\-\.]+:\d+)', worker) | |
if not m: | |
raise ValueError('Bad worker syntax: {}'.format(worker)) | |
workers['{}:{}'.format(m.group(1), m.group(2))] = WorkerConfigEntry( | |
worker_name=m.group(1), ordinal=int(m.group(2)), host_port=m.group(3)) | |
return workers |
However
xla/torch_xla/distributed/xla_multiprocessing.py
Lines 156 to 157 in 9917e41
for h, worker in enumerate(wcfg): | |
m = re.match(r'(.*):(\d+)$', worker.host_port) |
this code tries to access
worker.host_port
and this raises the error which can be fixed by replacing worker.host_port
to wcfg[worker].host_port
.
And,
xla/torch_xla/distributed/xla_multiprocessing.py
Lines 163 to 165 in 9917e41
workers.append('{}:{};grpc://{}:{}'.format(worker.worker_name, gindex, | |
m.group(1), | |
int(m.group(2)) + i)) |
this code is appending '{}:{};grpc://{}:{}'
but '{}:{};{}:{}'
is correct based on the following configuration in CI, because m.group(1)
includes 'grpc://'.
Line 11 in 3eaee46
export XRT_WORKERS="localservice:0;grpc://localhost:40934" |
To Reproduce
Steps to reproduce the behavior:
- With GPUs, run
test_train_mp_mnist.py
Expected behavior
Environment
- Reproducible on XLA backend [CPU/TPU]: GPUs
- torch_xla version: master
Additional context
Metadata
Metadata
Assignees
Labels
staleHas not had recent activityHas not had recent activity