Does the test accuracy need to be synchronized in distributed.py? #1

yifanjiang19 · 2019-12-25T07:30:41Z

If directly output the test accuracy, will the code automatically synchronize the accuracy between each GPUs?

tczhangzhi · 2019-12-25T11:39:00Z

Nope, if u really need it, u can use .share_memory() to share a Tensor's memory.
All in all, most distributed lib only help u to handle the synchronization of data, parameters, and gradient.

yifanjiang19 · 2019-12-26T16:46:59Z

Could you give a specific example?
Thanks!

tczhangzhi · 2019-12-27T03:39:24Z

hm, m afraid that's not right.
Here are two ways to communicate between torch.multiprocessing:

if u dont care the running results, u can use share_memory_ like this, which is more faster:

import time
import random

import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def evaluate(rank):
    torch.cuda.manual_seed(rank)
    local_acc = torch.randn(1)[0].cuda(rank)

    print("local_acc:", local_acc)

    return local_acc

def main_worker(gpu, ngpus_per_node, args):
    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)

    local_acc = evaluate(gpu)

    global_acc, global_count = args['global_acc'], args['global_count']

    global_acc += local_acc.cpu()
    global_count += 1

    print("global_acc:", global_acc / global_count)

if __name__ == '__main__':
    global_acc = torch.tensor(.0)
    global_count = torch.tensor(.0)
    
    global_acc.share_memory_()
    global_count.share_memory_()

    args = {
        'global_acc': global_acc,
        'global_count': global_count
    }
    
    mp.spawn(main_worker, nprocs=4, args=(4, args))

But if you really need to synchronize the accuracy, I suggest this kind of implement or something else using all_reduce:

import time
import random

import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def evaluate(rank):
    torch.cuda.manual_seed(rank)
    local_acc = torch.randn(1)[0].cuda(rank)

    print("local_acc:", local_acc)

    return local_acc

def main_worker(gpu, ngpus_per_node, args):
    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)

    local_acc = evaluate(gpu)

    dist.all_reduce(local_acc, op=dist.reduce_op.SUM)
    global_acc = local_acc / ngpus_per_node

    print("global:", global_acc)

if __name__ == '__main__':
    args = {}
    mp.spawn(main_worker, nprocs=4, args=(4, args))

tczhangzhi · 2019-12-27T03:51:35Z

m not sure if u understand, if not u can directly use this code:

acc1, acc5 = accuracy(output, target, topk=(1, 5))
...
dist.all_reduce(acc1, op=dist.reduce_op.SUM)
...
top1.update(acc1[0] / 4 , images.size(0))

Btw, I don't think we really need to calculate the average accuracy during training, which is the waste of time.

yifanjiang19 · 2019-12-27T07:24:13Z

Thanks!
Should the code synchronize the loss between each gpus before loss.backward()? Or the backward function will synchronize automatically?

tczhangzhi · 2019-12-27T08:20:16Z

No. That's DistributedDataParallel's job.
Wrap your model with DistributedDataParallel and just call backward() as usual. During the backwards pass, gradients from each node are averaged (same as your saying "synchronize the loss") and parameters are synchronized automatically.
Check it here: https://github.com/pytorch/pytorch/blob/46539eee0363e25ce5eb408c85cefd808cd6f878/torch/nn/parallel/distributed.py#L378-L382

yifanjiang19 · 2019-12-27T14:37:46Z

thanks

tczhangzhi added the question Further information is requested label Dec 25, 2019

tczhangzhi closed this as completed Dec 27, 2019

tczhangzhi self-assigned this Dec 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the test accuracy need to be synchronized in distributed.py? #1

Does the test accuracy need to be synchronized in distributed.py? #1

yifanjiang19 commented Dec 25, 2019

tczhangzhi commented Dec 25, 2019 •

edited

Loading

yifanjiang19 commented Dec 26, 2019 •

edited

Loading

tczhangzhi commented Dec 27, 2019

tczhangzhi commented Dec 27, 2019

yifanjiang19 commented Dec 27, 2019

tczhangzhi commented Dec 27, 2019

yifanjiang19 commented Dec 27, 2019

Does the test accuracy need to be synchronized in distributed.py? #1

Does the test accuracy need to be synchronized in distributed.py? #1

Comments

yifanjiang19 commented Dec 25, 2019

tczhangzhi commented Dec 25, 2019 • edited Loading

yifanjiang19 commented Dec 26, 2019 • edited Loading

tczhangzhi commented Dec 27, 2019

tczhangzhi commented Dec 27, 2019

yifanjiang19 commented Dec 27, 2019

tczhangzhi commented Dec 27, 2019

yifanjiang19 commented Dec 27, 2019

tczhangzhi commented Dec 25, 2019 •

edited

Loading

yifanjiang19 commented Dec 26, 2019 •

edited

Loading