cfg.DIST_ENABLE false fail #36

bhack · 2023-03-03T16:13:25Z

With cfg.DIST_ENABLE false all the distributed specific parts are not wrapped by a condition to check if distributed was enabled or not.

The text was updated successfully, but these errors were encountered:

z-x-yang · 2023-03-03T16:38:27Z

self.DIST_ENABLE is necessary for multi-GPU training.

bhack · 2023-03-03T18:08:25Z

I meant that I am trying to test single GPU.
In many places cfg.DIST_ENABLE is checked to safely go through the non distributed code path.

But in many other places not:

aot-benchmark/networks/managers/trainer.py

Lines 342 to 352 in 1c3a5ec

    
           self.train_sampler = torch.utils.data.distributed.DistributedSampler( 
        
               train_dataset) 
        
           self.train_loader = DataLoader(train_dataset, 
        
                                          batch_size=int(cfg.TRAIN_BATCH_SIZE / 
        
                                                         cfg.TRAIN_GPUS), 
        
                                          shuffle=False, 
        
                                          num_workers=cfg.DATA_WORKERS, 
        
                                          pin_memory=True, 
        
                                          sampler=self.train_sampler, 
        
                                          drop_last=True, 
        
                                          prefetch_factor=4)

bhack · 2023-03-07T13:07:27Z

E.g. here instead the code is checking cfg.DIST_ENABLE

aot-benchmark/networks/managers/trainer.py

Lines 59 to 83 in 1c3a5ec

    
           if cfg.DIST_ENABLE: 
        
               dist.init_process_group(backend=cfg.DIST_BACKEND, 
        
                                       init_method=cfg.DIST_URL, 
        
                                       world_size=cfg.TRAIN_GPUS, 
        
                                       rank=rank, 
        
                                       timeout=datetime.timedelta(seconds=300)) 
        
               self.model.encoder = nn.SyncBatchNorm.convert_sync_batchnorm( 
        
                   self.model.encoder).cuda(self.gpu) 
        
               self.dist_engine = torch.nn.parallel.DistributedDataParallel( 
        
                   self.engine, 
        
                   device_ids=[self.gpu], 
        
                   output_device=self.gpu, 
        
                   find_unused_parameters=True, 
        
                   broadcast_buffers=False) 
        
           else: 
        
               self.dist_engine = self.engine 
        
           self.use_frozen_bn = False 
        
           if 'swin' in cfg.MODEL_ENCODER: 
        
               self.print_log('Use LN in Encoder!') 
        
           elif not cfg.MODEL_FREEZE_BN: 
        
               if cfg.DIST_ENABLE: 
        
                   self.print_log('Use Sync BN in Encoder!')

z-x-yang · 2023-03-07T15:24:30Z

The distributed sampler is useless and meaningless for evaluation, where GPUs are asynchronous instead of synchronous. The video lengths are always different for different GPUs.

bhack · 2023-03-07T15:49:14Z

Is this not in the trainer?
I am trying to train a single GPU job with cfg.DIST_ENABLE=False

z-x-yang · 2023-03-07T15:57:39Z

This config is designed for training.

…

On Mar 7, 2023, at 23:49, bhack ***@***.***> wrote: Is this not in the trainer? I am trying to train a single GPU job with cfg.DIST_ENABLE=False — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPDXSYPPW5OOK75E233W25KILANCNFSM6AAAAAAVOZMQ2Q>. You are receiving this because you commented.

bhack · 2023-03-07T16:00:28Z

Yes it is what I meant. Are we not going to have issue if we don't conditional wrap torch.nn.parallel.DistributedDataParallel in the trainer?

z-x-yang · 2023-03-07T16:43:35Z

I suppose it should be ok.

…

On Mar 8, 2023, at 00:00, bhack ***@***.***> wrote: Yes it is what I meant. Are we not going to have issue if we don't conditional wrap torch.nn.parallel.DistributedDataParallel in the trainer? — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPGAB5G46JB5EUT5UALW25LSPANCNFSM6AAAAAAVOZMQ2Q>. You are receiving this because you commented.

bhack · 2023-03-07T16:54:41Z

Isn't that one going to require init_process_group? But it is conditional wrapped in the trainer.

aot-benchmark/networks/managers/trainer.py

Lines 59 to 64 in 1c3a5ec

    
           if cfg.DIST_ENABLE: 
        
               dist.init_process_group(backend=cfg.DIST_BACKEND, 
        
                                       init_method=cfg.DIST_URL, 
        
                                       world_size=cfg.TRAIN_GPUS, 
        
                                       rank=rank, 
        
                                       timeout=datetime.timedelta(seconds=300))

z-x-yang · 2023-03-07T16:58:57Z

Sorry, I don't understand what you mean.

…

On Mar 8, 2023, at 00:54, bhack ***@***.***> wrote: Isn't that one going to require init_process_group? But it is conditional wrapped in the trainer. https://github.com/yoxu515/aot-benchmark/blob/1c3a5ec51d81f3e17ff9092aa1e830206d766132/networks/managers/trainer.py#L59-L64 — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANX4YPDVCSEOT5LR3KPQOQLW25R5ZANCNFSM6AAAAAAVOZMQ2Q>. You are receiving this because you commented.

bhack · 2023-03-07T17:25:01Z

E.g. with self.DIST_ENABLE = False in configs/default.py we are going to fail directly at:

aot-benchmark/networks/managers/trainer.py

Line 342 in 1c3a5ec

self.train_sampler = torch.utils.data.distributed.DistributedSampler(

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Cause:
#36 (comment)

As we don't have completely covered cfg.DIST_ENABLE safeguards and the relatrive alternative code path.

bhack · 2023-03-08T12:35:55Z

Sorry, I don't understand what you mean.

Is it clear now?

z-x-yang · 2023-03-08T12:50:36Z

I have updated the trainer.py, and it should be ok to set self.DIST_ENABLE = False and train with a single GPU.

bhack · 2023-03-08T13:23:46Z

Thanks, I've checked your changes and they were the same I've done locally on my side in these days.

What do you think now about this (pytorch/pytorch#37444)?

aot-benchmark/tools/train.py

Lines 78 to 79 in d5bd73f

    
           # Use torch.multiprocessing.spawn to launch distributed processes 
        
           mp.spawn(main_worker, nprocs=cfg.TRAIN_GPUS, args=(cfg, args.amp))

bhack · 2023-03-08T13:54:39Z

Other then my previous comment I think that we still have now an issue with the keys with DIST_ENABLE=False:

aot-benchmark/networks/managers/trainer.py

Lines 677 to 682 in d5bd73f

    
           for key in boards['image'].keys(): 
        
               tmp = boards['image'][key].cpu().numpy() 
        
               self.tblogger.add_image('S{}/' + key, tmp, step) 
        
           for key in boards['scalar'].keys(): 
        
               tmp = boards['scalar'][key].cpu().numpy() 
        
               self.tblogger.add_scalar('S{}/' + key, tmp, step)

    for key in boards['image'].keys():
AttributeError: 'list' object has no attribute 'keys'

z-x-yang · 2023-03-08T14:00:40Z

It should be ok now.

bhack · 2023-03-08T14:23:04Z

It seems that we have two issues:

The first is that the trainer it seems to be "randomly" deadlocked on different runs same code
images in img_logs seems to be not correct anymore as they are only the binary masks instead they was "composited" if I remember correctly.

bhack · 2023-03-08T14:32:48Z

For the first issue this is the stacktrace of one of the deadlock and it could be related to #36 (comment):

  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 109, in join
    ready = multiprocessing.connection.wait(
  File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)

z-x-yang · 2023-03-08T16:08:22Z

I guess the first problem is due to torch.spawn, and I have modified the code related to it. Please take a try, and hope this will work for you.

As to the second issue, img_logs should be images where target objects are marked with colorful masks, referring to here.

bhack · 2023-03-08T16:25:36Z

I guess the first problem is due to torch.spawn, and I have modified the code related to it. Please take a try, and hope this will work for you.

Yes, it seems ok now.

As to the second issue, img_logs should be images where target objects are marked with colorful masks, referring to here.

Yes sorry this was a false positive related to the image range in the compositing phase, thanks for the double check.

bhack · 2023-12-07T23:09:58Z

The same need to be fixed in the PAOT branch.

bhack changed the title ~~self.DIST_ENABLE false fail~~ cfg.DIST_ENABLE false fail Mar 7, 2023

bhack closed this as completed Mar 8, 2023

bhack reopened this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cfg.DIST_ENABLE false fail #36

cfg.DIST_ENABLE false fail #36

bhack commented Mar 3, 2023 •

edited

Loading

z-x-yang commented Mar 3, 2023

bhack commented Mar 3, 2023 •

edited

Loading

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 •

edited

Loading

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 via email

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 via email

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 via email

bhack commented Mar 7, 2023 •

edited

Loading

bhack commented Mar 8, 2023

z-x-yang commented Mar 8, 2023

bhack commented Mar 8, 2023 •

edited

Loading

bhack commented Mar 8, 2023 •

edited

Loading

z-x-yang commented Mar 8, 2023

bhack commented Mar 8, 2023 •

edited

Loading

bhack commented Mar 8, 2023

z-x-yang commented Mar 8, 2023

bhack commented Mar 8, 2023

bhack commented Dec 7, 2023

cfg.DIST_ENABLE false fail #36

cfg.DIST_ENABLE false fail #36

Comments

bhack commented Mar 3, 2023 • edited Loading

z-x-yang commented Mar 3, 2023

bhack commented Mar 3, 2023 • edited Loading

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 • edited Loading

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 via email

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 via email

bhack commented Mar 7, 2023

z-x-yang commented Mar 7, 2023 via email

bhack commented Mar 7, 2023 • edited Loading

bhack commented Mar 8, 2023

z-x-yang commented Mar 8, 2023

bhack commented Mar 8, 2023 • edited Loading

bhack commented Mar 8, 2023 • edited Loading

z-x-yang commented Mar 8, 2023

bhack commented Mar 8, 2023 • edited Loading

bhack commented Mar 8, 2023

z-x-yang commented Mar 8, 2023

bhack commented Mar 8, 2023

bhack commented Dec 7, 2023

bhack commented Mar 3, 2023 •

edited

Loading

bhack commented Mar 3, 2023 •

edited

Loading

z-x-yang commented Mar 7, 2023 •

edited

Loading

bhack commented Mar 7, 2023 •

edited

Loading

bhack commented Mar 8, 2023 •

edited

Loading

bhack commented Mar 8, 2023 •

edited

Loading

bhack commented Mar 8, 2023 •

edited

Loading