The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68

shubham83183 · 2022-10-21T03:05:40Z

Hi,
I really enjoyed reading your paper and code. Great work.
I am trying to reproduce the results by running your code on HPC (cluster, one node with 2 GPUs). As mentioned in read me training section, I followed the following command in interactive slurm mode.
"
python -m torch.distributed.launch --nproc_per_node=2 --use_env src/train.py with \ crowdhuman
deformable
multi_frame
tracking
output_dir=models/crowdhuman_deformable_multi_frame \ "

But my code is getting hung up at line
" model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)."

Could you please help me? Following is the output

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
INFO - train - Running command 'load_config'
INFO - train - Started
INFO - train - Running command 'load_config'
INFO - train - Started
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 1): env://
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 0): env://
git:
sha: d62d810, status: has uncommited changes, branch: main

The text was updated successfully, but these errors were encountered:

timmeinhardt · 2022-10-21T17:25:21Z

I just ran your command on my HPC machine with 2 GPUs and it is working. I am myself not super familiar with multi-gpu machine setups but maybe something is not properly configured on your side? Have you used the same setup for multi-gpu training before?

shubham83183 · 2022-10-28T12:14:26Z

Hi,
Thank for your reply. I have solved the issue. Following were my observation.
The DDP was hanging because there was not enough CPU memory, and it was not able to display this error (not enough CPU memory ) as it was hanging. So it was hard for me to debug what went wrong. But now I have increased CPU memory and it works fine.

Could you please tell me how did you adjusted no of workers and GPU ? Because even with multiple GPUs I am not seeing any decrease in training time.

timmeinhardt · 2022-10-28T18:40:10Z

The number of workers is specified per GPU so this should scale automatically. And even if there are very few workers the training should still make progress. Maybe not as fast as it could but still. This sounds like a different problem or bug in your setup.

shubham83183 · 2022-10-29T11:58:57Z

Hi,
Thank you for your replying and helping me.
Now I can run the code with reduced training time.

timmeinhardt · 2022-10-31T17:24:54Z

Can we close this issue?

shubham83183 · 2022-10-31T17:26:09Z

Yes, sure. Thank you.

timmeinhardt closed this as completed Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68

The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68

shubham83183 commented Oct 21, 2022

timmeinhardt commented Oct 21, 2022

shubham83183 commented Oct 28, 2022

timmeinhardt commented Oct 28, 2022

shubham83183 commented Oct 29, 2022

timmeinhardt commented Oct 31, 2022

shubham83183 commented Oct 31, 2022

The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68

The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68

Comments

shubham83183 commented Oct 21, 2022

timmeinhardt commented Oct 21, 2022

shubham83183 commented Oct 28, 2022

timmeinhardt commented Oct 28, 2022

shubham83183 commented Oct 29, 2022

timmeinhardt commented Oct 31, 2022

shubham83183 commented Oct 31, 2022