Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68

Closed
shubham83183 opened this issue Oct 21, 2022 · 6 comments
Closed

Comments

@shubham83183
Copy link

Hi,
I really enjoyed reading your paper and code. Great work.
I am trying to reproduce the results by running your code on HPC (cluster, one node with 2 GPUs). As mentioned in read me training section, I followed the following command in interactive slurm mode.
"
python -m torch.distributed.launch --nproc_per_node=2 --use_env src/train.py with \ crowdhuman
deformable
multi_frame
tracking
output_dir=models/crowdhuman_deformable_multi_frame \ "

But my code is getting hung up at line
" model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)."

Could you please help me? Following is the output


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
INFO - train - Running command 'load_config'
INFO - train - Started
INFO - train - Running command 'load_config'
INFO - train - Started
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 1): env://
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 0): env://
git:
sha: d62d810, status: has uncommited changes, branch: main

@timmeinhardt
Copy link
Owner

I just ran your command on my HPC machine with 2 GPUs and it is working. I am myself not super familiar with multi-gpu machine setups but maybe something is not properly configured on your side? Have you used the same setup for multi-gpu training before?

@shubham83183
Copy link
Author

Hi,
Thank for your reply. I have solved the issue. Following were my observation.
The DDP was hanging because there was not enough CPU memory, and it was not able to display this error (not enough CPU memory ) as it was hanging. So it was hard for me to debug what went wrong. But now I have increased CPU memory and it works fine.

Could you please tell me how did you adjusted no of workers and GPU ? Because even with multiple GPUs I am not seeing any decrease in training time.

@timmeinhardt
Copy link
Owner

The number of workers is specified per GPU so this should scale automatically. And even if there are very few workers the training should still make progress. Maybe not as fast as it could but still. This sounds like a different problem or bug in your setup.

@shubham83183
Copy link
Author

Hi,
Thank you for your replying and helping me.
Now I can run the code with reduced training time.

@timmeinhardt
Copy link
Owner

Can we close this issue?

@shubham83183
Copy link
Author

Yes, sure. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants