New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The DDP hung up at torch.nn.parallel.DistributedDataParallel(model) #68
Comments
I just ran your command on my HPC machine with 2 GPUs and it is working. I am myself not super familiar with multi-gpu machine setups but maybe something is not properly configured on your side? Have you used the same setup for multi-gpu training before? |
Hi, Could you please tell me how did you adjusted no of workers and GPU ? Because even with multiple GPUs I am not seeing any decrease in training time. |
The number of workers is specified per GPU so this should scale automatically. And even if there are very few workers the training should still make progress. Maybe not as fast as it could but still. This sounds like a different problem or bug in your setup. |
Hi, |
Can we close this issue? |
Yes, sure. Thank you. |
Hi,
I really enjoyed reading your paper and code. Great work.
I am trying to reproduce the results by running your code on HPC (cluster, one node with 2 GPUs). As mentioned in read me training section, I followed the following command in interactive slurm mode.
"
python -m torch.distributed.launch --nproc_per_node=2 --use_env src/train.py with \ crowdhuman
deformable
multi_frame
tracking
output_dir=models/crowdhuman_deformable_multi_frame \ "
But my code is getting hung up at line
" model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)."
Could you please help me? Following is the output
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
WARNING - root - Changed type of config entry "train_split" from str to NoneType
WARNING - train - No observers have been added to this run
INFO - train - Running command 'load_config'
INFO - train - Started
INFO - train - Running command 'load_config'
INFO - train - Started
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 1): env://
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 1
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = 'data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 80
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 50
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = 'data/MOT17'
mot_path_val = 'data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = 'models/crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = ''
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = None
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8097
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 2
img_transform:
max_size = 1333
val_width = 800
INFO - train - Completed after 0:00:00
Namespace(aux_loss=True, backbone='resnet50', batch_size=1, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=80, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=50, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='data/MOT17', mot_path_val='data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='models/crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=1, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split=None, two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8097, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=2)
using distributed mode
| distributed init (rank 0): env://
git:
sha: d62d810, status: has uncommited changes, branch: main
The text was updated successfully, but these errors were encountered: