Why the 'DDP' mode cannot work on this framework #3

kongbia · 2022-06-13T01:52:20Z

Hi, I try to train UDAT with 4 2080 Ti. However, dp mode causes uneven distribution of GPU memory. For example, only the main GPU occupies 11G, while the remaining three occupy only 6G on average. I change it to DDP mode but it doesn't work. Would you have any ideas?

Besides, I find several bugs.

in the train.py, when calculating the summation of discriminator output, the numpy function cannot operate on tensors with grad.
i.e., train.py line 257-258, 285-286, 297-298

D_out_z = np.sum([Disc(F.softmax(_zf_up_t, dim=1)) for _zf_up_t in zf_up_t])/3.0
D_out_x = np.sum([Disc(F.softmax(_xf_up_t, dim=1)) for _xf_up_t in xf_up_t])/3.0

I think it should be

D_out_z = torch.stack([Disc(F.softmax(_zf_up_t, dim=1)) for _zf_up_t in zf_up_t]).sum(0) / 3.
D_out_x = torch.stack([Disc(F.softmax(_xf_up_t, dim=1)) for _xf_up_t in xf_up_t]).sum(0) / 3.

in the eval.py, the line 67 elif 'NAT' in args.dataset: should be elif 'NAT' == args.dataset:. Otherwise, the results of NAT_L would also go into this branch.

The text was updated successfully, but these errors were encountered:

Jay-Ye · 2022-06-19T15:32:25Z

Thank you for your comment.
Yes, exactly, we also found the uneven distribution of GPU memory during training. Though we haven't found the reason for now, it can be safely ignored since the training runs successfully.
And great thanks for your report of the bugs, we've fixed them accordingly.

kongbia · 2022-06-20T16:25:30Z

Thank you for your comment. Yes, exactly, we also found the uneven distribution of GPU memory during training. Though we haven't found the reason for now, it can be safely ignored since the training runs successfully. And great thanks for your report of the bugs, we've fixed them accordingly.

Hi, I tried to modify the line 136 of distributed.py
if param.requires_grad:
into
if param.requires_grad and param.grad is not None:
and it can trained with DDP successfully. The training time reduced from 10 hours to 4 hours in my machine.

However, I found the reason is that all parameters of ALIGN module have no grad (is None) but their requires_grad is True. I am quite confused about it.

cjyiiiing · 2022-07-06T01:37:24Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Jay-Ye · 2022-07-07T01:00:52Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

cjyiiiing · 2022-07-07T12:23:45Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Jay-Ye · 2022-07-07T13:10:51Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

cjyiiiing · 2022-07-09T05:22:26Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:

------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------

Jay-Ye · 2022-07-09T07:47:17Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:
------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------

I suggest you test the model we released (both the original version and the UDAT version) on your platform and compare the results with the paper, to figure out how environmental difference influences the results.

cjyiiiing · 2022-09-26T08:38:40Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:
------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------
I suggest you test the model we released (both the original version and the UDAT version) on your platform and compare the results with the paper, to figure out how environmental difference influences the results.

I test the model of the original version.

SiamCAR:

when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.422(Success) and 0.633(Precision);
when python=3.6.1, pytorch=1.2.0, cudatoolkit=10.0.130, the result is 0.450(Success) and 0.670(Precision);

SiamBAN:

when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.271(Success) and 0.441(Precision);
when python=3.7.13, pytorch=1.3.1, cudatoolkit=10.1.243, the result is 0.327(Success) and 0.540(Precision);

It seems that the environment influences the results a lot. Can you tell me your environment setting?

Jay-Ye · 2022-09-26T08:50:12Z

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:
------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------
I suggest you test the model we released (both the original version and the UDAT version) on your platform and compare the results with the paper, to figure out how environmental difference influences the results.
I test the model of the original version.

SiamCAR:

when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.422(Success) and 0.633(Precision);

when python=3.6.1, pytorch=1.2.0, cudatoolkit=10.0.130, the result is 0.450(Success) and 0.670(Precision);

SiamBAN:

when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.271(Success) and 0.441(Precision);

when python=3.7.13, pytorch=1.3.1, cudatoolkit=10.1.243, the result is 0.327(Success) and 0.540(Precision);

It seems that the environment influences the results a lot. Can you tell me your environment setting?

Hey, the environment used in the original paper is:

PyTorch 1.11.0
CUDA11.6/cudnn8.4.0
Python 3.9.12

Jay-Ye closed this as completed Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the 'DDP' mode cannot work on this framework #3

Why the 'DDP' mode cannot work on this framework #3

kongbia commented Jun 13, 2022

Jay-Ye commented Jun 19, 2022

kongbia commented Jun 20, 2022

cjyiiiing commented Jul 6, 2022

Jay-Ye commented Jul 7, 2022

cjyiiiing commented Jul 7, 2022

Jay-Ye commented Jul 7, 2022

cjyiiiing commented Jul 9, 2022

Jay-Ye commented Jul 9, 2022

cjyiiiing commented Sep 26, 2022 •

edited

Loading

Jay-Ye commented Sep 26, 2022

Why the 'DDP' mode cannot work on this framework #3

Why the 'DDP' mode cannot work on this framework #3

Comments

kongbia commented Jun 13, 2022

Jay-Ye commented Jun 19, 2022

kongbia commented Jun 20, 2022

cjyiiiing commented Jul 6, 2022

Jay-Ye commented Jul 7, 2022

cjyiiiing commented Jul 7, 2022

Jay-Ye commented Jul 7, 2022

cjyiiiing commented Jul 9, 2022

Jay-Ye commented Jul 9, 2022

cjyiiiing commented Sep 26, 2022 • edited Loading

Jay-Ye commented Sep 26, 2022

cjyiiiing commented Sep 26, 2022 •

edited

Loading