Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the 'DDP' mode cannot work on this framework #3

Closed
kongbia opened this issue Jun 13, 2022 · 10 comments
Closed

Why the 'DDP' mode cannot work on this framework #3

kongbia opened this issue Jun 13, 2022 · 10 comments

Comments

@kongbia
Copy link

kongbia commented Jun 13, 2022

Hi, I try to train UDAT with 4 2080 Ti. However, dp mode causes uneven distribution of GPU memory. For example, only the main GPU occupies 11G, while the remaining three occupy only 6G on average. I change it to DDP mode but it doesn't work. Would you have any ideas?

Besides, I find several bugs.

  1. in the train.py, when calculating the summation of discriminator output, the numpy function cannot operate on tensors with grad.
    i.e., train.py line 257-258, 285-286, 297-298
D_out_z = np.sum([Disc(F.softmax(_zf_up_t, dim=1)) for _zf_up_t in zf_up_t])/3.0
D_out_x = np.sum([Disc(F.softmax(_xf_up_t, dim=1)) for _xf_up_t in xf_up_t])/3.0

I think it should be

D_out_z = torch.stack([Disc(F.softmax(_zf_up_t, dim=1)) for _zf_up_t in zf_up_t]).sum(0) / 3.
D_out_x = torch.stack([Disc(F.softmax(_xf_up_t, dim=1)) for _xf_up_t in xf_up_t]).sum(0) / 3.
  1. in the eval.py, the line 67 elif 'NAT' in args.dataset: should be elif 'NAT' == args.dataset:. Otherwise, the results of NAT_L would also go into this branch.
@Jay-Ye
Copy link
Collaborator

Jay-Ye commented Jun 19, 2022

Thank you for your comment.
Yes, exactly, we also found the uneven distribution of GPU memory during training. Though we haven't found the reason for now, it can be safely ignored since the training runs successfully.
And great thanks for your report of the bugs, we've fixed them accordingly.

@kongbia
Copy link
Author

kongbia commented Jun 20, 2022

Thank you for your comment. Yes, exactly, we also found the uneven distribution of GPU memory during training. Though we haven't found the reason for now, it can be safely ignored since the training runs successfully. And great thanks for your report of the bugs, we've fixed them accordingly.

Hi, I tried to modify the line 136 of distributed.py
if param.requires_grad:
into
if param.requires_grad and param.grad is not None:
and it can trained with DDP successfully. The training time reduced from 10 hours to 4 hours in my machine.

However, I found the reason is that all parameters of ALIGN module have no grad (is None) but their requires_grad is True. I am quite confused about it.

@cjyiiiing
Copy link

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

@Jay-Ye
Copy link
Collaborator

Jay-Ye commented Jul 7, 2022

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

@cjyiiiing
Copy link

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

@Jay-Ye
Copy link
Collaborator

Jay-Ye commented Jul 7, 2022

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

@cjyiiiing
Copy link

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:

------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------

@Jay-Ye
Copy link
Collaborator

Jay-Ye commented Jul 9, 2022

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:

------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------

I suggest you test the model we released (both the original version and the UDAT version) on your platform and compare the results with the paper, to figure out how environmental difference influences the results.

@Jay-Ye Jay-Ye closed this as completed Jul 14, 2022
@cjyiiiing
Copy link

cjyiiiing commented Sep 26, 2022

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:

------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------

I suggest you test the model we released (both the original version and the UDAT version) on your platform and compare the results with the paper, to figure out how environmental difference influences the results.

I test the model of the original version.

SiamCAR:

  • when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.422(Success) and 0.633(Precision);
  • when python=3.6.1, pytorch=1.2.0, cudatoolkit=10.0.130, the result is 0.450(Success) and 0.670(Precision);

SiamBAN:

  • when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.271(Success) and 0.441(Precision);
  • when python=3.7.13, pytorch=1.3.1, cudatoolkit=10.1.243, the result is 0.327(Success) and 0.540(Precision);

It seems that the environment influences the results a lot. Can you tell me your environment setting?

@Jay-Ye
Copy link
Collaborator

Jay-Ye commented Sep 26, 2022

Hi. Can you reproduce the results in the paper? For UDAT-CAR tested on NAT2021-test, I got 0.458(Success) and 0.655(Precision), but in the paper it's 0.483 and 0.687.

Could you please post more details of your testing/training? So we can figure things out.

The setting I used during preprocessing/training/testing is python 3.7.11, pytorch 1.6.0, cudatoolkit 10.1.243. All training/testing parameters remain unchanged. So is it the problem of environment? Or do I need to post some more other details?

Have you checked about the model you were running? Some people have reproduced the results in the paper, the environmental difference should not cause a huge performance drop.

I have checked every steps including preprocessing and training. Everything seems ok, except that when I run gen_json.py of NAT dataset I skip sequence "0175bike1_3" bacause there is no "0175bike1_3_gt.txt" in "pseudo_anno/". I retrain the model, but the the performance is still low:

------------------------------------------------------------------------------------------------------
|                        Tracker name                         | Success | Norm Precision | Precision |
------------------------------------------------------------------------------------------------------
| UDATCAR_A6000_snapshot_wrandomcheckpoint_e19_0.39_0.04_0.37 |  0.457  |     0.000      |   0.652   |
------------------------------------------------------------------------------------------------------

I suggest you test the model we released (both the original version and the UDAT version) on your platform and compare the results with the paper, to figure out how environmental difference influences the results.

I test the model of the original version.

SiamCAR:

  • when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.422(Success) and 0.633(Precision);
  • when python=3.6.1, pytorch=1.2.0, cudatoolkit=10.0.130, the result is 0.450(Success) and 0.670(Precision);

SiamBAN:

  • when python=3.7.11, pytorch=1.6.0, cudatoolkit=10.1.243, the result is 0.271(Success) and 0.441(Precision);
  • when python=3.7.13, pytorch=1.3.1, cudatoolkit=10.1.243, the result is 0.327(Success) and 0.540(Precision);

It seems that the environment influences the results a lot. Can you tell me your environment setting?

Hey, the environment used in the original paper is:

PyTorch 1.11.0
CUDA11.6/cudnn8.4.0
Python 3.9.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants