New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPS #3
Comments
Moreover, can you provide the FLOPs of the MobileSal model? We think this result makes the comparison more comprehensive. Thank you very much! |
Thank you for your interest to our paper informationabout the issues you met.
Let's solve the No.1 issue first.
I never met this issue before and even a resnet-50-based method can support a inference batch size of 20. Could you please provide more details about the running environment? Then we may know what causes this issue. Perhaps we also have a similar environment to re-implement this issue.
Best, Yu-Huan
… 在 2022年4月30日,11:09,gbliao ***@***.***> 写道:
Thanks for your great work! We are very interested in the FPS mentioned in the paper. However, after running the code in speed_test.py, we have the following confusion.
(1) The paper mentions using a single 2080Ti GPU to test and get a 450 fps speed, but actually running the self-defined tensor with a batch of 20 in speed_test.py on a single 2080Ti will out of memory. So, how can we get the 450fps on a single 2080Ti GPU?
(2) We found that a self-defined tensor is used for inference speed testing in speed_test.py, which is different from the process in test.py that uses an actual image. The comparison is perhaps unfair. So we are very curious what the exact FPS result would be in test.py with a batch of 1. And what about including the ‘’torch.cuda.synchronize()‘’?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
I have tested the GFLOPS of MobileSal a long time ago. Given the input of size 224×224, it is about 0.4GFlops.
Since I am currently on a vacation, I will test mobileSal again with the precise number of GFLOPS in the next week. Thank you for your patience!
…
在 2022年4月30日,11:26,gbliao ***@***.***> 写道:
Moreover, can you provide the FLOPs of the MobileSal model? Thank you very much!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
Thanks for your timely reply! Our running environment: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz python 3.6.7 |
Thanks! We are curious about the GFLOPs with the input of 320x320 as mentioned in the paper. And we can discuss the above issues together after the holidays. Have a nice holiday! |
The number of GFLOPS & FPS for different input sizes is as below:
I am also going to replement the speed issues given the specific enviroment (python=3.6, torch=1.5.1+cu101, torchvision=0.6.1+cu101). Please wait in patience :) |
Thanks for your reply. We have tried testing FLOPs with the input of 320 x 320 by using thop.profile and the results are as follows.
|
Moreover, we are very curious what the exact FPS result would be in test.py with a batch of 1. We set the code in speed_test.py as follows, is this the right setting?
However, we only achieve 54~60 FPS under this setting. |
If you use PyTorch backend with batchsize 20, your gpu will have a high GPU utility rate (nvidia-smi), while the GPU utility rate (nvidia-smi) becomes ~20% if you use PyTorch backend with batchsize 1. This behavior does not appear in methods with regular backbones. On the other hand, the log you showed me seems to reveal that your code is computing MAdd, which is about twice the FLOPS. Meanwhile, the computational cost of some ops seem not to be computed (see the warnings in the log). So you are not computing the real number of FLOPS. I recommend to use
The number of GFLOPS is ~1/2 of the number of Madd. So the computational cost of the network is 1.56GFLOPS.
|
|
When we tested the MobileSal with the batch of 20 again, the aforementioned out-of-memory issue disappeared. I am sorry that previously we may have set something wrong. Thank you again for all your help! |
Yeah, glad to see that all issues are solved. This issue is to be closed. If you would like to discuss with me with other topics, you could add my we-chat |
Thanks for your great work! We are very interested in the FPS mentioned in the paper. However, after running the code in speed_test.py, we have the following confusion.
(1) The paper mentions using a single 2080Ti GPU to test and get a 450 fps speed, but actually running the self-defined tensor with a batch of 20 in speed_test.py on a single 2080Ti will out of memory. So, how can we get the 450fps on a single 2080Ti GPU?
(2) We found that a self-defined tensor is used for inference speed testing in speed_test.py, which is different from the process in test.py that uses an actual image. The comparison is perhaps unfair. So we are very curious what the exact FPS result would be in test.py with a batch of 1. And what about including the ‘’torch.cuda.synchronize()‘’?
The text was updated successfully, but these errors were encountered: