-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic mixed precision (AMP) training is now natively supported and a stable feature. #557
Comments
@Lornatang yes thank you, this is a worthy addition. I see your PR! |
Just curious about this: Is there any gain or difference between inferring/evaluating with AMP vs. fp16 since the models appear to be trained with AMP? Would the scores be better? why have all the evaluation results run on fp16? |
@rafale77 the models are saved as FP16, so any checkpoint that is saved and loaded won't have any FP32 values. |
I see. There is therefore no point of inferring with AMP. Thanks for the quick answer. |
@rafale77 sort of. test.py serves a dual purpose: standlone mAP (loading a checkpoint from hard drive), and mAP during training (accepting a model as an argument, called by train.py). detect.py only ever loads models from the hard drive as FP16. An added complication is CPU inference on both, which requires FP32, currently and for the foreseeable future. If there is a simpler solution to handle these various cases, I'm open to ideas. We did do mAP comparison tests before adopting FP16 as the native checkpointing standard though, and we observed no mAP difference in our mAP and the pycocotools results. EDIT: In the CPU case, models are converted to .float() before inference. |
Yes I understand how test.py is used. I was just looking to see if there is any benefit to running the model with AMP since I saw that the detect.py was running with fp16. If the models are saved in fp16 then any precision loss, if any, would have already been incurred and upscaling to fp32 would just be a waste of memory for no benefit I suppose. Training could likely still benefit some before the pre-tained model is saved... |
@rafale77 How does the model infer in fp16 mode in some GPU don't have fp16 support? |
I don't know of GPU which don't support FP16. If what you mean is the GPUs without tensor cores then indeed you can expect some very poor performance in terms of inference speed vs FP32 so you are better off turning on AMP or artificially upscaling the model and inputs to FP32. |
@jinfagang @rafale77 I think all GPUs will see memory savings with FP16 inference. GPUs without tensor cores will not see any speedup however. I'm not aware of any scenario where FP16 would hurt speed or memory, for any GPU. |
There are actually. For example the 1080Ti: https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877 |
@rafale77 oh, so you're saying a 1080Ti card would show slower pytorch inference at model.half() than at model.float()? Have you observed this in practice yourself (or if anyone else has seen this please let us know)? |
I thought this was a well known fact... see this other article: |
@rafale77 oh, thanks for the link, I did not know that. Well, that's unfortunate. The slowdown doesn't seem to be too bad on the gtx cards though, maybe 10%. |
It depends on what you run. You can see for example the shocking fact that a dual batch size 1080TI in FP16 is slower than a single 1080TI in FP32. |
Could you get me a better overview on this. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🚀 Feature
AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.
Motivation
In PyTorch1.6, the mixed precision calculation has been integrated, and there is no need to download the Nvdia/apex library.
Pitch
Update the code in training, remove apex.
Alternatives
No changes on the original basis.
Additional context
Refer to my recently updated PR.
The text was updated successfully, but these errors were encountered: