Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic mixed precision (AMP) training is now natively supported and a stable feature. #557

Closed
Lornatang opened this issue Jul 30, 2020 · 17 comments
Labels
enhancement New feature or request Stale Stale and schedule for closing soon

Comments

@Lornatang
Copy link
Contributor

🚀 Feature

AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.

Motivation

In PyTorch1.6, the mixed precision calculation has been integrated, and there is no need to download the Nvdia/apex library.

Pitch

Update the code in training, remove apex.

Alternatives

No changes on the original basis.

Additional context

Refer to my recently updated PR.

@Lornatang Lornatang added the enhancement New feature or request label Jul 30, 2020
@glenn-jocher
Copy link
Member

@Lornatang yes thank you, this is a worthy addition. I see your PR!

@rafale77
Copy link
Contributor

Just curious about this: Is there any gain or difference between inferring/evaluating with AMP vs. fp16 since the models appear to be trained with AMP? Would the scores be better? why have all the evaluation results run on fp16?

@glenn-jocher
Copy link
Member

@rafale77 the models are saved as FP16, so any checkpoint that is saved and loaded won't have any FP32 values.

@rafale77
Copy link
Contributor

I see. There is therefore no point of inferring with AMP. Thanks for the quick answer.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 27, 2020

@rafale77 sort of.

test.py serves a dual purpose: standlone mAP (loading a checkpoint from hard drive), and mAP during training (accepting a model as an argument, called by train.py).

detect.py only ever loads models from the hard drive as FP16.

An added complication is CPU inference on both, which requires FP32, currently and for the foreseeable future.

If there is a simpler solution to handle these various cases, I'm open to ideas. We did do mAP comparison tests before adopting FP16 as the native checkpointing standard though, and we observed no mAP difference in our mAP and the pycocotools results.

EDIT: In the CPU case, models are converted to .float() before inference.

@rafale77
Copy link
Contributor

Yes I understand how test.py is used. I was just looking to see if there is any benefit to running the model with AMP since I saw that the detect.py was running with fp16. If the models are saved in fp16 then any precision loss, if any, would have already been incurred and upscaling to fp32 would just be a waste of memory for no benefit I suppose. Training could likely still benefit some before the pre-tained model is saved...
My setup is not well suited for testing so I couldn't verify this, that's why I was asking. I could only see that AMP increases the memory usage for inference.

@lucasjinreal
Copy link

@rafale77 How does the model infer in fp16 mode in some GPU don't have fp16 support?

@rafale77
Copy link
Contributor

@rafale77 How does the model infer in fp16 mode in some GPU don't have fp16 support?

I don't know of GPU which don't support FP16. If what you mean is the GPUs without tensor cores then indeed you can expect some very poor performance in terms of inference speed vs FP32 so you are better off turning on AMP or artificially upscaling the model and inputs to FP32.

@glenn-jocher
Copy link
Member

@jinfagang @rafale77 I think all GPUs will see memory savings with FP16 inference. GPUs without tensor cores will not see any speedup however.

I'm not aware of any scenario where FP16 would hurt speed or memory, for any GPU.

@rafale77
Copy link
Contributor

There are actually. For example the 1080Ti: https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877
Has a FP16 computing capability 1/64th of FP32. This was intentionally designed so as not to have them compete with Titan/Tesla units.

@glenn-jocher
Copy link
Member

@rafale77 oh, so you're saying a 1080Ti card would show slower pytorch inference at model.half() than at model.float()? Have you observed this in practice yourself (or if anyone else has seen this please let us know)?

@rafale77
Copy link
Contributor

@glenn-jocher
Copy link
Member

@rafale77 oh, thanks for the link, I did not know that. Well, that's unfortunate. The slowdown doesn't seem to be too bad on the gtx cards though, maybe 10%.

@rafale77
Copy link
Contributor

rafale77 commented Aug 29, 2020

It depends on what you run. You can see for example the shocking fact that a dual batch size 1080TI in FP16 is slower than a single 1080TI in FP32.

@gowna-m
Copy link

gowna-m commented Sep 21, 2020

Just curious about this: Is there any gain or difference between inferring/evaluating with AMP vs. fp16 since the models appear to be trained with AMP? Would the scores be better? why have all the evaluation results run on fp16?
the models are saved as FP16, so any checkpoint that is saved and loaded won't have any FP32 values.

Could you get me a better overview on this.
@glenn-jocher @rafale77 Does training with AMP reduce only the training time and the memory usage OR does it have any impact on the latency? When not using AMP, will the models be saved as FP32? I have trained my model with AMP, saved the model and inferencing it without AMP (Does this mean I'm inferencing on FP16). Will there be any reduction in the inference time if pruning is added on top of it when using test.py while inferencing?

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 21, 2020

@gowna-m AMP is enabled by default for all model training on GPU. All YOLOv5 checkpoints are saved in FP16. All GPU inference is performed in FP16. See d4c6674

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

5 participants