-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[compiled autograd] move inputs to cuda with non_blocking=True #129181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, one question
| in_compiled_autograd_region = True | ||
| for i in runtime_inputs_to_move: | ||
| inputs[i] = inputs[i].cuda() | ||
| inputs[i] = inputs[i].pin_memory().cuda(non_blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put a numel() limit on this ? The general guidance for pin memory is to avoid overallocating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it doesn't matter rn since every input in runtime_inputs_to_move would have numel=1
Stack from ghstack (oldest at bottom):
non_blocking=True requires first pinning, which shouldn't be a problem given that they are cpu scalars
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang