-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[DataLoader] Move loop content into a function to ensure we don't preserve anything #83595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 93cd751 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is faster (not such by how much) and should not cause any issue. I want to benchmark this to see if there is any noticeable difference but haven't gotten to it yet.
@VitalyFedyunin @ejguan, can you confirm this doesn't cause any problem?
break | ||
except queue.Full: | ||
continue | ||
del r # save memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be simpler if we just do r = None
at the end if we want to move the Tensor out of scope. WDYT?
But, the original problem that GIL is blocked still persist when gc
tries to clean up the Tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be simpler if we just do r = None at the end if we want to move the Tensor out of scope. WDYT?
The problem is that doing r=None
here only clear the tuple itself. data
is still a local variable that remain alive. So this line is doing pretty much nothing today.
We could do del r, data
to solve this. But moving it into the function makes it more future-proof as all local state will be properly removed.
But, the original problem that GIL is blocked still persist when gc tries to clean up the Tensor.
Not sure how this is linked?
If you're talking about expensive unmap that happen on these objects, #83623 should fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're talking about expensive unmap that happen on these objects, #83623 should fix that.
Thanks for pointing it out. It looks good.
The problem is that doing
r=None
here only clear the tuple itself.data
is still a local variable that remain alive. So this line is doing pretty much nothing today. We could dodel r, data
to solve this. But moving it into the function makes it more future-proof as all local state will be properly removed.
Oh, yeah. Tensor is also referenced by data
. I agree on the idea of future proof. And, do you mind moving this function out of while not done_event.is_set():
? Otherwise, I believe each iteration would create a new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I left it there on purpose but without any strong reason.
Re-creating it means that there is no captured variable kept alive by the function.
But that shouldn't happen here indeed. moving it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I think this PR should be good to go after moving the function out of while loop
0f0dde1
to
93cd751
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks albanD and Erjia 🚀
@pytorchbot merge -g |
@pytorchbot successfully started a merge job. Check the current status here. |
…serve anything (#83595) (#83595) Summary: Can lead to CPU memory saving as we don't hold onto the pin memory buffer as long as we used to. Pull Request resolved: #83595 Approved by: https://github.com/ejguan, https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/38348362608a47371c65d7fd52db138b4c6a5d65 Reviewed By: atalman Differential Revision: D38852489 Pulled By: albanD fbshipit-source-id: 13021b949a34b64ffb9835992b77ef82fcbb0e85
Can lead to CPU memory saving as we don't hold onto the pin memory buffer as long as we used to.