Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use storage.cpu() for moving storage to CPU in serialization. #46028

Closed
wants to merge 2 commits into from

Conversation

t-vi
Copy link
Collaborator

@t-vi t-vi commented Oct 8, 2020

As reported in #46020, something seems to go wrong with the storage._write_file method used with a BytesIO and a GPU buffer.
Given that we were going to create the intermediate buffer (currently via BytesIO) anyway, we might as well use storage.cpu() to move the storage to the CPU. This appears to work better.

This is a hot fix, further investigation is highly desirable. In particular, I don't have a reproducing test to show.

Fixes #46020

As reported in pytorch#46020, something seems to go wrong with the
storage._write_file method used with a BytesIO and a GPU buffer.
Given that we were going to create the BytesIO intermediate buffer
anyway, we might as well use storage.cpu() to move the storage
to the CPU. This appears to work better.

This is a hot fix, further investigation is highly desirable.
@t-vi
Copy link
Collaborator Author

t-vi commented Oct 8, 2020

@jamesr66a

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 8, 2020

I could be wrong, but on further inspection it would seem that the new method might even be a bit of an optimization: going via write_file incurs an additional copy compared to the method used here: In writeFileRaw the cudaMemcpy to a new cpu buffer and then another by using BytesIO.write, while now we just copy to CPU.

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 8, 2020

I added commentary on the root problem to the bug report. This indicates that I indeed fix the bug here and as discussed above, I think this saves a memory copy, too, so I'd suggest to use this fix.

@codecov
Copy link

codecov bot commented Oct 8, 2020

Codecov Report

Merging #46028 into master will increase coverage by 0.00%.
The diff coverage is 75.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #46028   +/-   ##
=======================================
  Coverage   68.28%   68.29%           
=======================================
  Files         410      410           
  Lines       53609    53606    -3     
=======================================
  Hits        36608    36608           
+ Misses      17001    16998    -3     
Impacted Files Coverage Δ
torch/serialization.py 86.87% <75.00%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3caa71...cc9e756. Read the comment docs.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seemethere has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@t-vi t-vi mentioned this pull request Oct 8, 2020
@gchanan
Copy link
Contributor

gchanan commented Oct 8, 2020

really nice digging!

I don't really see how this avoids the copy -- isn't the writeFile buffer copy just moved to the cpu() copy? Also, given we are about to do a release, changing serialization seems a little risky -- it seems reasonable to me to do a stack where you fix the bug, we cherry-pick that to 1.7, and we then make your change here to master.

@gchanan
Copy link
Contributor

gchanan commented Oct 8, 2020

For a reproduction -- maybe we should just fix the error checking of _writeFile a bit (your suggest of moving to pybind is better), so we can at least get an error on regression.

@gchanan
Copy link
Contributor

gchanan commented Oct 8, 2020

I haven't verified this yet, but I'm guessing #46036 will blow up the serialization tests, so that and this together should convince us that the issue is fixed.

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 8, 2020 via email

@gchanan
Copy link
Contributor

gchanan commented Oct 8, 2020

The thing I was wondering about is the change of assumption, i.e. now we assume that cpu() is implemented for all storages (it's clearly true for cuda). It's almost certainly true for things like QuantizedCPU, etc. but I'm not 100% sure off the top of my head.

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 8, 2020

Good catch, but based on the test results, anything that has serialization tests would have .cpu().

@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 8, 2020
@mruberry mruberry requested a review from gchanan October 8, 2020 21:29
@t-vi
Copy link
Collaborator Author

t-vi commented Oct 9, 2020

@gchanan keep or trash this?

I still think it is superior to eliminate the BytesIO additional copy, but obviously your fix might be more conservative.

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 12, 2020

Well, looks like we keep the copying.

@t-vi t-vi closed this Oct 12, 2020
@gchanan gchanan reopened this Oct 13, 2020
@gchanan
Copy link
Contributor

gchanan commented Oct 13, 2020

@t-vi I think we should merge this (I wanted to get the more conservative one in first so we could cherry-pick it to the release branch).

@gchanan
Copy link
Contributor

gchanan commented Oct 13, 2020

do you want to fix the conflict or should I?

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 13, 2020

OK, thanks. I'll update the PR.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@gchanan merged this pull request in 7b7f251.

@gchanan
Copy link
Contributor

gchanan commented Oct 14, 2020

Nice, thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.save w/ _use_new_zipfile_serialization=True corrupts state_dict
5 participants