Use storage.cpu() for moving storage to CPU in serialization. #46028

t-vi · 2020-10-08T09:39:40Z

As reported in #46020, something seems to go wrong with the storage._write_file method used with a BytesIO and a GPU buffer.
Given that we were going to create the intermediate buffer (currently via BytesIO) anyway, we might as well use storage.cpu() to move the storage to the CPU. This appears to work better.

This is a hot fix, further investigation is highly desirable. In particular, I don't have a reproducing test to show.

Fixes #46020

As reported in pytorch#46020, something seems to go wrong with the storage._write_file method used with a BytesIO and a GPU buffer. Given that we were going to create the BytesIO intermediate buffer anyway, we might as well use storage.cpu() to move the storage to the CPU. This appears to work better. This is a hot fix, further investigation is highly desirable.

t-vi · 2020-10-08T10:06:24Z

@jamesr66a

t-vi · 2020-10-08T10:23:14Z

I could be wrong, but on further inspection it would seem that the new method might even be a bit of an optimization: going via write_file incurs an additional copy compared to the method used here: In writeFileRaw the cudaMemcpy to a new cpu buffer and then another by using BytesIO.write, while now we just copy to CPU.

t-vi · 2020-10-08T12:27:26Z

I added commentary on the root problem to the bug report. This indicates that I indeed fix the bug here and as discussed above, I think this saves a memory copy, too, so I'd suggest to use this fix.

codecov · 2020-10-08T14:21:22Z

Codecov Report

Merging #46028 into master will increase coverage by 0.00%.
The diff coverage is 75.00%.

@@           Coverage Diff           @@
##           master   #46028   +/-   ##
=======================================
  Coverage   68.28%   68.29%           
=======================================
  Files         410      410           
  Lines       53609    53606    -3     
=======================================
  Hits        36608    36608           
+ Misses      17001    16998    -3

Impacted Files	Coverage Δ
torch/serialization.py	`86.87% <75.00%> (+0.56%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3caa71...cc9e756. Read the comment docs.

facebook-github-bot

@seemethere has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

gchanan · 2020-10-08T16:20:05Z

really nice digging!

I don't really see how this avoids the copy -- isn't the writeFile buffer copy just moved to the cpu() copy? Also, given we are about to do a release, changing serialization seems a little risky -- it seems reasonable to me to do a stack where you fix the bug, we cherry-pick that to 1.7, and we then make your change here to master.

gchanan · 2020-10-08T16:23:30Z

For a reproduction -- maybe we should just fix the error checking of _writeFile a bit (your suggest of moving to pybind is better), so we can at least get an error on regression.

gchanan · 2020-10-08T16:32:25Z

I haven't verified this yet, but I'm guessing #46036 will blow up the serialization tests, so that and this together should convince us that the issue is fixed.

t-vi · 2020-10-08T16:54:27Z

Yeah, but we skip the other copy from cpu to writing into ByteIO. I'm reasonably certain this is the exact right thing, but you could also just add the False. I had this fix before doing the analysis posted to the issue.

gchanan · 2020-10-08T17:19:19Z

The thing I was wondering about is the change of assumption, i.e. now we assume that cpu() is implemented for all storages (it's clearly true for cuda). It's almost certainly true for things like QuantizedCPU, etc. but I'm not 100% sure off the top of my head.

t-vi · 2020-10-08T17:41:29Z

Good catch, but based on the test results, anything that has serialization tests would have .cpu().

t-vi · 2020-10-09T12:33:13Z

@gchanan keep or trash this?

I still think it is superior to eliminate the BytesIO additional copy, but obviously your fix might be more conservative.

t-vi · 2020-10-12T18:23:11Z

Well, looks like we keep the copying.

gchanan · 2020-10-13T14:50:06Z

@t-vi I think we should merge this (I wanted to get the more conservative one in first so we could cherry-pick it to the release branch).

gchanan · 2020-10-13T14:50:31Z

do you want to fix the conflict or should I?

t-vi · 2020-10-13T14:58:09Z

OK, thanks. I'll update the PR.

facebook-github-bot

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-10-13T22:15:02Z

@gchanan merged this pull request in 7b7f251.

gchanan · 2020-10-14T14:32:05Z

Nice, thanks for this!

pytorchbot added the open source label Oct 8, 2020

seemethere requested a review from jamesr66a October 8, 2020 15:53

facebook-github-bot reviewed Oct 8, 2020

View reviewed changes

t-vi mentioned this pull request Oct 8, 2020

[v1.7.0] Release Tracker #45592

Closed

mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 8, 2020

mruberry requested a review from gchanan October 8, 2020 21:29

t-vi closed this Oct 12, 2020

gchanan reopened this Oct 13, 2020

Merge branch 'master' into we_❤️_stas_and_🤗

cc9e756

gchanan approved these changes Oct 13, 2020

View reviewed changes

facebook-github-bot reviewed Oct 13, 2020

View reviewed changes

facebook-github-bot closed this in 7b7f251 Oct 13, 2020

facebook-github-bot added the Merged label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use storage.cpu() for moving storage to CPU in serialization. #46028

Use storage.cpu() for moving storage to CPU in serialization. #46028

t-vi commented Oct 8, 2020 •

edited

t-vi commented Oct 8, 2020

t-vi commented Oct 8, 2020 •

edited

t-vi commented Oct 8, 2020

codecov bot commented Oct 8, 2020 •

edited

facebook-github-bot left a comment

gchanan commented Oct 8, 2020

gchanan commented Oct 8, 2020

gchanan commented Oct 8, 2020

t-vi commented Oct 8, 2020 via email

gchanan commented Oct 8, 2020

t-vi commented Oct 8, 2020

t-vi commented Oct 9, 2020

t-vi commented Oct 12, 2020

gchanan commented Oct 13, 2020

gchanan commented Oct 13, 2020

t-vi commented Oct 13, 2020

facebook-github-bot left a comment

facebook-github-bot commented Oct 13, 2020

gchanan commented Oct 14, 2020

Use storage.cpu() for moving storage to CPU in serialization. #46028

Use storage.cpu() for moving storage to CPU in serialization. #46028

Conversation

t-vi commented Oct 8, 2020 • edited

t-vi commented Oct 8, 2020

t-vi commented Oct 8, 2020 • edited

t-vi commented Oct 8, 2020

codecov bot commented Oct 8, 2020 • edited

Codecov Report

facebook-github-bot left a comment

Choose a reason for hiding this comment

gchanan commented Oct 8, 2020

gchanan commented Oct 8, 2020

gchanan commented Oct 8, 2020

t-vi commented Oct 8, 2020 via email

gchanan commented Oct 8, 2020

t-vi commented Oct 8, 2020

t-vi commented Oct 9, 2020

t-vi commented Oct 12, 2020

gchanan commented Oct 13, 2020

gchanan commented Oct 13, 2020

t-vi commented Oct 13, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 13, 2020

gchanan commented Oct 14, 2020

t-vi commented Oct 8, 2020 •

edited

t-vi commented Oct 8, 2020 •

edited

codecov bot commented Oct 8, 2020 •

edited