Fix some bugs with zipfile serialization #32244

driazati · 2020-01-15T21:39:22Z

Stacked PRs

Make zip serialization the default #32958 - Make zip serialization the default
Fix some bugs with zipfile serialization #32244 - Fix some bugs with zipfile serialization

It includes the following changes:

Split up tests so that we can test both serialization methods
- Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
Call readinto on a buffer if possible instead of read + a copy
Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but zipinfo and unzip said the zip file was fine)

Differential Revision: D19418935

driazati · 2020-01-22T22:56:59Z

third_party/miniz-2.0.8/miniz.c

@@ -4522,7 +4522,9 @@ void *mz_zip_reader_extract_file_to_heap(mz_zip_archive *pZip, const char *pFile
 mz_bool mz_zip_reader_extract_to_callback(mz_zip_archive *pZip, mz_uint file_index, mz_file_write_func pCallback, void *pOpaque, mz_uint flags)
 {
    int status = TINFL_STATUS_DONE;
+#ifndef MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS


This is a fix that was upstreamed in this PR

Are we going to update miniz at some point?

We could but I think the update would just be the same as this change (plus whatever has been added since)

Summary: Stacked PRs * #32244 - Make zip serialization the default * **#32241 - Split serialization tests to their own file** This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`. ](https://our.intern.facebook.com/intern/diff/19415826/) Pull Request resolved: #32241 Pulled By: driazati Differential Revision: D19415826 fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b

Summary: Stacked PRs * pytorch#32244 - Make zip serialization the default * **pytorch#32241 - Split serialization tests to their own file** This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`. ](https://our.intern.facebook.com/intern/diff/19415826/) Pull Request resolved: pytorch#32241 Pulled By: driazati Differential Revision: D19415826 fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b

jamesr66a

Not really all that familiar with this code but nothing stands out to me as awful

jamesr66a · 2020-02-05T00:53:59Z

torch/csrc/generic/serialization.cpp

@@ -6,8 +6,11 @@
 #include <c10/cuda/CUDAGuard.h>
 #endif

+// save_save is necessary since the old eager format saved storages as


save_size

jamesr66a · 2020-02-05T00:57:51Z

third_party/miniz-2.0.8/miniz.c

@@ -4522,7 +4522,9 @@ void *mz_zip_reader_extract_file_to_heap(mz_zip_archive *pZip, const char *pFile
 mz_bool mz_zip_reader_extract_to_callback(mz_zip_archive *pZip, mz_uint file_index, mz_file_write_func pCallback, void *pOpaque, mz_uint flags)
 {
    int status = TINFL_STATUS_DONE;
+#ifndef MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS


Are we going to update miniz at some point?

facebook-github-bot · 2020-02-06T11:16:26Z

@driazati merged this pull request in 74ce3a0.

Evpok · 2020-02-07T13:17:44Z

The added warning is somewhat worrying, since what is suggested would break the possibility of pickling tensors as other Python objects (for instance for now I can pickle a tuple of tensors with a simple call to pickle.dumps). Was there a discussion for this choice?

driazati · 2020-02-07T19:13:17Z

Not really (feel free to open one and tag whoever), but the thinking was that PyTorch has its own serialization format that uses pickle but is not exactly pickle since it has some other stuff. torch.save will already let you serialize arbitrary Python objects so long as they are pickle-able, and provides the mechanism to make Tensors pickle-able. Is there some use case you have that isn't served by torch.save?

Evpok · 2020-02-08T14:18:15Z

@driazati Oh, right, I didn't remember that torch.save worked for arbitrary Python objects too. Thanks!

Summary: Stacked PRs * pytorch#32958 - Make zip serialization the default * **pytorch#32244 - Fix some bugs with zipfile serialization** It includes the following changes: * Split up tests so that we can test both serialization methods * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end) * Call `readinto` on a buffer if possible instead of `read` + a copy * Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine) ](https://our.intern.facebook.com/intern/diff/19418935/) Pull Request resolved: pytorch#32244 Pulled By: driazati Reviewed By: eellison Differential Revision: D19418935 fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573

Summary: Stacked PRs * pytorch#32244 - Make zip serialization the default * **pytorch#32241 - Split serialization tests to their own file** This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`. ](https://our.intern.facebook.com/intern/diff/19415826/) Pull Request resolved: pytorch#32241 Pulled By: driazati Differential Revision: D19415826 fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b

Summary: Stacked PRs * pytorch#32958 - Make zip serialization the default * **pytorch#32244 - Fix some bugs with zipfile serialization** It includes the following changes: * Split up tests so that we can test both serialization methods * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end) * Call `readinto` on a buffer if possible instead of `read` + a copy * Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine) ](https://our.intern.facebook.com/intern/diff/19418935/) Pull Request resolved: pytorch#32244 Pulled By: driazati Reviewed By: eellison Differential Revision: D19418935 fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573

ppwwyyxx · 2020-05-14T01:25:21Z

Yes, torch has its own serialization function that may work better, but that doesn't justify removing pickle support. There are a lot of reasons to still enable pickle, like

it's what everyone in the python world use by default. third-party libraries that we can't change may still use pickle under the hood
simpler APIs (no equivalent of pkl.dumps in torch)

andyljones · 2020-05-15T21:00:59Z

Echoing @ppwwyyxx; I frequently pass around torch storage as part of larger datastructures, and insisting on pickle means any generic code that serializes those data structures will now need to import torch, even if they have no use for torch's functionality other than save.

Phrased another way, the fact every other part of the Python ecosystem uses pickle for pickling means you can currently treat Python datastructures as black boxes. This change would break that nice opacity.

The pickle protocol is pretty powerful - there's a whole VM down there in fact. What functionality does torch.save provide that's tricky to hook into that?

driazati · 2020-05-16T00:09:33Z

torch.save uses pickle under the hood, it just has some special handling for saving tensor data. The pickle support here is just calling torch.save and storing the tensor as that series of bytes. It'd probably be best to define some real contract around this (i.e. what parts of torch are pickle.dumps-able, since only Storage has it right now).

Being able to treat PyTorch objects as a black box is a pretty strong use case, can one of you file a follow up issue with the points made here to remove that warning?

driazati requested a review from apaszke as a code owner January 15, 2020 21:39

driazati mentioned this pull request Jan 15, 2020

Split serialization tests to their own file #32241

Closed

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 15, 2020

driazati force-pushed the driazati/seri/2 branch from 2ac6851 to 53924a1 Compare January 15, 2020 22:30

driazati commented Jan 22, 2020

View reviewed changes

driazati mentioned this pull request Jan 27, 2020

torch.{save,load} data corruption when serializing a Module with __{get,set}state__ #24045

Open

update

6c32931

driazati force-pushed the driazati/seri/2 branch from b393b07 to 6c32931 Compare January 28, 2020 23:37

driazati requested review from ebetica, goldsborough, mrshenli, pietern, pritamdamania87, yf225 and zhaojuanmao as code owners January 28, 2020 23:37

driazati changed the base branch from driazati/seri/1 to master January 28, 2020 23:37

driazati removed request for pietern, ebetica, yf225, apaszke, goldsborough, pritamdamania87, mrshenli and zhaojuanmao January 28, 2020 23:37

[wip] address some comments

849f0d4

driazati changed the title ~~Make zip serialization the default~~ Fix some bugs with zipfile serialization Feb 4, 2020

driazati mentioned this pull request Feb 4, 2020

Make zip serialization the default #32958

Closed

driazati requested a review from jamesr66a February 4, 2020 19:23

driazati requested a review from dzhulgakov February 4, 2020 19:23

jamesr66a approved these changes Feb 5, 2020

View reviewed changes

facebook-github-bot closed this in 74ce3a0 Feb 5, 2020

facebook-github-bot added the merged label Feb 6, 2020

andyljones mentioned this pull request May 16, 2020

Remove the pickle deprecation warning #38597

Closed

facebook-github-bot deleted the driazati/seri/2 branch July 13, 2020 17:55

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some bugs with zipfile serialization #32244

Fix some bugs with zipfile serialization #32244

driazati commented Jan 15, 2020 •

edited

driazati Jan 22, 2020

jamesr66a Feb 5, 2020

driazati Feb 7, 2020

jamesr66a left a comment

jamesr66a Feb 5, 2020

jamesr66a Feb 5, 2020

facebook-github-bot commented Feb 6, 2020

Evpok commented Feb 7, 2020

driazati commented Feb 7, 2020

Evpok commented Feb 8, 2020

ppwwyyxx commented May 14, 2020

andyljones commented May 15, 2020 •

edited

driazati commented May 16, 2020

Fix some bugs with zipfile serialization #32244

Fix some bugs with zipfile serialization #32244

Conversation

driazati commented Jan 15, 2020 • edited

driazati Jan 22, 2020

Choose a reason for hiding this comment

jamesr66a Feb 5, 2020

Choose a reason for hiding this comment

driazati Feb 7, 2020

Choose a reason for hiding this comment

jamesr66a left a comment

Choose a reason for hiding this comment

jamesr66a Feb 5, 2020

Choose a reason for hiding this comment

jamesr66a Feb 5, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Feb 6, 2020

Evpok commented Feb 7, 2020

driazati commented Feb 7, 2020

Evpok commented Feb 8, 2020

ppwwyyxx commented May 14, 2020

andyljones commented May 15, 2020 • edited

driazati commented May 16, 2020

driazati commented Jan 15, 2020 •

edited

andyljones commented May 15, 2020 •

edited