Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some bugs with zipfile serialization #32244

Closed
wants to merge 2 commits into from
Closed

Conversation

driazati
Copy link
Contributor

@driazati driazati commented Jan 15, 2020

Stacked PRs

It includes the following changes:

  • Split up tests so that we can test both serialization methods
    • Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
  • Call readinto on a buffer if possible instead of read + a copy
  • Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but zipinfo and unzip said the zip file was fine)

Differential Revision: D19418935

@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 15, 2020
@@ -4522,7 +4522,9 @@ void *mz_zip_reader_extract_file_to_heap(mz_zip_archive *pZip, const char *pFile
mz_bool mz_zip_reader_extract_to_callback(mz_zip_archive *pZip, mz_uint file_index, mz_file_write_func pCallback, void *pOpaque, mz_uint flags)
{
int status = TINFL_STATUS_DONE;
#ifndef MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix that was upstreamed in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to update miniz at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but I think the update would just be the same as this change (plus whatever has been added since)

facebook-github-bot pushed a commit that referenced this pull request Jan 28, 2020
Summary:
Stacked PRs
 * #32244 - Make zip serialization the default
 * **#32241 - Split serialization tests to their own file**

This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`.
](https://our.intern.facebook.com/intern/diff/19415826/)
Pull Request resolved: #32241

Pulled By: driazati

Differential Revision: D19415826

fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b
@driazati driazati changed the base branch from driazati/seri/1 to master January 28, 2020 23:37
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Stacked PRs
 * pytorch#32244 - Make zip serialization the default
 * **pytorch#32241 - Split serialization tests to their own file**

This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`.
](https://our.intern.facebook.com/intern/diff/19415826/)
Pull Request resolved: pytorch#32241

Pulled By: driazati

Differential Revision: D19415826

fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b
@driazati driazati changed the title Make zip serialization the default Fix some bugs with zipfile serialization Feb 4, 2020
Copy link
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really all that familiar with this code but nothing stands out to me as awful

@@ -6,8 +6,11 @@
#include <c10/cuda/CUDAGuard.h>
#endif

// save_save is necessary since the old eager format saved storages as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_size

@@ -4522,7 +4522,9 @@ void *mz_zip_reader_extract_file_to_heap(mz_zip_archive *pZip, const char *pFile
mz_bool mz_zip_reader_extract_to_callback(mz_zip_archive *pZip, mz_uint file_index, mz_file_write_func pCallback, void *pOpaque, mz_uint flags)
{
int status = TINFL_STATUS_DONE;
#ifndef MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to update miniz at some point?

@facebook-github-bot
Copy link
Contributor

@driazati merged this pull request in 74ce3a0.

@Evpok
Copy link
Contributor

Evpok commented Feb 7, 2020

The added warning is somewhat worrying, since what is suggested would break the possibility of pickling tensors as other Python objects (for instance for now I can pickle a tuple of tensors with a simple call to pickle.dumps). Was there a discussion for this choice?

@driazati
Copy link
Contributor Author

driazati commented Feb 7, 2020

Not really (feel free to open one and tag whoever), but the thinking was that PyTorch has its own serialization format that uses pickle but is not exactly pickle since it has some other stuff. torch.save will already let you serialize arbitrary Python objects so long as they are pickle-able, and provides the mechanism to make Tensors pickle-able. Is there some use case you have that isn't served by torch.save?

@Evpok
Copy link
Contributor

Evpok commented Feb 8, 2020

@driazati Oh, right, I didn't remember that torch.save worked for arbitrary Python objects too. Thanks!

BowenBao pushed a commit to BowenBao/pytorch that referenced this pull request Feb 12, 2020
Summary:
Stacked PRs
 * pytorch#32958 - Make zip serialization the default
 * **pytorch#32244 - Fix some bugs with zipfile serialization**

It includes the following changes:
* Split up tests so that we can test both serialization methods
    * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
* Call `readinto` on a buffer if possible instead of `read` + a copy
* Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine)
](https://our.intern.facebook.com/intern/diff/19418935/)
Pull Request resolved: pytorch#32244

Pulled By: driazati

Reviewed By: eellison

Differential Revision: D19418935

fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Stacked PRs
 * pytorch#32244 - Make zip serialization the default
 * **pytorch#32241 - Split serialization tests to their own file**

This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`.
](https://our.intern.facebook.com/intern/diff/19415826/)
Pull Request resolved: pytorch#32241

Pulled By: driazati

Differential Revision: D19415826

fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Stacked PRs
 * pytorch#32958 - Make zip serialization the default
 * **pytorch#32244 - Fix some bugs with zipfile serialization**

It includes the following changes:
* Split up tests so that we can test both serialization methods
    * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
* Call `readinto` on a buffer if possible instead of `read` + a copy
* Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine)
](https://our.intern.facebook.com/intern/diff/19418935/)
Pull Request resolved: pytorch#32244

Pulled By: driazati

Reviewed By: eellison

Differential Revision: D19418935

fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573
@ppwwyyxx
Copy link
Contributor

Yes, torch has its own serialization function that may work better, but that doesn't justify removing pickle support. There are a lot of reasons to still enable pickle, like

  • it's what everyone in the python world use by default. third-party libraries that we can't change may still use pickle under the hood
  • simpler APIs (no equivalent of pkl.dumps in torch)

@andyljones
Copy link
Contributor

andyljones commented May 15, 2020

Echoing @ppwwyyxx; I frequently pass around torch storage as part of larger datastructures, and insisting on pickle means any generic code that serializes those data structures will now need to import torch, even if they have no use for torch's functionality other than save.

Phrased another way, the fact every other part of the Python ecosystem uses pickle for pickling means you can currently treat Python datastructures as black boxes. This change would break that nice opacity.

The pickle protocol is pretty powerful - there's a whole VM down there in fact. What functionality does torch.save provide that's tricky to hook into that?

@driazati
Copy link
Contributor Author

torch.save uses pickle under the hood, it just has some special handling for saving tensor data. The pickle support here is just calling torch.save and storing the tensor as that series of bytes. It'd probably be best to define some real contract around this (i.e. what parts of torch are pickle.dumps-able, since only Storage has it right now).

Being able to treat PyTorch objects as a black box is a pretty strong use case, can one of you file a follow up issue with the points made here to remove that warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants