Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda #4614

Closed
alexwlchan opened this issue Jul 2, 2020 · 5 comments · Fixed by wellcomecollection/archivematica-infrastructure#62

Comments

@alexwlchan
Copy link
Contributor

Trudy has reported a fascinating bug – she’s uploaded a ZIP that’s ~5.6GB in size. The Lambda should be triggered and start a transfer in Archivematica, or produce a log if not. It’s not producing a log.

Looking in CloudWatch, I see the following error:

zipfile.BadZipFile: zipfiles that span multiple disks are not supported

A bit of googling suggests this is a bug in the Python standard library: https://bugs.python.org/issue22102

I’m hoping that updating the Lambda runtime to Python 3.8 will fix this for us. I can’t find a published changelog, but a mention appears in the Python 3.8.0b1 changelog which matches the changelog in the patch:

Added support for ZIP files with disks set to 0. Such files are commonly created by builtin tools on Windows when use ZIP64 extension. Patch by Francisco Facioni.

@alexwlchan
Copy link
Contributor Author

Bumping the environment to python3.8 gets a different error:

NotImplementedError: That compression method is not supported

@alexwlchan
Copy link
Contributor Author

Cracking the zipfile open in an EC2 instance running Python 3.7 works fine, so I guess we're missing something like zlib in the Lambda?

@alexwlchan
Copy link
Contributor Author

Ah, so I can see the namelist but not actually extract files. It does give me a slightly more useful error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib64/python3.7/zipfile.py", line 1560, in open
    return ZipExtFile(zef_file, mode, zinfo, pwd, True)
  File "/usr/lib64/python3.7/zipfile.py", line 809, in __init__
    self._decompressor = _get_decompressor(self._compress_type)
  File "/usr/lib64/python3.7/zipfile.py", line 722, in _get_decompressor
    raise NotImplementedError("compression type %d (%s)" % (compress_type, descr))
NotImplementedError: compression type 9 (deflate64)

@alexwlchan
Copy link
Contributor Author

Some thorough discussion of compression methods over here explains that deflate64 is a proprietary compression method not supported by Python: thejoshwolfe/yauzl#58

I'm going to look at including p7zip in the Lambda container.

@alexwlchan alexwlchan changed the title Fix the "start S3 transfer" Lambda for Archivematica Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda Jul 8, 2020
@alexwlchan
Copy link
Contributor Author

So Trudy is using 7-Zip to create the ZIP files, and that uses the deflate64-compression algorithm.

You can unpack these files with p7zip on the command line (7z e 2589.zip metadata/metadata.csv), but only if you have the file downloaded. There's a Lambda layer than includes p7zip – https://github.com/Securezapp/p7zip-aws-lambda-layer – but the lack of disk space on Lambda nixes that plan.

We only need to unpack the ZIP in the Lambda to do validation; for accessions at least we can infer the accession number from the filename. I'm going to try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant