Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda #4614

alexwlchan · 2020-07-02T16:14:05Z

Trudy has reported a fascinating bug – she’s uploaded a ZIP that’s ~5.6GB in size. The Lambda should be triggered and start a transfer in Archivematica, or produce a log if not. It’s not producing a log.

Looking in CloudWatch, I see the following error:

zipfile.BadZipFile: zipfiles that span multiple disks are not supported

A bit of googling suggests this is a bug in the Python standard library: https://bugs.python.org/issue22102

I’m hoping that updating the Lambda runtime to Python 3.8 will fix this for us. I can’t find a published changelog, but a mention appears in the Python 3.8.0b1 changelog which matches the changelog in the patch:

Added support for ZIP files with disks set to 0. Such files are commonly created by builtin tools on Windows when use ZIP64 extension. Patch by Francisco Facioni.

The text was updated successfully, but these errors were encountered:

alexwlchan · 2020-07-07T10:43:43Z

Bumping the environment to python3.8 gets a different error:

NotImplementedError: That compression method is not supported

alexwlchan · 2020-07-07T13:17:26Z

Cracking the zipfile open in an EC2 instance running Python 3.7 works fine, so I guess we're missing something like zlib in the Lambda?

alexwlchan · 2020-07-07T13:30:26Z

Ah, so I can see the namelist but not actually extract files. It does give me a slightly more useful error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib64/python3.7/zipfile.py", line 1560, in open
    return ZipExtFile(zef_file, mode, zinfo, pwd, True)
  File "/usr/lib64/python3.7/zipfile.py", line 809, in __init__
    self._decompressor = _get_decompressor(self._compress_type)
  File "/usr/lib64/python3.7/zipfile.py", line 722, in _get_decompressor
    raise NotImplementedError("compression type %d (%s)" % (compress_type, descr))
NotImplementedError: compression type 9 (deflate64)

alexwlchan · 2020-07-08T06:24:53Z

Some thorough discussion of compression methods over here explains that deflate64 is a proprietary compression method not supported by Python: thejoshwolfe/yauzl#58

I'm going to look at including p7zip in the Lambda container.

alexwlchan · 2020-07-08T07:48:32Z

So Trudy is using 7-Zip to create the ZIP files, and that uses the deflate64-compression algorithm.

You can unpack these files with p7zip on the command line (7z e 2589.zip metadata/metadata.csv), but only if you have the file downloaded. There's a Lambda layer than includes p7zip – https://github.com/Securezapp/p7zip-aws-lambda-layer – but the lack of disk space on Lambda nixes that plan.

We only need to unpack the ZIP in the Lambda to do validation; for accessions at least we can infer the accession number from the filename. I'm going to try that.

alexwlchan added 🐛 Bug 🤖 Archivematica labels Jul 2, 2020

alexwlchan self-assigned this Jul 7, 2020

alexwlchan changed the title ~~Fix the "start S3 transfer" Lambda for Archivematica~~ Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda Jul 8, 2020

alexwlchan mentioned this issue Jul 8, 2020

More tolerant handling of deflate64-compressed ZIP files wellcomecollection/archivematica-infrastructure#62

Merged

alexwlchan closed this as completed in wellcomecollection/archivematica-infrastructure#62 Jul 8, 2020

alexwlchan mentioned this issue Jul 14, 2020

Transfer package 2589 seems to have got stuck somewhere #4655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda #4614

Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda #4614

alexwlchan commented Jul 2, 2020

alexwlchan commented Jul 7, 2020

alexwlchan commented Jul 7, 2020

alexwlchan commented Jul 7, 2020

alexwlchan commented Jul 8, 2020

alexwlchan commented Jul 8, 2020

Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda #4614

Handle deflate64-compressed ZIP files in the "Start S3 transfer" Lambda #4614

Comments

alexwlchan commented Jul 2, 2020

alexwlchan commented Jul 7, 2020

alexwlchan commented Jul 7, 2020

alexwlchan commented Jul 7, 2020

alexwlchan commented Jul 8, 2020

alexwlchan commented Jul 8, 2020