problems with zip support #586

0-wiz-0 · 2023-02-16T11:15:19Z

I tried fq on zip archives, and I had two problems so far:

fq seems to have zip64 support, but it's not complete - at least on one file I tried it on - the eocd64 marker is shown as 'gap0' (example file: https://github.com/nih-at/libzip/blob/main/regress/zip64.zip):

# fq . zip64.zip
    |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b|0123456789abcdef0123456789ab|.{}: zip64.zip (zip)
0x00|50 4b 03 04 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd ff ff ff ff ff ff ff ff 01 00|PK..-.....Or[@..............|  local_files[0:1]:
0x1c|14 00 2d 01 00 10 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 61 0a         |..-....................a.   |
0x1c|                                                                           50 4b 01|                         PK.|  central_directories[0:1]:
0x38|02 1e 03 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd 02 00 00 00 02 00 00 00 01 00 00|...-.....Or[@...............|
0x54|00 00 00 00 00 01 00 00 00 80 11 00 00 00 00 2d                                    |...............-            |
0x54|                                                50 4b 06 06 2c 00 00 00 00 00 00 00|                PK..,.......|  gap0: raw bits
0x70|1e 03 2d 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00|..-.........................|
*   |until 0xaf.7 (76)                                                                  |                            |
0xa8|                        50 4b 05 06 00 00 00 00 01 00 01 00 2f 00 00 00 35 00 00 00|        PK........../...5...|  end_of_central_directory_record{}:
0xc4|00 00|                                                                             |..|                         |

I wanted to try the interactive mode to look inside a zip inside a zip, and fq just hung completely, eating ~33GB RAM and 100% CPU:

fq -i . bigzero-zip.zip

using https://github.com/nih-at/libzip/blob/main/regress/bigzero-zip.zip

The text was updated successfully, but these errors were encountered:

wader · 2023-02-20T17:52:10Z

Hey, sorry i somehow missed the notification email about this. Will have a look.

wader · 2023-02-20T18:19:40Z

Had quick look at zip64.zip. gap0 seems to be a "Zip64 end of central directory record" structure (signature 0x06064b50, confusengly is reverse, zip uses little endian). If i've understood the spec correctly the way to detect zip64 is that the "zip32" version of "end of central directory record" has a "offset of start of central directory ...." field that is 0xffffffff, see:

 4.4.24 offset of start of central directory with respect to
        the starting disk number:  (4 bytes)

  Offset of the start of the central directory on the
  disk on which the central directory starts. If an 
  archive is in ZIP64 format and the value in this 
  field is 0xFFFFFFFF, the size will be in the 
  corresponding 8 byte zip64 end of central 
  directory field.

In this file it seems to be 0x35 🤔 do you know if there is some other way to detect it? or could this be some left over data etc because of how the zip writer is implemented?

$ fq -o line_bytes=8 '(.end_of_central_directory_record | d), (.gap0 | dd)' zip64.zip
    │00 01 02 03 04 05 06 07│01234567│.end_of_central_directory_record{}:
0xb0│50 4b 05 06            │PK..    │  signature: raw bits (valid)
0xb0│            00 00      │    ..  │  disk_nr: 0
0xb0│                  00 00│      ..│  central_directory_start_disk_nr: 0
0xb8│01 00                  │..      │  nr_of_central_directory_records_on_disk: 1
0xb8│      01 00            │  ..    │  nr_of_central_directory_records: 1
0xb8│            2f 00 00 00│    /...│  size_of_central_directory: 47
0xc0│35 00 00 00            │5...    │  offset_of_start_of_central_directory: 53
0xc0│            00 00│     │    ..│ │  comment_length: 0
    │                       │        │  comment: ""
    │00 01 02 03 04 05 06 07│01234567│
0x60│            50 4b 06 06│    PK..│.gap0: raw bits
0x68│2c 00 00 00 00 00 00 00│,.......│
0x70│1e 03 2d 00 00 00 00 00│..-.....│
0x78│00 00 00 00 01 00 00 00│........│
0x80│00 00 00 00 01 00 00 00│........│
0x88│00 00 00 00 2f 00 00 00│..../...│
0x90│00 00 00 00 35 00 00 00│....5...│
0x98│00 00 00 00 50 4b 06 07│....PK..│
0xa0│00 00 00 00 64 00 00 00│....d...│
0xa8│00 00 00 00 01 00 00 00│........│

0-wiz-0 · 2023-02-20T18:23:18Z

You can identify a zip64 by the existence of the zip64 EOCD, which you can recognize by the 'PK\06\06'.
If any of the files is too large (even just when uncompressed), then the whole file needs to be zip64 even if none of the fields in the EOCD marker are too big. AFAIK there is no requirement to set the values in the EOCD to 0xffffffff if they fit.
Does this help?

0-wiz-0 · 2023-02-20T18:42:56Z

Sorry, that was slightly wrong: You look for the Zip64 end of central directory locator (0x07064b50) directly before the standard ZIP EOCD that points you to the Zip64 EOCD and then you can parse from there.

wader · 2023-02-20T18:51:00Z

Yes helps, and that is actually you how the decoder works now, it heuristically seeks from the end backwards after the "zip32" EOCD signature. So should it look for both and decode but prefer the zip64 one? otherwise there might be a zip32 gap i guess?

0-wiz-0 · 2023-02-20T18:53:26Z

No, in a zip64 there are both. From the appnote:

      [central directory header n]
      [zip64 end of central directory record]
      [zip64 end of central directory locator] 
      [end of central directory record]

So at the end you always have a zip(32) EOCD. If you go backwards from that and find a zip64 EOCD locator, it's a zip64 and you need to parse the zip64 EOCD (which it tells you how to find). If you find a central directory header instead, you know it's not a zip64.

wader · 2023-02-20T20:27:08Z

Aha i see, thanks. Will look into a fix soon, let me know if you want to have a go at it. The code looks a messy now, so could probably be cleanup up and refactored a bit also.

wader · 2023-02-20T20:36:23Z

Had a look at bigzero-zip.zip now also. Looks like the toml decoder is not very happy about the 4GB zeroes files. If i build without it in the probe group it succeeds decoding after some time (but is quite slow).

I did fq -d bytes '_registry.groups.probe[] as $f | $f, try decode($f) catch .' bigzero to figure which probe format that got stuck.

Will have a look at the toml code (uses github.com/BurntSushi/toml) and try figure out some way to fail toml decoding faster.

0-wiz-0 · 2023-02-20T21:27:01Z

No hurry. Since I'm already maintaining a zip library, I'll leave this one up to you ;)

Interesting analysis about the toml decoder. I had thought files would be decoded more on-demand, not all in advance. Good to know!

wader · 2023-02-20T22:03:18Z

I was a bit fast to judge, toml is slow but finishes, it seems to actually be the xml decoder that eat a lot of memory for some reason hmm.

Currently i'm not doing any on-demand decoding, have thought about it and would be interesting to look into, but fq's decode and jq code is quite complex as it is :) so will see. Focus have mostly been on making things possible over speed and efficiency. But some format do have options to disable sub-decoding, the zip format should probably have decode_uncompressed option etc.

encoding/xml and github.com/BurntSushi/toml both reads a lot before detecting that it can't decode. Now we instead read one UTF-8 and make sure it's valid xml or toml. Should speed up probing Relatd to #586 bigzero-zip.zip

encoding/xml and github.com/BurntSushi/toml both reads a lot before detecting that it can't decode. Now we instead read one UTF-8 and make sure it's valid xml or toml. Should speed up probing Related to #586 bigzero-zip.zip

wader · 2023-02-22T15:30:33Z

#594 makes decoding bigzero-zip.zip quite a lot faster but will still use some cpu and memory as is uncompresses to memory. This will also speed up probing in many other cases. If your curious, fq does not have any special probe code instead it's up to decoder that are in the "probe" group to fail fast.

There will always be zip(32) EOCD but optinally a zip64 EOCD Related to #586

wader · 2023-02-22T18:45:52Z

Both issues should be resolved now. Give it a try if you can.

I noticed that there were some more (non-broken) files from the libzip regression tests that fq dont like, maybe i will have a look at those also.

0-wiz-0 · 2023-02-22T19:17:01Z

Thanks, the zip64 eocd parsing looks good now.
I've tried the bigzero-zip.zip too but it doesn't look too good?

# ./fq -i . /tmp/bigzero-zip.zip                                                                                                                                                                                                                                 
error: /tmp/bigzero-zip.zip: probe: failed to decode: try fq -d FORMAT to force format, see fq -h formats for list
empty> ^D
# ./fq -d zip -i . /tmp/bigzero-zip.zip 
zip!> .
      |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f|0123456789abcdef0123456789abcdef|.{}: /tmp/bigzero-zip.zip (zip)
      |                                                                                               |                                |  error: zip: RawLen(signature): failed at position 10489.7 (read size 0 seek pos 0): outside buffer
0x0000|50 4b 03 04 14 00 02 00 08 00 c8 78 84 45 54 81 15 ae 4e 28 00 00 b9 9a 3f 00 0b 00 1c 00 62 69|PK.........x.ET...N(....?.....bi|  gap0: raw bits
*     |until 0x28e3.7 (10468)                                                                         |                                |
0x28e0|            50 4b 05 06 00 00 00 00 01 00 01 00 51 00 00 00 93 28 00 00 00 00|                 |    PK..........Q....(....|     |  end_of_central_directory_record{}:
      |                                                                                               |                                |  end_of_central_directory_locator{}:

wader · 2023-02-22T21:24:13Z

Oh yes i messed up, this should fix it #596

0-wiz-0 · 2023-02-22T22:10:00Z

All good now, thank you!

wader · 2023-02-22T22:35:11Z

🥳 thanks for reporting and nice bug report

wader · 2023-02-22T22:40:19Z

You added fq to NetBSD ports? thanks for that, were no issues? have no idea how the golang support is on *BSDs.

Also I can add you to the list of ppl i notify when doing a new release if you want?

0-wiz-0 · 2023-02-22T23:53:28Z

Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :)
If you have pkgsrc set up, just use pkgin install fq to get the program installed.

Golang support in pkgsrc and NetBSD in special is quite good.

Sure, let me know about new releases. Thank you

wader · 2023-02-23T09:59:04Z

Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :) If you have pkgsrc set up, just use pkgin install fq to get the program installed.

Aha didn't know 👍 could possibly add it to the fq README

Golang support in pkgsrc and NetBSD in special is quite good.

Good to know. The only thing i'm a bit worried about fq-wise is the readline module which has some os-specific code, but the REPL seems to work fine?

Sure, let me know about new releases. Thank you

Will do

0-wiz-0 · 2023-02-23T10:33:57Z

Yes, the REPL works fine.

(NetBSD is the upstream for editline (a BSD licensed readline).)

wader · 2023-02-23T10:45:23Z

Good, fq uses a fork of https://github.com/chzyer/readline which is similar to libreadline etc but implemented in go so is very convenient to use and build. Trying to stay away from cgo to not have to deal with c build issues... have enough of those in my life anyway :)

wader mentioned this issue Feb 22, 2023

toml,xml: Fail fast on invalid content #594

Merged

wader added a commit that referenced this issue Feb 22, 2023

zip: Correctly look for and decode both zip32/64 EOCD record

5228fdd

There will always be zip(32) EOCD but optinally a zip64 EOCD Related to #586

wader mentioned this issue Feb 22, 2023

zip: Correctly look for and decode both zip32/64 EOCD record #595

Merged

0-wiz-0 closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with zip support #586

problems with zip support #586

0-wiz-0 commented Feb 16, 2023

wader commented Feb 20, 2023

wader commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

wader commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

wader commented Feb 20, 2023

wader commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

wader commented Feb 20, 2023 •

edited

wader commented Feb 22, 2023 •

edited

wader commented Feb 22, 2023

0-wiz-0 commented Feb 22, 2023

wader commented Feb 22, 2023 •

edited

0-wiz-0 commented Feb 22, 2023

wader commented Feb 22, 2023

wader commented Feb 22, 2023

0-wiz-0 commented Feb 22, 2023

wader commented Feb 23, 2023

0-wiz-0 commented Feb 23, 2023

wader commented Feb 23, 2023

problems with zip support #586

problems with zip support #586

Comments

0-wiz-0 commented Feb 16, 2023

wader commented Feb 20, 2023

wader commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

wader commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

wader commented Feb 20, 2023

wader commented Feb 20, 2023

0-wiz-0 commented Feb 20, 2023

wader commented Feb 20, 2023 • edited

wader commented Feb 22, 2023 • edited

wader commented Feb 22, 2023

0-wiz-0 commented Feb 22, 2023

wader commented Feb 22, 2023 • edited

0-wiz-0 commented Feb 22, 2023

wader commented Feb 22, 2023

wader commented Feb 22, 2023

0-wiz-0 commented Feb 22, 2023

wader commented Feb 23, 2023

0-wiz-0 commented Feb 23, 2023

wader commented Feb 23, 2023

wader commented Feb 20, 2023 •

edited

wader commented Feb 22, 2023 •

edited

wader commented Feb 22, 2023 •

edited