Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with zip support #586

Closed
0-wiz-0 opened this issue Feb 16, 2023 · 21 comments
Closed

problems with zip support #586

0-wiz-0 opened this issue Feb 16, 2023 · 21 comments

Comments

@0-wiz-0
Copy link

0-wiz-0 commented Feb 16, 2023

I tried fq on zip archives, and I had two problems so far:

# fq . zip64.zip
    |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b|0123456789abcdef0123456789ab|.{}: zip64.zip (zip)
0x00|50 4b 03 04 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd ff ff ff ff ff ff ff ff 01 00|PK..-.....Or[@..............|  local_files[0:1]:
0x1c|14 00 2d 01 00 10 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 61 0a         |..-....................a.   |
0x1c|                                                                           50 4b 01|                         PK.|  central_directories[0:1]:
0x38|02 1e 03 2d 00 00 00 00 00 4f 72 5b 40 07 a1 ea dd 02 00 00 00 02 00 00 00 01 00 00|...-.....Or[@...............|
0x54|00 00 00 00 00 01 00 00 00 80 11 00 00 00 00 2d                                    |...............-            |
0x54|                                                50 4b 06 06 2c 00 00 00 00 00 00 00|                PK..,.......|  gap0: raw bits
0x70|1e 03 2d 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00|..-.........................|
*   |until 0xaf.7 (76)                                                                  |                            |
0xa8|                        50 4b 05 06 00 00 00 00 01 00 01 00 2f 00 00 00 35 00 00 00|        PK........../...5...|  end_of_central_directory_record{}:
0xc4|00 00|                                                                             |..|                         |
  • I wanted to try the interactive mode to look inside a zip inside a zip, and fq just hung completely, eating ~33GB RAM and 100% CPU:
fq -i . bigzero-zip.zip

using https://github.com/nih-at/libzip/blob/main/regress/bigzero-zip.zip

@wader
Copy link
Owner

wader commented Feb 20, 2023

Hey, sorry i somehow missed the notification email about this. Will have a look.

@wader
Copy link
Owner

wader commented Feb 20, 2023

Had quick look at zip64.zip. gap0 seems to be a "Zip64 end of central directory record" structure (signature 0x06064b50, confusengly is reverse, zip uses little endian). If i've understood the spec correctly the way to detect zip64 is that the "zip32" version of "end of central directory record" has a "offset of start of central directory ...." field that is 0xffffffff, see:

 4.4.24 offset of start of central directory with respect to
        the starting disk number:  (4 bytes)

  Offset of the start of the central directory on the
  disk on which the central directory starts. If an 
  archive is in ZIP64 format and the value in this 
  field is 0xFFFFFFFF, the size will be in the 
  corresponding 8 byte zip64 end of central 
  directory field.

In this file it seems to be 0x35 🤔 do you know if there is some other way to detect it? or could this be some left over data etc because of how the zip writer is implemented?

$ fq -o line_bytes=8 '(.end_of_central_directory_record | d), (.gap0 | dd)' zip64.zip
    │00 01 02 03 04 05 06 07│01234567│.end_of_central_directory_record{}:
0xb0│50 4b 05 06            │PK..    │  signature: raw bits (valid)
0xb0│            00 00      │    ..  │  disk_nr: 0
0xb0│                  00 00│      ..│  central_directory_start_disk_nr: 0
0xb8│01 00                  │..      │  nr_of_central_directory_records_on_disk: 1
0xb8│      01 00            │  ..    │  nr_of_central_directory_records: 1
0xb8│            2f 00 00 00│    /...│  size_of_central_directory: 47
0xc0│35 00 00 00            │5...    │  offset_of_start_of_central_directory: 53
0xc0│            00 00│     │    ..│ │  comment_length: 0
    │                       │        │  comment: ""
    │00 01 02 03 04 05 06 07│01234567│
0x60│            50 4b 06 06│    PK..│.gap0: raw bits
0x68│2c 00 00 00 00 00 00 00│,.......│
0x70│1e 03 2d 00 00 00 00 00│..-.....│
0x78│00 00 00 00 01 00 00 00│........│
0x80│00 00 00 00 01 00 00 00│........│
0x88│00 00 00 00 2f 00 00 00│..../...│
0x90│00 00 00 00 35 00 00 00│....5...│
0x98│00 00 00 00 50 4b 06 07│....PK..│
0xa0│00 00 00 00 64 00 00 00│....d...│
0xa8│00 00 00 00 01 00 00 00│........│

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 20, 2023

You can identify a zip64 by the existence of the zip64 EOCD, which you can recognize by the 'PK\06\06'.
If any of the files is too large (even just when uncompressed), then the whole file needs to be zip64 even if none of the fields in the EOCD marker are too big. AFAIK there is no requirement to set the values in the EOCD to 0xffffffff if they fit.
Does this help?

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 20, 2023

Sorry, that was slightly wrong: You look for the Zip64 end of central directory locator (0x07064b50) directly before the standard ZIP EOCD that points you to the Zip64 EOCD and then you can parse from there.

@wader
Copy link
Owner

wader commented Feb 20, 2023

Yes helps, and that is actually you how the decoder works now, it heuristically seeks from the end backwards after the "zip32" EOCD signature. So should it look for both and decode but prefer the zip64 one? otherwise there might be a zip32 gap i guess?

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 20, 2023

No, in a zip64 there are both. From the appnote:

      [central directory header n]
      [zip64 end of central directory record]
      [zip64 end of central directory locator] 
      [end of central directory record]

So at the end you always have a zip(32) EOCD. If you go backwards from that and find a zip64 EOCD locator, it's a zip64 and you need to parse the zip64 EOCD (which it tells you how to find). If you find a central directory header instead, you know it's not a zip64.

@wader
Copy link
Owner

wader commented Feb 20, 2023

Aha i see, thanks. Will look into a fix soon, let me know if you want to have a go at it. The code looks a messy now, so could probably be cleanup up and refactored a bit also.

@wader
Copy link
Owner

wader commented Feb 20, 2023

Had a look at bigzero-zip.zip now also. Looks like the toml decoder is not very happy about the 4GB zeroes files. If i build without it in the probe group it succeeds decoding after some time (but is quite slow).

I did fq -d bytes '_registry.groups.probe[] as $f | $f, try decode($f) catch .' bigzero to figure which probe format that got stuck.

Will have a look at the toml code (uses github.com/BurntSushi/toml) and try figure out some way to fail toml decoding faster.

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 20, 2023

No hurry. Since I'm already maintaining a zip library, I'll leave this one up to you ;)

Interesting analysis about the toml decoder. I had thought files would be decoded more on-demand, not all in advance. Good to know!

@wader
Copy link
Owner

wader commented Feb 20, 2023

I was a bit fast to judge, toml is slow but finishes, it seems to actually be the xml decoder that eat a lot of memory for some reason hmm.

Currently i'm not doing any on-demand decoding, have thought about it and would be interesting to look into, but fq's decode and jq code is quite complex as it is :) so will see. Focus have mostly been on making things possible over speed and efficiency. But some format do have options to disable sub-decoding, the zip format should probably have decode_uncompressed option etc.

wader added a commit that referenced this issue Feb 22, 2023
encoding/xml and github.com/BurntSushi/toml both reads a lot before detecting
that it can't decode. Now we instead read one UTF-8 and make sure it's valid
xml or toml.

Should speed up probing

Relatd to #586 bigzero-zip.zip
wader added a commit that referenced this issue Feb 22, 2023
encoding/xml and github.com/BurntSushi/toml both reads a lot before detecting
that it can't decode. Now we instead read one UTF-8 and make sure it's valid
xml or toml.

Should speed up probing

Related to #586 bigzero-zip.zip
wader added a commit that referenced this issue Feb 22, 2023
encoding/xml and github.com/BurntSushi/toml both reads a lot before detecting
that it can't decode. Now we instead read one UTF-8 and make sure it's valid
xml or toml.

Should speed up probing

Related to #586 bigzero-zip.zip
@wader
Copy link
Owner

wader commented Feb 22, 2023

#594 makes decoding bigzero-zip.zip quite a lot faster but will still use some cpu and memory as is uncompresses to memory. This will also speed up probing in many other cases. If your curious, fq does not have any special probe code instead it's up to decoder that are in the "probe" group to fail fast.

wader added a commit that referenced this issue Feb 22, 2023
There will always be zip(32) EOCD but optinally a zip64 EOCD

Related to #586
@wader
Copy link
Owner

wader commented Feb 22, 2023

Both issues should be resolved now. Give it a try if you can.

I noticed that there were some more (non-broken) files from the libzip regression tests that fq dont like, maybe i will have a look at those also.

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 22, 2023

Thanks, the zip64 eocd parsing looks good now.
I've tried the bigzero-zip.zip too but it doesn't look too good?

# ./fq -i . /tmp/bigzero-zip.zip                                                                                                                                                                                                                                 
error: /tmp/bigzero-zip.zip: probe: failed to decode: try fq -d FORMAT to force format, see fq -h formats for list
empty> ^D
# ./fq -d zip -i . /tmp/bigzero-zip.zip 
zip!> .
      |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f|0123456789abcdef0123456789abcdef|.{}: /tmp/bigzero-zip.zip (zip)
      |                                                                                               |                                |  error: zip: RawLen(signature): failed at position 10489.7 (read size 0 seek pos 0): outside buffer
0x0000|50 4b 03 04 14 00 02 00 08 00 c8 78 84 45 54 81 15 ae 4e 28 00 00 b9 9a 3f 00 0b 00 1c 00 62 69|PK.........x.ET...N(....?.....bi|  gap0: raw bits
*     |until 0x28e3.7 (10468)                                                                         |                                |
0x28e0|            50 4b 05 06 00 00 00 00 01 00 01 00 51 00 00 00 93 28 00 00 00 00|                 |    PK..........Q....(....|     |  end_of_central_directory_record{}:
      |                                                                                               |                                |  end_of_central_directory_locator{}:

@wader
Copy link
Owner

wader commented Feb 22, 2023

Oh yes i messed up, this should fix it #596

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 22, 2023

All good now, thank you!

@0-wiz-0 0-wiz-0 closed this as completed Feb 22, 2023
@wader
Copy link
Owner

wader commented Feb 22, 2023

🥳 thanks for reporting and nice bug report

@wader
Copy link
Owner

wader commented Feb 22, 2023

You added fq to NetBSD ports? thanks for that, were no issues? have no idea how the golang support is on *BSDs.

Also I can add you to the list of ppl i notify when doing a new release if you want?

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 22, 2023

Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :)
If you have pkgsrc set up, just use pkgin install fq to get the program installed.

Golang support in pkgsrc and NetBSD in special is quite good.

Sure, let me know about new releases. Thank you

@wader
Copy link
Owner

wader commented Feb 23, 2023

Yes, I did; and pkgsrc is portable and also used e.g. on Illumos, macOS, Linux and other operating systems :) If you have pkgsrc set up, just use pkgin install fq to get the program installed.

Aha didn't know 👍 could possibly add it to the fq README

Golang support in pkgsrc and NetBSD in special is quite good.

Good to know. The only thing i'm a bit worried about fq-wise is the readline module which has some os-specific code, but the REPL seems to work fine?

Sure, let me know about new releases. Thank you

Will do

@0-wiz-0
Copy link
Author

0-wiz-0 commented Feb 23, 2023

Yes, the REPL works fine.

(NetBSD is the upstream for editline (a BSD licensed readline).)

@wader
Copy link
Owner

wader commented Feb 23, 2023

Good, fq uses a fork of https://github.com/chzyer/readline which is similar to libreadline etc but implemented in go so is very convenient to use and build. Trying to stay away from cgo to not have to deal with c build issues... have enough of those in my life anyway :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants