Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor documentation #15

Open
freakeinstein opened this issue May 2, 2019 · 13 comments
Open

poor documentation #15

freakeinstein opened this issue May 2, 2019 · 13 comments

Comments

@freakeinstein
Copy link

Hi, I see poorly documented README without any link to documentation or related papers. I'm totally out of context. Can you guide me? Fixing README might help newbies like me who explores Github and finds out this repo.

@jdieter
Copy link
Member

jdieter commented May 11, 2019

Apologies for the delay in responding. I've been on holidays and then started a new job.

I've added links in the README to the file format and some blog posts that give some background in commit f1a3824. Is this what you're looking for? If not, would you be willing to do a PR that does contain what you're looking for?

@bmarwell
Copy link

I think the graphics belong into the repository, having a README-FORMAT.md or just FORMAT.md. or in this wiki, with some examples. That would be really helpful.

@bmarwell
Copy link

bmarwell commented May 16, 2019

So here are some more thinks that should be added:

  • exhausive compressed integer documentation

    • especially that this is not LEB128, rather the opposite
    • why it exists and not LEB128 was used
  • more info about data streams. The current format specs throw these terms to the user without explaining it.

  • the dict stream is not mandatory, but the dict length etc. is. That is confusing.

  • What is even more confusing: The dict is just the first chunk. That's not explicitly mentioned.

    • it is implicitly mentioned by the chunk count (which says: includes the dict).
  • More info about the download algorithm.

    • It seems that for zckdl, the min download size is being calculated as: 5 + MAX_COMP_SIZE*2 + get_max_hash_size(). But MAX_COMP_SIZE is defined as #define MAX_COMP_SIZE (((sizeof(size_t) * 8) / 7) + 1). I wonder if this could lead to problems, as size_t has different sizes on different architectures. Would a 32bit system be able to run any download?
    • What happens to old clients if you start using a bigger hash algorithm in newer versions? The header download would be too small, although smart implementations might exit before they reach the hash.
  • Fast downloading heavily relies on http keep alive (persistent connections), or better: http2.

    • Should go into the description.
    • http1.1 supports pipelining.
    • http2 supports out-of-order-pipelining (multiplexing).
    • maybe at a specific number of remaining chunks, it might be useful to open more connections. Java defaults to 5 connections per destination.
  • merge algorithm. How does the default client merge the downloaded chunks?

    • Are they stored in a temp file and then concatinated?
    • Other strategies?

I know, lots of questions. But this will help implementing zchunk in other languages.

@F-i-f
Copy link
Contributor

F-i-f commented May 19, 2019

Also, how to use zck_gen_zdict would be handy.
Using zck_gen_zdict in combination with createrepo_c would also help.

@jdieter
Copy link
Member

jdieter commented May 19, 2019

Also, how to use zck_gen_zdict would be handy.

Good point.

Using zck_gen_zdict in combination with createrepo_c would also help.

This is actually already supposed to be working (there are zdicts designed for Fedora repodata available in the package fedora-repo-zdicts), but it seems that the updates* metadata isn't currently using the zdicts. I've submitted a patch to Fedora infra that will hopefully fix this.

@F-i-f
Copy link
Contributor

F-i-f commented May 19, 2019

This is actually already supposed to be working (there are zdicts designed for Fedora repodata available in the package fedora-repo-zdicts), but it seems that the updates* metadata isn't currently using the zdicts. I've submitted a patch to Fedora infra that will hopefully fix this.

I'd be more interested in how to generate zdicts for different repos that fedora and fedora-update (eg. rpmfusion, private repos). Or are the fedora-repo-zdicts generic enough to be used on any repo?

@jdieter
Copy link
Member

jdieter commented May 20, 2019

That's a good question. They'd definitely be better than nothing, but, especially for a private repo, you'd probably get much better compression using a zdict generated specifically for the repo.

@F-i-f
Copy link
Contributor

F-i-f commented May 20, 2019

Thanks for the information.
Anyways, if I'm reading you correctly, adding a zdict will trade (de)compression speed (lots faster) for file size (minor increase). I'd rather save on bandwidth than CPU as my target hardware has plenty of spare cycles. So I should prefer zdict-less zchunk.

@jdieter
Copy link
Member

jdieter commented May 20, 2019

No, zdicts give you significantly better compression at the cost of a slight speed decrease, mainly during the compression process. Zchunk splits a file into completely independent compressed chunks, which means that identical data in one chunk can't be referenced by another.

Zdicts help us get around this problem by providing a compression dictionary of strings common to more than one chunk. This dictionary is stored in the first chunk, and, while it takes up space (the default is around 100KB), it generally makes a huge difference in compression size.

The trick is that, for a file to take advantage of zchunk's benefits, the zdict has to stay the same from one version of the file to the next. If you change the zdict, identical chunks will no longer match because their compressed data is different.

@tomberek
Copy link

In case this helps anyone. I built a rough parser in Kaitai (https://ide.kaitai.io/devel/# )for zchunk. I've found this helps visualize the format and what is happening. And can generate parsers in other languages.

meta:
  id: zchunk
  file-extension: zck
  
seq:
  - id: lead
    type: lead
  - id: preface
    type: preface
  - id: index
    type: index
  - id: signatures
    type: signatures
instances:
  datachunk:
    type: datum(_index)
    repeat: expr
    repeat-expr: index.count.value
types:
  datum:
    seq:
      - id: bytes
        #type: zstd
        size: _root.index.chunks[i].length.value
    params:
      - id: i
        type: u8
  signatures:
    seq:
      - id: count
        type: ci
      - id: signature
        type: signature
        repeat: expr
        repeat-expr: count.value
  signature:
    seq:
      - id: type
        type: ci
      - id: size
        type: ci
      - id: signature
        size: size.value
  index:
    seq:
      - id: size
        type: ci
      - id: checksum_type
        type: ci
      - id: count
        type: ci
      - id: chunks
        type: chunk(_index)
        repeat: expr
        repeat-expr: count.value
  chunk:
    params:
      - id: i
        type: s4
    seq:
      - id: stream
        type: ci
        if: _root.preface.flags.value & 0b1 == 1
      - id: checksum
        size: 16
      - id: length
        type: ci
      - id: uncompressed_length
        type: ci

      
  preface:
    seq:
      - id: data_checksum
        size: 32
      - id: flags
        type: ci
      - id: compression_type
        type: ci
      - id: optional
        type: optional
        if: flags.value == 1
  optional:
    seq:
      - id: count
        type: ci
  lead:
    seq:
      - id: id
        contents: "\0ZCK1"
      - id: checksum_type
        type: ci
      - id: header_size
        type: ci
      - id: header_checksum
        size: 32
  ci:
    seq:
    - id: groups
      type: group
      repeat: until
      repeat-until: not _.has_next
    types:
      group:
        -webide-representation: '{value}'
        doc: |
          One byte group, clearly divided into 7-bit "value" chunk and 1-bit "continuation" flag.
        seq:
          - id: b
            type: u1
        instances:
          has_next:
            value: (b & 0b1000_0000) == 0
            doc: If true, then we have more bytes to read
          value:
            value: b & 0b0111_1111
            doc: The 7-bit (base128) numeric value chunk of this group
    instances:
      last:
        value: groups.size - 1
      value:
        value: >-
          groups[last].value
          + (last >= 1 ? (groups[last - 1].value << 7) : 0)
          + (last >= 2 ? (groups[last - 2].value << 14) : 0)
          + (last >= 3 ? (groups[last - 3].value << 21) : 0)
          + (last >= 4 ? (groups[last - 4].value << 28) : 0)
          + (last >= 5 ? (groups[last - 5].value << 35) : 0)
          + (last >= 6 ? (groups[last - 6].value << 42) : 0)
          + (last >= 7 ? (groups[last - 7].value << 49) : 0)
        doc: Resulting value as normal integer

@mrdis
Copy link

mrdis commented Jul 20, 2023

@tomberek , I think there is an issue with endianness of ci values, in my tests the ci value must be calculated as follows

          groups[0].value
          + (last >= 1 ? (groups[1].value << 7) : 0)
          + (last >= 2 ? (groups[2].value << 14) : 0)
          + (last >= 3 ? (groups[3].value << 21) : 0)
          + (last >= 4 ? (groups[4].value << 28) : 0)
          + (last >= 5 ? (groups[5].value << 35) : 0)
          + (last >= 6 ? (groups[6].value << 42) : 0)
          + (last >= 7 ? (groups[7].value << 49) : 0)

@dralley
Copy link

dralley commented Oct 15, 2023

Does Zchunk have a "magic number" identifying the file type?

@jdieter
Copy link
Member

jdieter commented Oct 15, 2023

Does Zchunk have a "magic number" identifying the file type?

Hey Daniel, yes it does. As per https://github.com/zchunk/zchunk/blob/main/zchunk_format.txt the first bytes of the file are:
'\0ZCK1', identifies file as zchunk version 1 file
OR
'\0ZHR1', identifies file as zchunk detached header version 1 file

To clarify, almost everything you would see in Fedora would be the first. Detached headers were added for someone wanting to use zchunk to download immutable full disk images for the automotive industry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants