poor documentation #15

freakeinstein · 2019-05-02T09:05:33Z

Hi, I see poorly documented README without any link to documentation or related papers. I'm totally out of context. Can you guide me? Fixing README might help newbies like me who explores Github and finds out this repo.

jdieter · 2019-05-11T21:31:15Z

Apologies for the delay in responding. I've been on holidays and then started a new job.

I've added links in the README to the file format and some blog posts that give some background in commit f1a3824. Is this what you're looking for? If not, would you be willing to do a PR that does contain what you're looking for?

bmarwell · 2019-05-14T06:17:26Z

I think the graphics belong into the repository, having a README-FORMAT.md or just FORMAT.md. or in this wiki, with some examples. That would be really helpful.

bmarwell · 2019-05-16T19:40:48Z

So here are some more thinks that should be added:

exhausive compressed integer documentation
- especially that this is not LEB128, rather the opposite
- why it exists and not LEB128 was used
more info about data streams. The current format specs throw these terms to the user without explaining it.
the dict stream is not mandatory, but the dict length etc. is. That is confusing.
What is even more confusing: The dict is just the first chunk. That's not explicitly mentioned.
- it is implicitly mentioned by the chunk count (which says: includes the dict).
More info about the download algorithm.
- It seems that for zckdl, the min download size is being calculated as: 5 + MAX_COMP_SIZE*2 + get_max_hash_size(). But MAX_COMP_SIZE is defined as #define MAX_COMP_SIZE (((sizeof(size_t) * 8) / 7) + 1). I wonder if this could lead to problems, as size_t has different sizes on different architectures. Would a 32bit system be able to run any download?
- What happens to old clients if you start using a bigger hash algorithm in newer versions? The header download would be too small, although smart implementations might exit before they reach the hash.
Fast downloading heavily relies on http keep alive (persistent connections), or better: http2.
- Should go into the description.
- http1.1 supports pipelining.
- http2 supports out-of-order-pipelining (multiplexing).
- maybe at a specific number of remaining chunks, it might be useful to open more connections. Java defaults to 5 connections per destination.
merge algorithm. How does the default client merge the downloaded chunks?
- Are they stored in a temp file and then concatinated?
- Other strategies?

I know, lots of questions. But this will help implementing zchunk in other languages.

F-i-f · 2019-05-19T20:08:46Z

Also, how to use zck_gen_zdict would be handy.
Using zck_gen_zdict in combination with createrepo_c would also help.

jdieter · 2019-05-19T20:28:58Z

Also, how to use zck_gen_zdict would be handy.

Good point.

Using zck_gen_zdict in combination with createrepo_c would also help.

This is actually already supposed to be working (there are zdicts designed for Fedora repodata available in the package fedora-repo-zdicts), but it seems that the updates* metadata isn't currently using the zdicts. I've submitted a patch to Fedora infra that will hopefully fix this.

F-i-f · 2019-05-19T22:18:49Z

This is actually already supposed to be working (there are zdicts designed for Fedora repodata available in the package fedora-repo-zdicts), but it seems that the updates* metadata isn't currently using the zdicts. I've submitted a patch to Fedora infra that will hopefully fix this.

I'd be more interested in how to generate zdicts for different repos that fedora and fedora-update (eg. rpmfusion, private repos). Or are the fedora-repo-zdicts generic enough to be used on any repo?

jdieter · 2019-05-20T19:33:14Z

That's a good question. They'd definitely be better than nothing, but, especially for a private repo, you'd probably get much better compression using a zdict generated specifically for the repo.

F-i-f · 2019-05-20T20:26:59Z

Thanks for the information.
Anyways, if I'm reading you correctly, adding a zdict will trade (de)compression speed (lots faster) for file size (minor increase). I'd rather save on bandwidth than CPU as my target hardware has plenty of spare cycles. So I should prefer zdict-less zchunk.

jdieter · 2019-05-20T20:39:19Z

No, zdicts give you significantly better compression at the cost of a slight speed decrease, mainly during the compression process. Zchunk splits a file into completely independent compressed chunks, which means that identical data in one chunk can't be referenced by another.

Zdicts help us get around this problem by providing a compression dictionary of strings common to more than one chunk. This dictionary is stored in the first chunk, and, while it takes up space (the default is around 100KB), it generally makes a huge difference in compression size.

The trick is that, for a file to take advantage of zchunk's benefits, the zdict has to stay the same from one version of the file to the next. If you change the zdict, identical chunks will no longer match because their compressed data is different.

tomberek · 2021-07-31T01:41:04Z

In case this helps anyone. I built a rough parser in Kaitai (https://ide.kaitai.io/devel/# )for zchunk. I've found this helps visualize the format and what is happening. And can generate parsers in other languages.

meta:
  id: zchunk
  file-extension: zck
  
seq:
  - id: lead
    type: lead
  - id: preface
    type: preface
  - id: index
    type: index
  - id: signatures
    type: signatures
instances:
  datachunk:
    type: datum(_index)
    repeat: expr
    repeat-expr: index.count.value
types:
  datum:
    seq:
      - id: bytes
        #type: zstd
        size: _root.index.chunks[i].length.value
    params:
      - id: i
        type: u8
  signatures:
    seq:
      - id: count
        type: ci
      - id: signature
        type: signature
        repeat: expr
        repeat-expr: count.value
  signature:
    seq:
      - id: type
        type: ci
      - id: size
        type: ci
      - id: signature
        size: size.value
  index:
    seq:
      - id: size
        type: ci
      - id: checksum_type
        type: ci
      - id: count
        type: ci
      - id: chunks
        type: chunk(_index)
        repeat: expr
        repeat-expr: count.value
  chunk:
    params:
      - id: i
        type: s4
    seq:
      - id: stream
        type: ci
        if: _root.preface.flags.value & 0b1 == 1
      - id: checksum
        size: 16
      - id: length
        type: ci
      - id: uncompressed_length
        type: ci

      
  preface:
    seq:
      - id: data_checksum
        size: 32
      - id: flags
        type: ci
      - id: compression_type
        type: ci
      - id: optional
        type: optional
        if: flags.value == 1
  optional:
    seq:
      - id: count
        type: ci
  lead:
    seq:
      - id: id
        contents: "\0ZCK1"
      - id: checksum_type
        type: ci
      - id: header_size
        type: ci
      - id: header_checksum
        size: 32
  ci:
    seq:
    - id: groups
      type: group
      repeat: until
      repeat-until: not _.has_next
    types:
      group:
        -webide-representation: '{value}'
        doc: |
          One byte group, clearly divided into 7-bit "value" chunk and 1-bit "continuation" flag.
        seq:
          - id: b
            type: u1
        instances:
          has_next:
            value: (b & 0b1000_0000) == 0
            doc: If true, then we have more bytes to read
          value:
            value: b & 0b0111_1111
            doc: The 7-bit (base128) numeric value chunk of this group
    instances:
      last:
        value: groups.size - 1
      value:
        value: >-
          groups[last].value
          + (last >= 1 ? (groups[last - 1].value << 7) : 0)
          + (last >= 2 ? (groups[last - 2].value << 14) : 0)
          + (last >= 3 ? (groups[last - 3].value << 21) : 0)
          + (last >= 4 ? (groups[last - 4].value << 28) : 0)
          + (last >= 5 ? (groups[last - 5].value << 35) : 0)
          + (last >= 6 ? (groups[last - 6].value << 42) : 0)
          + (last >= 7 ? (groups[last - 7].value << 49) : 0)
        doc: Resulting value as normal integer

mrdis · 2023-07-20T13:16:42Z

@tomberek , I think there is an issue with endianness of ci values, in my tests the ci value must be calculated as follows

          groups[0].value
          + (last >= 1 ? (groups[1].value << 7) : 0)
          + (last >= 2 ? (groups[2].value << 14) : 0)
          + (last >= 3 ? (groups[3].value << 21) : 0)
          + (last >= 4 ? (groups[4].value << 28) : 0)
          + (last >= 5 ? (groups[5].value << 35) : 0)
          + (last >= 6 ? (groups[6].value << 42) : 0)
          + (last >= 7 ? (groups[7].value << 49) : 0)

dralley · 2023-10-15T05:31:56Z

Does Zchunk have a "magic number" identifying the file type?

jdieter · 2023-10-15T21:16:28Z

Does Zchunk have a "magic number" identifying the file type?

Hey Daniel, yes it does. As per https://github.com/zchunk/zchunk/blob/main/zchunk_format.txt the first bytes of the file are:
'\0ZCK1', identifies file as zchunk version 1 file
OR
'\0ZHR1', identifies file as zchunk detached header version 1 file

To clarify, almost everything you would see in Fedora would be the first. Detached headers were added for someone wanting to use zchunk to download immutable full disk images for the automotive industry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poor documentation #15

poor documentation #15

freakeinstein commented May 2, 2019

jdieter commented May 11, 2019

bmarwell commented May 14, 2019

bmarwell commented May 16, 2019 •

edited

Loading

F-i-f commented May 19, 2019

jdieter commented May 19, 2019

F-i-f commented May 19, 2019

jdieter commented May 20, 2019

F-i-f commented May 20, 2019

jdieter commented May 20, 2019

tomberek commented Jul 31, 2021

mrdis commented Jul 20, 2023

dralley commented Oct 15, 2023

jdieter commented Oct 15, 2023

poor documentation #15

poor documentation #15

Comments

freakeinstein commented May 2, 2019

jdieter commented May 11, 2019

bmarwell commented May 14, 2019

bmarwell commented May 16, 2019 • edited Loading

F-i-f commented May 19, 2019

jdieter commented May 19, 2019

F-i-f commented May 19, 2019

jdieter commented May 20, 2019

F-i-f commented May 20, 2019

jdieter commented May 20, 2019

tomberek commented Jul 31, 2021

mrdis commented Jul 20, 2023

dralley commented Oct 15, 2023

jdieter commented Oct 15, 2023

bmarwell commented May 16, 2019 •

edited

Loading