zchunk_format.txt

A zchunk file contains two parts, the header and the body.  The header consists
of four parts:
 * The lead: Everything necessary to validate the header
 * The preface: Metadata about the zchunk file
 * The index: Details about each chunk
 * The signatures: Signatures used to sign the zchunk file

Definitions:
(ci)
 Compressed (unsigned) integer - An variable length little endian integer where
 the first seven bits of the number are stored in the first byte, followed by
 the next seven bits in the next byte, and so on.  The top bit of all bytes
 except the final byte must be zero, and the top bit of the final byte must be
 one, indicating the end of the number.

The lead:
+-+-+-+-+-+====================+==================+=================+
|   ID    | Checksum type (ci) | Header size (ci) | Header checksum |
+-+-+-+-+-+====================+==================+=================+

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Checksum type
 This is an integer containing the type of checksum used to generate the header
 checksum and the total data checksum, but *not* the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Header size:
 This is an integer containing the size of the header, not including the lead

Header checksum
 This is the checksum of everything from the beginning of the file until the end
 of the signatures, ignoring the header checksum.


The preface:
+===============+============+========================+
| Data checksum | Flags (ci) | Compression type (ci ) |
+===============+============+========================+

(Optional elements will only be set if flag 1 is set to 1)
+=============================+
| Optional element count (ci) |
+=============================+

[+==========================+=================================+
[| Optional element id (ci) | Optional element data size (ci) |
[+==========================+=================================+

+=======================+]
| Optional element data |] ...
+=======================+]

Data checksum
 This is the checksum of everything after the header, including the compressed
 dict and all the compressed chunks.  This checksum is generated using the
 overall checksum type, *not* the chunk checksum type.

Flags
 This is a compressed integer containing a bitmask of the flags.  All unused
 flags MUST be set to 0.  If a decoder sees a flag set that it doesn't
 recognize, it MUST exit with an error.

 Current flags are:
  bit 0: File has data streams
  bit 1: File has optional elements

Compression type
 This is an integer containing the type of compression used to compress dict and
 chunks.

 Current values:
   0 - Uncompressed
   2 - zstd

NOTE: The following optional element fields will only be set if flag 1 is set
      to 1

Optional element count
 This is a compressed integer containing the count of optional elements.  If
 there are no optional elements, then flag 1 MUST be set to 0, and this field
 and the following fields MUST NOT exist.

Optional element id
 This is a compressed integer containing the ID of the current optional element.

 Available optional elements are:
  - none

Optional element data size
 This is a compressed integer containing the current optional element's data
 size.

Optional element data
 This contains the current optional element's data.  This data MUST NOT be vital
 for decompressing the zchunk file.  Zchunk versions that don't recognize the
 current optional element ID, will ignore this data.


The index:
+=================+==========================+==================+
| Index size (ci) | Chunk checksum type (ci) | Chunk count (ci) |
+=================+==========================+==================+

(Dict stream will only exist if flag 0 is set to 1)
+==================+===============+==================+
| Dict stream (ci) | Dict checksum | Dict length (ci) |
+==================+===============+==================+

+===============================+
| Uncompressed dict length (ci) |
+===============================+

(Chunk stream will only exist if flag 0 is set to 1)
[+===================+================+===================+
[| Chunk stream (ci) | Chunk checksum | Chunk length (ci) |
[+===================+================+===================+

+==========================+]
| Uncompressed length (ci) |] ...
+==========================+]

Index size
 This is an integer containing the size of the index.

Chunk checksum type
 This is an integer containing the type of checksum used to generate the chunk
 checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256
   2 = SHA-512
   3 = SHA-512/128 (first 128 bits of SHA-512 checksum)

Chunk count
 This is a count of the number of chunks in the zchunk file including the
 dictionary chunk.

NOTE: Dict stream will only be set if flag 0 is set to 1
Dict stream
 If the data streams flag is set, this must always be 0, otherwise don't include
 this integer

Dict checksum
 This is the checksum of the compressed dict, used to detect whether two dicts
 are identical.  If there is no dict, the checksum must be all zeros.

Dict length
 This is an integer containing the length of the dict.  If there is no dict,
 this must be a zero.

Uncompressed dict length
 This is an integer containing the length of the dict after it has been
 decompressed.  If there is no dict, this must be a zero.

NOTE: Chunk stream will only be set if flag 0 is set to 1
Chunk stream
 If the data streams flag is set, this indicates which stream this chunk belongs
 to.  1 is the default, so decoders SHOULD decode stream 1 by default.  If the
 data streams flag isn't set, don't include this integer.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether any two
 chunks are identical.

Chunk length
 This is an integer containing the length of the chunk.

Uncompressed dict length
 This is an integer containing the length of the chunk after it has been
 decompressed.

The index is designed to be able to be extracted from the file on the server and
downloaded separately, to facilitate downloading only the parts of the file that
are needed, but must then be re-embedded when assembling the file so the user
only needs to keep one file.

Streams can be used to separate file metadata and data.  An example might be a
package format with the files stored in a tarball in stream 1, but the metadata
stored in stream 2.


The signatures:
+======================+
| Signature count (ci) |
+======================+

[+=====================+=====================+===========+]
[| Signature type (ci) | Signature size (ci) | Signature |] ...
[+=====================+=====================+===========+]

Signature count
 This is an integer countaining the number of signatures.

Signature type
 This is an integer containing the type of signature.  Currently there are no
 recognized signature types.

Signature size
 This is an integer containing the size of the signature.

Signature
 The actual signature.  The signature MUST only apply to the header, excluding
 the header size, the header checksum, the signature count and the signatures.
 The excluded data MUST be omitted when calculating the signature.

Signatures are designed so that anyone can add a new signature to a file
without changing the validity of other signatures, but the header size and
checksum must be recalculated.

We sign only the header so the signature can be validated independently of the
data, though the data can then be validated through both the chunk checksums
and the full data checksum, both of which are embedded in the signed header.


After the header, we have the body, which has the following:
+=================+
| Compressed Dict |
+=================+

[+===========================+]
[|           Chunk           |] ...
[+===========================+]

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk. Because each
 chunk is compressed completely separately from the others, the custom
 dictionary gives us much better overall compression.  The custom dictionary is
 compressed without a custom dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary provided above.