/
zchunk_format.txt
238 lines (180 loc) · 8.02 KB
/
zchunk_format.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
A zchunk file contains two parts, the header and the body. The header consists
of four parts:
* The lead: Everything necessary to validate the header
* The preface: Metadata about the zchunk file
* The index: Details about each chunk
* The signatures: Signatures used to sign the zchunk file
Definitions:
(ci)
Compressed (unsigned) integer - An variable length little endian integer where
the first seven bits of the number are stored in the first byte, followed by
the next seven bits in the next byte, and so on. The top bit of all bytes
except the final byte must be zero, and the top bit of the final byte must be
one, indicating the end of the number.
The lead:
+-+-+-+-+-+====================+==================+=================+
| ID | Checksum type (ci) | Header size (ci) | Header checksum |
+-+-+-+-+-+====================+==================+=================+
ID
'\0ZCK1', identifies file as zchunk version 1 file
Checksum type
This is an integer containing the type of checksum used to generate the header
checksum and the total data checksum, but *not* the chunk checksums.
Current values:
0 = SHA-1
1 = SHA-256
Header size:
This is an integer containing the size of the header, not including the lead
Header checksum
This is the checksum of everything from the beginning of the file until the end
of the signatures, ignoring the header checksum.
The preface:
+===============+============+========================+
| Data checksum | Flags (ci) | Compression type (ci ) |
+===============+============+========================+
(Optional elements will only be set if flag 1 is set to 1)
+=============================+
| Optional element count (ci) |
+=============================+
[+==========================+=================================+
[| Optional element id (ci) | Optional element data size (ci) |
[+==========================+=================================+
+=======================+]
| Optional element data |] ...
+=======================+]
Data checksum
This is the checksum of everything after the header, including the compressed
dict and all the compressed chunks. This checksum is generated using the
overall checksum type, *not* the chunk checksum type.
Flags
This is a compressed integer containing a bitmask of the flags. All unused
flags MUST be set to 0. If a decoder sees a flag set that it doesn't
recognize, it MUST exit with an error.
Current flags are:
bit 0: File has data streams
bit 1: File has optional elements
Compression type
This is an integer containing the type of compression used to compress dict and
chunks.
Current values:
0 - Uncompressed
2 - zstd
NOTE: The following optional element fields will only be set if flag 1 is set
to 1
Optional element count
This is a compressed integer containing the count of optional elements. If
there are no optional elements, then flag 1 MUST be set to 0, and this field
and the following fields MUST NOT exist.
Optional element id
This is a compressed integer containing the ID of the current optional element.
Available optional elements are:
- none
Optional element data size
This is a compressed integer containing the current optional element's data
size.
Optional element data
This contains the current optional element's data. This data MUST NOT be vital
for decompressing the zchunk file. Zchunk versions that don't recognize the
current optional element ID, will ignore this data.
The index:
+=================+==========================+==================+
| Index size (ci) | Chunk checksum type (ci) | Chunk count (ci) |
+=================+==========================+==================+
(Dict stream will only exist if flag 0 is set to 1)
+==================+===============+==================+
| Dict stream (ci) | Dict checksum | Dict length (ci) |
+==================+===============+==================+
+===============================+
| Uncompressed dict length (ci) |
+===============================+
(Chunk stream will only exist if flag 0 is set to 1)
[+===================+================+===================+
[| Chunk stream (ci) | Chunk checksum | Chunk length (ci) |
[+===================+================+===================+
+==========================+]
| Uncompressed length (ci) |] ...
+==========================+]
Index size
This is an integer containing the size of the index.
Chunk checksum type
This is an integer containing the type of checksum used to generate the chunk
checksums.
Current values:
0 = SHA-1
1 = SHA-256
2 = SHA-512
3 = SHA-512/128 (first 128 bits of SHA-512 checksum)
Chunk count
This is a count of the number of chunks in the zchunk file including the
dictionary chunk.
NOTE: Dict stream will only be set if flag 0 is set to 1
Dict stream
If the data streams flag is set, this must always be 0, otherwise don't include
this integer
Dict checksum
This is the checksum of the compressed dict, used to detect whether two dicts
are identical. If there is no dict, the checksum must be all zeros.
Dict length
This is an integer containing the length of the dict. If there is no dict,
this must be a zero.
Uncompressed dict length
This is an integer containing the length of the dict after it has been
decompressed. If there is no dict, this must be a zero.
NOTE: Chunk stream will only be set if flag 0 is set to 1
Chunk stream
If the data streams flag is set, this indicates which stream this chunk belongs
to. 1 is the default, so decoders SHOULD decode stream 1 by default. If the
data streams flag isn't set, don't include this integer.
Chunk checksum
This is the checksum of the compressed chunk, used to detect whether any two
chunks are identical.
Chunk length
This is an integer containing the length of the chunk.
Uncompressed dict length
This is an integer containing the length of the chunk after it has been
decompressed.
The index is designed to be able to be extracted from the file on the server and
downloaded separately, to facilitate downloading only the parts of the file that
are needed, but must then be re-embedded when assembling the file so the user
only needs to keep one file.
Streams can be used to separate file metadata and data. An example might be a
package format with the files stored in a tarball in stream 1, but the metadata
stored in stream 2.
The signatures:
+======================+
| Signature count (ci) |
+======================+
[+=====================+=====================+===========+]
[| Signature type (ci) | Signature size (ci) | Signature |] ...
[+=====================+=====================+===========+]
Signature count
This is an integer countaining the number of signatures.
Signature type
This is an integer containing the type of signature. Currently there are no
recognized signature types.
Signature size
This is an integer containing the size of the signature.
Signature
The actual signature. The signature MUST only apply to the header, excluding
the header size, the header checksum, the signature count and the signatures.
The excluded data MUST be omitted when calculating the signature.
Signatures are designed so that anyone can add a new signature to a file
without changing the validity of other signatures, but the header size and
checksum must be recalculated.
We sign only the header so the signature can be validated independently of the
data, though the data can then be validated through both the chunk checksums
and the full data checksum, both of which are embedded in the signed header.
After the header, we have the body, which has the following:
+=================+
| Compressed Dict |
+=================+
[+===========================+]
[| Chunk |] ...
[+===========================+]
Compressed Dict (optional)
This is a custom dictionary used when compressing each chunk. Because each
chunk is compressed completely separately from the others, the custom
dictionary gives us much better overall compression. The custom dictionary is
compressed without a custom dictionary (for obvious reasons).
Chunk
This is a chunk of data, compressed with the custom dictionary provided above.