Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: 0e19a720d9
Fetching contributors…

Cannot retrieve contributors at this time

226 lines (165 sloc) 8.215 kb

pre {margin: 0em 4em;} table {border: thin solid black;} table td {border: thin solid black;} /**/ table.format td {text-align: center; padding: 0em 1em; } table.enum {border-collapse: collapse;} table.enum th {padding: 0em 1em; text-align: left;} table.enum td {padding: 0.1em 1em; border-width: thin 0px;} table.enum td:first-child {text-align: right; padding: 0em 4em;}

DeltaZip

DeltaZip is a compact file format for storing a sequence of versions of a file.

By taking advantage of the redundancy in the version history, the sequence can be stored very compactly, compared to storing each version separately.

File format overview

A DeltaZip file (or DeltaZip archive) consists of a sequence of chapters, each containing one version of the file. The last chapter contains the latest version, using normal compression - it is a self-contained chapter. In the other, previous chapters, the corresponding version is stored, expressed as the difference between that version and the following - they are delta-chapters. (In other words, what is stored in a delta-chapter is a description of how to construct the version corresponding to the chapter, given the next version in the sequence.)

Ch. 1 Ch. 2 ... Ch. (N-1) Ch. N
V1 − V2 V2 − V3   …   VN-1 − VN VN

Properties of the DeltaZip format

Using the DeltaZip format, the following can be achieved:

  • compact representation
  • easy access to the latest version
  • easy addition of a newer version
  • relatively quick access to older versions
  • easy deletion of the oldest versions.

Compact representation

The size of the delta representation may be a few per mille of the full versions, depending on the nature of the changes. Version histories of a few hundred versions may thus be stored in a file having the same order of size as the latest version in itself.

Easy access to the latest version

The latest chapter is easily locatable, and contains the latest version compressed using normal compression algorithms, and it can therefore easily be retrieved.

Relatively quick access to older versions

Retrieval of a version which is not the newest is done by going backwards from the most recent version, reconstructing each of the intermediate versions in turn, until the desired version is reached.

The Erlang implementation of DeltaZip has demonstrated an unpacking speed of ~250MB/s (on 2011 laptop hardware).

Easy deletion of oldest versions

Truncating the history by deleting a number of the oldest versions is done by simply removing the corresponding number of chapters from the beginning of the archive. Because a DeltaZip archive is always accessed backwards, from the latest chapters towards the earliest, the remaining versions can be accessed just as before the deletion.

Technical limits

The file format in the present form only supports version sizes up to 228-1 bytes, or 256MB.

File format details

Gross structure

archive ::= header chapter*
header  ::= deltazip-magic-number
deltazip-magic-number ::= 0xCE 0xB4 0x7A 0x10

Chapters

chapter     ::= chapter-tag adler data chapter-tag
chapter-tag ::= {Method:4, Size:28}
adler       ::= uint32
data        ::= byte* // Length specified by Size

The leading and trailing chapter-tag must be identical.

'Adler' is the Adler32 checksum of the (raw, uncompressed) version contained in the chapter.

The Method is a number signifying how the chapter's file version is represented.

The values 0-3 are used for self-contained chapters:

Method value Method name
0 Uncompressed (raw).
1 Deflated (using window size 32K, no zlib header).
2–3 Unassigned.

The values 4-15 are used for delta-chapters:

Method value Method name
4 Chunked.
5 Chunked-Middle.
6 Reserved for Dittoflate.
7 Chunked-Middle2.
8–15 Unassigned.

The delta compression methods are described below.

Delta compression methods

The notation "array[a;b]" means the sub-array which is obtained by taking the "b" first elements of "array", then removing the first "a" elements.

The "Chunked" compression method:

chunked-chapter-data ::= chunk*
chunk        ::= chunk-header chunk-size chunk-data
chunk-header ::= {ChunkMethod:5 ChunkParams:3}
chunk-size   ::= uint16
chunk-data   ::= byte* // Length specified by chunk-size

Chunk methods:

Chunk method value Method name
0 Deflated.
1 Prefix-copy.
2 Offset-copy.
3–31 Unassigned.

Chunk methods representations:

chunk-data ::= deflate-chunk-data     // when ChunkMethod=0
chunk-data ::= prefix-copy-chunk-data // when ChunkMethod=1
chunk-data ::= offset-copy-chunk-data // when ChunkMethod=2

deflate-chunk-data ::= byte*
prefix-copy-chunk-data ::= copy-lengthM1
offset-copy-chunk-data ::= offset-lengthM1 copy-lengthM1

// Lengths minus one:
copy-lengthM1   ::= uint16
offset-lengthM1 ::= uint16

Algorithm for "Chunked":

WINDOW_SIZE := 0x7E00;
CHUNK_SIZE  := WINDOW_SIZE / 2;
RSKIP_GRANULARITY := CHUNK_SIZE / 2;

chunked() {
    org_pos := 0;
    output := empty;
    for each chunk c : apply chunk c
}

apply deflate-chunk-data(params, chunk_data) {
    rskip := params * RSKIP_GRANULARITY;
    org_pos := org_pos + rskip;
    dict := org[org_pos; min(org_pos + WINDOW_SIZE, |org|)]
    output := output ++ inflate_with_dict(chunk_data, dict, window_size:32, zlib_headers:off)
}

apply prefix-copy-chunk-data(copy_lengthM1) {
    copy_lengthM1 = copy_length+1
    output := output ++ org[org_pos; org_pos + copy_length]
    org_pos := org_pos + copy_length;
}

apply prefix-copy-chunk-data(offset_lengthM1, copy_lengthM1) {
    offset_lengthM1 = offset_length+1
    org_pos := org_pos + offset_length;
    apply prefix-copy-chunk-data(copy_lengthM1);
}

The "Chunked Middle" compression method:

The old and new version have a common prefix and a common suffix. The middle of the old version is chunk-encoded relative to the middle of the new version.

Representation:

chunked-middle-chapter-data ::= prefix-length suffix-length chunked-chapter-data
prefix-length ::= varlen_int
suffix-length ::= varlen_int

varlen_int ::= {0:1, Bits:7}            // value is Bits
             | {1:1, Bits:7} varlen_int // Bits are MSB of the value.

Algorithm for "Chunked Middle":

chunked_middle(prefix_length, suffix_length) {
    org := org[prefix_length; |org| - suffix_length];
    chunked()
}

The "Chunked Middle 2" compression method:

The old and new version have a common prefix and a common suffix. The middle of the old version is chunk-encoded relative to a part of the new version starting up to WINDOW_SIZE/2 before the prefix ends.

Representation:

chunked-middle2-chapter-data ::= prefix-length suffix-length chunked-chapter-data
prefix-length ::= varlen_int
suffix-length ::= varlen_int

varlen_int ::= {0:1, Bits:7}            // value is Bits
             | {1:1, Bits:7} varlen_int // Bits are MSB of the value.

Algorithm for "Chunked Middle 2":

chunked_middle(prefix_length, suffix_length) {
    cut_length := max(0, prefix_length - WINDOW_SIZE/2);
    org := org[cut_length; |org|];
    chunked()
}
Jump to Line
Something went wrong with that request. Please try again.