-
Notifications
You must be signed in to change notification settings - Fork 5
/
spec.txt
149 lines (106 loc) · 5.69 KB
/
spec.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
============================================================================
File Compressor
============================================================================
Your team is creating a program called "huff" that compresses and
decompresses files on UNIX machines. It operates in three main modes:
Compress a file:
huff -c file
The compressed file is called "file.huff". This file is silently
overwritten if it already exists.
NOTE: It is perfectly permissible (if useless) to compress a .huff
file. If x.huff is compressed, the resulting file would be called
x.huff.huff.
NOTE: Unlike gzip and bzip2, huff does not remove the original file
after it creates a compressed file.
Decompress a file:
huff -d file
The name of the uncompressed file is recovered by removing the ".huff"
extension. If a file of that name exists it is silently overwritten.
NOTE: Unlike gzip and bzip2, huff does not remove the original file
after it creates a decompressed file.
Dump the compression table for a file:
huff -t file
If file is a huff file, its compression table is recovered from the
file and dumped to stdout. Otherwise, its compression table is
computed and dumped to stdout.
============================================================================
The "expected case" is defined to be the situation where the input
file is a regular file that can be opened for reading, the output file
is a regular file that can be opened for writing, and library/system
calls do not fail unexpectedly (i.e. malloc calls succeed, the hard
disk does not fill up, etc.). In the expected case, neither file is
modified by another program while huff is executing.
In compression mode, huff is not permitted to fail in the expected
case.
In decompression mode, huff must fail if the file is not a valid huff
file (see below). If the compressed file is valid, then huff must not
fail in the expected case.
In table dump mode, when the argument is a huff file, huff is required
to fail if the file is not valid; otherwise, it is not permitted to fail
in the expected case.
If huff succeeds, it must return a 0 exit code
If huff fails, it must return a 255 exit code and a reasonable error
message should be printed.
============================================================================
huff file format
A file is a valid huff file only if it meets all of the following
requirements:
1. Its filename ends with ".huff"
2. Bytes 0-3 are the magic number: the ASCII characters HUFF
3. Bytes 4-11 are the length field: a 64-bit unsigned integer in
little endian format indicating the size of the decompressed file,
which is also the number of Huffman codes in the compressed data
section
4. A valid compression table (see format below) begins at byte 12
5. The remainder of the file contains compressed data; the number of
items must match the length field; the only exception is that the
final byte of the file, if it is not filled, is padded with
1-7 bits whose value is unspecified. Bits are placed into bytes
in big-endian order.
A valid compression table is comprised of a sequence of 256 strings,
each of which matches this Perl regex:
^[01]+\n$
In other words, each of the 256 strings is comprised of a sequence of
0 and 1 characters of at least length 1, followed by a newline. No
string of 0s and 1s can exceed length 255.
NOTE: The strings in the compression table are not C strings. That is,
they are not terminated by nul bytes. Each entry in the table is
terminated by a single newline character only.
============================================================================
The Huffman coding has the property that no code is the prefix of any
other code. This enables unambiguous decoding.
The algorithm to generate the compression table is as follows:
- Determine the frequency of occurrence of each byte in the input
file.
- Make a list that contains one element for each byte value; this list
is sorted by increasing frequency (i.e., least frequently occurring
bytes are first). For bytes with the same frequency, the
lower-valued byte goes earlier in the list.
- Remove the first two elements of the list and combine them into a
single list element whose frequency is the sum of the frequencies of
the removed elements. The new element should contain the removed
list elements as subtrees. The new element should be inserted into
its proper place in the sorted list. Ties are broken by putting the
element whose subtree contains the lowest-valued byte earlier in the
list.
- If the list contains more than one element, go back to the previous
step.
- Create the compression table as follows. Observe that for every
non-leaf node in the tree, the subtree containing the minimum-valued
element can be called the 0-subtree and the other subtree can be
called the 1-subtree. To generate the table entry for the
zero-valued byte, walk the tree from root to the leaf corresponding
to the 0 byte, generating a 0 in the Huffman code every time the
0-subtree is followed and a 1 in the Huffman code every time the
1-subtree is followed.
============================================================================
Additional requirements:
huff is implemented in C and must not execute any undefined behaviors
on either 32-bit or 64-bit systems. Furthermore, Valgrind should
declare it to be free of memory leaks.
Running "make" in your team's directory should cause an executable
called "huff" to be compiled. This must work on CADE but please make
an effort to also support random Linux and MacOS X machines.
huff must successfully compress and decompress files larger than 2^32
bytes, even on a 32-bit machine (so don't mmap() the file!)
============================================================================