-
-
Notifications
You must be signed in to change notification settings - Fork 52
/
overview.edoc
292 lines (230 loc) · 8.77 KB
/
overview.edoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
@author Ulf Wiger <ulf.wiger@erlang-solutions.com>
@doc A sortable serialization library
This library offers a serialization format (a la term_to_binary()) that
preserves the Erlang term order.
<pre>
Copyright 2010 Erlang Solutions Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
</pre>
<h1>1. Introduction</h1>
The idea to this library came out of the need for disk-based storage
with ordered_set semantics in Erlang. One solution exists - Tokyo Cabinet -
in which a C routine is used to hook into the sorting logic of TC.
I thought a more generic solution would be to be able to have a version
of term_to_binary() that respected the ordering semantics of Erlang terms.
A new addition is support for 'sb32' encoding. This is my own version of
Base32 encoding, with a slightly different alphabet, in order to preserve
sorting properties while generating octet strings that are perfectly safe
to use in file names.
<h1>2. Specification</h1>
<h2>2.1 Type tags</h2>
Each data type is encoded using a type tag (1 byte) that represents its order
in the global Erlang term ordering. The number type is divided into several
subtypes, to facilitate a reasonably efficient representation:
<table border="1">
<tr align="left">
<th>Type</th>
<th>Description</th>
<th>Tag</th>
</tr>
<tr>
<td>negbig</td>
<td>Negative bignum</td>
<td>8</td>
</tr>
<tr>
<td>neg4</td>
<td>Negative 31-bit integer</td>
<td>9</td>
</tr>
<tr>
<td>pos4</td>
<td>Positive 31-bit integer</td>
<td>10</td>
</tr>
<tr>
<td>posbig</td>
<td>Positive bignum</td>
<td>11</td>
</tr>
<tr>
<td>atom</td>
<td>Obj of type atom()</td>
<td>12</td>
</tr>
<tr>
<td>reference</td>
<td>Obj of type reference()</td>
<td>13</td>
</tr>
<tr>
<td>port</td>
<td>Obj of type port()</td>
<td>14</td>
</tr>
<tr>
<td>pid</td>
<td>Obj of type pid()</td>
<td>15</td>
</tr>
<tr>
<td>reference</td>
<td>Obj of type list()</td>
<td>16</td>
</tr>
<tr>
<td>reference</td>
<td>Obj of type list()</td>
<td>17</td>
</tr>
<tr>
<td>reference</td>
<td>Obj of type binary()</td>
<td>18</td>
</tr>
</table>
<h2>2.2 Tuples</h2>
Tuples are encoded as the tuple tag, followed by a 32-bit size element,
denoting the number of elements in the tuple, followed by each element
in the tuple individually encoded.
<h2>2.3 Lists</h2>
Lists are encoded as the list tag, followed by each element in the list
individually encoded, followed by a zero (1 byte).
<h2>2.4 Binaries and bitstrings</h2>
A binary is basically a bitstring whose size is a multiple of 8. From a sorting
perspective, binaries and bitstrings are both sorted as left-aligned bit
arrays.
<pre>
1> bitstring_to_list(<<11111111111:11>>).
[56,<<7:3>>]
</pre>
Binaries and bitstrings are encoded as the binary tag, followed by each whole
byte, each padded with a leading 1 (one bit), followed by a number of 0-bits
to pad again make the size a multiple of 8 bits, followed by a byte whose
value is Bits, where Bits is the number of "remainder bits"; 8 if the original
binary is 8-bit aligned.
Example:
<pre>
2> sext:encode(<<1,2,3>>).
<<18,128,192,160,96,8>>
3> <<18, 1:1,1, 1:1,2, 1:1,3, 0:5, 8>>.
<<18,128,192,160,96,8>>
</pre>
In the example above, we inserted 3 1-bits, and therefore had to insert 5 more
bad bits (zeroes) at the end. The last byte is 8, signifying that the original
binary was 8-bit aligned.
If the remainder is not an even 8 bits, the remainder bits are padded with
a 1-bit, just like the others, then left-aligned and padded up to a whole
byte (excluding the 1-bit added in front).
The value of the last byte is the bit size of the remainder.
Example:
<pre>
2> sext:encode(<<1,2,3,4:3>>).
<<18,128,192,160,96,8>>
3> <<18, 1:1,1, 1:1,2, 1:1,3, 1:1,4:3,0:5, 0:4, 3>>.
<<18,128,192,160,96,8>>
</pre>
The first part of the bitstring is encoded exactly like above. The number 4:3
is first padded with 1 then padded at the end to become a whole byte. Then
an additional pad, 0:4, is inserted to compensate for the fact that we have
inserted 4 1-bits. Finally, the last byte is 3, to signify the size of the
remainder.
<h2>2.5 Positive Numbers</h2>
Numbers are encoded as the corresponding type tag, followed by the integer
part, a marker indicating the presence of a fraction part, and the fraction
part, if any. The integer part is encoded differently depending on the size
of the value. The fraction part is encoded as a binary (without the 'binary'
type tag).
<h3>2.5.1 Positive small integers, pos4</h3>
Integers up to 31 bits are encoded as << ?pos4, I:31, F:1 >>
where I is the integer value, and F is 1 if a fraction part follows;
0 otherwise.
<h3>2.5.2 Positive large integers</h3>
Larger integers are converted to a byte string and then encoded like
binaries (without the 'binary' type tag), followed by a byte signifying
whether a fraction part follows (1 if yes; 0 otherwise).
<pre>
Bytes = encode_big(I),
<< ?pos_big, Bytes/binary, F:8 >>
</pre>
<h3>2.5.3 Fraction part of positive numbers</h3>
The representation of floating point numbers is based on the <a href="http://en.wikipedia.org/wiki/Double_precision_floating-point_format">IEEE 764 Binary 64 standard representation</a>. This is also the representation used by Erlang:
<pre>
<<Sign:1, Exp:11, Frac:52>> = <<F/float>>
</pre>
The encoding extracts the integer part and encodes it as a positive integer
(either pos4 or pos_big), flags the presence of a fraction part, and encodes
the fraction part as a binary (without the binary tag).
<h2>2.6 Negative Numbers</h2>
<h3>2.6.1 Small negative numbers</h3>
<pre>
<< ?neg4:8, IRep:31, F:1 >>
</pre>
A negative number I is encoded as IRep = Max + I, where Max is the largest
possible number that can be represented with the number of bits present for
the given subtype. For example, Max for neg4 is 0x7FFF FFFF (31 bits).
Keep in mind that I < 0.
The fraction flag is inverted, compared to the pos4 representation, so it will
be 1 if there is no fraction part; 0 otherwise.
<h3>2.6.2 Large negative numbers</h3>
Larger negative numbers are encoded as
<pre>
{Words, Max} = get_max(-I),
Bin = encode_bin_elems(list_to_binary(encode_big(Max + I)),
WordsRep = 16#FFFFffff - Words,
<< ?neg_big:8, WordsRep:32, Bin/binary, F:8 >>
</pre>
That is, get_max() figures out how many 64-bit words are needed to represent
-I (the positive number), and also gives the maximum value that can be
represented in so many words. WordsRep in essence becomes a sub-subtag of
the negative bignum.
<h3>2.6.3 Fraction of negative numbers</h3>
The fraction is encoded almost like the inverse of the positive fraction
(as a "negative binary", if such a thing existed). Each byte is padded with
a 0-bit rather than a 1-bit, and the byte itself is replaced by 16#ff - Byte.
The sequence is then padded with 1s to become a multiple of 8 bits.
The last byte, denoting the number of significant bits in the last byte,
is similarly inverted.
<h2>2.7 Atoms</h2>
Atoms are encoded as the atom tag, followed by the string representation of
the atom using the binary encoding described above (but without the binary
tag).
<h2>2.8 References</h2>
The encoding of references is perhaps best described by the code:
<pre>
encode_ref(R) ->
RBin = term_to_binary(R),
<<131,114,_Len:16,100,NLen:16,Name:NLen/binary,Rest/binary>> = RBin,
NameEnc = encode_bin_elems(Name),
RestEnc = encode_bin_elems(Rest),
<<?reference, NameEnc/binary, RestEnc/binary>>.
</pre>
where encode_bin_elems(B) encodes the argument B the same way as a binary
(excluding the 'binary' type tag).
<h2>2.9 Ports</h2>
The encoding of ports is perhaps best described by the code:
<pre>
encode_port(P) ->
PBin = term_to_binary(P),
<<131,102,100,ALen:16,Name:ALen/binary,Rest:5/binary>> = PBin,
NameEnc = encode_bin_elems(Name),
<<?port, NameEnc/binary, Rest/binary>>.
</pre>
<h2>2.10 Pids</h2>
The encoding of ports is perhaps best described by the code:
<pre>
encode_pid(P) ->
PBin = term_to_binary(P),
<<131,103,100,ALen:16,Name:ALen/binary,Rest:9/binary>> = PBin,
NameEnc = encode_bin_elems(Name),
<<?pid, NameEnc/binary, Rest/binary>>.
</pre>
@end