This repository has been archived by the owner on Jan 30, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 7
/
shannon_fano_elias.py
409 lines (321 loc) · 14.2 KB
/
shannon_fano_elias.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
r"""
Shannon-Fano-Elias Coding
This module implements functionalities relating to Shannon-Fano-Elias
encoding and decoding.
AUTHORS:
- Jan Wabbersen (2015-04-27): initial version based on concepts and
ideas of the Huffman module by Nathann Cohen
"""
#*****************************************************************************
# Copyright (C) 2015 Jan Wabbersen <jan.wabbersen@googlemail.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 2 of the License, or
# (at your option) any later version.
# http://www.gnu.org/licenses/
#*****************************************************************************
from __future__ import print_function, absolute_import
from math import ceil, log
from six import iteritems
from .prefix_coding import PrefixCoding
from .misc import SimpleTable, first_binary_dec_places
class ShannonFanoElias(PrefixCoding):
r"""
This class implements basic functionalities of Shannon-Fano-Elias
coding.
It can build a Shannon-Fano-Elias code from a given string, or from
the information of a dictionary associating to each key (the
elements of the alphabet) a weight (most of the time, a probability
value or a number of occurrences).
INPUT:
- ``source`` -- can be either
- A string from which the Shannon-Fano-Elias encoding should
be created.
- A dictionary that associates to each symbol of an
alphabet a numeric value. If we consider the frequency
of each alphabetic symbol, then ``source`` is considered
as the frequency table of the alphabet with each numeric
(non-negative integer) value being the number of
occurrences of a symbol. The numeric values can also
represent weights of the symbols. In that case, the
numeric values are not necessarily integers, but can be
real numbers.
- ``verbose`` -- (default: False) if True, print intermediate
data of building the code.
- ``char_per_symbol`` -- (default: 1) the number of characters
that define one symbol.
- ``decoding_table_key_len`` -- (default: 8) the length of each
key in the resulting decoding table. Longer keys may result in
faster decoding. Decoding tables with shorter keys consume
less memory, but decoding may be slower.
In order to construct a Shannon-Fano-Elias encoding for an alphabet,
we use exactly one of the following methods:
#. Let ``source`` be a string of symbols over an alphabet and feed
``source`` to the constructor of this class. Based on the input
string and ``char_per_code``, a frequency table is constructed
that contains the frequency of each unique symbol (defined by
``char_per_symbol``) in ``source``. The alphabet in question is
then all the unique symbols in ``source``. A significant
implication of this is that any subsequent string that we want to
encode must contain only symbols that can be found in ``source``.
#. Let ``source`` be the frequency table of an alphabet. We can
feed this table to the constructor of this class. The table
``source`` can be a table of frequencies or a table of weights.
EXAMPLES::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: sfe1 = ShannonFanoElias("Encode me!")
sage: for symbol, code in sfe1.encoding_table().iteritems():
....: print("'" + symbol + "' : " + code)
....:
'!' : 0000
' ' : 0010
'c' : 0100
'e' : 011
'd' : 1000
'm' : 1010
'o' : 1100
'n' : 1101
'E' : 1111
Now, we have a Shannon-Fano-Elias code for encoding strings
consisting of symbols that are in the encoding table::
sage: encoded = sfe1.encode("Encode me!"); encoded
'11111101010011001000011001010100110000'
And we can decode the encoded string in the following way::
sage: sfe1.decode(encoded)
'Encode me!'
If we try to encode a string with ``sfe1`` that consists of symbols
which are not part of the encoding table, we of course get an
error::
sage: sfe1.encode("This string contains other symbols!")
Traceback (most recent call last):
...
KeyError: 'T'
If we want to encode multiple strings over an alphabet, it might be
a good idea to generate a Shannon-Fano-Elias code for this alphabet
and a given probability distribution for the symbols::
sage: d = {'a': 0.5, 'b': 0.25, 'c': 0.2, 'd': 0.05}
sage: sfe2 = ShannonFanoElias(d)
sage: for symbol, code in sfe2.encoding_table().iteritems():
....: print("'" + symbol + "' : " + code)
....:
'a' : 0
'c' : 100
'b' : 11
'd' : 11111
We can also use frequencies or in general weights for the symbols::
sage: d2 = {'a': 2.5, 'b': 1.25, 'c': 1, 'd': 0.25}
sage: sfe3 = ShannonFanoElias(d2)
sage: for symbol, code in sfe3.encoding_table().iteritems():
....: print("'" + symbol + "' : " + code)
....:
'a' : 0
'c' : 100
'b' : 11
'd' : 11111
If you are interested in Shannon-Fano-Elias coding, you can output
intermediate data of the generation of the Shannon-Fano-Elias code::
sage: sfe4 = ShannonFanoElias("code", verbose=True)
| symbol | p | midpoint | bin(midpoint) | bits | codeword |
| c | 0.25 | 0.125 | 0.00... | 2 | 00 |
| e | 0.25 | 0.375 | 0.01... | 2 | 01 |
| d | 0.25 | 0.625 | 0.10... | 2 | 10 |
| o | 0.25 | 0.875 | 0.11... | 2 | 11 |
With ``char_per_symbol`` we are able to define symbols consisting
of multiple characters. If we choose a dictionary for ``source``,
we can just use keys/symbols consisting of multiple characters. But
we have to make sure they all have the same length. If we choose a
string for ``source``, we have to specify how many characters build
one symbol::
sage: str = "Split me into symbols consisting of 3 characters"
sage: sfe5 = ShannonFanoElias(str, char_per_symbol=3)
sage: for symbol, code in sfe5.encoding_table().iteritems():
....: print("'" + symbol + "' : " + code)
....:
'nsi' : 0000
'me ' : 0011
'int' : 0110
'3 c' : 0111
'it ' : 1001
'ols' : 0100
'of ' : 0001
'ng ' : 0010
' co' : 1111
'ers' : 0101
'sti' : 1000
'act' : 1010
'ymb' : 1011
'Spl' : 1100
'o s' : 1101
'har' : 1110
Obviously, we have to make sure that the length of the string is a
multiple of ``char_per_symbol`` or else we get an error::
sage: sfe6 = ShannonFanoElias("four", char_per_symbol=3)
Traceback (most recent call last):
...
ValueError: The passed string does not match with the passed value for char_per_symbol.
When decoding long strings with a code consisting of long codewords,
you can try to adapt the parameter ``decoding_table_key_len`` for
a faster decoding. Or if you want to save memory when dealing with
long codewords, you can set it to a smaller value::
sage: sfe7 = ShannonFanoElias("Let's call it a day!", decoding_table_key_len=3)
"""
def __init__(self, source, verbose=False, char_per_symbol=1,
decoding_table_key_len=8):
r"""
Constructor for ShannonFanoElias.
See the docstring of this class for full documentation.
EXAMPLES::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: str = "Give me an example!"
sage: sfe = ShannonFanoElias(str, decoding_table_key_len=6)
TESTS:
Feeding anything else than a string or a dictionary::
sage: ShannonFanoElias(Graph())
Traceback (most recent call last):
...
ValueError: Input must be either a string or a dictionary.
"""
PrefixCoding.__init__(self, source, verbose, 2, char_per_symbol,
decoding_table_key_len)
def _build_code(self, dic, verbose=False):
r"""
Construct a Shannon-Fano-Elias code corresponding to an alphabet
with the given weight table.
INPUT:
- ``dic`` -- a dictionary that associates to each symbol of an
alphabet a numeric value. If we consider the frequency of
each alphabetic symbol, then ``dic`` is considered as the
frequency table of the alphabet with each numeric
(non-negative integer) value being the number of occurrences
of a symbol. The numeric values can also represent weights of
the symbols. In that case, the numeric values are not
necessarily integers, but can be real numbers. In general,
we refer to ``dic`` as a weight table.
- ``verbose`` -- (default: False) if True, print intermediate
data of building the code.
EXAMPLES::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: d = {'c': 4, 'o': 3, 'd': 2, 'e':1}
sage: sfe = ShannonFanoElias(d)
sage: sfe._build_code(d)
"""
# weights to probabilities
total = sum(dic.values())
dic.update((s, w * 1.0 / total) for (s, w) in dic.items())
low = 0
all_values = []
for s, p in iteritems(dic):
mid = low + p/2.0
bits = int(ceil(-log(p, 2)))
c = first_binary_dec_places(mid, bits)
all_values.append((s, p, mid, bits, c))
low += p
if verbose:
head = [
'symbol', 'p', 'midpoint', 'bin(midpoint)',
'bits', 'codeword',
]
t = SimpleTable(head)
for s, p, mid, bits, c in all_values:
t.add_row([s, p, mid, '0.' + c + '...', bits, c])
t.print_me()
self._character_to_code = {s: c for (s, _, _, _, c) in all_values}
def encode(self, string):
r"""
Encode the given string based on the current encoding table.
INPUT:
- ``string`` -- a string of symbols over an alphabet.
OUTPUT:
- A Shannon-Fano-Elias encoding of ``string``.
EXAMPLES:
This example illustrates how to encode a string::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: str = "Encode me!"
sage: sfe = ShannonFanoElias(str)
sage: encoded = sfe.encode(str); encoded
'11111101010011001000011001010100110000'
"""
return PrefixCoding.encode(self, string)
def decode(self, string):
r"""
Decode the given string based on the current decoding table.
INPUT:
- ``string`` -- a string of encoded symbols.
OUTPUT:
- The Shannon-Fano-Elias decoding of ``string``.
EXAMPLES:
This example illustrates how to encode and then decode a
string::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: str = "Encode me!"
sage: sfe = ShannonFanoElias(str)
sage: encoded = sfe.encode(str); encoded
'11111101010011001000011001010100110000'
sage: sfe.decode(encoded)
'Encode me!'
TESTS:
It is an error to try to decode an encoding from another
instance. This may raise an exception or result in
a wrong decoding::
sage: str2 = "I'm another string."
sage: sfe2 = ShannonFanoElias(str2)
sage: encoded_sfe2 = sfe2.encode(str2); encoded_sfe2
'010100100101110000100000100110000111001100001011100000110110111011001010110010011111111'
sage: sfe.decode(encoded_sfe2)
Traceback (most recent call last):
...
ValueError: Input must be a string encoded by this instance.
"""
return PrefixCoding.decode(self, string)
def encoding_table(self):
r"""
Return the current encoding table.
INPUT:
- None.
OUTPUT:
- A dictionary associating an alphabetic symbol to an encoding.
EXAMPLES::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: str = "Show my encoding table!"
sage: sfe = ShannonFanoElias(str)
sage: t = sorted(sfe.encoding_table().items())
sage: for symbol, code in t:
....: print(symbol, code)
....:
000
! 11011
S 11001
a 00000
b 00111
c 00110
d 01011
e 0100
g 01101
h 01111
i 01110
l 10010
m 10001
n 1011
o 1010
t 11100
w 11101
y 11111
"""
return PrefixCoding.encoding_table(self)
def tree(self):
r"""
Return the tree corresponding to the current encoding.
INPUT:
- None.
OUTPUT:
- The binary tree representing a Shannon-Fano code.
EXAMPLES::
sage: from sage.coding.source_coding.shannon_fano_elias import ShannonFanoElias
sage: str = "Binary tree"
sage: sfe = ShannonFanoElias(str)
sage: t = sfe.tree(); t
Digraph on 21 vertices
sage: t.show(layout='tree', vertex_size=1800, figsize=[8,8])
"""
return PrefixCoding.tree(self)