Cеминар 3. Реализация алгоритма Шеннона-Фано
-----------------------------------------------

***********************************************

*Материалы:*

[Википедия](https://ru.wikipedia.org/wiki/LZ77)  
[Викиконспекты](http://neerc.ifmo.ru/wiki/index.php?title=%D0%90%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B_LZ77_%D0%B8_LZ78)  
[Хабр](https://habrahabr.ru/post/132683/)

***********************************************

1. Генерируем случайный текст
-----------------------------------------------------------

In [1]:
from faker import Factory

fake = Factory.create('en_US')

fake_text = fake.text(1024).lower()
print fake_text

ipsam praesentium repudiandae natus at unde rem eius. dolores corporis consectetur repellat saepe. explicabo qui voluptates ducimus reiciendis numquam. fugit adipisci suscipit numquam pariatur quae labore voluptate.
dolores sunt expedita veritatis repellendus quis vero corrupti. sint aliquam autem non maiores nulla ipsum hic. ut saepe iure ducimus impedit fugiat.
maxime recusandae totam earum nulla officia optio. consequuntur aliquid quod veniam fugit sed. dicta cumque similique sit quia nobis. vel labore explicabo optio aliquam.
deserunt nihil ratione earum sunt accusamus itaque. occaecati ut dolorum dolor minus. necessitatibus repellendus voluptatibus nobis reiciendis explicabo vel illo. illo laboriosam ducimus neque dolore illo vel nihil.
recusandae pariatur asperiores incidunt rem dolores sunt at molestiae. nesciunt autem laudantium eligendi quam modi ipsam. tenetur ex explicabo sunt assumenda corporis porro.


2. Вычисляем частоту появления для каждого символа
--------------------------------------------------------

In [2]:
from itertools import groupby

sorted_text = sorted(fake_text)
groups = groupby(sorted_text)

words = {}

for key, val in groups:
    words[key] = len(list(val))/float(len(fake_text))
    
from collections import OrderedDict
sorted_words = OrderedDict(sorted(words.items(), key=lambda x: x[1], reverse=True))

for key, val in sorted_words.items():
    print '{} - {}'.format(key, val)

  - 0.128509719222
i - 0.0917926565875
e - 0.0896328293737
u - 0.0799136069114
a - 0.0745140388769
s - 0.060475161987
o - 0.0561555075594
t - 0.0561555075594
r - 0.048596112311
l - 0.0475161987041
n - 0.0453563714903
m - 0.0388768898488
c - 0.0334773218143
d - 0.0334773218143
p - 0.0323974082073
. - 0.0194384449244
q - 0.0172786177106
b - 0.011879049676
v - 0.0097192224622
x - 0.00755939524838
f - 0.00539956803456

 - 0.00431965442765
g - 0.00431965442765
h - 0.00323974082073


3. Получаем для каждого символа код на основании частоты его появления
------------------------------------------------------------------------

In [3]:
# Создаем словарь буква - код

words_code = {}
for key in words.keys():
    words_code[key] = ''

def divide(words):
    if len(words) < 2:
        return

    words = OrderedDict(sorted(words.items(), key=lambda x: x[1], reverse=True))
    s = 0
    middle = sum(words.values())*0.5    
    letters_0 = set()
    letters_1 = set()
    for key, val in words.items():
        if abs(s + val - middle) < abs(s - middle):
            s += val
            letters_0.add(key)
        else:
            break
                        
    for key in words.keys():
        if key in letters_0:
            words_code[key] += '0'
        else:
            words_code[key] += '1'
            
    words_0 = {}
    words_1 = {}
    
    words_0 = dict((key, val) for key, val in words.items() if key in letters_0)
    words_1 = dict((key, val) for key, val in words.items() if key not in letters_0)
    
    divide(words_0)
    divide(words_1)
    
divide(sorted_words)

for k,v in OrderedDict(sorted(words_code.items(), key=lambda x: x[1], reverse=True)).items():
    print '{} - {}'.format(k, v)
    
code_words = dict((value, key) for key, value in words_code.items())

h - 11111111
g - 11111110

 - 11111101
f - 11111100
x - 1111101
v - 1111100
b - 111101
q - 111100
. - 11101
p - 11100
d - 11011
c - 11010
m - 1100
n - 10111
l - 10110
r - 1010
o - 1001
t - 1000
s - 0111
a - 0110
u - 0101
e - 0100
i - 001
  - 000


4. Кодируем исходный текст в соответствии с полученными кодами

In [4]:
encoded_text = ''

for letter in fake_text:
    encoded_text += words_code[letter]
    
print encoded_text

0011110001110110110000011100101001100100011101001011110000010101110000010100100111000101110110010110101111101101100100000101110110100001010111000011010000000101101111101101000001010010011000000100001010101111110100011011100110110100110100100011100011010100110101110010011010001011100011010100110111011101001101010000100100001011010000101001001110001001011010110011010000000111011001001110001001110100001001111101111001011000111010011011110110010001111000101001000111110010011011001011110010000110100001000111000110110101110100011100010101110001010010000111010001010010111110110010111000101110101110011110001010110110011101000111111000101111111100011000000011011011001111000010111110100010000111010101111101000111100001100000010111010111001111000101011011000001110001101010001011010000101101000011110001010110010000010110011011110110011010010000011111001001101100101111001000011010000100111011111110111011100110110100110100100011100001110101101111000000010011111011110001001101100110000110000111110001

In [5]:
current_code = ''
decoded_text = ''

for letter in encoded_text:
    current_code += letter
    if current_code in code_words.keys():
        decoded_text += code_words[current_code]
        current_code = ''
        
print decoded_text

ipsam praesentium repudiandae natus at unde rem eius. dolores corporis consectetur repellat saepe. explicabo qui voluptates ducimus reiciendis numquam. fugit adipisci suscipit numquam pariatur quae labore voluptate.
dolores sunt expedita veritatis repellendus quis vero corrupti. sint aliquam autem non maiores nulla ipsum hic. ut saepe iure ducimus impedit fugiat.
maxime recusandae totam earum nulla officia optio. consequuntur aliquid quod veniam fugit sed. dicta cumque similique sit quia nobis. vel labore explicabo optio aliquam.
deserunt nihil ratione earum sunt accusamus itaque. occaecati ut dolorum dolor minus. necessitatibus repellendus voluptatibus nobis reiciendis explicabo vel illo. illo laboriosam ducimus neque dolore illo vel nihil.
recusandae pariatur asperiores incidunt rem dolores sunt at molestiae. nesciunt autem laudantium eligendi quam modi ipsam. tenetur ex explicabo sunt assumenda corporis porro.
