Cеминар 6. Реализация алгоритма LZW
-----------------------------------------------

***********************************************

*Материалы:*

[Википедия](https://ru.wikipedia.org/wiki/LZ77)  
[Викиконспекты](http://neerc.ifmo.ru/wiki/index.php?title=%D0%90%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D1%8B_LZ77_%D0%B8_LZ78)  
[Хабр](https://habrahabr.ru/post/132683/)

***********************************************

1. Определяем вспомогательные функции
-----------------------------------------------------------------------------

`find_patern` - находит подстроку максимальной длины, находящейся в списке  
`int_to_byte` - переводит число в двоичный вид с заданной точностью
*****************************************************************************

In [1]:
def find_pattern(dictionary, text):
    # Просматриваем слова в порядке возрастания
    for element in sorted(dictionary, key=lambda x: len(x), reverse=True):
        # Обрабатываем случай неправильной работы zip
        if len(element) <= len(text):
            # Если все пары одинаковы - текст начинался с данного элемента
            if all(a == b for a, b in zip(element, text)):
                return element
            
def int_to_byte(element, base):
    return ('{:0'+ str(base) + 'b}').format(element)

2. Генерируем случайный текст
-----------------------------------------------------------

In [2]:
from faker import Factory

fake = Factory.create('en_US')

text = fake.text(256)
print text

Pariatur enim fugiat delectus occaecati temporibus corporis. Molestiae fugiat est at quasi soluta accusamus aspernatur aperiam. Sapiente iste dolores a amet veniam architecto. Accusantium iure animi commodi temporibus.


3. Создаем список уникальных букв, встречающихся в тексте
--------------------------------------------------------------------------

In [3]:
letters = sorted(list(set(text)))
print letters

[u' ', u'.', u'A', u'M', u'P', u'S', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v']


4. Функция кодирования
----------------------------------------------------------

In [4]:
import math

def encode_lzw(text, dictionary):
    current_base = int(math.ceil(math.log(len(dictionary), 2)))
    temp_dictionary = dict((element, index) for index, element in enumerate(dictionary))
    codes = []
    encoded_text = ''
    
    while len(text):
        element = find_pattern(temp_dictionary.keys(), text)
        if len(text) > len(element):
            temp_dictionary[element + text[len(element)]] = len(temp_dictionary)
        
        current_base = int(math.ceil(math.log(len(temp_dictionary) - 1, 2)))        
        encoded_text += int_to_byte(temp_dictionary[element], current_base)
        
        text = text[len(element):]
        
    return codes, encoded_text
       

5. Результат кодирования
------------------------------------------------------------------

In [5]:
codes, encoded_text = encode_lzw(text, letters)
print encoded_text

0010000110101010111000110101111100001010100000000101001000100111001000000000000101101100000110001110101011100000000100100101000111100101000100001111101011000000001001000100000100000011011000101111000111000000001011100101001000000100110010010001110000001110011000011010000010001000010100000100111000010110000000100000000000011001001001100000010110001011100111010001010010011101010010101011010001010100010000000001111000000000010100001100000001100010110011110000101101001111001100000101110000110101101001101111000101000011000100001000101101101000101100010011000101000101010010001001111001000001011010110111100111001101010100110000001010000110001001100011100100011011111000000000000111001010001010101000000100101001111010000100000101000110100011001100110101000001010001011000001100100100011000111010010011000011011000010000000110100001110001111100011001000010010010011000000001001101000000101100000011000010001010100100001100000100110000011100010000001010100100111000010010100111100010001110001000000010

6. Функция декодирования
-----------------------------------------------------------

In [6]:
def decode_lzw(text, dictionary):
    
    current_base = int(math.ceil(math.log(len(dictionary), 2)))
    temp_dictionary = dict((index, element) for index, element in enumerate(dictionary))   
        
    decoded_text = ''    
    prev_letter = ''  
    
    while len(text):
        current_base = int(math.ceil(math.log(len(temp_dictionary) + 1, 2)))
        
        element = int(text[:current_base], 2)
        
        if prev_letter:
            temp_dictionary[len(temp_dictionary)] = prev_letter + temp_dictionary[element][0]
        prev_letter = temp_dictionary[element]

        decoded_text += prev_letter
        text = text[current_base:]
            
    return decoded_text

7. Результат декодирования
-------------------------------------------------

In [7]:
print decode_lzw(encoded_text, letters) + '\n'
print text

Pariatur enim fugiat delectus occaecati temporibus corporis. Molestiae fugiat est at quasi soluta accusamus aspernatur aperiam. Sapiente iste dolores a amet veniam architecto. Accusantium iure animi commodi temporibus.

Pariatur enim fugiat delectus occaecati temporibus corporis. Molestiae fugiat est at quasi soluta accusamus aspernatur aperiam. Sapiente iste dolores a amet veniam architecto. Accusantium iure animi commodi temporibus.
