# Fun with language modelling

* [Unreasonable effectiveness of RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) (Andrej Karpathy)
* [Официальный гайд от TensorFlow](https://www.tensorflow.org/tutorials/sequences/recurrent)

---

In [1]:
import numpy as np
import tensorflow as tf
import re
import os
from tensorflow.contrib import rnn
import tensorflow.contrib.layers as layers

  from ._conv import register_converters as _register_converters


## Препроцессинг (2 балл)

Возьмите какие-нибудь сырые данные. Википедия, «Гарри Поттер», «Игра Престолов», твиты Тинькова — что угодно.

Давайте для простоты делать char-level модель. Сопоставьте всем различным символам свой номер. Удобно это хранить просто в питоновском словаре (`char2idx`). Для генерации вам потребуется ещё и обратный словарь (`idx2char`).

Клёво будет ещё написать класс, который делает токенизацию и детокенизацию.

In [2]:
tf.reset_default_graph()

In [3]:
os.environ["CUDA_VISIBLE_DEVICES"] = '3'

In [4]:
dataFile = '/home/pavel/MyDocs/MachineLearning/Tinkoff/lecture6/dataset/got1.txt'
embeddingsFile = '/home/pavel/MyDocs/MachineLearning/Tinkoff/lecture6/char-embeddings.txt'

In [5]:
class Vocab:
    def __init__(self, data):
        self.char2idx = {}
        self.idx2char = {}
        counter = 0
        for line in data:
            sym = line[0]
            if sym not in self.char2idx:
                self.char2idx[sym] = counter
                self.idx2char[counter] = sym
                counter += 1
    
    def update_sequence(self, sequence):
        newSequence = ''
        for char in sequence:
            if char in self.char2idx:
                newSequence += char
            else:
                newSequence += ' '
        newSequence = re.sub(' +',' ', newSequence)
        return newSequence

    def tokenize(self, sequence):
        sequence = self.update_sequence(sequence)
        return [self.char2idx[sym] for sym in sequence]
    
    def detokenize(self, sequence):
        return ''.join([self.idx2char[num] for num in sequence])

In [6]:
with open(embeddingsFile, 'r') as file:
    embDict = file.readlines()
    vocab = Vocab(embDict)

with open(dataFile, 'r') as file:
    data = file.readlines()

### Text example:

In [7]:
print(''.join(data[240:300]))


PROLOGUE

The comet’s tail spread across the dawn, a red slash that bled above the crags of Dragonstone like a wound in the pink and purple sky.

The maester stood on the windswept balcony outside his chambers. It was here the ravens came, after long flight. Their droppings speckled the gargoyles that rose twelve feet tall on either side of him, a hellhound and a wyvern, two of the thousand that brooded over the walls of the ancient fortress. When first he came to Dragonstone, the army of stone grotesques had made him uneasy, but as the years passed he had grown used to them. Now he thought of them as old friends. The three of them watched the sky together with foreboding.

The maester did not believe in omens. And yet . . . old as he was, Cressen had never seen a comet half so bright, nor yet that color, that terrible color, the color of blood and flame and sunsets. He wondered if his gargoyles had ever seen its like. They had been here so much longer than he had, and would still be 

## Модель (1 балл)

Примерно такое должно зайти:

* Эмбеддинг
* LSTM / GRU
* Дропаут
* Линейный слой
* Softmax

In [8]:
map_fn = tf.map_fn

In [9]:
def get_idx_vec(embDict):
    idxVec = {}
    linesNum = len(embDict)
    for i in range(linesNum):
        if embDict[i]:
            char = embDict[i][0]
            if char in vocab.char2idx:
                numsStr = embDict[i][1:]
                numsList = re.split(r'\s+', numsStr)
                numsList = [float(el) for el in numsList if el]
                vec = np.array(numsList)
                idxVec[vocab.char2idx[char]] = vec
    
    return idxVec

In [10]:
idxVec = get_idx_vec(embDict)

    
# Parameters
learning_rate = 0.01
display_step = 1000
seq_len = 30
rnn_hidden = 100
keep_prob = 0.6
vocab_size = len(vocab.char2idx)
emb_size = len(idxVec[0])

# tf Graph input
inputs  = tf.placeholder(tf.int32, (seq_len, None, 1))  # (time, batch, in), here seq_len is 
outputs = tf.placeholder(tf.float32, (seq_len, None, vocab_size)) # (time, batch, out)

In [11]:
print(outputs)
print(len(idxVec[1]))

Tensor("Placeholder_1:0", shape=(30, ?, 91), dtype=float32)
300


In [12]:
def build_model(inputs):
    #Turning sequence of id into embeddings
    with tf.variable_scope('enc', reuse = tf.AUTO_REUSE):
        emb_matrix = tf.get_variable('embedding_matrix',
                                 shape=[vocab_size, emb_size],
                                 dtype=tf.float32)
    
        embs = tf.nn.embedding_lookup(emb_matrix, inputs)  # (seq_len, batch_size, 1, emb_size)
    
        new_input = tf.reduce_sum(embs, axis=3)  
    
    
        batch_size = tf.shape(inputs)[1]
        input_size = tf.shape(inputs)[2]
    
        #LSTM 
        cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_hidden, state_is_tuple=True)
        initial_state = cell.zero_state(batch_size, tf.float32)
        rnn_outputs, rnn_states = tf.nn.dynamic_rnn(cell, new_input, initial_state=initial_state, time_major=True)
    
        
        dropout_outputs = tf.nn.dropout(rnn_outputs, keep_prob)
    
        
        final_projection = lambda x: layers.linear(x, num_outputs=vocab_size, activation_fn=tf.nn.softmax)
        logits = map_fn(final_projection, dropout_outputs)
    
    return logits
    

## Обучение (3 балла)

* Делайте сэмплирование предложений фиксированной длины из вашего корпуса. Можете как нарезать их изначально, так и написать генератор.
* Используйте teacher forcing.
* Выход модели — это one-hot вход, смещенный на одну позиию.
* Функция потерь: кроссэнтропия.
* Не забудьте мониторить и валидацию, и train.

In [13]:
def get_one_hot(sym):
    one_hot = np.zeros(vocab_size)
    one_hot[vocab.char2idx[sym]] = 1
    return one_hot

In [14]:
def de_one_hot(one_hot):
    return vocab.idx2char[np.argmax(one_hot)]

In [15]:
def generate_example():
    my_str = data[0]
    while len(my_str) < (seq_len + 1):
        str_num = np.random.randint(len(data))
        my_str = vocab.update_sequence(data[str_num])
    
    start = len(my_str)
    while(start + seq_len + 1 >= len(my_str)):
        start = np.random.randint(len(my_str))
    example = my_str[start:start+seq_len]
    example = vocab.tokenize(example)
    
    result = np.zeros([seq_len, vocab_size])
    
    for i in range(start+1, start+1+seq_len):
        char = my_str[i]
        result[i-start-1] = get_one_hot(char)
    return example, result
    
    
def generate_batch(batch_size):
    x = np.empty((seq_len, batch_size, 1), dtype=np.int32)
    y = np.empty((seq_len, batch_size, vocab_size), dtype=np.float32)

    for i in range(batch_size):
        ex, res = generate_example()
        x[:, i, 0] = ex
        y[:, i, :] = res
    return x, y

In [16]:
logits = build_model(inputs)
true_pred = tf.argmax(outputs, axis = 2)
pred = tf.argmax(logits, axis = 2)
accuracy = tf.reduce_sum(tf.cast(tf.math.equal(true_pred, pred), tf.int32))/tf.size(pred)

loss = tf.losses.softmax_cross_entropy(onehot_labels=outputs, logits=logits)
train_fn = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').


In [17]:
iterations_per_epoch = 1000
batch_size = 50

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    valid_x, valid_y = generate_batch(batch_size=50) 
    print(sess.run(tf.shape(pred), feed_dict = {inputs:valid_x, outputs:valid_y}))

Instructions for updating:
Use `tf.global_variables_initializer` instead.
[30 50]


In [22]:
batch_size = 50

texts = set()
x, y = generate_batch(batch_size)
for i in range(batch_size):
    tmp = x[:, i, 0]
    tmp = np.reshape(tmp, -1)
    text = vocab.detokenize(tmp)
    texts.add(text)
print(len(texts))

32


In [24]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
iterations_per_epoch = 300

valid_x, valid_y = generate_batch(batch_size=50) 
for epoch in range(15):
    epoch_error = 0
    for _ in range(iterations_per_epoch):
        x, y = generate_batch(batch_size)
        epoch_error += sess.run([loss, train_fn], {
            inputs: x,
            outputs: y,
        })[0]
       
        valid_accuracy = sess.run(accuracy, {
            inputs:  valid_x,
            outputs: valid_y,
        })
    epoch_error /= iterations_per_epoch
    print ("Epoch %d, train error: %.2f, valid accuracy: %.1f %%" % (epoch, epoch_error, valid_accuracy * 100.0))

Epoch 0, train error: 4.18, valid accuracy: 40.5 %
Epoch 1, train error: 4.16, valid accuracy: 41.4 %
Epoch 2, train error: 4.08, valid accuracy: 58.1 %
Epoch 3, train error: 3.94, valid accuracy: 65.9 %
Epoch 4, train error: 3.83, valid accuracy: 77.9 %
Epoch 5, train error: 3.76, valid accuracy: 78.7 %
Epoch 6, train error: 3.76, valid accuracy: 79.3 %
Epoch 7, train error: 3.75, valid accuracy: 79.7 %
Epoch 8, train error: 3.75, valid accuracy: 80.6 %
Epoch 9, train error: 3.73, valid accuracy: 83.0 %
Epoch 10, train error: 3.72, valid accuracy: 83.0 %
Epoch 11, train error: 3.71, valid accuracy: 85.6 %
Epoch 12, train error: 3.70, valid accuracy: 85.7 %
Epoch 13, train error: 3.70, valid accuracy: 85.9 %
Epoch 14, train error: 3.70, valid accuracy: 85.7 %


In [31]:
sx = np.empty((seq_len, 1, 1), dtype=np.int32)
sy = np.empty((seq_len, 1, vocab_size), dtype=np.float32)

start = 9
my_str = data[259][start:seq_len+start+1]
xInp = vocab.tokenize(my_str[:seq_len])


yInp = np.zeros([seq_len, vocab_size])
    
for i in range(1, 1+seq_len):
    char = my_str[i]
    yInp[i-1] = get_one_hot(char)

    
sx[:,0,0] = xInp
sy[:, 0, :] = yInp
sx_nums = np.reshape(sx, [-1])


test = sess.run(pred, feed_dict={inputs:sx, outputs:sy})
test = np.reshape(test, -1)
print(vocab.detokenize(sx_nums))
print(vocab.detokenize(test))

e younger man settle him behin
 tdoitidaiomt oieee ceo tdieos


## Спеллчекер (1 балла)

Из языковой модели можно сделать простенький спеллчекер: можно визуализировать лоссы на каждом символе.

Бонус: можете усреднить перплексии по словам и выделять их, а не отдельные символы.

In [None]:
from IPython.core.display import display, HTML

def print_colored(sequence, intensities, delimeter=''):
    html = delimeter.join([
        f'<span style="background: rgb({255}, {255-x}, {255-x})">{c}</span>'
        for c, x in zip(sequence, intensities) 
    ])
    display(HTML(html))

print_colored('Налейте мне эспрессо'.split(), [0, 0, 100], ' ')

sequence = 'Эту домашку нужно сдать втечении двух недель'
intensities = [0]*len(sequence)
intensities[25] = 50
intensities[26] = 60
intensities[27] = 70
intensities[31] = 150
print_colored(sequence, intensities)

In [None]:
# ...

## Генерация предложений (3 балла)

* Поддерживайте hidden state при генерации. Не пересчитывайте ничего больше одного раза.
* Прикрутите температуру: это когда при сэмплировании все логиты (то, что перед софтмаксом) делятся на какое-то число (по умолчанию 1, тогда ничего не меняется). Температура позволяет делать trade-off между разнообразием и правдоподобием (подробнее — см. блог Карпатого).
* Ваша реализация должна уметь принимать строку seed — то, с чего должно начинаться сгенерированная строка.

Если сделаете все вышеперечисленное, то получите 2 балла. Если сделаете хоть какую-то генерацию, то 1 балл.

In [None]:
def sample(length, temperature=1, seed=''):
    # ...