<h1 align="center">Organization Info</h1> 

* Дедлайн **DD MM 2018 23:59** для всех групп.
* В качестве решения задания нужно прислать ноутбук с подробными комментариями (<span style='color:red'> без присланного решения результат контеста не будет засчитан </span>).
* <span style='color:red'>Название команды в контесте должно соответствовать шаблону: НомерГруппы_Имя_Фамилия, например, 594_Ivan_Ivanov</span>.

**Оформление дз**: 
- Присылайте выполненное задание на почту ``ml.course.mipt@gmail.com``
- Укажите тему письма в следующем формате ``ML2018_fall_<номер_группы>_<фамилия>``, к примеру -- ``ML2018_fall_495_ivanov``
- Выполненное дз сохраните в файл ``<фамилия>_<группа>_task<номер>.ipnb, к примеру`` -- ``ivanov_401_task7.ipnb``

**Вопросы**:
- Присылайте вопросы на почту ``ml.course.mipt@gmail.com``
- Укажите тему письма в следующем формате ``ML2018_fall Question <Содержание вопроса>``


--------
- **PS1:** Используются автоматические фильтры, и просто не найдем ваше дз, если вы неаккуратно его подпишите.
- **PS2:**  Просроченный дедлайн снижает максимальный вес задания по формуле, указнной на первом семинаре
- **PS3:** Допустимы исправление кода предложенного кода ниже, если вы считаете

<h1 align="center">Checking Questions</h1> 

**Вопрос 1**: Чем LSTM лучше/хуже чем обычная RNN?

<Ответ>

**Вопрос 2**:  Выпишите производную $\frac{d c_{n+1}}{d c_{k}}$ для LSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/, объясните формулу, когда производная затухает, когда взрывается?

<Ответ>

**Вопрос 3**: Зачем нужен TBPTT почему BPTT плох?

<Ответ>


**Вопрос 4**: Как комбинировать рекуррентные и сверточные сети, а главное зачем? Приведите несколько примеров реальных задач.

<Ответ>

**Вопрос 5**: Можно ли использовать сверточные сети для классификации текстов? Если нет обоснуйте :D, если да то как? как решить проблему с произвольной длинной входа?

<Ответ>

**Вопрос 6**: Attention - что это такое, где применяют и как? Приведите пример использования на какой-нибудь задаче

<Ответ>

## Grading
* starting at zero points
* +2 for describing your iteration path in a report below (compare models).
* +2 for correct check questions
* +3 (7 total) for 99% accuracy with simple NMT model on __TEST__ dataset
* +3 (10 total) for 99% accuracy with attention NMT model on __TEST__ dataset
----
* tatoeba bonus for accuracy on __TEST__ dataset:
    * +2 for report
    * 60% (14 total)
    * 65% (16 total)
    * 70% (18 total)
    * 75% (20 total)
    
## Bonus points

Common ways to get bonus points are:
* Get higher score, obviously.
* Anything special about your NN. For example "A super-small/fast NN that gets 99%" gets a bonus.
* Any detailed analysis of the results. (attention maps, whatever)

---

In [1]:
# additional packages for this notebook

In [2]:
# ! pip install faker tqdm babel

## Task - translation

The machine translation is old and well-known field in natural language processing. From the 1950s scientists tried to create a model to automatically translate from say French to English. Nowadays it became possible and the attention mechanism takes great part in that. Here the example image with attention map for the neural machine translation of sample phrase:
<p align="center">
  <img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.23.48-PM.png" width="400">
</p>

In our lab we will concentrate on much simplier task: we will translate from human readable date to machine readable one.

To do this we need to get one more concept - Sequence-to-Sequence language modeling.
The idea of such architecture is here:
<p aling="center">
<img src="./img/simple_nmt.jpg" width="400">
</p>

There is an Embeding layer at the bottom, the RNN in the middle and softmax as an output.

In [249]:
from keras.layers import Embedding, Bidirectional, Dot
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *
from keras.layers.merge import Multiply, Concatenate
from keras.utils import to_categorical
from keras.layers import TimeDistributed

import keras.backend as K
import numpy as np

---

### Data

Now we need to generate data. It will be dates in different text formats and in fixed output format.

In [4]:
from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
import numpy as np

In [5]:
fake = Faker()

FORMATS = ['short',
           'medium',
           'long',
           'full',
           'd MMM YYY', 
           'd MMMM YYY',
           'dd MMM YYY',
           'd MMM, YYY',
           'd MMMM, YYY',
           'dd, MMM YYY',
           'd MM YY',
           'd MMMM YYY',
           'MMMM d YYY',
           'MMMM d, YYY',
           'dd.MM.YY']

# change this if you want it to work with another language
LOCALES = ['en_US']

In [6]:
def create_date():
    """
        Creates some fake dates 
        :returns: tuple containing human readable string, machine readable string, and date object
    """
    dt = fake.date_object()

    try:
        human_readable = format_date(dt, format=random.choice(FORMATS), locale=random.choice(LOCALES))

        case_change = random.choice([0,1,2])
        if case_change == 1:
            human_readable = human_readable.upper()
        elif case_change == 2:
            human_readable = human_readable.lower()
        # if case_change == 0, do nothing

        machine_readable = dt.isoformat()
    except AttributeError as e:
        return None, None, None

    return human_readable, machine_readable, dt

In [7]:
def create_dataset(n_examples):
    """
        Creates a dataset with n_examples and vocabularies
        :n_examples: the number of examples to generate
    """
    human_vocab = set()
    machine_vocab = set()
    dataset = []

    for i in tqdm(range(n_examples)):
        h, m, _ = create_date()
        if h is not None:
            dataset.append((h, m))
            human_vocab.update(tuple(h))
            machine_vocab.update(tuple(m))

    human = dict(zip(list(human_vocab) + ['<unk>', '<pad>'], 
                     list(range(len(human_vocab) + 2))))
    inv_machine = dict(enumerate(list(machine_vocab) + ['<unk>', '<pad>']))
    machine = {v:k for k,v in inv_machine.items()}
 
    return dataset, human, machine, inv_machine

In [8]:
def string_to_int(string, lenght, vocab):
    if len(string) > lenght:
        string = string[:lenght]
        
    rep = list(map(lambda x: vocab.get(x, '<unk>'), string))
    
    if len(string) < lenght:
        rep += [vocab['<pad>']] * (lenght - len(string))
    
    return rep

In [9]:
def int_to_string(ints, inv_vocab):
    return [inv_vocab[i] for i in ints]

Actually generating data:

In [10]:
fake.seed(42)
random.seed(42)
N = int(3e5)
dataset, human_vocab, machine_vocab, inv_machine_vocab = create_dataset(N)

100%|██████████| 300000/300000 [00:11<00:00, 25886.16it/s]


In [11]:
dataset[2]

('tuesday, september 14, 1971', '1971-09-14')

In [14]:
# TIME_STEP is undefined. Set to 20
inputs, targets = zip(*dataset)
inputs = np.array([string_to_int(i, TIME_STEPS, human_vocab) for i in inputs])
targets = [string_to_int(t, TIME_STEPS, machine_vocab) for t in targets]
targets = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), targets)))

In [15]:
X_train, y_train, X_valid, y_valid, X_test, y_test = (
    inputs[:int(2e5)], targets[:int(2e5)], 
    inputs[int(2e5):-int(5e4)], targets[int(2e5):-int(5e4)],  
    inputs[-int(5e4):], targets[-int(5e4):], )

---

### Part 1: Simple NMT

In [98]:
# :good-enouht:
ENCODER_UNITS = 32 # change me if u want
DECODER_UNITS = 32 # change me if u want
TIME_STEPS = 20 # change me if u want

In [119]:
# input - [bs; in_time_len]
# output - [bs; out_time_len]; out_time_len=10

def model_simple_nmt(in_chars, out_chars):
    # RNN encoder -> hidden representation -> RNN decoder
    
    inputs = Input(shape=(TIME_STEPS,))
    
    # your code
    e = Embedding(in_chars, TIME_STEPS, input_length=TIME_STEPS)(inputs)
    encoder_outputs, state_h, state_c = LSTM(ENCODER_UNITS, return_sequences=True, return_state=True)(e)
    encoder_states = [state_h, state_c]
    
    decoder_outputs,  _,  _ = LSTM(DECODER_UNITS, return_sequences=True, return_state=True)(
        e, initial_state=encoder_states
    )
    d2 = Dense(out_chars)(decoder_outputs)
    output = Activation('softmax')(d2)
    # your code

    model = Model(input=[inputs], output=output)
    return model

In [120]:
TIME_STEPS

20

In [121]:
m = model_simple_nmt(len(human_vocab), len(machine_vocab))

m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(m.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_24 (InputLayer)           (None, 20)           0                                            
__________________________________________________________________________________________________
embedding_16 (Embedding)        (None, 20, 20)       1200        input_24[0][0]                   
__________________________________________________________________________________________________
lstm_28 (LSTM)                  [(None, 20, 32), (No 6784        embedding_16[0][0]               
__________________________________________________________________________________________________
lstm_29 (LSTM)                  [(None, 20, 32), (No 6784        embedding_16[0][0]               
                                                                 lstm_28[0][1]                    
          



In [122]:
y_valid[0]

array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0., 

In [123]:
m.fit(
    [X_train], y_train, 
    validation_data=(X_valid, y_valid),
    epochs=10, batch_size=64, 
    validation_split=0.1)

Train on 200000 samples, validate on 50000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7604ee2240>

In [124]:
m.evaluate([X_test], y_test)



[0.016672639101073146, 0.99297300113677978]

Lets check our model:

In [125]:
EXAMPLES = ['3 May 1979', '5 Apr 09', '20th February 2016', 'Wed 10 Jul 2007']

def run_example(model, input_vocabulary, inv_output_vocabulary, text):
    encoded = string_to_int(text, TIME_STEPS, input_vocabulary)
    prediction = model.predict(np.array([encoded]))
    prediction = np.argmax(prediction[0], axis=-1)
    return int_to_string(prediction, inv_output_vocabulary)

def run_examples(model, input_vocabulary, inv_output_vocabulary, examples=EXAMPLES):
    predicted = []
    for example in examples:
        predicted.append(''.join(run_example(model, input_vocabulary, inv_output_vocabulary, example)))
        print('input:', example)
        print('output:', predicted[-1])
    return predicted

In [126]:
run_examples(m, human_vocab, inv_machine_vocab)

input: 3 May 1979
output: 1979-05-03<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
input: 5 Apr 09
output: 2009-04-05<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
input: 20th February 2016
output: 2016-02-20<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
input: Wed 10 Jul 2007
output: 2007-06-00<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


['1979-05-03<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 '2009-04-05<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 '2016-02-20<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 '2007-06-00<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

---

### Part 2: All u need is attention

Here we use more complex idea that simple seq2seq: we're adding two explicit parts of our network - encoder and decoder (which is applied attention on). The explanatory picture for this idea is below:
<p aling="center"><img src="https://i.stack.imgur.com/Zwsmz.png"></p>

The lower part of the network is encoding the input to some hidden intermediate representation and the upper part is decoing the hidвen represenataion into some readable output.

In [368]:
# :good-enouht:
ENCODER_UNITS = 32 # change me if u want
DECODER_UNITS = 32 # change me if u want
TIME_STEPS = 20 # change me if u want

In [369]:
def model_attention_nmt(in_chars, out_chars):
    # RNN encoder -> hidden representation -> RNN decoder
    inputs = Input(shape=(TIME_STEPS,))
    
    # your code
    e = Embedding(in_chars, TIME_STEPS, input_length=TIME_STEPS)(inputs)
    encoder_outputs, en_state_h, en_state_c = LSTM(ENCODER_UNITS, return_sequences=True, return_state=True)(e)
    encoder_states = [en_state_h, en_state_c]
    
    f = Flatten()(encoder_outputs)
    d = Dense(DECODER_UNITS)(f)

    f2 = Flatten()(encoder_outputs)
    d2 = Dense(DECODER_UNITS)(f2)
    
    decoder_outputs,  de_state_h, de_state_c = LSTM(DECODER_UNITS, return_sequences=True, return_state=True)(
        e, initial_state=[d, d2]
    )


    d2 = Dense(out_chars)(decoder_outputs)
    output = Activation('softmax')(d2)
    # your code
    
    model = Model(input=[inputs], output=output)
    return model

In [370]:
m = model_attention_nmt(len(human_vocab), len(machine_vocab))

m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(m.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_132 (InputLayer)          (None, 20)           0                                            
__________________________________________________________________________________________________
embedding_124 (Embedding)       (None, 20, 20)       1200        input_132[0][0]                  
__________________________________________________________________________________________________
lstm_233 (LSTM)                 [(None, 20, 32), (No 6784        embedding_124[0][0]              
__________________________________________________________________________________________________
flatten_37 (Flatten)            (None, 640)          0           lstm_233[0][0]                   
__________________________________________________________________________________________________
flatten_38



In [371]:
X_train.shape

(200000, 20)

In [372]:
m.fit(
    [X_train], y_train, 
    validation_data=(X_valid, y_valid),
    epochs=15, batch_size=64, 
    validation_split=0.1)

Train on 200000 samples, validate on 50000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
   704/200000 [..............................] - ETA: 3:40 - loss: 0.0172 - acc: 0.9923

  % delta_t_median)


Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f75d9b90400>

In [367]:
m.evaluate([X_test], y_test)



[0.018978905791323632, 0.9928760002136231]

### Report

* final architectures
* comparison
* as well as training method and tricks


---

## Part 3*: tatoeba - real NMT

### Data

In [373]:
# dataset from http://www.manythings.org/anki/

In [374]:
! wget http://www.manythings.org/anki/rus-eng.zip

--2018-05-05 16:15:31--  http://www.manythings.org/anki/rus-eng.zip
Распознаётся www.manythings.org (www.manythings.org)… 104.24.109.196, 104.24.108.196, 2400:cb00:2048:1::6818:6cc4, ...
Подключение к www.manythings.org (www.manythings.org)|104.24.109.196|:80... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа… 200 OK
Длина: 6366669 (6,1M) [application/zip]
Сохранение в: «rus-eng.zip»


2018-05-05 16:15:33 (2,80 MB/s) - «rus-eng.zip» сохранён [6366669/6366669]



In [375]:
! unzip ./rus-eng.zip

Archive:  ./rus-eng.zip
  inflating: rus.txt                 
  inflating: _about.txt              


In [376]:
with open("./rus.txt") as fin:
    data = fin.readlines()
data = list(map(lambda x: x.replace("\n", "").lower(), data))

In [377]:
len(data)

300108

In [378]:
data = data[:int(1e5)]

In [379]:
len(data)

100000

In [385]:
for i in range(1, 5):
    print(data[-i])

tom is here to see you.	к тебе том пришёл.
tom is here to help us.	том здесь, чтобы помочь нам.
tom is here to help us.	том пришёл нам помочь.
tom is having a crisis.	у тома кризис.


----

In [386]:
source = list(map(lambda x: x.split("\t")[0], data))
target = list(map(lambda x: x.split("\t")[1], data))

In [387]:
source_vocab = set("".join(source).strip())
target_vocab = set("".join(target).strip())

In [388]:
source_vocab = dict(zip(
    list(source_vocab) + ['<unk>', '<pad>'], 
    list(range(len(source_vocab) + 2))))
target_vocab = dict(zip(
    list(target_vocab) + ['<unk>', '<pad>'], 
    list(range(len(target_vocab) + 2))))
inv_target_vocab = dict(enumerate(list(target_vocab) + ['<unk>', '<pad>']))

In [393]:
TIME_STEPS = 32
ENCODER_UNITS = 32 # 256
DECODER_UNITS = 32 # 256

In [394]:
def model_simple_nmt_tatoeba(in_chars, out_chars):
    inputs = Input(shape=(TIME_STEPS,))
    
    # your code
    e = Embedding(in_chars, TIME_STEPS, input_length=TIME_STEPS)(inputs)
    encoder_outputs, en_state_h, en_state_c = LSTM(ENCODER_UNITS, return_sequences=True, return_state=True)(e)
    encoder_states = [en_state_h, en_state_c]
    
    f = Flatten()(encoder_outputs)
    d = Dense(DECODER_UNITS)(f)

    f2 = Flatten()(encoder_outputs)
    d2 = Dense(DECODER_UNITS)(f2)
    
    decoder_outputs,  de_state_h, de_state_c = LSTM(DECODER_UNITS, return_sequences=True, return_state=True)(
        e, initial_state=[d, d2]
    )


    d2 = Dense(out_chars)(decoder_outputs)
    output = Activation('softmax')(d2)
    # your code

    model = Model(input=[inputs], output=output)
    return model

In [395]:
m = model_attention_nmt(len(human_vocab), len(machine_vocab))

m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(m.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_134 (InputLayer)          (None, 32)           0                                            
__________________________________________________________________________________________________
embedding_126 (Embedding)       (None, 32, 32)       1920        input_134[0][0]                  
__________________________________________________________________________________________________
lstm_237 (LSTM)                 [(None, 32, 32), (No 8320        embedding_126[0][0]              
__________________________________________________________________________________________________
flatten_41 (Flatten)            (None, 1024)         0           lstm_237[0][0]                   
__________________________________________________________________________________________________
flatten_42



In [396]:
inputs = np.array([string_to_int(i, TIME_STEPS, source_vocab) for i in source])
targets = [string_to_int(t, TIME_STEPS, target_vocab) for t in target]
targets = np.array(list(map(lambda x: to_categorical(x, num_classes=len(target_vocab)), targets)))

MemoryError: 

In [None]:
m.fit(
    [inputs], targets, 
    epochs=10, batch_size=64, 
    validation_split=0.1)

In [None]:
run_example(m, source_vocab, inv_target_vocab, 'hello')

### Tatoeba Report

* final architectures
* comparison
* as well as training method and tricks
