# Neural Machine Translation

目标： 训练一个机器翻译（Machine Translation），功能很简单，就是将一些 human readable的日期，转成 machine readable的日期，固定格式： 'yyyy-mm-dd'

1. 准备训练数据
2. 构建模型，重点是 Attention机制。
3. 训练模型，观察翻译效果，观察 Attention 矩阵（可视化）


流程大致了解。 还有很多问题： 最后的模型训练又是稀奇古怪，没有达到预期效果。哪里出问题了。需要改进

In [50]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 1 准备训练数据

- faker，制造一些日期数据 
- FORMATS，定义一些日期格式
- babel，主要是Internationalization 功能



In [14]:
import random
from tqdm import tqdm
from faker import Faker
from babel.dates import format_date    # babel: the Python Internationalization Library


fake = Faker()
fake.seed(12345)
random.seed(12345)

FORMATS = ['short',
           'medium',
           'long',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'd MMM YYY', 
           'd MMMM YYY',
           'dd MMM YYY',
           'd MMM, YYY',
           'd MMMM, YYY',
           'dd, MMM YYY',
           'd MM YY',
           'd MMMM YYY',
           'MMMM d YYY',
           'MMMM d, YYY',
           'dd.MM.YY']
LOCALES = ['en_US', 'zh_CN']

def load_date():
    dt = fake.date_object()

    try:
        human_readable = format_date(dt, format=random.choice(FORMATS), 
                                     # locale=random.choice(LOCALES))
                                     locale='en_US')
                                     # locale='zh_CN')
        human_readable = human_readable.lower().replace(',','')
        machine_readable = dt.isoformat()
        
    except AttributeError as e:
        return None, None, None

    return human_readable, machine_readable, dt


def load_dataset(m):
    human_vocab = set()
    machine_vocab = set()
    dataset = []
    Tx = 30 

    for i in tqdm(range(m)):
        h, m, _ = load_date()
        if h is not None:
            dataset.append((h, m))
            human_vocab.update(tuple(h))
            machine_vocab.update(tuple(m))
    
    # 建立索引
    human = dict(zip(sorted(human_vocab) + ['<unk>', '<pad>'], 
                     list(range(len(human_vocab) + 2))))
    inv_machine = dict(enumerate(sorted(machine_vocab)))
    machine = {v:k for k,v in inv_machine.items()}
 
    return dataset, human, machine, inv_machine


m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|██████████| 10000/10000 [00:00<00:00, 11728.40it/s]


In [15]:
dataset[:10]

[('9 may 1998', '1998-05-09'),
 ('10.09.70', '1970-09-10'),
 ('4/28/90', '1990-04-28'),
 ('thursday january 26 1995', '1995-01-26'),
 ('monday march 7 1983', '1983-03-07'),
 ('sunday may 22 1988', '1988-05-22'),
 ('tuesday july 8 2008', '2008-07-08'),
 ('08 sep 1999', '1999-09-08'),
 ('1 jan 1981', '1981-01-01'),
 ('monday may 22 1995', '1995-05-22')]

将数据转成 numpy array 格式， 

X 长度Tx=30，长度不固定，短则用 pad 补齐，长了就截掉。 每个字符 One-hot编码； 

Y 长度Ty=10固定， 每个字符 One-hot编码

In [17]:
import numpy as np
from keras.utils import to_categorical


# 字符转成索引， 长则截短，短则用pad补
def string_to_int(string, length, vocab):
    #make lower to standardize
    string = string.lower().replace(',','')
    
    if len(string) > length:
        string = string[:length]
        
    rep = list(map(lambda x: vocab.get(x, '<unk>'), string))
    
    if len(string) < length:
        rep += [vocab['<pad>']] * (length - len(string))
    return rep

# 先将字符转索引，再 one-hot
def preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty):
    
    X, Y = zip(*dataset)
    
    X = np.array([string_to_int(i, Tx, human_vocab) for i in X])
    Y = [string_to_int(t, Ty, machine_vocab) for t in Y]
    
    # 转成 one-hot
    Xoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), X)))
    Yoh = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y)))

    return X, np.array(Y), Xoh, Yoh

Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)

X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)


In [18]:
index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])

Source date: 9 may 1998
Target date: 1998-05-09

Source after preprocessing (indices): [12  0 24 13 34  0  4 12 12 11 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36
 36 36 36 36 36]
Target after preprocessing (indices): [ 2 10 10  9  0  1  6  0  1 10]

Source after preprocessing (one-hot): [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  0. ...,  0.  0.  1.]]
Target after preprocessing (one-hot): [[ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]]


## 2 构建模型


### 2.1 - Attention mechanism


<table>
<td> 
<img src="images/attn_model.png" style="width:500;height:500px;"> <br>
</td> 
<td> 
<img src="images/attn_mechanism.png" style="width:500;height:500px;"> <br>
</td> 
</table>
<caption><center> **Figure 1**: Neural machine translation with attention</center></caption>




注意几点：
1. 两层LSTM， 第一层是Bi-LSTM， 双向的，用于解析整句的意思。输入 $x^{<t>}$ ，输出 $a^{\langle t \rangle} = [\overrightarrow{a}^{\langle t \rangle}; \overleftarrow{a}^{\langle t \rangle}] $
2. Attention是一层简单的NN。所有Attention实际上重复使用。输入 $(s^{<t-1>}, a^{<t'>}), t' = 1 \cdots Tx$ ，输出 $\alpha^{<t, t'>}$，将表示LSTM 的下一个输出 $s^{<t>}$ 对 $a^{<t'>}$ 的“注意力”。 最后计算出 $context^{<t>}$
$$context^{<t>} = \sum_{t' = 0}^{T_x} \alpha^{<t,t'>}a^{<t'>}\tag{1}$$ 

3. 最后一层LSTM，注意一点，就是 $\hat{y}^{<t>}$ 不作为下一个LSTM单元的输入，因为输出的日期，字符之间是没有关联的。

左图，Model使用的Keras Layer：
- [Bidirectional](https://keras.io/layers/wrappers/#bidirectional)
- [LSTM](https://keras.io/layers/recurrent/#lstm)

右图，Attention机制使用的 Keras Layer：

- [RepeatVector](https://keras.io/layers/core/#repeatvector)，用于复用 $s^{<t-1>}$
- [Concatenate](https://keras.io/layers/merge/#concatenate)，用于拼接 $(s^{<t-1>}, a^{<t'>})$
- [Dense](https://keras.io/layers/core/#dense)， 连接层， 作用等同于 $ a = activation(Wx + b) $
- [Activation](https://keras.io/layers/core/#activation)， activate函数，没什么好说的
- [Dot](https://keras.io/layers/merge/#dot)， Dot计算，用在计算 $context^{<t>}$ 中


In [36]:
import keras.backend as K
from keras.models import Model
from keras.layers import Input, Bidirectional, LSTM, RepeatVector,  Concatenate, Dense, Activation, Dot

def softmax(x, axis=1):
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims=True)
        return e / s
    else:
        raise ValueError('Cannot apply softmax to a tensor that is 1D')

        
# attention layers
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

# post LSTM layer
n_a = 64
n_s = 128
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)

In [37]:
def one_step_attention(a, s_prev):
    # a: (m, Tx, n_a*2)
    # s_prev: (m, n_s)
    s_prev = repeator(s_prev)    # (m, Tx, n_s)
    concat = concatenator([a, s_prev])   # (m, Tx, n_a*2+n_s)
    e = densor(concat)      # 维度变化，做一次 ‘relu’， (m, Tx, 1)
    alphas = activator(e)   # 在axis=1 上做softmax， (m, Tx, 1)
    context = dotor([alphas, a])   # 在axis=1上相乘，相加，得到  (m, 1, n_a*2)
    return context


def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    
    # Define the inputs of your model with a shape (Tx,)
    # Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
    X = Input(shape=(Tx, human_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    # Initialize empty list of outputs
    outputs = []
    
    # Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)    # (m, Tx, n_a*2)
    
    # Step 2: Iterate for Ty steps
    for t in range(Ty):
        context = one_step_attention(a, s)  # (m, 1, n_a*2)
        s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])   # (m, n_s)
        out = output_layer(s) # (m, n_s)
        outputs.append(out)
    
    # Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
    model = Model(inputs=[X, s0, c0], outputs=outputs)
    
    return model

In [38]:
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))

In [39]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 30, 37)       0                                            
__________________________________________________________________________________________________
s0 (InputLayer)                 (None, 128)          0                                            
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) (None, 30, 128)      52224       input_2[0][0]                    
__________________________________________________________________________________________________
repeat_vector_8 (RepeatVector)  (None, 30, 128)      0           s0[0][0]                         
                                                                 lstm_6[0][0]                     
          

In [41]:
from keras.optimizers import Adam

opt = Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.1)
model.compile(loss='categorical_crossentropy', optimizer=opt)


In [45]:
s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))
model.fit([Xoh, s0, c0], outputs, epochs=1, batch_size=100)

Epoch 1/1


<keras.callbacks.History at 0x138c69a90>

While training you can see the loss as well as the accuracy on each of the 10 positions of the output. The table below gives you an example of what the accuracies could be if the batch had 2 examples: 

<img src="images/table.png" style="width:700;height:200px;"> <br>
<caption><center>Thus, `dense_2_acc_8: 0.89` means that you are predicting the 7th character of the output correctly 89% of the time in the current batch of data. </center></caption>


We have run this model for longer, and saved the weights. Run the next cell to load our weights. (By training a model for several minutes, you should be able to obtain a model of similar accuracy, but loading our model will save you time.) 

In [48]:
model.load_weights('models/model.h5')

In [49]:
EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
for example in EXAMPLES:
    
    source = string_to_int(example, Tx, human_vocab)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)
    source = source.reshape(1, Tx, -1)
    prediction = model.predict([source, s0, c0])
    prediction = np.argmax(prediction, axis = -1)
    output = [inv_machine_vocab[int(i)] for i in prediction]
    
    print("source:", example)
    print("output:", ''.join(output))

source: 3 May 1979
output: 19872-1222
source: 5 April 09
output: 1987-00322
source: 21th of August 2016
output: 1977-10-14
source: Tue 10 Jul 2007
output: 1974-03-23
source: Saturday May 9 2018
output: 1971-10-11
source: March 3 2001
output: 19874-2444
source: March 3rd 2001
output: 1977-04424
source: 1 March 2001
output: 19874-4421


## 3 - Visualizing Attention (Optional / Ungraded)

Since the problem has a fixed output length of 10, it is also possible to carry out this task using 10 different softmax units to generate the 10 characters of the output. But one advantage of the attention model is that each part of the output (say the month) knows it needs to depend only on a small part of the input (the characters in the input giving the month). We can  visualize what part of the output is looking at what part of the input.

Consider the task of translating "Saturday 9 May 2018" to "2018-05-09". If we visualize the computed $\alpha^{\langle t, t' \rangle}$ we get this: 

<img src="images/date_attention.png" style="width:600;height:300px;"> <br>
<caption><center> **Figure 8**: Full Attention Map</center></caption>

Notice how the output ignores the "Saturday" portion of the input. None of the output timesteps are paying much attention to that portion of the input. We see also that 9 has been translated as 09 and May has been correctly translated into 05, with the output paying attention to the parts of the input it needs to to make the translation. The year mostly requires it to pay attention to the input's "18" in order to generate "2018." 



### 3.1 - Getting the activations from the network

Lets now visualize the attention values in your network. We'll propagate an example through the network, then visualize the values of $\alpha^{\langle t, t' \rangle}$. 

To figure out where the attention values are located, let's start by printing a summary of the model .

Navigate through the output of `model.summary()` above. You can see that the layer named `attention_weights` outputs the `alphas` of shape (m, 30, 1) before `dot_2` computes the context vector for every time step $t = 0, \ldots, T_y-1$. Lets get the activations from this layer.

The function `attention_map()` pulls out the attention values from your model and plots them.

In [None]:
attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, "Tuesday April 08 1993", num = 6, n_s = 128)