# 基于Sequence-to-Sequence模型实现的字符串序列数字相加

本文主要介绍了使用keras来构建一个简单的seq2seq模型，来实现一个字符串序列数字相加的模型。  
我们将会生成形如`("876+920", "0001796")`、`("001+012", "0000013")`这样的字符串序列对，训练一个seq2seq模型，输入前一个字符串序列，如`"876+920"`，输出字符串序列`"0001796"`

In [1]:
import numpy as np
import string

np.random.seed(1024)

定义字符集合以后字符串长度，这里我们使用的字符串长度为7

In [2]:
CHAR_SET = string.digits + '+'
CHAR_SET_LEN = len(CHAR_SET)
STRING_LEN = 7

In [3]:
CHAR_TO_INDEX = {c: i for i, c in enumerate(CHAR_SET)}
INDEX_TO_CHAR = {i: c for c, i in CHAR_TO_INDEX.items()}

定义将序列转为向量表示和向量表示转化为序列的函数

In [4]:
def seq_to_vector(seq):
    vector = np.zeros((STRING_LEN, CHAR_SET_LEN), dtype=int)
    for i, char in enumerate(seq):
        vector[i, CHAR_TO_INDEX[char]] = 1
    return vector

def vector_to_seq(vector):
    seq = ''
    for v in vector:
        index = np.argmax(v)
        seq = seq + INDEX_TO_CHAR[index]
    return seq

定义自动生成数据函数

In [5]:
def build_data(size=10000):
    num1 = np.random.randint(1000, size=size)
    num2 = np.random.randint(1000, size=size)
    add_result = num1 + num2
    
    seq_X = ['%03d' % x + '+' '%03d' % y for x, y in zip(num1, num2)]
    seq_Y = ['%07d' % r for r in add_result]
    
    X = np.zeros((size, STRING_LEN, CHAR_SET_LEN), dtype=int)
    Y = np.zeros((size, STRING_LEN, CHAR_SET_LEN), dtype=int)
    
    for i, seq in enumerate(zip(seq_X, seq_Y)):
        X[i] = seq_to_vector(seq[0])
        Y[i] = seq_to_vector(seq[1])
        
    return X, Y

In [6]:
from keras.models import Sequential
from keras import layers

HIDDEN_SIZE = 128

Using TensorFlow backend.


构建一个简单的seq2seq模型

In [7]:
model = Sequential()

# Encoder
model.add(layers.LSTM(HIDDEN_SIZE, input_shape=(STRING_LEN, CHAR_SET_LEN)))
# 使用RepeatVector将Encoder的输出复制N份作为Decoder的N次输入
model.add(layers.RepeatVector(STRING_LEN))
# Decoder
model.add(layers.LSTM(HIDDEN_SIZE, return_sequences=True))
# 使用TimeDistributed将上一层输出应用到输入的每一个时间步上
model.add(layers.TimeDistributed(layers.Dense(CHAR_SET_LEN)))
model.add(layers.Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

生成大小为10000的训练集和大小为5000的验证集

In [8]:
train_X, train_Y = build_data(10000)
test_X, test_Y = build_data(2000)

训练模型

In [9]:
model.fit(train_X, train_Y, validation_data=(test_X, test_Y),
          batch_size=128, epochs=200, 
          verbose=0)

<keras.callbacks.History at 0x1198a35c0>

In [10]:
model.history.history['acc'][-1], model.history.history['val_acc'][-1]

(0.9988428497314453, 0.9952142930030823)

训练集的准确率高达99.88%，而验证集的准确率也达到了99.52%，模型的准确率还是很高的，接下来生成10个数据来看看模型的效果

In [11]:
watch_X, watch_Y = build_data(10)
watch_pred = model.predict(watch_X)

for x, y, pred in zip(watch_X, watch_Y, watch_pred):
    seq_x = vector_to_seq(x)
    seq_y = vector_to_seq(y)
    seq_pred = vector_to_seq(pred)
    print(f'x@{seq_x}\ty@{seq_y}\tpred@{seq_pred}')

x@997+677	y@0001674	pred@0001674
x@073+178	y@0000251	pred@0000251
x@354+074	y@0000428	pred@0000428
x@898+682	y@0001580	pred@0001580
x@704+116	y@0000820	pred@0000820
x@686+199	y@0000885	pred@0000885
x@803+106	y@0000909	pred@0000909
x@130+231	y@0000361	pred@0000361
x@426+846	y@0001272	pred@0001272
x@662+193	y@0000855	pred@0000855
