# mvlstm算法实战

## 数据准备
数据来源于https://github.com/terrifyzhao/text_matching， 里面是中文的匹配数据，有sentence1，sentence2和对应的label，0代表不匹配，1代表匹配。可以直接在GitHub中引用出来

In [1]:
!git clone https://github.com/terrifyzhao/text_matching.git
!cp -r /content/text_matching/input ./

Cloning into 'text_matching'...
remote: Enumerating objects: 334, done.[K
remote: Total 334 (delta 0), reused 0 (delta 0), pack-reused 334[K
Receiving objects: 100% (334/334), 24.77 MiB | 31.59 MiB/s, done.
Resolving deltas: 100% (212/212), done.


In [2]:
!ls ./input/

dev.csv  test.csv  train.csv  vocab.txt


In [3]:
!head -5 ./input/train.csv

﻿sentence1,sentence2,label
用微信都6年，微信没有微粒贷功能,4。  号码来微粒贷,0
微信消费算吗,还有多少钱没还,0
交易密码忘记了找回密码绑定的手机卡也掉了,怎么最近安全老是要改密码呢好麻烦,0
你好 我昨天晚上申请的没有打电话给我 今天之内一定会打吗？,什么时候可以到账,0


In [4]:
# 首先设置一下本模型中要用到的具体的参数
# 设置基本的参数
base_params = {
    'num_classes':2,               # 类别   
    'max_features':1700,             # 嵌入层的最大词数
    'embed_size':200,              # 嵌入维度
    'filters':300,
    'kernel_size':3,
    'strides':1,
    'padding':'same',
    'conv_activation_func':'relu',
    'embedding_matrix':[],
    'w_initializer':'random_uniform',
    'b_initializer':'zeros',
    'dropout_rate':0.2,
    'mlp_activation_func':'relu',
    'mlp_num_layers':1,
    'mlp_num_units':128,
    'mlp_num_fan_out':128,
    'input_shapes':[(64,),(64,)],         # 每句话填充后的最大长度
    'task':'Classification',
}

mvlstm_params = base_params
mvlstm_params['lstm_units'] = 64
mvlstm_params['top_k'] = 50
mvlstm_params['mlp_num_units'] = 128
mvlstm_params['mlp_num_fan_out'] = 128
mvlstm_params['dropout_rate'] = 0.3
mvlstm_params['embed_size'] = 100
base_params = mvlstm_params

# 数据输入与转换
由于mvlstm算法对于中文要求的输入是字向量，因为字相比于词，数量大大减少

In [5]:
#首先把文本中所有的字统计出来，制作出字表
import pandas as pd

df = pd.read_csv('input/train.csv')
p = df['sentence1'].values
h = df['sentence2'].values
p_seg = list(map(lambda x: list(x.replace(" ","")), p))
h_seg = list(map(lambda x: list(x.replace(" ","")), h))
common_texts = []
common_texts.extend(p_seg)
common_texts.extend(h_seg)



df = pd.read_csv('input/dev.csv')
p = df['sentence1'].values
h = df['sentence2'].values
p_seg = list(map(lambda x: list(x.replace(" ","")), p))
h_seg = list(map(lambda x: list(x.replace(" ","")), h))
common_texts.extend(p_seg)
common_texts.extend(h_seg)

df = pd.read_csv('input/test.csv')
p = df['sentence1'].values
h = df['sentence2'].values
p_seg = list(map(lambda x: list(x.replace(" ","")), p))
h_seg = list(map(lambda x: list(x.replace(" ","")), h))
common_texts.extend(p_seg)
common_texts.extend(h_seg)

In [6]:
common_texts[0]

['用',
 '微',
 '信',
 '都',
 '6',
 '年',
 '，',
 '微',
 '信',
 '没',
 '有',
 '微',
 '粒',
 '贷',
 '功',
 '能']

In [7]:
#使用set来统计不同的字
char_set = set()
for sample in common_texts:
    for char in sample:
        char_set.add(char)
with open('input/char_vocab.txt','w',encoding='utf8') as f:
    f.write("\n".join(sorted(list(char_set),reverse=True)))

In [8]:
#首先读取训练集的数据
df = pd.read_csv('input/train.csv')
p = df['sentence1'].values
h = df['sentence2'].values
label = df['label'].values

In [9]:
#写一个shuffle函数，将里面的数据随机打乱
import numpy as np

def shuffle(*arrs):
    """ shuffle

    Shuffle 数据

    Arguments:
        *arrs: 数组数据

    Returns:
        shuffle后的数据

    """
    arrs = list(arrs)
    for i, arr in enumerate(arrs):
        assert len(arrs[0]) == len(arrs[i])
        arrs[i] = np.array(arr)
    p = np.random.permutation(len(arrs[0]))
    return tuple(arr[p] for arr in arrs)

p, h, label = shuffle(p, h, label)

In [10]:
#接下来要将字转换成对应的数字，所以需要先对字表中的字编码
# 加载字典
def load_char_vocab():
    vocab = [line.strip() for line in open('./input/char_vocab.txt', encoding='utf-8').readlines()]
    word2idx = {word: index for index, word in enumerate(vocab,start=1)}
    idx2word = {index: word for index, word in enumerate(vocab,start=1)}
    return word2idx, idx2word

word2idx, idx2word = load_char_vocab()

In [11]:
#利用字典进行转换
p_list, h_list = [], []
for p_sentence, h_sentence in zip(p, h):
    p = [word2idx[word.lower()] for word in p_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]
    h = [word2idx[word.lower()] for word in h_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]

    p_list.append(p)
    h_list.append(h)

In [12]:
#看一下编码后的句子
p_list[0]

[1246, 1552, 918, 388, 1464, 578, 197, 1552, 1255]

In [13]:
#接下来需要写一个pad_sequences函数，将每句话都填充到相同长度，不足的补零
def pad_sequences(sequences, maxlen=None, dtype='int32', padding='post',
                  truncating='post', value=0.):
    """ pad_sequences

    把序列长度转变为一样长的，如果设置了maxlen则长度统一为maxlen，如果没有设置则默认取
    最大的长度。填充和截取包括两种方法，post与pre，post指从尾部开始处理，pre指从头部
    开始处理，默认都是从尾部开始。

    Arguments:
        sequences: 序列
        maxlen: int 最大长度
        dtype: 转变后的数据类型
        padding: 填充方法'pre' or 'post'
        truncating: 截取方法'pre' or 'post'
        value: float 填充的值

    Returns:
        x: numpy array 填充后的序列维度为 (number_of_sequences, maxlen)

    """
    lengths = [len(s) for s in sequences]

    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)

    x = (np.ones((nb_samples, maxlen)) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError("Truncating type '%s' not understood" % padding)

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError("Padding type '%s' not understood" % padding)
    return x

p_list = pad_sequences(p_list, maxlen=base_params['input_shapes'][0][0])
h_list = pad_sequences(h_list, maxlen=base_params['input_shapes'][0][0])

In [14]:
#看一下编码后的句子
p_list[0]

array([1246, 1552,  918,  388, 1464,  578,  197, 1552, 1255,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

In [15]:
#将训练集数据制作成特征和标签
import keras
p = p_list
h = h_list
y = label
x = [p,h]
y = keras.utils.to_categorical(y,num_classes=2)

In [16]:
#对验证集进行相同的操作
df = pd.read_csv('input/dev.csv')
p = df['sentence1'].values
h = df['sentence2'].values
label = df['label'].values

p, h, label = shuffle(p, h, label)
word2idx, idx2word = load_char_vocab()

#利用字典进行转换
p_list, h_list = [], []
for p_sentence, h_sentence in zip(p, h):
    p = [word2idx[word.lower()] for word in p_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]
    h = [word2idx[word.lower()] for word in h_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]

    p_list.append(p)
    h_list.append(h)

p_list = pad_sequences(p_list, maxlen=base_params['input_shapes'][0][0])
h_list = pad_sequences(h_list, maxlen=base_params['input_shapes'][0][0])

p_eval = p_list
h_eval = h_list
y_eval = label
x_eval = [p_eval, h_eval]
y_eval = keras.utils.to_categorical(y_eval,num_classes=2)

In [17]:
#对测试集进行相同的操作,注意这里的测试集是有标签的，便于比较算法效果，测试集并不参与训练过程
df = pd.read_csv('input/test.csv')
p = df['sentence1'].values
h = df['sentence2'].values
label = df['label'].values

p, h, label = shuffle(p, h, label)
word2idx, idx2word = load_char_vocab()

#利用字典进行转换
p_list, h_list = [], []
for p_sentence, h_sentence in zip(p, h):
    p = [word2idx[word.lower()] for word in p_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]
    h = [word2idx[word.lower()] for word in h_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]

    p_list.append(p)
    h_list.append(h)

p_list = pad_sequences(p_list, maxlen=base_params['input_shapes'][0][0])
h_list = pad_sequences(h_list, maxlen=base_params['input_shapes'][0][0])

p_test = p_list
h_test = h_list
y_test = label
x_test = [p_test, h_test]
y_test = keras.utils.to_categorical(y_test, num_classes=2)

# 建立模型

模型建立参考了：https://github.com/wangle1218/deep_text_matching  与match Zoo：https://github.com/NTMC-Community/MatchZoo  
模型的结构在build函数里

In [39]:
#为什么要把这个作为一个类层定义呢，因为我之前用tf.nn.top_k函数，会报错，因为使用了自定义的层，必须转化为类才行，可以看：https://blog.csdn.net/qq_44930937/article/details/104509254
from keras.engine import Layer, InputSpec
class KMaxPooling(Layer):
    """
    K-max pooling layer that extracts the k-highest activations from a sequence (2nd dimension).
    TensorFlow backend.
    """
    def __init__(self, k=1, **kwargs):
        super().__init__(**kwargs)
        self.input_spec = InputSpec(ndim=2)
        self.k = k

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.k)

    def call(self, inputs):
        
        # swap last two dimensions since top_k will be applied along the last dimension
        # shifted_input = tf.transpose(inputs, [0, 2, 1])
        
        # extract top_k, returns two tensors [values, indices]
        top_k = tf.nn.top_k(inputs, k=self.k, sorted=True, name=None)[0]
        
        # return flattened output
        # return Flatten()(top_k)
        return top_k

    def get_config(self):
        config = {'k':self.k}
        base_config = super(KMaxPooling,self).get_config()
        return dict(list(base_config.items()) + list(config.items()))


In [40]:
from keras.models import Model
import tensorflow as tf 

np.random.seed(1)
tf.random.set_seed(1)

class MVLSTM(object):

    def __init__( self, params):
        """Init."""
        self._params = params

    
    def make_embedding_layer(self,name='embedding',embed_type='char',**kwargs):   #加载词向量的方法

        def init_embedding(weights=None):
            if embed_type == "char":
                input_dim = self._params['max_features']
                output_dim = self._params['embed_size']
            else:
                input_dim = self._params['word_max_features']
                output_dim = self._params['word_embed_size']

            return keras.layers.Embedding(
                input_dim = input_dim,
                output_dim = output_dim,
                trainable = False,
                name = name,
                weights = weights,
                **kwargs)

        if embed_type == "char":
            embed_weights = self._params['embedding_matrix']
        else:
            embed_weights = self._params['word_embedding_matrix']

        if embed_weights == []:
            embedding = init_embedding()
        else:
            embedding = init_embedding(weights = [embed_weights])

        return embedding

    def _make_multi_layer_perceptron_layer(self) -> keras.layers.Layer:   #多层感知机方法（全连接）
        # TODO: do not create new layers for a second call
        def _wrapper(x):
            activation = self._params['mlp_activation_func']   #全连接激活函数
            for _ in range(self._params['mlp_num_layers']):  #全连接层数
                x = keras.layers.Dense(self._params['mlp_num_units'],   #全连接神经元数
                                       activation=activation)(x)
            return keras.layers.Dense(self._params['mlp_num_fan_out'],  #全连接最后一层神经元数
                                      activation=activation)(x)

        return _wrapper

    def _make_inputs(self) -> list:        #定义输入层，这里只有两个输入，也就是要判断是否匹配的两段文本
        input_left = keras.layers.Input(
            name='text_left',
            shape=self._params['input_shapes'][0]
        )
        input_right = keras.layers.Input(
            name='text_right',
            shape=self._params['input_shapes'][1]
        )
        return [input_left, input_right]

    def _make_output_layer(self) -> keras.layers.Layer:   #定义输出层
        """:return: a correctly shaped keras dense layer for model output."""
        task = self._params['task']
        if task == "Classification":
            return keras.layers.Dense(self._params['num_classes'], activation='softmax')  #分类使用softmax，两个文档输入就分类
        elif task == "Ranking":
            return keras.layers.Dense(1, activation='linear')          #排序使用linear，多个文档输入就排序
        else:
            raise ValueError(f"{task} is not a valid task type."
                             f"Must be in `Ranking` and `Classification`.")


    def build(self):
        """Build model structure."""
        query, doc = self._make_inputs()

        # Embedding layer
        embedding = self.make_embedding_layer(mask_zero=True)
        embed_query = embedding(query)
        embed_doc = embedding(doc)

        # Bi-directional LSTM layer
        rep_query = keras.layers.Bidirectional(keras.layers.LSTM(
            self._params['lstm_units'],
            return_sequences=True,
            dropout=self._params['dropout_rate']
        ))(embed_query)
        rep_doc = keras.layers.Bidirectional(keras.layers.LSTM(
            self._params['lstm_units'],
            return_sequences=True,
            dropout=self._params['dropout_rate']
        ))(embed_doc)

        # Top-k matching layer
        matching_matrix = keras.layers.Dot(
            axes=[2, 2], normalize=False)([rep_query, rep_doc])
        matching_signals = keras.layers.Reshape((-1,))(matching_matrix)
        # matching_topk = keras.layers.Lambda(
        #     lambda x: tf.nn.top_k(x, k=self._params['top_k'], sorted=True)[0]
        # )(matching_signals)
        matching_topk = KMaxPooling(k=self._params['top_k'])(matching_signals)

        # Multilayer perceptron layer.
        mlp = self._make_multi_layer_perceptron_layer()(matching_topk)
        mlp = keras.layers.Dropout(
            rate=self._params['dropout_rate'])(mlp)

        x_out = self._make_output_layer()(mlp)
        model = keras.Model(inputs=[query, doc], outputs=x_out)
        return model

# 加载模型训练

In [41]:
params = base_params
backend = MVLSTM(params)

In [42]:
model = backend.build()

In [43]:
model.compile(
        loss='categorical_crossentropy', 
        optimizer='adam', 
        metrics=['accuracy']
        )

In [44]:
print(model.summary())

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_left (InputLayer)          [(None, 64)]         0                                            
__________________________________________________________________________________________________
text_right (InputLayer)         [(None, 64)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 64, 100)      170000      text_left[0][0]                  
                                                                 text_right[0][0]                 
__________________________________________________________________________________________________
bidirectional_6 (Bidirectional) (None, 64, 128)      84480       embedding[0][0]            

In [45]:
!mkdir output

mkdir: cannot create directory ‘output’: File exists


In [46]:
earlystop = keras.callbacks.EarlyStopping(
        monitor='val_accuracy', 
        patience=4, 
        verbose=2, 
        mode='max'
        )
model_name = 'mvlstm'
bast_model_filepath = './output/best_%s_model.h5' % model_name
checkpoint = keras.callbacks.ModelCheckpoint(
    bast_model_filepath, 
    monitor='val_accuracy', 
    verbose=1, 
    save_best_only=True,
    mode='max'
    )
model.fit(
    x=x, 
    y=y, 
    batch_size=64, 
    epochs=15, 
    validation_data=(x_eval, y_eval), 
    shuffle=True, 
    callbacks=[earlystop,checkpoint]
    )

Epoch 1/15

Epoch 00001: val_accuracy improved from -inf to 0.70750, saving model to ./output/best_mvlstm_model.h5
Epoch 2/15

Epoch 00002: val_accuracy improved from 0.70750 to 0.72060, saving model to ./output/best_mvlstm_model.h5
Epoch 3/15

Epoch 00003: val_accuracy improved from 0.72060 to 0.74310, saving model to ./output/best_mvlstm_model.h5
Epoch 4/15

Epoch 00004: val_accuracy improved from 0.74310 to 0.74880, saving model to ./output/best_mvlstm_model.h5
Epoch 5/15

Epoch 00005: val_accuracy improved from 0.74880 to 0.75380, saving model to ./output/best_mvlstm_model.h5
Epoch 6/15

Epoch 00006: val_accuracy improved from 0.75380 to 0.75800, saving model to ./output/best_mvlstm_model.h5
Epoch 7/15

Epoch 00007: val_accuracy did not improve from 0.75800
Epoch 8/15

Epoch 00008: val_accuracy improved from 0.75800 to 0.76990, saving model to ./output/best_mvlstm_model.h5
Epoch 9/15

Epoch 00009: val_accuracy did not improve from 0.76990
Epoch 10/15

Epoch 00010: val_accuracy impr

<tensorflow.python.keras.callbacks.History at 0x7f3924d300f0>

In [47]:
# 保存模型
model_frame_path = "./output/%s_model.json" % model_name
model_json = model.to_json()
with open(model_frame_path, "w") as json_file:
    json_file.write(model_json)

# 测试模型

In [48]:
model.load_weights(bast_model_filepath)
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
    )

loss, acc = model.evaluate(
    x=x_test, 
    y=y_test, 
    batch_size=128, 
    verbose=1
    )
print("Test loss:",loss, "Test accuracy:",acc)

Test loss: 0.5093920230865479 Test accuracy: 0.7590000033378601
