# ARC-II算法实战

## 数据准备
数据来源于https://github.com/terrifyzhao/text_matching， 里面是中文的匹配数据，有sentence1，sentence2和对应的label，0代表不匹配，1代表匹配。可以直接在GitHub中引用出来

In [None]:
!git clone https://github.com/terrifyzhao/text_matching.git
!cp -r /content/text_matching/input ./

Cloning into 'text_matching'...
remote: Enumerating objects: 334, done.[K
remote: Total 334 (delta 0), reused 0 (delta 0), pack-reused 334[K
Receiving objects: 100% (334/334), 24.77 MiB | 20.65 MiB/s, done.
Resolving deltas: 100% (212/212), done.


In [None]:
!ls ./input/

dev.csv  test.csv  train.csv  vocab.txt


In [None]:
!head -5 ./input/train.csv

﻿sentence1,sentence2,label
用微信都6年，微信没有微粒贷功能,4。  号码来微粒贷,0
微信消费算吗,还有多少钱没还,0
交易密码忘记了找回密码绑定的手机卡也掉了,怎么最近安全老是要改密码呢好麻烦,0
你好 我昨天晚上申请的没有打电话给我 今天之内一定会打吗？,什么时候可以到账,0


In [None]:
# 首先设置一下本模型中要用到的具体的参数
# 设置基本的参数
base_params = {
    'num_classes':2,               # 类别   
    'max_features':1700,             # 嵌入层的最大词数
    'embed_size':200,              # 嵌入维度
    'filters':300,
    'kernel_size':3,
    'strides':1,
    'padding':'same',
    'conv_activation_func':'relu',
    'embedding_matrix':[],
    'w_initializer':'random_uniform',
    'b_initializer':'zeros',
    'dropout_rate':0.2,
    'mlp_activation_func':'relu',
    'mlp_num_layers':1,
    'mlp_num_units':128,
    'mlp_num_fan_out':128,
    'input_shapes':[(64,),(64,)],         # 每句话填充后的最大长度
    'task':'Classification',
}

arcii_params = base_params
arcii_params = base_params
arcii_params['matching_type'] = 'dot'
arcii_params['num_blocks'] = 3
arcii_params['kernel_1d_count'] = 32
arcii_params['kernel_1d_size'] = 3
arcii_params['kernel_2d_count'] = [16, 32, 32]
arcii_params['kernel_2d_size'] = [[3, 3], [3, 3], [3, 3]]
arcii_params['pool_2d_size'] = [[2, 2], [2, 2], [2, 2]]
arcii_params['dropout_rate'] = 0.5

# 数据输入与转换
由于ARC-II算法对于中文要求的输入是字向量，因为字相比于词，数量大大减少

In [None]:
#首先把文本中所有的字统计出来，制作出字表
import pandas as pd

df = pd.read_csv('input/train.csv')
p = df['sentence1'].values
h = df['sentence2'].values
p_seg = list(map(lambda x: list(x.replace(" ","")), p))
h_seg = list(map(lambda x: list(x.replace(" ","")), h))
common_texts = []
common_texts.extend(p_seg)
common_texts.extend(h_seg)



df = pd.read_csv('input/dev.csv')
p = df['sentence1'].values
h = df['sentence2'].values
p_seg = list(map(lambda x: list(x.replace(" ","")), p))
h_seg = list(map(lambda x: list(x.replace(" ","")), h))
common_texts.extend(p_seg)
common_texts.extend(h_seg)

df = pd.read_csv('input/test.csv')
p = df['sentence1'].values
h = df['sentence2'].values
p_seg = list(map(lambda x: list(x.replace(" ","")), p))
h_seg = list(map(lambda x: list(x.replace(" ","")), h))
common_texts.extend(p_seg)
common_texts.extend(h_seg)

In [None]:
common_texts[0]

['用',
 '微',
 '信',
 '都',
 '6',
 '年',
 '，',
 '微',
 '信',
 '没',
 '有',
 '微',
 '粒',
 '贷',
 '功',
 '能']

In [None]:
#使用set来统计不同的字
char_set = set()
for sample in common_texts:
    for char in sample:
        char_set.add(char)
with open('input/char_vocab.txt','w',encoding='utf8') as f:
    f.write("\n".join(sorted(list(char_set),reverse=True)))

In [None]:
#首先读取训练集的数据
df = pd.read_csv('input/train.csv')
p = df['sentence1'].values
h = df['sentence2'].values
label = df['label'].values

In [None]:
#写一个shuffle函数，将里面的数据随机打乱
import numpy as np

def shuffle(*arrs):
    """ shuffle

    Shuffle 数据

    Arguments:
        *arrs: 数组数据

    Returns:
        shuffle后的数据

    """
    arrs = list(arrs)
    for i, arr in enumerate(arrs):
        assert len(arrs[0]) == len(arrs[i])
        arrs[i] = np.array(arr)
    p = np.random.permutation(len(arrs[0]))
    return tuple(arr[p] for arr in arrs)

p, h, label = shuffle(p, h, label)

In [None]:
#接下来要将字转换成对应的数字，所以需要先对字表中的字编码
# 加载字典
def load_char_vocab():
    vocab = [line.strip() for line in open('./input/char_vocab.txt', encoding='utf-8').readlines()]
    word2idx = {word: index for index, word in enumerate(vocab,start=1)}
    idx2word = {index: word for index, word in enumerate(vocab,start=1)}
    return word2idx, idx2word

word2idx, idx2word = load_char_vocab()

In [None]:
#利用字典进行转换
p_list, h_list = [], []
for p_sentence, h_sentence in zip(p, h):
    p = [word2idx[word.lower()] for word in p_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]
    h = [word2idx[word.lower()] for word in h_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]

    p_list.append(p)
    h_list.append(h)

In [None]:
#看一下编码后的句子
p_list[0]

[1290, 1563, 1290, 1496, 1301, 1569, 1552, 433, 645, 291, 792]

In [None]:
#接下来需要写一个pad_sequences函数，将每句话都填充到相同长度，不足的补零
def pad_sequences(sequences, maxlen=None, dtype='int32', padding='post',
                  truncating='post', value=0.):
    """ pad_sequences

    把序列长度转变为一样长的，如果设置了maxlen则长度统一为maxlen，如果没有设置则默认取
    最大的长度。填充和截取包括两种方法，post与pre，post指从尾部开始处理，pre指从头部
    开始处理，默认都是从尾部开始。

    Arguments:
        sequences: 序列
        maxlen: int 最大长度
        dtype: 转变后的数据类型
        padding: 填充方法'pre' or 'post'
        truncating: 截取方法'pre' or 'post'
        value: float 填充的值

    Returns:
        x: numpy array 填充后的序列维度为 (number_of_sequences, maxlen)

    """
    lengths = [len(s) for s in sequences]

    nb_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)

    x = (np.ones((nb_samples, maxlen)) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if len(s) == 0:
            continue  # empty list was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError("Truncating type '%s' not understood" % padding)

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError("Padding type '%s' not understood" % padding)
    return x

p_list = pad_sequences(p_list, maxlen=base_params['input_shapes'][0][0])
h_list = pad_sequences(h_list, maxlen=base_params['input_shapes'][0][0])

In [None]:
#看一下编码后的句子
p_list[0]

array([1290, 1563, 1290, 1496, 1301, 1569, 1552,  433,  645,  291,  792,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

In [None]:
#将训练集数据制作成特征和标签
import keras
p = p_list
h = h_list
y = label
x = [p,h]
y = keras.utils.to_categorical(y,num_classes=2)

In [None]:
#对验证集进行相同的操作
df = pd.read_csv('input/dev.csv')
p = df['sentence1'].values
h = df['sentence2'].values
label = df['label'].values

p, h, label = shuffle(p, h, label)
word2idx, idx2word = load_char_vocab()

#利用字典进行转换
p_list, h_list = [], []
for p_sentence, h_sentence in zip(p, h):
    p = [word2idx[word.lower()] for word in p_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]
    h = [word2idx[word.lower()] for word in h_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]

    p_list.append(p)
    h_list.append(h)

p_list = pad_sequences(p_list, maxlen=base_params['input_shapes'][0][0])
h_list = pad_sequences(h_list, maxlen=base_params['input_shapes'][0][0])

p_eval = p_list
h_eval = h_list
y_eval = label
x_eval = [p_eval, h_eval]
y_eval = keras.utils.to_categorical(y_eval,num_classes=2)

In [None]:
#对测试集进行相同的操作,注意这里的测试集是有标签的，便于比较算法效果，测试集并不参与训练过程
df = pd.read_csv('input/test.csv')
p = df['sentence1'].values
h = df['sentence2'].values
label = df['label'].values

p, h, label = shuffle(p, h, label)
word2idx, idx2word = load_char_vocab()

#利用字典进行转换
p_list, h_list = [], []
for p_sentence, h_sentence in zip(p, h):
    p = [word2idx[word.lower()] for word in p_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]
    h = [word2idx[word.lower()] for word in h_sentence if len(word.strip()) > 0 and word.lower() in word2idx.keys()]

    p_list.append(p)
    h_list.append(h)

p_list = pad_sequences(p_list, maxlen=base_params['input_shapes'][0][0])
h_list = pad_sequences(h_list, maxlen=base_params['input_shapes'][0][0])

p_test = p_list
h_test = h_list
y_test = label
x_test = [p_test, h_test]
y_test = keras.utils.to_categorical(y_test, num_classes=2)

# 建立模型

模型建立参考了：https://github.com/wangle1218/deep_text_matching  与match Zoo：https://github.com/NTMC-Community/MatchZoo  
模型的结构在build函数里，包括输入层，嵌入层，匹配层，卷积和池化的堆叠，输出层，可以看到ARC-II是基于交互的方法，先把两个文本做了交互，得到交互矩阵，再提取深层次的特征进行分类。MatchingLayer是交互层的函数，通过将两个一维卷积之后的向量输入，句子x的为1\*M的向量，然后y的向量转秩为M\*1，然后将句子x、y进行复制为M*M向量，然后将两个向量相加，就会得到交叉卷积的效果

In [None]:
from keras.engine import Layer, InputSpec
from keras.layers import Flatten
import tensorflow as tf
import keras.backend as K
import keras


class MatchingLayer(Layer):
    """
    Layer that computes a matching matrix between samples in two tensors.
    :param normalize: Whether to L2-normalize samples along the
        dot product axis before taking the dot product.
        If set to True, then the output of the dot product
        is the cosine proximity between the two samples.
    :param matching_type: the similarity function for matching
    :param kwargs: Standard layer keyword arguments.
    """

    def __init__(self, normalize: bool = False,
                 matching_type: str = 'dot', **kwargs):
        """:class:`MatchingLayer` constructor."""
        super().__init__(**kwargs)
        self._normalize = normalize
        self._validate_matching_type(matching_type)
        self._matching_type = matching_type
        self._shape1 = None
        self._shape2 = None

    @classmethod
    def _validate_matching_type(cls, matching_type: str = 'dot'):
        valid_matching_type = ['dot', 'mul', 'plus', 'minus', 'concat']
        if matching_type not in valid_matching_type:
            raise ValueError(f"{matching_type} is not a valid matching type, "
                             f"{valid_matching_type} expected.")

    def build(self, input_shape: list):
        """
        Build the layer.
        :param input_shape: the shapes of the input tensors,
            for MatchingLayer we need tow input tensors.
        """
        # Used purely for shape validation.
        if not isinstance(input_shape, list) or len(input_shape) != 2:
            raise ValueError('A `MatchingLayer` layer should be called '
                             'on a list of 2 inputs.')
        self._shape1 = input_shape[0]
        self._shape2 = input_shape[1]
        for idx in 0, 2:
            if self._shape1[idx] != self._shape2[idx]:
                raise ValueError(
                    'Incompatible dimensions: '
                    f'{self._shape1[idx]} != {self._shape2[idx]}.'
                    f'Layer shapes: {self._shape1}, {self._shape2}.'
                )

    def call(self, inputs: list, **kwargs):
        """
        The computation logic of MatchingLayer.
        :param inputs: two input tensors.
        """
        x1 = inputs[0]
        x2 = inputs[1]
        if self._matching_type == 'dot':
            if self._normalize:
                x1 = tf.math.l2_normalize(x1, axis=2)
                x2 = tf.math.l2_normalize(x2, axis=2)
            return tf.expand_dims(tf.einsum('abd,acd->abc', x1, x2), 3)
        else:
            if self._matching_type == 'mul':
                def func(x, y):
                    return x * y
            elif self._matching_type == 'plus':
                def func(x, y):
                    return x + y
            elif self._matching_type == 'minus':
                def func(x, y):
                    return x - y
            elif self._matching_type == 'concat':
                def func(x, y):
                    return tf.concat([x, y], axis=3)
            else:
                raise ValueError(f"Invalid matching type."
                                 f"{self._matching_type} received."
                                 f"Mut be in `dot`, `mul`, `plus`, "
                                 f"`minus` and `concat`.")
            x1_exp = tf.stack([x1] * self._shape2[1], 2)
            x2_exp = tf.stack([x2] * self._shape1[1], 1)
            return func(x1_exp, x2_exp)

    def compute_output_shape(self, input_shape: list) -> tuple:
        """
        Calculate the layer output shape.
        :param input_shape: the shapes of the input tensors,
            for MatchingLayer we need tow input tensors.
        """
        if not isinstance(input_shape, list) or len(input_shape) != 2:
            raise ValueError('A `MatchingLayer` layer should be called '
                             'on a list of 2 inputs.')
        shape1 = list(input_shape[0])
        shape2 = list(input_shape[1])
        if len(shape1) != 3 or len(shape2) != 3:
            raise ValueError('A `MatchingLayer` layer should be called '
                             'on 2 inputs with 3 dimensions.')
        if shape1[0] != shape2[0] or shape1[2] != shape2[2]:
            raise ValueError('A `MatchingLayer` layer should be called '
                             'on 2 inputs with same 0,2 dimensions.')

        if self._matching_type in ['mul', 'plus', 'minus']:
            return shape1[0], shape1[1], shape2[1], shape1[2]
        elif self._matching_type == 'dot':
            return shape1[0], shape1[1], shape2[1], 1
        elif self._matching_type == 'concat':
            return shape1[0], shape1[1], shape2[1], shape1[2] + shape2[2]
        else:
            raise ValueError(f"Invalid `matching_type`."
                             f"{self._matching_type} received."
                             f"Must be in `mul`, `plus`, `minus` "
                             f"`dot` and `concat`.")

    def get_config(self) -> dict:
        """Get the config dict of MatchingLayer."""
        config = {
            'normalize': self._normalize,
            'matching_type': self._matching_type,
        }
        base_config = super(MatchingLayer, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

In [None]:
from keras.models import Model
import tensorflow as tf 

np.random.seed(1)
tf.random.set_seed(1)

class ArcII(object):

    def __init__( self, params):
        """Init."""
        self._params = params

    
    def make_embedding_layer(self,name='embedding',embed_type='char',**kwargs):   #加载词向量的方法

        def init_embedding(weights=None):
            if embed_type == "char":
                input_dim = self._params['max_features']
                output_dim = self._params['embed_size']
            else:
                input_dim = self._params['word_max_features']
                output_dim = self._params['word_embed_size']

            return keras.layers.Embedding(
                input_dim = input_dim,
                output_dim = output_dim,
                trainable = False,
                name = name,
                weights = weights,
                **kwargs)

        if embed_type == "char":
            embed_weights = self._params['embedding_matrix']
        else:
            embed_weights = self._params['word_embedding_matrix']

        if embed_weights == []:
            embedding = init_embedding()
        else:
            embedding = init_embedding(weights = [embed_weights])

        return embedding

    def _make_multi_layer_perceptron_layer(self) -> keras.layers.Layer:   #多层感知机方法（全连接）
        # TODO: do not create new layers for a second call
        def _wrapper(x):
            activation = self._params['mlp_activation_func']   #全连接激活函数
            for _ in range(self._params['mlp_num_layers']):  #全连接层数
                x = keras.layers.Dense(self._params['mlp_num_units'],   #全连接神经元数
                                       activation=activation)(x)
            return keras.layers.Dense(self._params['mlp_num_fan_out'],  #全连接最后一层神经元数
                                      activation=activation)(x)

        return _wrapper

    def _make_inputs(self) -> list:        #定义输入层，这里只有两个输入，也就是要判断是否匹配的两段文本
        input_left = keras.layers.Input(
            name='text_left',
            shape=self._params['input_shapes'][0]
        )
        input_right = keras.layers.Input(
            name='text_right',
            shape=self._params['input_shapes'][1]
        )
        return [input_left, input_right]

    def _make_output_layer(self) -> keras.layers.Layer:   #定义输出层
        """:return: a correctly shaped keras dense layer for model output."""
        task = self._params['task']
        if task == "Classification":
            return keras.layers.Dense(self._params['num_classes'], activation='softmax')  #分类使用softmax，两个文档输入就分类
        elif task == "Ranking":
            return keras.layers.Dense(1, activation='linear')          #排序使用linear，多个文档输入就排序
        else:
            raise ValueError(f"{task} is not a valid task type."
                             f"Must be in `Ranking` and `Classification`.")


    def build(self):
        """
        Build model structure.
        ArcII has the desirable property of letting two sentences meet before
        their own high-level representations mature.
        """
        input_left, input_right = self._make_inputs()

        embedding = self.make_embedding_layer()
        embed_left = embedding(input_left)
        embed_right = embedding(input_right)

        # Phrase level representations
        # conv_1d_left = keras.layers.Conv1D(
        #     self._params['kernel_1d_count'],
        #     self._params['kernel_1d_size'],
        #     padding=self._params['padding']
        # )(embed_left)
        # conv_1d_right = keras.layers.Conv1D(
        #     self._params['kernel_1d_count'],
        #     self._params['kernel_1d_size'],
        #     padding=self._params['padding']
        # )(embed_right)
        conv_1d = keras.layers.Conv1D(
            self._params['kernel_1d_count'],
            self._params['kernel_1d_size'],
            padding=self._params['padding']
        )
        conv_1d_left = conv_1d(embed_left)                    # 卷积层共享
        conv_1d_right = conv_1d(embed_right)

        # Interaction
        embed_cross = MatchingLayer(                       # 交互层
            normalize=True,
            matching_type=self._params['matching_type']
            )([conv_1d_left, conv_1d_right])

        for i in range(self._params['num_blocks']):       # 交互得到的是二维的feature map，之后执行卷积和池化操作
            embed_cross = self._conv_pool_block(
                embed_cross,
                self._params['kernel_2d_count'][i],
                self._params['kernel_2d_size'][i],
                self._params['padding'],
                self._params['conv_activation_func'],
                self._params['pool_2d_size'][i]
            )

        embed_flat = keras.layers.Flatten()(embed_cross)    # 把二维拉平到一维
        x = keras.layers.Dropout(rate=self._params['dropout_rate'])(embed_flat)

        inputs = [input_left, input_right]
        x_out = self._make_output_layer()(x)               # 输出层做分类
        model = keras.Model(inputs=inputs, outputs=x_out)

        return model

    @classmethod
    def _conv_pool_block(cls, x,kernel_count, kernel_size,padding,activation,pool_size):
        output = keras.layers.Conv2D(kernel_count,
                      kernel_size,
                      padding=padding,
                      activation=activation)(x)
        output = keras.layers.MaxPooling2D(pool_size=pool_size)(output)
        # output = keras.layers.normalization.BatchNormalization()(output)
        return output

# 加载模型训练

In [None]:
params = base_params
backend = ArcII(params)

In [None]:
model = backend.build()

In [None]:
model.compile(
        loss='categorical_crossentropy', 
        optimizer='adam', 
        metrics=['accuracy']
        )

In [None]:
print(model.summary())

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_left (InputLayer)          [(None, 64)]         0                                            
__________________________________________________________________________________________________
text_right (InputLayer)         [(None, 64)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 64, 200)      340000      text_left[0][0]                  
                                                                 text_right[0][0]                 
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 64, 32)       19232       embedding[0][0]              

In [None]:
!mkdir output

In [None]:
earlystop = keras.callbacks.EarlyStopping(
        monitor='val_accuracy', 
        patience=4, 
        verbose=2, 
        mode='max'
        )
model_name = 'ARC_II'
bast_model_filepath = './output/best_%s_model.h5' % model_name
checkpoint = keras.callbacks.ModelCheckpoint(
    bast_model_filepath, 
    monitor='val_accuracy', 
    verbose=1, 
    save_best_only=True,
    mode='max'
    )
model.fit(
    x=x, 
    y=y, 
    batch_size=64, 
    epochs=15, 
    validation_data=(x_eval, y_eval), 
    shuffle=True, 
    callbacks=[earlystop,checkpoint]
    )

Epoch 1/15

Epoch 00001: val_accuracy improved from -inf to 0.77070, saving model to ./output/best_CDSSM_model.h5
Epoch 2/15

Epoch 00002: val_accuracy improved from 0.77070 to 0.77990, saving model to ./output/best_CDSSM_model.h5
Epoch 3/15

Epoch 00003: val_accuracy improved from 0.77990 to 0.78830, saving model to ./output/best_CDSSM_model.h5
Epoch 4/15

Epoch 00004: val_accuracy did not improve from 0.78830
Epoch 5/15

Epoch 00005: val_accuracy did not improve from 0.78830
Epoch 6/15

Epoch 00006: val_accuracy did not improve from 0.78830
Epoch 7/15

Epoch 00007: val_accuracy did not improve from 0.78830
Epoch 00007: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f157919d780>

In [None]:
# 保存模型
model_frame_path = "./output/%s_model.json" % model_name
model_json = model.to_json()
with open(model_frame_path, "w") as json_file:
    json_file.write(model_json)

# 测试模型

In [None]:
model.load_weights(bast_model_filepath)
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
    )

loss, acc = model.evaluate(
    x=x_test, 
    y=y_test, 
    batch_size=128, 
    verbose=1
    )
print("Test loss:",loss, "Test accuracy:",acc)

Test loss: 0.47998741269111633 Test accuracy: 0.7724000215530396
