# Quora 相似问题检测

Quora 作为一个高质量的知识平台，需要确保相同的问题不会多次出现。答题者不应该把相同的答案复制粘贴到类似的问题下方，而读者也应该只能在站内找到唯一的问题与他的需求对应。例如，“减体重的最佳方法是什么？”，“如何才能减肥？”，“最有效的减肥计划是什么？”，通常这些问题都会被人们认为是重复提问，因为这些问题的意图都相同。

__数据来源__：Data @ Quora https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs

__目标__：给定任意一句语句，预测一个类似的语句

In [None]:
!wget -q -nc https://raw.githubusercontent.com/skyu0221/online-dropbox/master/ml/capstone2/quora_duplicate_questions.tsv

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.utils.vis_utils import model_to_dot
from keras.models import Model, Sequential
from keras.layers import *

### 读取并观察数据

In [None]:
data = pd.read_csv("quora_duplicate_questions.tsv", sep ='\t')
data.head()

In [None]:
data.describe()

最近新发现了一个 package 用来检视数据：pandas-profiling，有兴趣可以自己研究一下

In [None]:
# 我们只考虑重复问题
data = data[data['is_duplicate'] == 1]
data.head()

In [None]:
data.describe()

### Pre-processing

In [None]:
# train data 包含多少 data point
num_samples = 10000
# 最小出现次数数
min_count = 5

In [None]:
q1 = data['question1'][:num_samples].to_list()
q2 = data['question2'][:num_samples].to_list()

In [None]:
input_words = []
target_words = []
for sentence in q1:
    input_words.extend(sentence.split())
for sentence in q2:
    target_words.extend(sentence.split())

In [None]:
input_words = np.array(input_words, dtype=str)
unique_input_words, input_count = np.unique(input_words, return_counts=True)
target_words = np.array(input_words, dtype=str)
unique_target_words, target_count = np.unique(target_words, return_counts=True)
len(unique_input_words)

In [None]:
unique_input_words = unique_input_words[input_count >= min_count]
unique_target_words = unique_target_words[target_count >= min_count]
len(unique_input_words)

为了让程序知道哪里是句子开头，哪里是句子结尾，我们在 target 开头加一个 @，结尾加一个 #

In [None]:
q1 = [q.split() for q in q1]
q2 = [['@'] + q.split() + ['#'] for q in q2]

In [None]:
unique_target_words = np.hstack((unique_target_words, np.array(['@', '#'])))

In [None]:
for sentence in q1:
    i = 0
    while i < len(sentence):
        if sentence[i] not in unique_input_words:
            sentence.pop(i)
        else:
            i += 1
for sentence in q2:
    i = 0
    while i < len(sentence):
        if sentence[i] not in unique_target_words:
            sentence.pop(i)
        else:
            i += 1

In [None]:
print(q1[0])
print(q2[0])
print(q1[5])
print(q2[5])

In [None]:
del input_words
del target_words

In [None]:
num_encoder_tokens = len(unique_input_words)
num_decoder_tokens = len(unique_target_words)

In [None]:
max_encoder_seq_len = max([len(q) for q in q1])
max_decoder_seq_len = max([len(q) for q in q2])

In [None]:
print('Total Number of samples: ', len(q1))
print('Number of unique input tokens (words): ', num_encoder_tokens)
print('Number of unique output tokens (words): ', num_decoder_tokens)
print('Max seq length for inputs: ', max_encoder_seq_len)
print('Max seq length for outputs: ', max_decoder_seq_len)

给每一个 token 加一个编号，并创建 training data

In [None]:
input_token_idx = dict([(token, i) for i, token in enumerate(unique_input_words)])
target_token_idx = dict([(token, i) for i, token in enumerate(unique_target_words)])

In [None]:
encoder_input = np.zeros((len(q1),
                          max_encoder_seq_len,
                          num_encoder_tokens), dtype = 'float32')
decoder_input = np.zeros((len(q1),
                          max_decoder_seq_len,
                          num_decoder_tokens), dtype = 'float32')
decoder_target = np.zeros((len(q1),
                           max_decoder_seq_len,
                           num_decoder_tokens), dtype = 'float32')

decoder_input 里装的东西和 decoder_target 一样，但是 target 会错开一个 element

- 例子 **"How can I see all my Youtube comments?"**
- decoder_input is **"How / can / I / see / all / my / Youtube / comments?"**
- decoder_target is **"can / I / see / all / my / Youtube / comments?**
- seq2seq model sees **Input** and predicts **Target**

| Input |    |Target    |
|---------|-------|-----|
|How  | ========>|can  |
|can    | ========>      | I    |
|I    | ========>      | see    |
|see    | ========>      | all   |
|all    | ========>      | my    |
|my    | ========>      | Youtube    |
|Youtube    | ========>      | comments?    |
|comments?    | ========>      |     |

In [None]:
for i, (x, y) in enumerate(zip(q1, q2)):
    for t, token in enumerate(x):
        encoder_input[i, t, input_token_idx[token]] = 1.
    for t, token in enumerate(y):
        decoder_input[i, t, target_token_idx[token]] = 1.
        if t > 0:
            decoder_target[i, t-1, target_token_idx[token]] = 1.

### 创建模型  (seq2seq model)

创建两个LSTM 模型 （encoder 和 decoder）
<img src="https://blog.keras.io/img/seq2seq/seq2seq-inference.png" style="width: 500px"/>

In [None]:
encoder_inputs = Input(shape = (None, num_encoder_tokens))
encoder = LSTM(300, return_state = True)
_, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

In [None]:
decoder_inputs = Input(shape = (None, num_decoder_tokens))
lstm = LSTM(300, return_sequences = True, return_state = True)
decoder_outputs, _, _ = lstm(decoder_inputs, initial_state = encoder_states)
dense = Dense(num_decoder_tokens, activation = 'softmax')
decoder_outputs = dense(decoder_outputs)

In [None]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
from IPython.display import SVG
SVG(model_to_dot(model).create(prog='dot', format='svg'))

In [None]:
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy')

In [None]:
model.summary()

In [None]:
history = model.fit([encoder_input, decoder_input],
                    decoder_target,
                    batch_size=100,
                    epochs = 10)

In [None]:
loss = history.history['loss']
plt.plot(loss, label='Training loss')
plt.legend()
plt.show()

### 分解模型

In [None]:
encoder_model = Model(encoder_inputs, encoder_states)

In [None]:
decoder_state_input_h = Input(shape = (300, ))
decoder_state_input_c = Input(shape = (300, ))
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

In [None]:
decoder_outputs, state_h, state_c = lstm(decoder_inputs,
                                         initial_state = decoder_state_inputs)
decoder_states = [state_h, state_c]

In [None]:
decoder_outputs = dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_state_inputs,
                      [decoder_outputs] + decoder_states)

In [None]:
reverse_input_token_idx = dict((i, token) for token, i in input_token_idx.items())
reverse_target_token_idx = dict((i, token) for token, i in target_token_idx.items())

In [None]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_idx['@']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_token_idx[sampled_token_index]
        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '#' or
           len(decoded_sentence) > max_decoder_seq_len):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [None]:
for idx in range(100):
    input_seq = encoder_input[idx: idx+1]
    decoded_sent = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', q1[idx])
    print('Decoded sentence:', decoded_sent)