# seq2seq构建的聊天机器人应用
我们来使用seq2seq框架完成一个聊天机器人构建的任务，我给大家准备了一些对话语料，我们使用这份数据来构建聊天机器人的AI应用。在此之前，我们先了解一下原有的翻译系统需要准备的语料格式，我们把中文数据处理成格式一致的形态。

我们先拉取一份样例数据。

In [22]:
%cd nmt
!bash nmt/scripts/download_iwslt15.sh /tmp/nmt_data

/content/nmt
Download training dataset train.en and train.vi.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.9M  100 12.9M    0     0  4415k      0  0:00:03  0:00:03 --:--:-- 4415k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.2M  100 17.2M    0     0  5386k      0  0:00:03  0:00:03 --:--:-- 5386k
Download dev dataset tst2012.en and tst2012.vi.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  136k  100  136k    0     0   188k      0 --:--:-- --:--:-- --:--:--  188k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  183k  100  183k 

**查看一下包含的文件**

In [23]:
!ls /tmp/nmt_data

train.en  tst2012.en  tst2013.en  vocab.en
train.vi  tst2012.vi  tst2013.vi  vocab.vi


**看一下源语言与目标语言的格式，以及对应的数据量**

可以看到都是做过tokenization之后的数据。

In [24]:
!head -10 /tmp/nmt_data/train.en

Rachel Pike : The science behind a climate headline
In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .
I &apos;d like to talk to you today about the scale of the scientific effort that goes into making the headlines you see in the paper .
Headlines that look like this when they have to do with climate change , and headlines that look like this when they have to do with air quality or smog .
They are both two branches of the same field of atmospheric science .
Recently the headlines looked like this when the Intergovernmental Panel on Climate Change , or IPCC , put out their report on the state of understanding of the atmospheric system .
That report was written by 620 scientists from 40 countries .
They wrote almost a thousand pages on the topic .
And all of tho

In [25]:
!wc -l /tmp/nmt_data/train.en

133317 /tmp/nmt_data/train.en


**还需要准备好vocabulary词表**

In [26]:
!head -10 /tmp/nmt_data/vocab.en

<unk>
<s>
</s>
Rachel
:
The
science
behind
a
climate


In [27]:
!wc -l /tmp/nmt_data/vocab.en

17191 /tmp/nmt_data/vocab.en


In [28]:
!wc -l /tmp/nmt_data/vocab.vi

7709 /tmp/nmt_data/vocab.vi


### 聊天机器人语料
这里列举一些从网络中找到的用于训练中文（英文）聊天机器人的对话语料。

1. [dgk_shooter_min.conv.zip](https://github.com/rustch3n/dgk_lost_conv)
<br>中文电影对白语料，噪音比较大，许多对白问答关系没有对应好

2. [The NUS SMS Corpus](https://github.com/kite1988/nus-sms-corpus)
<br>包含中文和英文短信息语料，据说是世界最大公开的短消息语料

3. [ChatterBot中文基本聊天语料](https://github.com/gunthercox/chatterbot-corpus/tree/master/chatterbot_corpus/data)
<br>ChatterBot聊天引擎提供的一点基本中文聊天语料，量很少，但质量比较高

4. [Datasets for Natural Language Processing](https://github.com/karthikncode/nlp-datasets)
<br>这是他人收集的自然语言处理相关数据集，主要包含Question Answering，Dialogue Systems， Goal-Oriented Dialogue Systems三部分，都是英文文本。可以使用机器翻译为中文，供中文对话使用

5. [小黄鸡](https://github.com/rustch3n/dgk_lost_conv/tree/master/results)
<br>据传这就是小黄鸡的语料：xiaohuangji50w_fenciA.conv.zip （已分词） 和 xiaohuangji50w_nofenci.conv.zip （未分词）

6. [白鹭时代中文问答语料](https://github.com/Samurais/egret-wenda-corpus)
<br>由白鹭时代官方论坛问答板块10,000+ 问题中，选择被标注了“最佳答案”的纪录汇总而成。人工review raw data，给每一个问题，一个可以接受的答案。目前，语料库只包含2907个问答。([备份](./egret-wenda-corpus.zip))

7. [Chat corpus repository](https://github.com/Marsan-Ma/chat_corpus)
<br>chat corpus collection from various open sources
<br>包括：开放字幕、英文电影字幕、中文歌词、英文推文

8. [保险行业QA语料库](https://github.com/Samurais/insuranceqa-corpus-zh)
<br>通过翻译 [insuranceQA](https://github.com/shuzi/insuranceQA)产生的数据集。train_data含有问题12,889条，数据 141779条，正例：负例 = 1:10； test_data含有问题2,000条，数据 22000条，正例：负例 = 1:10；valid_data含有问题2,000条，数据 22000条，正例：负例 = 1:10

**我们下载小黄鸡语料，并对它做一个处理，使得它符合seq2seq模型的输入格式**

In [29]:
!wget https://github.com/candlewill/Dialog_Corpus/raw/master/xiaohuangji50w_nofenci.conv.zip
!unzip xiaohuangji50w_nofenci.conv.zip

--2019-01-27 07:52:17--  https://github.com/candlewill/Dialog_Corpus/raw/master/xiaohuangji50w_nofenci.conv.zip
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/candlewill/Dialog_Corpus/master/xiaohuangji50w_nofenci.conv.zip [following]
--2019-01-27 07:52:17--  https://raw.githubusercontent.com/candlewill/Dialog_Corpus/master/xiaohuangji50w_nofenci.conv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10232473 (9.8M) [application/zip]
Saving to: ‘xiaohuangji50w_nofenci.conv.zip’


2019-01-27 07:52:18 (19.1 MB/s) - ‘xiaohuangji50w_nofenci.conv.zip’ saved [10232473/10232473]

Arch

In [30]:
!perl -pi.bak -e 's/(E\s)/\1Q /g' xiaohuangji50w_nofenci.conv
!perl -pi.bak -e 's/(Q M)/Q/g' xiaohuangji50w_nofenci.conv
!perl -pi.bak -e 's/(M )/A /g' xiaohuangji50w_nofenci.conv
!head -30 xiaohuangji50w_nofenci.conv

E
Q 呵呵
A 是王若猫的。
E
Q 不是
A 那是什么？
E
Q 怎么了
A 我很难过，安慰我~
E
Q 开心点哈,一切都会好起来
A 嗯 会的
E
Q 我还喜欢她,怎么办
A 我帮你告诉她？发短信还是打电话？
E
Q 短信
A 嗯嗯。我也相信
E
Q 你知道谁么
A 肯定不是我，是阮德培
E
Q 许兵是谁
A 吴院四班小帅哥
E
Q 这么假
A 三鹿奶粉也假，不一样的卖啊
E
Q 许兵是傻逼
A 被你发现了。


In [0]:
text = open("xiaohuangji50w_nofenci.conv").read().split("E\n")

In [32]:
text[1]

'Q 呵呵\nA 是王若猫的。\n'

In [0]:
import jieba

def split_conv(in_f, out_q, out_a):
  out_question = open(out_q, 'w')
  out_answer = open(out_a, 'w')
  text = open(in_f).read().split("E\n")
  for pair in text:
    # 句子长度太短的问题对话，跳过
    if len(pair)<=4:
      continue
    # 切分问题和回答
    contents = pair.split("\n")
    out_question.write(" ".join(jieba.lcut(contents[0].strip("Q ")))+"\n")
    out_answer.write(" ".join(jieba.lcut(contents[1].strip("A ")))+"\n")
  out_question.close()
  out_answer.close()

In [0]:
in_f = "xiaohuangji50w_nofenci.conv"
out_q = 'question.file'
out_a = 'answer.file'
split_conv(in_f, out_q, out_a)

In [35]:
!head -10 question.file

呵呵
不是
怎么 了
开心 点哈 , 一切 都 会 好 起来
我 还 喜欢 她 , 怎么办
短信
你 知道 谁 么
许兵 是 谁
这么 假
许兵 是 傻 逼


In [36]:
!head -10 answer.file

是 王若 猫 的 。
那 是 什么 ？
我 很 难过 ， 安慰 我 ~
嗯   会 的
我 帮 你 告诉 她 ？ 发短信 还是 打电话 ？
嗯 嗯 。 我 也 相信
肯定 不是 我 ， 是 阮德培
吴院 四班 小帅哥
三鹿 奶粉 也 假 ， 不 一样 的 卖 啊
被 你 发现 了 。


In [37]:
!wc -l question.file

454131 question.file


In [38]:
!wc -l answer.file

454131 answer.file


In [0]:
import re
def get_vocab(in_f, out_f):
    vocab_dic = {}
    for line in open(in_f, encoding='utf-8'):
        words = line.strip().split(" ")
        for word in words:
            # 保留汉字内容
            if not re.match(r"[\u4e00-\u9fa5]+", word):
                continue
            try:
                vocab_dic[word] += 1
            except:
                vocab_dic[word] = 1
    out = open(out_f, 'w', encoding='utf-8')
    out.write("<unk>\n<s>\n</s>\n")
    vocab = sorted(vocab_dic.items(),key = lambda x:x[1],reverse = True)
    for word in [x[0] for x in vocab[:80000]]:
        out.write(word)
        out.write("\n")
    out.close()

#### 切分训练，验证，测试集

In [0]:
!mkdir data
!head -300000 question.file > data/train.input
!head -300000 answer.file > data/train.output
!head -380000 question.file | tail -80000 > data/val.input
!head -380000 answer.file | tail -80000 > data/val.output
!tail -75000 question.file > data/test.input
!tail -75000 answer.file > data/test.output

**构建词表**

In [0]:
in_file = "question.file"
out_file = "./data/vocab.input"
get_vocab(in_file, out_file)

In [0]:
in_file = "answer.file"
out_file = "./data/vocab.output"
get_vocab(in_file, out_file)

**新建文件夹**

In [0]:
!mkdir /tmp/nmt_attention_model

**训练摘要生成模型**

In [0]:
!python3 -m nmt.nmt \
    --attention=scaled_luong \
    --src=input --tgt=output \
    --vocab_prefix=./data/vocab  \
    --train_prefix=./data/train \
    --dev_prefix=./data/val  \
    --test_prefix=./data/test \
    --out_dir=/tmp/nmt_attention_model \
    --num_train_steps=12000 \
    --steps_per_stats=1 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu

# Job id 0
# Devices visible to TensorFlow: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 9021483762835584285), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 656894764299535912)]
# Loading hparams from /tmp/nmt_attention_model/hparams
# Vocab file ./data/vocab.input exists
# Vocab file ./data/vocab.output exists
  saving hparams to /tmp/nmt_attention_model/hparams
  saving hparams to /tmp/nmt_attention_model/best_bleu/hparams
  attention=scaled_luong
  attention_architecture=standard
  avg_ckpts=False
  batch_size=128
  beam_width=0
  best_bleu=0
  best_bleu_dir=/tmp/nmt_attention_model/best_bleu
  check_special_token=True
  colocate_gradients_with_ops=True
  decay_scheme=
  dev_prefix=./data/val
  dropout=0.2
  embed_prefix=None
  encoder_type=uni
  eos=</s>
  epoch_step=0
  forget_bias=1.0
  infer_batch_size=32
  infer_mode=greedy
  init_op=uniform
  init_weight=0.1
  language_model=False
  learning_rate=1