## 实验演示：使用fasttext做意图分类xiam
### 意图（intent）是服务类聊天机器人，搜索引擎领域的重要手段，例如下图


![alt text](figure/Dialogue-Structure-NLP.png)

图片来自[网络](https://stanfy.com/blog/advanced-natural-language-processing-tools-for-bot-makers/)


下面的数据样本来自github的rasa_nlu项目： `data/examples/wit/demo-flights.json`

几个订机票的用户命令，看名字是facebook的wit.ai的数据格式


```json
{
  "data" : [
    {
      "text" : "i want to go from berlin to tokyo tomorrow",
      "entities" : [
        {
          "entity" : "location",
          "value" : "\"berlin\"",
          "role" : "from",
          "start" : 18,
          "end" : 24
        },
        {
          "entity" : "intent",
          "value" : "\"flight_booking\"",
          "start" : 0,
          "end" : 42
        },
        {
          "entity" : "location",
          "value" : "\"tokyo\"",
          "role" : "to",
          "start" : 28,
          "end" : 33
        },
        {
          "entity" : "datetime",
          "value" : "\"2016-05-29T00:00:00.000-07:00\"",
          "start" : 34,
          "end" : 42
        }
      ]
    },
    {
      "text" : "i'm looking for a flight from london to amsterdam next monday",
      "entities" : [
        {
          "entity" : "location",
          "value" : "\"london\"",
          "role" : "from",
          "start" : 30,
          "end" : 36
        },
        {
          "entity" : "location",
          "value" : "\"amsterdam\"",
          "role" : "to",
          "start" : 40,
          "end" : 49
        },
        {
          "entity" : "datetime",
          "value" : "\"2016-05-30T00:00:00.000-07:00\"",
          "start" : 50,
          "end" : 61
        }
      ]
    },
    {
      "text" : "i want to fly to berlin",
      "entities" : [
        {
          "entity" : "location",
          "value" : "\"berlin\"",
          "role" : "from",
          "start" : 17,
          "end" : 23
        }
      ]
    },
    {
      "text" : "i want to fly from london",
      "entities" : [
        {
          "entity" : "location",
          "value" : "\"london\"",
          "role" : "from",
          "start" : 19,
          "end" : 25
        }
      ]
    }
  ]
}
```

### 大规模 intent 分类数据是各个公司的重要资产，我没有大的intent分类数据，所以使用rasa_nlu的一个小的demo样本来做示例

In [1]:
import json
import io

# 数据来源：
# 从github下载rasa_nlu项目的repo， 使用`rasa_nlu/test_models/test_model_mitie/training_data.json`

# 1. 从json文件读入数据
name = 'data/training_data.json'
with io.open(name, encoding="utf-8-sig") as f:
    data = json.loads(f.read())

In [2]:
labels, texts = [], []

# 2. 从json格式的数据提取intent和text
for eg in data['rasa_nlu_data']['common_examples']:
    texts.append(eg['text'])
    labels.append('__label__'  + eg['intent'])

# 3. 将数据分割成 training数据 和 heldout(又名validation)数据
with open('data/intent_small_train.txt', 'w') as f_tr:
    with open('data/intent_small_valid.txt', 'w') as f_val:
        for i in range(len(labels)):
            if i==0 or labels[i]!=labels[i-1]:
                f_val.write(labels[i] + ' ' + texts[i]+'\n')
            else:
                f_tr.write(labels[i] + ' ' + texts[i]+'\n')

# 4. 打印数据，直观了解

print('所有的 intent:')
print(set([x[9:] for x in labels]))
print('\n')

print('所有的 (intent, text) 样本:')
xs = sorted([(labels[i], texts[i]) for i in range(len(labels))])

for i in range(len(labels)):
    print('\t%s : %s' % (xs[i][0][9:], xs[i][1]))

所有的 intent:
{'affirm', 'greet', 'restaurant_search', 'goodbye'}


所有的 (intent, text) 样本:
	affirm : great
	affirm : great
	affirm : indeed
	affirm : indeed
	affirm : ok
	affirm : ok
	affirm : that's right
	affirm : that's right
	affirm : yeah
	affirm : yeah
	affirm : yep
	affirm : yep
	affirm : yes
	affirm : yes
	goodbye : bye
	goodbye : bye
	goodbye : end
	goodbye : end
	goodbye : good bye
	goodbye : good bye
	goodbye : goodbye
	goodbye : goodbye
	goodbye : stop
	goodbye : stop
	greet : hello
	greet : hello
	greet : hey
	greet : hey
	greet : hey there
	greet : hey there
	greet : hi
	greet : hi
	greet : howdy
	greet : howdy
	restaurant_search : anywhere in the west
	restaurant_search : anywhere in the west
	restaurant_search : central indian restaurant
	restaurant_search : central indian restaurant
	restaurant_search : i am looking for an indian spot
	restaurant_search : i am looking for an indian spot
	restaurant_search : i'm looking for a place in the north of town
	restaurant_search 

### 上面好多样本是重复的，我们假装不是重复样本，强行练习一波cross-validation

In [9]:
# 使用fasttext训练分类器

import fasttext

In [10]:
# fasttext用法的参考文献：https://pypi.python.org/pypi/fasttext

# 我们尝试不同的learning_rate和feature_dimension
lrs = [0.01, 0.05, 0.002]
dims = [5, 10, 25, 50, 75, 100]

best_tr, best_val = 0, 0
for lr in lrs:
    for dim in dims:
        classifier = fasttext.supervised(input_file = 'data/intent_small_train.txt',
                                         output = 'data/intent_model',
                                         label_prefix = '__label__',
                                         dim = dim,
                                         lr = lr,
                                         epoch = 50)
        result_tr = classifier.test('data/intent_small_train.txt')
        result_val = classifier.test('data/intent_small_test.txt')
        
        if result_tr.precision > best_tr:
            best_tr = result_tr.precision
            params_tr = (lr, dim, result_tr)
            
        if result_val.precision > best_val:
            best_val = result_val.precision
            params_val = (lr, dim, result_val)

In [5]:
print(best_tr)
print(params_tr)

0.9459459459459459
(0.05, 5, <fasttext.model.ClassifierTestResult object at 0x7f90c455ffd0>)


In [6]:
print(best_val)
print(params_val)

0.8571428571428571
(0.05, 5, <fasttext.model.ClassifierTestResult object at 0x7f90bc1649e8>)


In [7]:
classifier = fasttext.supervised(input_file = 'data/intent_small_train.txt',
                                         output = 'data/intent_model',
                                         label_prefix = '__label__',
                                         dim = params_val[1],
                                         lr = params_val[0],
                                         epoch = 50)

In [8]:
print(classifier.predict(['ok ', 'hello', 'bye bye', 'show me chinese restaurants'], k=1))

[['affirm'], ['greet'], ['goodbye'], ['restaurant_search']]
