Skip to content

turn natural language into structured data(支持中文,自定义了N种模型,支持不同的场景和任务)

License

Notifications You must be signed in to change notification settings

tx-anin/rasa_nlu_gq

 
 

Repository files navigation

Rasa NLU GQ

Rasa NLU (Natural Language Understanding) 是一个自然语义理解的工具,举个官网的例子如下:

"I'm looking for a Mexican restaurant in the center of town"

And returning structured data like:

  intent: search_restaurant
  entities: 
    - cuisine : Mexican
    - location : center

Intent of this project

这个项目的目的和初衷,是由于官方的rasa nlu里面提供的components和models并不能满足实际需求。所以我自定义了一些components,并发布到Pypi上。可以通过pip install rasa-nlu-gao下载。后续会不断往里面填充和优化组件,也欢迎大家贡献。

New features

目前新增的特性如下(请下载最新的rasa-nlu-gao版本):

  • 新增了实体识别的模型,一个是bilstm+crf,一个是idcnn+crf膨胀卷积模型,对应的yml文件配置如下:
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "intent_featurizer_count_vectors"
    token_pattern: "(?u)\b\w+\b"
  - name: "intent_classifier_tensorflow_embedding"
  - name: "ner_bilstm_crf"
    lr: 0.001
    char_dim: 100
    lstm_dim: 100
    batches_per_epoch: 10
    seg_dim: 20
    num_segs: 4
    batch_size: 200
    tag_schema: "iobes"
    model_type: "bilstm" # 模型支持两种idcnn膨胀卷积模型或bilstm双向lstm模型
    clip: 5
    optimizer: "adam"
    dropout_keep: 0.5
    steps_check: 100
  • 新增了jieba词性标注的模块,可以方便识别名字,地名,机构名等等jieba能够支持的词性,对应的yml文件配置如下:
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "ner_crf"
  - name: "jieba_pseg_extractor"
    part_of_speech: ["nr", "ns", "nt"]
  - name: "intent_featurizer_count_vectors"
    OOV_token: oov
    token_pattern: "(?u)\b\w+\b"
  - name: "intent_classifier_tensorflow_embedding"
  • 新增了根据实体反向修改意图,对应的文件配置如下:
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "ner_crf"
  - name: "jieba_pseg_extractor"
  - name: "intent_featurizer_count_vectors"
    OOV_token: oov
    token_pattern: '(?u)\b\w+\b'
  - name: "intent_classifier_tensorflow_embedding"
  - name: "entity_edit_intent"
    entity: ["nr"]
    intent: ["enter_data"]
    min_confidence: 0
  • 新增了word2vec提取词向量特征,对应的配置文件如下:
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "intent_featurizer_wordvector"
    vector: "data/vectors.txt"
  - name: "intent_classifier_tensorflow_embedding"
  - name: "ner_crf"
  - name: "jieba_pseg_extractor"
  • 新增了bert模型提取词向量特征,对应的配置文件如下:
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "bert_vectors_featurizer"
    ip: '172.16.10.46'
    port: 5555
    port_out: 5556
    show_server_config: True
    timeout: 10000
  - name: "intent_classifier_tensorflow_embedding"
  - name: "ner_crf"
  - name: "jieba_pseg_extractor"
  • 新增了对CPU和GPU的利用率的配置,主要是intent_classifier_tensorflow_embeddingner_bilstm_crf这两个使用到tensorflow的组件,配置如下(当然config_proto可以不配置,默认值会将资源全部利用):
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "intent_featurizer_count_vectors"
    token_pattern: '(?u)\b\w+\b'
  - name: "intent_classifier_tensorflow_embedding"
    config_proto: {
      "device_count": 4,
      "inter_op_parallelism_threads": 0,
      "intra_op_parallelism_threads": 0,
      "allow_growth": True
    }
  - name: "ner_bilstm_crf"
    config_proto: {
      "device_count": 4,
      "inter_op_parallelism_threads": 0,
      "intra_op_parallelism_threads": 0,
      "allow_growth": True
    }
  • 新增了embedding_bert_intent_classifier分类器,对应的配置文件如下:
  language: "zh"

  pipeline:
  - name: "tokenizer_jieba"
  - name: "bert_vectors_featurizer"
    ip: '172.16.10.46'
    port: 5555
    port_out: 5556
    show_server_config: True
    timeout: 10000
  - name: "intent_classifier_tensorflow_embedding_bert"
  - name: "ner_crf"
  - name: "jieba_pseg_extractor"

Quick Install

pip install rasa-nlu-gao

🤖 Running of the bot

To train the NLU model:

python -m rasa_nlu_gao.train -c sample_configs/config_embedding_bilstm.yml --data data/examples/rasa/rasa_dataset_training.json --path models

To run the NLU model:

python -m rasa_nlu_gao.server -c sample_configs/config_embedding_bilstm.yml --path models

Some Examples

具体的例子请看rasa_chatbot_cn

About

turn natural language into structured data(支持中文,自定义了N种模型,支持不同的场景和任务)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.6%
  • Other 0.4%