jieba analysis plugin for elasticsearch 6.4.0, 6.0.0, 5.4.0,5.3.0, 5.2.2, 5.2.1, 5.2, 5.1.2, 5.1.1
Clone or download
Permalink
Failed to load latest commit information.
src/main 优先召回 Oct 27, 2018
.gitignore init project Jan 17, 2017
LICENSE Initial commit Jan 17, 2017
README.md Update README.md Oct 27, 2018
build.gradle update plugin-descriptor.properties Sep 15, 2018
gradlew init project Jan 17, 2017
gradlew.bat init project Jan 17, 2017
settings.gradle init project Jan 17, 2017

README.md

elasticsearch-jieba-plugin

jieba analysis plugin for elasticsearch: 6.4.0, 6.0.0, 5.4.0, 5.3.0, 5.2.2, 5.2.1, 5.2.0, 5.1.2, 5.1.1

新分词支持

如果是ES6.4.0的版本,请使用6.4.0分支最新的代码,或者master分支最新代码,也可以下载6.4.1的release,强烈推荐升级!

6.4.1的release,解决了PositionIncrement问题。详细说明见ES分词PositionIncrement解析

版本对应

分支 tag elasticsearch版本 Release Link
6.4.0 tag v6.4.1 v6.4.0 Download: v6.4.1
6.4.0 tag v6.4.0 v6.4.0 Download: v6.4.0
6.0.0 tag v6.0.0 v6.0.0 Download: v6.0.1
5.4.0 tag v5.4.0 v5.4.0 Download: v5.4.0
5.3.0 tag v5.3.0 v5.3.0 Download: v5.3.0
5.2.2 tag v5.2.2 v5.2.2 Download: v5.2.2
5.2.1 tag v5.2.1 v5.2.1 Download: v5.2.1
5.2 tag v5.2.0 v5.2.0 Download: v5.2.0
5.1.2 tag v5.1.2 v5.1.2 Download: v5.1.2
5.1.1 tag v5.1.1 v5.1.1 Download: v5.1.1

more details

  • choose right version source code.
  • run
gradle pz
  • copy the zip file to plugin directory
cp build/distributions/elasticsearch-jieba-plugin-5.1.2.zip ${path.home}/plugins
  • unzip and rm zip file
unzip elasticsearch-jieba-plugin-5.1.2.zip
rm elasticsearch-jieba-plugin-5.1.2.zip
  • start elasticsearch
./bin/elasticsearch

Custom User Dict

Just put you dict file with suffix .dict into ${path.home}/plugins/jieba/dic. Your dict file should like this:

小清新 3
百搭 3
显瘦 3
隨身碟 100
your_word word_freq

Using stopwords

  • find stopwords.txt in ${path.home}/plugins/jieba/dic.
  • create folder named stopwords under ${path.home}/config
mkdir -p {path.home}/config/stopwords
  • copy stopwords.txt into the folder just created
cp ${path.home}/plugins/jieba/dic/stopwords.txt {path.home}/config/stopwords
  • create index:
PUT http://localhost:9200/jieba_index
{
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {
          "type":        "stop",
          "stopwords_path": "stopwords/stopwords.txt"
        },
        "jieba_synonym": {
          "type":        "synonym",
          "synonyms_path": "synonyms/synonyms.txt"
        }
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "jieba_stop",
            "jieba_synonym"
          ]
        }
      }
    }
  }
}
  • test analyzer:
GET http://localhost:9200/jieba_index/_analyze?analyzer=my_ana&text=中国的伟大时代来临了,欢迎参观北京大学PKU

Response as follow:

{
  "tokens": [
    {
      "token": "中国",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "伟大",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "时代",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "来临",
      "start_offset": 7,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "欢迎",
      "start_offset": 11,
      "end_offset": 13,
      "type": "word",
      "position": 7
    },
    {
      "token": "参观",
      "start_offset": 13,
      "end_offset": 15,
      "type": "word",
      "position": 8
    },
    {
      "token": "北京",
      "start_offset": 15,
      "end_offset": 17,
      "type": "word",
      "position": 9
    },
    {
      "token": "大学",
      "start_offset": 17,
      "end_offset": 19,
      "type": "word",
      "position": 10
    },
    {
      "token": "北京大",
      "start_offset": 15,
      "end_offset": 18,
      "type": "word",
      "position": 11
    },
    {
      "token": "北京大学",
      "start_offset": 15,
      "end_offset": 19,
      "type": "word",
      "position": 12
    },
    {
      "token": "北大",
      "start_offset": 15,
      "end_offset": 19,
      "type": "SYNONYM",
      "position": 12
    },
    {
      "token": "pku",
      "start_offset": 15,
      "end_offset": 19,
      "type": "SYNONYM",
      "position": 12
    },
    {
      "token": "pku",
      "start_offset": 19,
      "end_offset": 22,
      "type": "word",
      "position": 13
    },
    {
      "token": "北大",
      "start_offset": 19,
      "end_offset": 22,
      "type": "SYNONYM",
      "position": 13
    },
    {
      "token": "北京大学",
      "start_offset": 19,
      "end_offset": 22,
      "type": "SYNONYM",
      "position": 13
    }
  ]
}
  • Pay attention to *jieba_synonym, same with jieba_stop, the format of synoyms.txt:
北京大学,北大,pku
清华大学,清华,Tsinghua University
  • create document
POST http://localhost:9200/jieba_index/fulltext/1
{"content":"中国的伟大时代来临了,欢迎参观北京大学PKU"}
  • search
POST http://localhost:9200/jieba_index/fulltext/_search

Request body:

{
    "query" : { "match" : { "content" : "pku" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}

Response body:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.52305835,
    "hits": [
      {
        "_index": "jieba_index",
        "_type": "fulltext",
        "_id": "1",
        "_score": 0.52305835,
        "_source": {
          "content": "中国的伟大时代来临了,欢迎参观北京大学PKU"
        },
        "highlight": {
          "content": [
            "中国的伟大时代来临了,欢迎参观<tag1>北京大学</tag1><tag1>PKU</tag1>"
          ]
        }
      }
    ]
  }
}
  • 聚合示例(aggregation)

Query:

{
  "query": {
    "match": {
      "name": "lala"
    }
  },
  "_source": [
    "name"
  ],
  "aggs": {
    "dedup": {
      "terms": {
        "field": "your_agg_field"
      },
      "aggs": {
        "dedup_docs": {
          "top_hits": {
            "sort": [
              {
                "updatedAt": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "name"
                ]
            },
            "size": 2
          }
        }
      }
    },
    "facets": {
      "terms": {
        "field": "your_facet_field"
      },
      "aggs": {
        "facets_docs": {
          "top_hits": {
            "sort": [
              {
                "updatedAt": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "name"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

NOTE

migrate from jieba-solr

Roadmap

I will add more analyzer support:

  • stanford chinese analyzer
  • fudan nlp analyzer
  • ...

If you have some ideas, you should create an issue. Then, we will do it together.