This Pinyin Analysis plugin facilitates the conversion between Chinese characters and Pinyin. It supports major versions of Elasticsearch and OpenSearch. Maintained and supported with ā¤ļø by INFINI Labs.
The plugin comprises an analyzer named pinyin
, a tokenizer named pinyin
, and a token filter named pinyin
.
-
keep_first_letter
: When enabled, retains only the first letter of each Chinese character. For example,åå¾·å
becomesldh
. Default: true. -
keep_separate_first_letter
: When enabled, keeps the first letters of each Chinese character separately. For example,åå¾·å
becomesl
,d
,h
. Default: false. Note: This may increase query fuzziness due to term frequency. -
limit_first_letter_length
: Sets the maximum length of the first letter result. Default: 16. -
keep_full_pinyin
: When enabled, preserves the full Pinyin of each Chinese character. For example,åå¾·å
becomes [liu
,de
,hua
]. Default: true. -
keep_joined_full_pinyin
: When enabled, joins the full Pinyin of each Chinese character. For example,åå¾·å
becomes [liudehua
]. Default: false. -
keep_none_chinese
: Keeps non-Chinese letters or numbers in the result. Default: true. -
keep_none_chinese_together
: Keeps non-Chinese letters together. Default: true. For example,DJé³ä¹å®¶
becomesDJ
,yin
,yue
,jia
. When set tofalse
,DJé³ä¹å®¶
becomesD
,J
,yin
,yue
,jia
. Note:keep_none_chinese
should be enabled first. -
keep_none_chinese_in_first_letter
: Keeps non-Chinese letters in the first letter. For example,åå¾·åAT2016
becomesldhat2016
. Default: true. -
keep_none_chinese_in_joined_full_pinyin
: Keeps non-Chinese letters in joined full Pinyin. For example,åå¾·å2016
becomesliudehua2016
. Default: false. -
none_chinese_pinyin_tokenize
: Breaks non-Chinese letters into separate Pinyin terms if they are Pinyin. Default: true. For example,liudehuaalibaba13zhuanghan
becomesliu
,de
,hua
,a
,li
,ba
,ba
,13
,zhuang
,han
. Note:keep_none_chinese
andkeep_none_chinese_together
should be enabled first. -
keep_original
: When enabled, keeps the original input as well. Default: false. -
lowercase
: Lowercases non-Chinese letters. Default: true. -
trim_whitespace
: Default: true. -
remove_duplicated_term
: When enabled, removes duplicated terms to save index space. For example,deē
becomesde
. Default: false. Note: Position-related queries may be influenced. -
ignore_pinyin_offset
: After version 6.0, offsets are strictly constrained, and overlapped tokens are not allowed. With this parameter, overlapped tokens will be allowed by ignoring the offset. Please note, all position-related queries or highlights will become incorrect. You should use multi-fields and specify different settings for different query purposes. If you need offsets, please set it to false. Default: true.
You can download the packaged plugins from here: https://release.infinilabs.com/
,
or you can use the plugin
cli to install the plugin like this:
For Elasticsearch
bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/8.4.1
For OpenSearch
bin/opensearch-plugin install https://get.infini.cloud/opensearch/analysis-pinyin/2.12.0
Tips: replace your own version number related to your elasticsearch or opensearch.
1.Create a index with custom pinyin analyzer
PUT /medcl/ { "settings" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "remove_duplicated_term" : true } } } } }
2.Test Analyzer, analyzing a chinese name, such as åå¾·å
GET /medcl/_analyze { "text": ["åå¾·å"], "analyzer": "pinyin_analyzer" }
{ "tokens" : [ { "token" : "liu", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "de", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "hua", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 2 }, { "token" : "åå¾·å", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 3 }, { "token" : "ldh", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 4 } ] }
3.Create mapping
POST /medcl/_mapping { "properties": { "name": { "type": "keyword", "fields": { "pinyin": { "type": "text", "store": false, "term_vector": "with_offsets", "analyzer": "pinyin_analyzer", "boost": 10 } } } } }
4.Indexing
POST /medcl/_create/andy {"name":"åå¾·å"}
5.Let's search
curl http://localhost:9200/medcl/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E curl http://localhost:9200/medcl/_search?q=name.pinyin:%e5%88%98%e5%be%b7 curl http://localhost:9200/medcl/_search?q=name.pinyin:liu curl http://localhost:9200/medcl/_search?q=name.pinyin:ldh curl http://localhost:9200/medcl/_search?q=name.pinyin:de+hua
6.Using Pinyin-TokenFilter
PUT /medcl1/ { "settings" : { "analysis" : { "analyzer" : { "user_name_analyzer" : { "tokenizer" : "whitespace", "filter" : "pinyin_first_letter_and_full_pinyin_filter" } }, "filter" : { "pinyin_first_letter_and_full_pinyin_filter" : { "type" : "pinyin", "keep_first_letter" : true, "keep_full_pinyin" : false, "keep_none_chinese" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true, "trim_whitespace" : true, "keep_none_chinese_in_first_letter" : true } } } } }
Token Test:åå¾·å å¼ å¦å éåÆå é»ę å大天ē
GET /medcl1/_analyze { "text": ["åå¾·å å¼ å¦å éåÆå é»ę å大天ē"], "analyzer": "user_name_analyzer" }
{ "tokens" : [ { "token" : "ldh", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "zxy", "start_offset" : 4, "end_offset" : 7, "type" : "word", "position" : 1 }, { "token" : "gfc", "start_offset" : 8, "end_offset" : 11, "type" : "word", "position" : 2 }, { "token" : "lm", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 3 }, { "token" : "sdtw", "start_offset" : 15, "end_offset" : 19, "type" : "word", "position" : 4 } ] }
7.Used in phrase query
- option 1
PUT /medcl2/ { "settings" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_first_letter":false, "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true } } } } } GET /medcl2/_search { "query": {"match_phrase": { "name.pinyin": "åå¾·å" }} }
- option 2
PUT /medcl3/ { "settings" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_first_letter":true, "keep_separate_first_letter" : true, "keep_full_pinyin" : true, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true } } } } } POST /medcl3/_mapping { "properties": { "name": { "type": "keyword", "fields": { "pinyin": { "type": "text", "store": false, "term_vector": "with_offsets", "analyzer": "pinyin_analyzer", "boost": 10 } } } } } GET /medcl3/_analyze { "text": ["åå¾·å"], "analyzer": "pinyin_analyzer" } POST /medcl3/_create/andy {"name":"åå¾·å"} GET /medcl3/_search { "query": {"match_phrase": { "name.pinyin": "åå¾·h" }} } GET /medcl3/_search { "query": {"match_phrase": { "name.pinyin": "ådh" }} } GET /medcl3/_search { "query": {"match_phrase": { "name.pinyin": "liudh" }} } GET /medcl3/_search { "query": {"match_phrase": { "name.pinyin": "liudeh" }} } GET /medcl3/_search { "query": {"match_phrase": { "name.pinyin": "liudeå" }} }
8.That's all, have fun.
Fell free to join the Discord server to discuss anything around this project:
Copyright Ā©ļø INFINI Labs.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.