Pinyin Analysis for ElasticSearch

The Pinyin Analysis plugin integrates Pinyin4j(http://pinyin4j.sourceforge.net/) module into elasticsearch.

Pinyin4j is a popular Java library supporting convertion between Chinese characters and most popular Pinyin systems. The output format of pinyin could be customized.

you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf)

--------------------------------------------------
| Pinyin4j   Analysis Plugin    | ElasticSearch  |
--------------------------------------------------
| master                        | 1.6.0 -> master|
--------------------------------------------------
| 1.3.0                         | 1.6.0          |
--------------------------------------------------
| 1.2.2                         | 1.0.0          |
--------------------------------------------------
| 1.2.0                         | 0.90.0         |
--------------------------------------------------
| 1.1.2                         | 0.20.2         |
--------------------------------------------------
| 1.1.1                         | 0.19.x         |
--------------------------------------------------
| 1.1.0                         | 0.19.0         |
--------------------------------------------------

The plugin includes a pinyin analyzer , two tokenizer: pinyin pinyin_first_letter and a token-filter: pinyin .

1.Create a index for doing some tests

curl -XPUT http://localhost:9200/medcl/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin",
                    "filter" : "word_delimiter
]                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "first_letter" : "none",
                    "padding_char" : " "
                }
            }
        }
    }
}'

2.Analyzing a chinese name,such as 刘德华

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer
{"tokens":[{"token":"liu de hua ","start_offset":0,"end_offset":3,"type":"word","position":1}]}

3.Thant's all,have fun.

optional config: the parameter first_letter can be set to: prefix , append , only and none ,default value is none

examples: first_letter set toprifix and padding_char is set to "" the analysis result will be:

{"tokens":[{"token":"ldhliudehua","start_offset":0,"end_offset":3,"type":"word","position":1}]}

and if we set first_letter to only ,the result will be:

{"tokens":[{"token":"ldh","start_offset":0,"end_offset":3,"type":"word","position":1}]}

also first_letter to append

{"tokens":[{"token":"liu de hua ldh","start_offset":0,"end_offset":3,"type":"word","position":1}]}

----------additional----------example-----------------------

if you wanna do a auto-complete with people's name,combining with the magic of pinyin,and it's very easy now,here is the detail instructions:

1.Index setting

curl -XPOST http://localhost:9200/medcl/_close
curl -XPUT http://localhost:9200/medcl/_settings -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin",
                    "filter" : ["word_delimiter","nGram"]
                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "first_letter" : "prefix",
                    "padding_char" : " "
                }
            }
        }
    }
}'
curl -XPOST http://localhost:9200/medcl/_open

2.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
    "folks": {
        "properties": {
            "name": {
                "type": "multi_field",
                "fields": {
                    "name": {
                        "type": "string",
                        "store": "no",
                        "term_vector": "with_positions_offsets",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    },
                    "primitive": {
                        "type": "string",
                        "store": "yes",
                        "analyzer": "keyword"
                    }
                }
            }
        }
    }
}'

3.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

4.Have a try

curl http://localhost:9200/medcl/folks/_search?q=name:%e5%88%98
curl http://localhost:9200/medcl/folks/_search?q=name:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name:liu
curl http://localhost:9200/medcl/folks/_search?q=name:ldh
curl http://localhost:9200/medcl/folks/_search?q=name:dehua

5.Use Pinyin-TokenFilter (contributed by @wangweiwei)

curl -XPUT http://localhost:9200/medcl1/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_filter" : {
                    "type" : "pinyin",
                    "first_letter" : "only",
                    "padding_char" : ""
                }
            }
        }
    }
}'

Token Test:刘德华张学友郭富城黎明四大天王

curl -XGET http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer
{"tokens":[{"token":"ldh","start_offset":0,"end_offset":3,"type":"word","position":1},{"token":"zxy","start_offset":4,"end_offset":7,"type":"word","position":2},{"token":"gfc","start_offset":8,"end_offset":11,"type":"word","position":3},{"token":"lm","start_offset":12,"end_offset":14,"type":"word","position":4},{"token":"sdtw","start_offset":15,"end_offset":19,"type":"word","position":5}]}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
lib		lib
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinyin Analysis for ElasticSearch

About

Releases

Packages

Languages

stgrandet/elasticsearch-analysis-pinyin

Folders and files

Latest commit

History

Repository files navigation

Pinyin Analysis for ElasticSearch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages