Apache Lucene Japanese User-Defined Dictionary with NEologd

Our Python script enables you to utilize NEologd, a neologism dictionary for the MeCab morphological analyzer, to enhance Japanese tokenization of Apache Lucene.

Since Lucene's custom dictionary does not allow us to specify word score which is necessary to make more accurate tokenization, the script aggressively filters out less important words from original NEologd (e.g., non-noun words, short word which might be tokenizable even by the standard tokenizer).

Note that tokenizer implemented in Lucene is based on Kuromoji, and the latest version of Kuromoji already supports custom dictionary format with word scores: atilika/kuromoji#91. However, unfortunately Lucene does not include the update at this moment.

Usage

Following command downloads the latest version of NEologd and converts it into Lucene user-defined dictionary format:

$ python build.py

Eventually, lucene-ja-userdict-neologd.csv.gz is created. The CSV file contains custom rules for better Japanese tokenization as:

ten create,ten-create,,名詞
ハインリヒ・ベル,ハインリヒ・ベル,,名詞
佐竹笙悟,佐竹笙悟,,名詞
小貝澤,小貝澤,,名詞
神田達成,神田達成,,名詞
西村ツチカ,西村ツチカ,,名詞
愛と涙の蔵出し物語,愛と涙の蔵出し物語,,名詞
ヨンマンキュウセンヒャクエン,ヨンマンキュウセンヒャクエン,,名詞
mfブックス,mfブックス,,名詞
ハイイロガン,ハイイロガン,,名詞
...

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
resources		resources
LICENSE		LICENSE
README.md		README.md
build.py		build.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Lucene Japanese User-Defined Dictionary with NEologd

Usage

About

Releases

Packages

Languages

License

takuti/lucene-ja-userdict-neologd

Folders and files

Latest commit

History

Repository files navigation

Apache Lucene Japanese User-Defined Dictionary with NEologd

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages