Skip to content

yahoojapan/vespa-kuromoji-linguistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vespa Linguistics with Kuromoji Tokenizer

Overview

This package provides Japanese tokenizer with Vespa using Kuromoji. Kuromoji is one of the famous Japanese tokenizer, it is implemented by Java and used by various services such as Solr, Elasticsearch, and so on. For more details, please see official website of Kuromoji.

Create Package

Requirement

JDK (>= 11) and maven are required to build package.

Build

Execute mvn command as below, and you can get package as target/kuromoji-linguistics-${VERSION}-deploy.jar

$ mvn package -Dvespa.version='7.594.36'     # You can specify 7.594.36 or later.

Use Package

Deploy

Put the built package to components directory of your service. If there is no components directory, create it. For example, the structure will be like below with sampleapps.

  • sampleapps/search/music/
    • services.xml
    • components/
      • kuromoji-linguistics-${VERSION}-deploy.jar

Configuration

Because the package will be used by searcher and indexer, it is recommended to define <component> in all <jdisc> sections of services.xml.

<container id="container" version="1.0">
    <component id="kuromoji" class="jp.co.yahoo.vespa.language.lib.kuromoji.KuromojiLinguistics" bundle="kuromoji-linguistics">
        <config name="language.lib.kuromoji.kuromoji">
            <mode>search</mode>
            <ignore_case>true</ignore_case>
        </config>
    </component>
</container>

You can configure package by <config name="language.lib.kuromoji.kuromoji"> (optional). Parameters and default settings are below.

parameter type default description
mode string search mode of Kuromoji (normal OR search OR extended)
kanji.length_threshold int 2 threshold of the length of kanji tokens which is penalized while running the Viterbi search (expert feature).
kanji.penalty int 3000 additional cost for kanji tokens which is longer than the pre-defined length threshold (expert feature).
other.length_threshold int 7 threshold of the length of non-kanji tokens which is penalized while running the Viterbi search (expert feature).
other.penalty int 1700 additional cost for non-kanji tokens which is longer than the pre-defined length threshold (expert feature).
nakaguro_split bool false whether splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
user_dict string - path of user dictionary
tokenlist_name string default target specialtokens name
all_language bool false apply kuromoji tokenizer to all language or only Japanese
ignore_case bool true ignore upper/lower case difference

Activate

Simply use deploy command to activate package. For example, commands will be like below with sampleapps.

$ vespa-deploy prepare sampleapps/search/music/
$ vespa-deploy activate

Now, you can use the tokenizer with "language=ja" options !

License

Code licensed under the Apache 2.0 license. See LICENSE for terms.

Contributor License Agreement

This project requires contributors to agree to a Contributor License Agreement (CLA).

Note that only for contributions to the vespa-kuromoji-linguistics repository on the GitHub (https://github.com/yahoojapan/vespa-kuromoji-linguistics), the contributors of them shall be deemed to have agreed to the CLA without individual written agreements.