Kuromoji filter plugin for Embulk. Neologd support.
- Plugin type: filter
- tokenizer: select tokenizer.(kuromoji or neologd) (string, default: kuromoji)
- mode: select mode.(normal or search or extended) (string, default: normal)
- use_stop_tag: neologd only.(bool, default: false)
- key_names: description (list, required)
- keep_input: keep input columns. (bool, default:
true
) - ok_parts_of_speech: ok parts of speech. (list, default: null)
- dictionary_path: user dictionary file path. (string, default: null)
- settings: description (list, required)
- suffix: output column name suffix. if null overwrite column. (string, default: null)
- method: description (string, required. surface_form or base_form or reading)
- delimiter: delimiter (string, default: ",")
- type: extract data type, array or string. array is json type. (string, default: "string")
filters:
- type: kuromoji
tokenizer: neologd
use_stop_tag: true
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
- { suffix: _array, method: 'surface_form', type: 'array' }
filters:
- type: kuromoji
keep_input: false
mode: search
ok_parts_of_speech:
- 名詞
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
- { suffix: _array, method: 'surface_form', type: 'array' }
{
"catchcopy" : "安全・安心を追及した曲面ボディにデザインを一新しました。"
}
As below
{
"catchcopy" : "アンゼン・アンシンヲツイキュウシタキョクメンボディニデザインヲイッシン。",
"catchcopy_surface_form_no_delim" : "安全・安心を追及した曲面ボディにデザインを一新。",
"catchcopy_base_form" : "安全###・###安心###を###追及###する###た###曲面###ボディ###に###デザイン###を###一新###。",
"catchcopy_surface_form" : "安全###・###安心###を###追及###し###た###曲面###ボディ###に###デザイン###を###一新###。",
"catchcopy_array" : ["安全","・","安心","を","追及","し","た","曲面","ボディ","に","デザイン","を","一新","。"]
}
- type: kuromoji
keep_input: false
dictionary_path: /tmp/kuromoji.txt
ok_parts_of_speech:
- 名詞
key_names:
- catchcopy
settings:
- { method: 'reading', delimiter: '#' }
- { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
- { suffix: _base_form, method: 'base_form', delimiter: '###' }
- { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
西国分寺,西国分寺,ニシコクブンジ,駅名
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
$ ./gradlew gem # -t to watch change of files and rebuild continuously