Skip to content
Morphological analysis plugin for Embulk.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config/checkstyle
gradle/wrapper
lib/embulk/filter
src
.gitignore
LICENSE.txt
README.md
build.gradle
gradlew
gradlew.bat

README.md

Kuromoji filter plugin for Embulk

Gem Version

Kuromoji filter plugin for Embulk. Neologd support.

Reference

Overview

  • Plugin type: filter

Configuration

  • tokenizer: select tokenizer.(kuromoji or neologd) (string, default: kuromoji)
  • mode: select mode.(normal or search or extended) (string, default: normal)
  • use_stop_tag: neologd only.(bool, default: false)
  • key_names: description (list, required)
  • keep_input: keep input columns. (bool, default: true)
  • ok_parts_of_speech: ok parts of speech. (list, default: null)
  • dictionary_path: user dictionary file path. (string, default: null)
  • settings: description (list, required)
    • suffix: output column name suffix. if null overwrite column. (string, default: null)
    • method: description (string, required. surface_form or base_form or reading)
    • delimiter: delimiter (string, default: ",")
    • type: extract data type, array or string. array is json type. (string, default: "string")

Neologd Example

filters:
  - type: kuromoji
    tokenizer: neologd
    use_stop_tag: true
    key_names:
      - catchcopy
    settings:
      - { method: 'reading', delimiter: '' }
      - { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
      - { suffix: _base_form, method: 'base_form', delimiter: '###' }
      - { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
      - { suffix: _array, method: 'surface_form', type: 'array' }

Pure kuromoji Example

filters:
  - type: kuromoji
    keep_input: false
    mode: search
    ok_parts_of_speech:
      - 名詞
    key_names:
      - catchcopy
    settings:
      - { method: 'reading', delimiter: '' }
      - { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
      - { suffix: _base_form, method: 'base_form', delimiter: '###' }
      - { suffix: _surface_form, method: 'surface_form', delimiter: '###' }
      - { suffix: _array, method: 'surface_form', type: 'array' }

input

{
    "catchcopy" : "安全・安心を追及した曲面ボディにデザインを一新しました。"
}

As below

{
    "catchcopy" : "アンゼン・アンシンヲツイキュウシタキョクメンボディニデザインヲイッシン。",
    "catchcopy_surface_form_no_delim" : "安全・安心を追及した曲面ボディにデザインを一新。",
    "catchcopy_base_form" : "安全###・###安心###を###追及###する###た###曲面###ボディ###に###デザイン###を###一新###。",
    "catchcopy_surface_form" : "安全###・###安心###を###追及###し###た###曲面###ボディ###に###デザイン###を###一新###。",
    "catchcopy_array" : ["安全","","安心","","追及","","","曲面","ボディ","","デザイン","","一新",""]
}

Example2(use user dictionary)

  - type: kuromoji
    keep_input: false
    dictionary_path: /tmp/kuromoji.txt
    ok_parts_of_speech:
      - 名詞
    key_names:
      - catchcopy
    settings:
      - { method: 'reading', delimiter: '#' }
      - { suffix: _surface_form_no_delim, method: 'surface_form', delimiter: '' }
      - { suffix: _base_form, method: 'base_form', delimiter: '###' }
      - { suffix: _surface_form, method: 'surface_form', delimiter: '###' }

user dictionary example

西国分寺,西国分寺,ニシコクブンジ,駅名
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously
You can’t perform that action at this time.