Normalize String for Embulk
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config/checkstyle
gradle/wrapper
lib/embulk/filter
src
.gitignore
ICU_LICENSE.txt
LICENSE.txt initial commit. Oct 1, 2015
README.md
build.gradle
gradlew
gradlew.bat

README.md

Icu4j filter plugin for Embulk

Unicode normalize string value.

Icu4j filter plugin for Embulk. see. http://site.icu-project.org/

Overview

  • Plugin type: filter

Configuration

  • key_names: target key names. (list, required)
  • keep_input: keep input columns. (bool, default: true)
  • settings: settings. (list, required)

Example normalize NFKC

filters:
  - type: icu4j
    key_names:
      - title
    settings:
      - { transliterators: 'Any-NFKC', case: upper }

Example

filters:
  - type: icu4j
    keep_input: false
    key_names:
      - catchcopy
    settings:
      - { suffix: _katakana, transliterators: 'Katakana-Hiragana,Fullwidth-Halfwidth', case: upper }
      - { transliterators: 'Katakana-Hiragana', case: lower }
      - { suffix: _romaji_lower, transliterators: 'Katakana-Hiragana,Hiragana-Latin', case: lower }

input

{
    "catchcopy" : "ホゲホゲ"
}

As below

{
    "catchcopy" : "ほげほげ",
    "catchcopy_katakana" : "ホゲホゲ",
    "catchcopy_romaji_lower" : "hogehoge"
}

transliterator rules

see. http://hondou.homedns.org/pukiwiki/pukiwiki.php?Java%20ICU4J

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously