Hive Japanese NLP UDFs with NEologd
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src Enable `tokenize_ja_neologd` to return version number May 9, 2018
.gitignore Update .gitignore May 10, 2018
.travis.yml
LICENSE Add LICENSE May 8, 2018
NEOLOGD_VERSION_DATE Update version to 0.1.0-20180628 Jul 3, 2018
README.md Add CI badge Jul 5, 2018
VERSION Update version to 0.1.1-20180628 Jul 5, 2018
build.sh Set Lucene version to 5.5.5 Jul 5, 2018
lucene-analyzers-kuromoji-neologd.xml Set Lucene version to 5.5.5 Jul 5, 2018
pom.xml Update version to 0.1.1-20180628 Jul 5, 2018

README.md

Hive Japanese NLP UDFs with NEologd

Build Status

This package extends Hivemall's Japanese NLP capability by utilizing NEologd.

Before getting started, build the latest version of hivemall-all-{HIVEMALL_VERSION}.jar as documented on Hivemall installation guide.

Usage

Run build script:

./build.sh

The build script is modified version of kazuhira-r/kuromoji-with-mecab-neologd-buildscript.

Use the UDFs on Hive:

add jar hivemall-all-{HIVEMALL_VERSION}.jar; -- e.g., hivemall-all-0.5.1-incubating-SNAPSHOT.jar
add jar hive-udf-neologd-{VERSION}-{NEOLOGD_VERSION_DATE}.jar; -- e.g., hive-udf-neologd-0.1.0-20180524.jar;
create temporary function tokenize_ja_neologd as 'hivemall.nlp.tokenizer.KuromojiNEologdUDF';
select tokenize_ja_neologd();
-- ["{VERSION}-{NEOLOGD_VERSION_DATE}"]
select tokenize_ja_neologd('10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。');
-- ["10日","放送","中居正広の身になる図書館","テレビ朝日","系","smap","中居正広","篠原信一","過去","勘違い","明かす","一幕"]