Skip to content
This repository has been archived by the owner on Dec 8, 2023. It is now read-only.

treasure-data/hive-udf-neologd

Repository files navigation

Hive Japanese NLP UDFs with NEologd

Build Status

This package extends Hivemall's Japanese NLP capability by utilizing NEologd.

Before getting started, build the latest version of hivemall-all-{HIVEMALL_VERSION}.jar as documented on Hivemall installation guide.

Usage

Run build script:

./build.sh

The build script is modified version of kazuhira-r/kuromoji-with-mecab-neologd-buildscript.

Use the UDFs on Hive:

add jar hivemall-all-{HIVEMALL_VERSION}.jar; -- e.g., hivemall-all-0.5.1-incubating-SNAPSHOT.jar
add jar hive-udf-neologd-{VERSION}-{NEOLOGD_VERSION_DATE}.jar; -- e.g., hive-udf-neologd-0.1.0-20180524.jar;
create temporary function tokenize_ja_neologd as 'hivemall.nlp.tokenizer.KuromojiNEologdUDF';
select tokenize_ja_neologd();
-- ["{VERSION}-{NEOLOGD_VERSION_DATE}"]
select tokenize_ja_neologd('10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。');
-- ["10日","放送","中居正広の身になる図書館","テレビ朝日","系","smap","中居正広","篠原信一","過去","勘違い","明かす","一幕"]