sumire

形態素解析器などの事前インストールなしで使える, CPUベースの日本語自然言語処理のための, Scikit-learn互換の日本語の単語分割器と, テキストのベクトル化ツール.

Table of Contents

sumire

Installation

pre-requirements

Tested OS: ubuntu 22.04.
python >=3.9
make
cmake
git

# Jumanpp dependencies.
sudo apt update -y;
sudo apt install -y cmake libeigen3-dev libprotobuf-dev protobuf-c-compiler;

pip install sumireだけで, MeCabもJumanppも, インストールなしで使えます. MeCabやJumanppの実行バイナリや各種辞書がなければ, $HOME/.local/sumire/にTokenizerをインスタンス化した時にインストールされます.

Usage

Tokenizer usage

from sumire.tokenizer import MecabTokenizer, JumanppTokenizer


text = "これはテスト文です。" 
texts = ["これはテスト文です。", "別のテキストもトークン化します。"]

mecab = MecabTokenizer("unidic-lite")
text_mecab_tokenized = mecab.tokenize(text)
texts_mecab_tokenized = mecab.tokenize(texts)

jumanpp = JumanppTokenizer()
jumanpp.tokenize(text)
text_jumanpp_tokenized = jumanpp.tokenize(text)
texts_jumanpp_tokenized = jumanpp.tokenize(texts)

Vectorizer usage

from sumire.tokenizer.mecab import MecabTokenizer
from sumire.vectorizer.count import CountVectorizer
from sumire.vectorizer.swem import W2VSWEMVectorizer
from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer

texts = ["これはテスト文です。", "別のテキストもトークン化します。"]

count_vectorizer = CountVectorizer()  # this automatically use MecabTokenizer()
swem_vectorizer = W2VSWEMVectorizer()
bert_cls_vectorizer = TransformerEmbeddingVectorizer()

# fit and transform at the same time. (Of course, you can .fit() and .transform() separately!)
count_vectorized = count_vectorizer.fit_transform(texts)
swem_vectorized = swem_vectorizer.fit_transform(texts)
bert_cls_vectorized = bert_cls_vectorizer.fit_transform(texts)

# save and load vectorizer.
count_vectorizer.save_pretrained("path/to/count_vectorizer")
count_vectorizer = CountVectorizer.from_pretrained("path/to/count_vectorizer")
swem_vectorizer.save_pretrained("path/to/swem_vectorizer")
swem_vectorizer = W2VSWEMVectorizer.from_pretrained("path/to/swem_vectorizer")
bert_cls_vectorizer.save_pretrained("path/to/bert_cls_vectorizer")
bert_cls_vectorizer = TransformerEmbeddingVectorizer.from_pretrained("path/to/beert_cls_vectorizer")

各単語分割器や文分散表現モジュールの詳細なドキュメントはドキュメントページを参照してください. また, Transformersやgensimの動作済みmodelの情報は, /sumire/resources/model_cardを参照してください.

Development background

LLMの隆盛に伴い, 検索, 感情分析, その他テキスト分類・回帰などの日本語のNLPの実用タスクへの注目も高まりつつあります. これらの基本的なタスクにおいて, 日本語のテキストを単語分割や, 単語や文の分散表現を得ることは, 最も基礎的な処理の一つです. LLMの時代において, BERTなどの事前訓練済みTransformerモデルや, Open AI APIによるEmbeddingsは, テキスト分散表現技術において最も重要な技術であることはいうまでもなく, また, 簡単に実装できるといえば実装できます.

しかし, 実用の現場において, BERTや, OpenAI APIなどの, 高価なGPUが必要な手法や, 1 Queryごとに費用が発生するAPIを用いた最先端の手法を使うことは, 計算量・運用コストの両面から負荷が軽いとはいえません. また, データセット構築段階などのプロジェクトの初期段階での概念実証 (PoC) において, 辞書データや形態素解析器の~~めんどくさい~~インストール作業や, それぞれやや異なるAPIのメソッドやプロパティを調べながら作業を行うのは少しばかり手間です.

これらの点を踏まえて, GPUがあるとは限らない手元環境で, PoCにおけるモデリング・分析部分へ速やかに注力できように, Scikit-learnのように, 機能ごとに統一的なAPIインターフェースで, テキストを与えればとりあえず色々な文の分散表現を取得できるライブラリを開発しました.

Unmotivated development tasks (at this moment.)

Open-AI Embedding modelを使うこと. (高い.)
事前訓練済みTransformerモデルによるEmbeddingについて, GPUが必要なチューニング機能を実装すること. (手元にGPUがない.)
実行速度のためにライブラリ内部の可読性を大きく下げること.
- 小規模なPoCにおいて, コードの実行速度より, 実装速度のほうが重要だと考えています.
- PoC後の大規模な運用にて, 速度やディスク容量が問題になった場合があれば, 本ライブラリ中の不要な処理をそれぞれの開発者が削除したりカスタマイズしやすいように, 可読性を維持したいです.

Roadmap (motivated development tasks)

vectorizer inputsのdecode().
Google colabでの動作環境検証.

Coding rule

https://pep8-ja.readthedocs.io/ja/latest/

License

sumire is distributed under the terms of the MIT License.

Acknowledgements (Dependent libraries, data, and models.)

See dependent_licenses.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
data/test		data/test
docs		docs
scripts		scripts
sumire		sumire
tests		tests
.gitignore		.gitignore
.pytest.ini		.pytest.ini
LICENSE		LICENSE
README.en.md		README.en.md
README.md		README.md
dependent_licences.csv		dependent_licences.csv
pyproject.toml		pyproject.toml

License

underfirst/sumire

Folders and files

Latest commit

History

Repository files navigation

sumire

Installation

pre-requirements

Usage

Tokenizer usage

Vectorizer usage

Development background

Unmotivated development tasks (at this moment.)

Roadmap (motivated development tasks)

Coding rule

License

Acknowledgements (Dependent libraries, data, and models.)

About

Resources

License

Stars

Watchers

Forks

Languages