An Open Source Japanese NLP Library, based on Universal Dependencies
Please read the Important changes before you upgrade GiNZA.
GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under The MIT License. You must agree and follow The MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.
spaCy is the key framework of GiNZA. spaCy LICENSE PAGE
SudachiPy provides high accuracies for tokenization and pos tagging. Sudachi LICENSE PAGE, SudachiPy LICENSE PAGE
This project is developed with Python>=3.6 and pip>=18 for it. We do not recommend to use Anaconda environment because the pip install step may not work properly. (We'd like to support Anaconda in near future.)
Please also see the Development Environment section below.
Run following line
$ pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
or download pip install archive from
release page
and run pip install
with it.
$ pip install ginza-2.2.1.tar.gz
If you encountered install error with the message like below, please upgrade pip
.
Could not find a version that satisfies the requirement ja_ginza@ http://github.com/ ...
$ pip install --upgrade pip
For Google Colab, you need to reload the package info.
import pkg_resources, imp
imp.reload(pkg_resources)
If you encountered some install problems related to Cython, please try to set the CFLAGS like below.
$ CFLAGS='-stdlib=libc++' pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
Run ginza
command from the console, then input some Japanese text.
After pressing enter key, you will get the parsed results with CoNLL-U Syntactic Annotation format.
$ ginza
銀座七丁目はお洒落だ。
# text = 銀座七丁目はお洒落だ。
1 銀座 銀座 PROPN 名詞-固有名詞-地名-一般 _ 3 compound _ BunsetuBILabel=B|BunsetuPositionType=CONT|SpaceAfter=No|NP_B|NE=LOC_B
2 七 7 NUM 名詞-数詞 NumType=Card 3 nummod _ BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No|NE=LOC_I
3 丁目 丁目 NOUN 名詞-普通名詞-助数詞可能 _ 5 nsubj _ BunsetuBILabel=I|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_B|NE=LOC_I
4 は は ADP 助詞-係助詞 _ 3 case _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
5 お洒落 御洒落 ADJ 名詞-普通名詞-サ変形状詞可能 _ 0 root _ BunsetuBILabel=B|BunsetuPositionType=ROOT|SpaceAfter=No
6 だ だ AUX 助動詞 _ 5 cop _ BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|SpaceAfter=No
7 。 。 PUNCT 補助記号-句点 _ 5 punct _ BunsetuBILabel=I|BunsetuPositionType=CONT|SpaceAfter=No
If you want to use cabocha -f1
(lattice style) like output, add -f 1
or -f cabocha
option to ginza
command.
This option's format is almost same as cabocha -f1
but the func_index
field (after the slash) is slightly different.
Our func_index
field indicates the boundary where the 自立語
ends in each 文節
(and the 機能語
might start from there).
And the functional token filter is also slightly different between cabocha -f1
and ' ginza -f cabocha
.
$ ginza -f 1
銀座七丁目はお洒落だ。
* 0 1D 2/3 0.000000
銀座 名詞,固有名詞,地名,一般,*,*,銀座,ギンザ, B-LOC
七 名詞,数詞,*,*,*,*,7,ナナ, I-LOC
丁目 名詞,普通名詞,助数詞可能,*,*,*,丁目,チョウメ, I-LOC
は 助詞,係助詞,*,*,*,*,は,ハ, O
* 1 -1D 0/1 0.000000
お洒落 名詞,普通名詞,サ変形状詞可能,*,*,*,御洒落,オシャレ, O
だ 助動詞,*,*,*,助動詞-ダ,終止形-一般,だ,ダ, O
。 補助記号,句点,*,*,*,*,。,。, O
EOS
If you need only the tokenization results, please consider to use sudachipy
command to speed up. Please see SudachiPy for details. The SudachiPy is the tokenizer of GiNZA.
$ sudachipy
Following steps shows dependency parsing results with sentence boundary 'EOS'.
import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('依存構造解析の実験を行っています。')
for sent in doc.sents:
for token in sent:
print(token.i, token.orth_, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.i)
print('EOS')
Please see spaCy API documents.
- 2019-10-28
- Improvements
- JapaneseCorrector can merge the
as_*
type dependencies completely
- JapaneseCorrector can merge the
- Bug fixes
- command line tool failed at the specific situations
- 2019-10-04, Ametrine
- Important changes
split_mode
has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)- This bug caused
split_mode
incompatibility between the training phase and theginza
command. split_mode
was set to 'B' for training phase and python APIs, but 'C' forginza
command.- We fixed this bug by setting the default
split_mode
to 'C' entirely. - This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
- This bug caused
- New features
- Add
-f
and--output-format
option toginza
command:-f 0
or-f conllu
: CoNLL-U Syntactic Annotation format-f 1
or-f cabocha
: cabocha -f1 compatible format
- Add custom token fields:
bunsetu_index
: bunsetu index starting from 0reading
: reading of token (not a pronunciation)sudachi
: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
- Add
- Performance improvements
- Tokenizer
- Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
- Use Cythonized SudachiPy (v0.4.0)
- Dependency parser
- Apply
spacy pretrain
command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC. - Apply multitask objectives by using
-pt 'tag,dep'
option ofspacy train
- Apply
- New model file
- ja_ginza-2.2.0.tar.gz
- Tokenizer
- 2019-07-08
- Add
ginza
command- run
ginza
from the console
- run
- Change package structure
- module package as
ginza
- language model package as
ja_ginza
spacy.lang.ja
is overridden byginza
- module package as
- Remove
sudachipy
related directories- SudachiPy and its dictionary are installed via
pip
duringginza
installation
- SudachiPy and its dictionary are installed via
- User dictionary available
- Token extension fields
- Added
token._.bunsetu_bi_label
,token._.bunsetu_position_type
- Remained
token._.inf
- Removed
pos_detail
(same value is set totoken.tag_
)
- Added
- 2019-04-07
- Set depending token index of root as 0 to meet with conllu format definitions
- 2019-04-02
- Add new Japanese era 'reiwa' to system_core.dic.
- 2019-04-01
- First release version
$ git clone 'https://github.com/megagonlabs/ginza.git'
For normal environment:
$ python setup.sh develop
To be described