Skip to content
script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
src
.gitignore
LICENSE
Pipfile
README.md
eval.py
requirements.txt

README.md

evaluate_japanese_w2v

日本語類似度評価データセットをword2vecモデルに適用するためのスクリプト

mecab-python3SudachiPy による分かち書きに対応

mecab-python3 and SudachiPy for tokenizing Japanese

Requirements

  • chardet
  • numpy
  • scipy
  • gensim
  • mecab-python3
  • sudachipy
  • sudachidict-core

Usage

$ python eval.py model data [option]
  • model: gensimで読み込み可能なモデルファイル

  • data: 単語1, 単語2, (類似度などの)数値の3つの列を持つcsvファイルもしくはcsvファイルを含むディレクトリ

    • --col で3つの列を指定可能 (デフォルトは [0,1,2])
  • model: The word2vec model file that can be load by gensim.

  • data: csv file or directory path. The files contain 3 columns of word1, word2, similarity score

    • 3 columns can be specified by --col (default [0,1,2])
optional arguments:
  -h, --help            show this help message and exit
  --col COL COL COL     indexes of word1, word2, similarity
  --verbose, -v         verbose
  --mecab, -m           use mecab
  --mecab_dict MECAB_DICT, -d MECAB_DICT
                        mecab dictionary path
  --sudachi, -s         use sudachi
  --sudachi_mode SUDACHI_MODE
                        select sudachi tokenizer mode: A or B or C
  --output OUTPUT, -o OUTPUT
                        output csv path or directory path

Example

Example for Mecab

$ python eval.py /path/to/latest-ja-word2vec-gensim-model/word2vec.gensim.model \
    /path/to/JWSAN/jwsan-1400.txt \
    -v --col 1 2 4 -m --mecab_dict /usr/local/lib/mecab/dic/mecab-ipadic-neologd 

Output:

[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] set logger
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Word vector 50 dim, Vocab size 335476
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Use mecab : dict setting is /usr/local/lib/mecab/dic/mecab-ipadic-neologd
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] load filepath : /path/to/JWSAN/jwsan-1400.csv, 1400 data
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Evaluate 1359 data
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] spearmanr SpearmanrResult(correlation=0.4155930561711437, pvalue=6.97399627506598e-58)
Data    1400
OOV     41
Corr    0.416

More results on 学習済み日本語word2vecとその評価について (write in Japanese)

You can’t perform that action at this time.