# MeCab, SentencePiece の精度評価

- 評価データセット：ldcc
- 評価方法：pipeline
    - ../model/
        - pipe-jptokenizermecab.gz
        - pipe-jptokenizersentencepiece.gz

In [1]:
import numpy
import pandas
import scipy.stats

In [2]:
import sys
sys.path.append('../')

from ldccset import DatasetLdcc
from aozoraset import DatasetAozora
from classify import TagDocMaker, Doc2Vectorizer
from classify import JpTokenizerMeCab, JpTokenizerSentencePiece

'pattern' package not found; tag filters are not available for English


## Pipelineの確認

In [3]:
%%time
import os
import joblib
from classify import ident_tokener, SparsetoDense, Transer
try:
    os.chdir("../")
    pipe_mecab = joblib.load("model/pipe-jptokenizermecab.gz")
    pipe_sentencepiece = joblib.load("model/pipe-jptokenizersentencepiece.gz")
except Exception as e:
    raise e
finally:
    os.chdir("notebook/")

CPU times: user 1.8 s, sys: 19.8 ms, total: 1.82 s
Wall time: 1.56 s


In [4]:
pipe_mecab

Pipeline(memory=None,
         steps=[('tokenizer',
                 <classify.JpTokenizerMeCab object at 0x7efd4c76f3c8>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=False, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True...
                 LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='gain',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_j

In [5]:
pipe_sentencepiece

Pipeline(memory=None,
         steps=[('tokenizer',
                 <classify.JpTokenizerSentencePiece object at 0x7efd39453748>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=False, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_...
                 LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='gain',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_j

In [6]:
result_csv = "../data/result.csv"
columns = ["tokenizer", "train_acc", "valid_acc", "elapsed_time", "cpu_time"]
df = pandas.read_csv(result_csv, header=None, names=columns)
df.head()

Unnamed: 0,tokenizer,train_acc,valid_acc,elapsed_time,cpu_time
0,JpTokenizerMeCab,1.0,0.944369,69.335296,332.607681
1,JpTokenizerSentencePiece,1.0,0.956128,110.3677,631.398006
2,JpTokenizerMeCab,1.0,0.952962,72.925654,350.740181
3,JpTokenizerSentencePiece,1.0,0.954319,115.859467,659.67091
4,JpTokenizerMeCab,1.0,0.945274,69.753902,334.857192


## 回数情報を追加

In [7]:
tokenizers = df["tokenizer"].drop_duplicates()
n = len(df) // 2
times = numpy.array([list(range(1, n+1)) for tkr in tokenizers]).T.ravel()
times
df["times"] = times[:len(df)]
df.head()

Unnamed: 0,tokenizer,train_acc,valid_acc,elapsed_time,cpu_time,times
0,JpTokenizerMeCab,1.0,0.944369,69.335296,332.607681,1
1,JpTokenizerSentencePiece,1.0,0.956128,110.3677,631.398006,1
2,JpTokenizerMeCab,1.0,0.952962,72.925654,350.740181,2
3,JpTokenizerSentencePiece,1.0,0.954319,115.859467,659.67091,2
4,JpTokenizerMeCab,1.0,0.945274,69.753902,334.857192,3


## 実行時間を評価

In [8]:
_acc_df = df.pivot(index="tokenizer", columns="times", values=["valid_acc", "train_acc", "elapsed_time", "cpu_time"]).T
#_acc_df["mean"] = pvdf.mean(axis=1)
#_acc_df["std"] = pvdf.std(axis=1)
_acc_df.head(10)

Unnamed: 0_level_0,tokenizer,JpTokenizerMeCab,JpTokenizerSentencePiece
Unnamed: 0_level_1,times,Unnamed: 2_level_1,Unnamed: 3_level_1
valid_acc,1,0.944369,0.956128
valid_acc,2,0.952962,0.954319
valid_acc,3,0.945274,0.954319
valid_acc,4,0.955676,0.961104
valid_acc,5,0.956581,0.955224
valid_acc,6,0.950249,0.957938
valid_acc,7,0.953415,0.957485
valid_acc,8,0.947987,0.961556
valid_acc,9,0.947987,0.957033
valid_acc,10,0.944821,0.960199


### 経過時間

In [9]:
edf = _acc_df.loc["elapsed_time"].dropna().T
edf["mean"] = edf.mean(axis=1)
edf["std"] = edf.std(axis=1)
edf

times,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,mean,std
tokenizer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JpTokenizerMeCab,69.335296,72.925654,69.753902,67.479392,67.771489,69.552905,75.482269,69.395602,72.83761,70.585096,...,62.292603,61.729877,62.531481,62.099945,62.240652,61.816332,62.112534,62.092931,65.023183,4.115273
JpTokenizerSentencePiece,110.3677,115.859467,118.826758,104.146838,105.39665,107.318686,108.428783,115.120286,108.636047,111.911376,...,94.794646,94.912333,94.788782,95.611353,95.053526,95.421243,94.589187,95.824682,100.34744,7.686137


In [10]:
for tkr, m, s in edf[["mean", "std"]].reset_index().values:
    print(f"{tkr}: {m/60:.1f} min ({s:.1f} sec)")

JpTokenizerMeCab: 1.1 min (4.1 sec)
JpTokenizerSentencePiece: 1.7 min (7.7 sec)


### CPU時間

In [11]:
cdf = _acc_df.loc["cpu_time"].dropna().T
cdf["mean"] = cdf.mean(axis=1)
cdf["std"] = cdf.std(axis=1)
cdf

times,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,mean,std
tokenizer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JpTokenizerMeCab,332.607681,350.740181,334.857192,324.195809,325.98567,334.925326,368.034112,332.41661,351.524826,341.340222,...,297.976241,295.077203,300.346705,297.323128,299.26722,295.326602,296.702732,297.005095,312.067686,21.099491
JpTokenizerSentencePiece,631.398006,659.67091,672.109113,593.418019,597.599394,608.212748,614.086151,650.768025,618.19334,638.298616,...,547.34109,546.093395,546.962231,549.048228,548.50603,547.79948,544.533046,551.169484,575.390356,40.692919


In [12]:
for tkr, m, s in cdf[["mean", "std"]].reset_index().values:
    print(f"{tkr}: {m/60:.1f} min ({s:.1f} sec)")

JpTokenizerMeCab: 5.2 min (21.1 sec)
JpTokenizerSentencePiece: 9.6 min (40.7 sec)


## 精度評価

In [13]:
acc_df = _acc_df.loc["valid_acc"].dropna()
acc_df

tokenizer,JpTokenizerMeCab,JpTokenizerSentencePiece
times,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.944369,0.956128
2,0.952962,0.954319
3,0.945274,0.954319
4,0.955676,0.961104
5,0.956581,0.955224
6,0.950249,0.957938
7,0.953415,0.957485
8,0.947987,0.961556
9,0.947987,0.957033
10,0.944821,0.960199


In [14]:
acc = acc_df.dropna().T.copy()
m = acc.mean(axis=1)
s = acc.std(axis=1)
acc["mean"] = m
acc["std"] = s
acc["mean"] *= 100
acc["std"] *= 100
acc.sort_values("mean", ascending=False)

times,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,mean,std
tokenizer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JpTokenizerSentencePiece,0.956128,0.954319,0.954319,0.961104,0.955224,0.957938,0.957485,0.961556,0.957033,0.960199,...,0.956128,0.95251,0.948892,0.951153,0.954319,0.954772,0.959294,0.964722,95.59611,0.391064
JpTokenizerMeCab,0.944369,0.952962,0.945274,0.955676,0.956581,0.950249,0.953415,0.947987,0.947987,0.944821,...,0.945274,0.943917,0.94663,0.935323,0.946178,0.947083,0.951606,0.95251,94.869742,0.480086


In [15]:
for tkr, m, s in acc[["mean", "std"]].reset_index().values:
    print(f"{tkr}: {m:.1f} % ({s:.1f} %)")

JpTokenizerMeCab: 94.9 % (0.5 %)
JpTokenizerSentencePiece: 95.6 % (0.4 %)


## 検定

### 正規性の検定

In [16]:
for tkr in acc_df.columns:
    W, pvalue = scipy.stats.shapiro(acc_df[tkr].dropna())
    print(tkr, W, pvalue, pvalue < 0.05, "棄却" if pvalue < 0.05 else "非棄却")

JpTokenizerMeCab 0.9869959950447083 0.43735480308532715 False 非棄却
JpTokenizerSentencePiece 0.9907791614532471 0.727403998374939 False 非棄却


### 正規乱数で検定に必要なサンプルサイズを評価

In [17]:
# 正規乱数 サンプルサイズ=10
x = numpy.random.normal(0, 1, 10)
scipy.stats.shapiro(x)

(0.8561763763427734, 0.06876995414495468)

In [18]:
# 正規乱数 サンプルサイズ=100
x = numpy.random.normal(0, 1, 100)
scipy.stats.shapiro(x)

(0.9917338490486145, 0.8016242980957031)

In [19]:
# 一様乱数 サンプルサイズ=10
x = numpy.random.uniform(0, 1, 10)
scipy.stats.shapiro(x)  # <- 棄却できず

(0.8837308287620544, 0.14397722482681274)

In [20]:
# 一様乱数 サンプルサイズ=50
x = numpy.random.uniform(0, 1, 50)
scipy.stats.shapiro(x)

(0.9426774382591248, 0.017216842621564865)

In [21]:
# 一様乱数 サンプルサイズ=100
x = numpy.random.uniform(0, 1, 100)
scipy.stats.shapiro(x)

(0.9250912666320801, 2.6490444724913687e-05)

- サンプルサイズ=10 では、正規分布からのサンプルであることを否定するのは難しそう
    - サンプルサイズ=100 でやり直した
    - やり直した結果、正規性は棄却されなかった
        - i.e. 正規性があると考えても(測定データと)矛盾しない
- 50サンプルで、ギリギリな印象
- 結果的に、50-100サンプルは正規性を否定できるためのサンプルとして取得したい

### t検定（対応あり）
- MeCab, SentencePiece の2群のみを比較するため、t検定でよい
- t検定は、正規性に頑健性があるので、参考として実行する

In [22]:
cols = acc_df.columns
for base in cols:
    for target in [trg for trg in cols if trg != base]:
        t, pvalue = scipy.stats.ttest_rel(acc_df[base], acc_df[target])
        if pvalue < 0.05:
            print(base, target, t, pvalue, (pvalue < 0.05))

JpTokenizerMeCab JpTokenizerSentencePiece -16.263633862651595 1.0211749301055299e-29 True
JpTokenizerSentencePiece JpTokenizerMeCab 16.263633862651595 1.0211749301055299e-29 True


### ウィルコクソンの符号順位検定
- 両側検定
- 連続補正なし（精度は、離散分布ではないため）

In [23]:
cols = acc_df.columns
for base in cols:
    for target in [trg for trg in cols if trg != base]:
        w, pvalue = scipy.stats.wilcoxon(acc_df[base], acc_df[target], correction=False)
        if pvalue < 0.05:
            print(base, target, w, pvalue, (pvalue < 0.05))

JpTokenizerMeCab JpTokenizerSentencePiece 37.5 1.1971604369766303e-17 True
JpTokenizerSentencePiece JpTokenizerMeCab 37.5 1.1971604369766303e-17 True


### 検定結果

- t検定も、ウィルコクソンの符号順位和検定のいずれも、有意差がある結果になった

| tokenizer name | accuracy mean (std) |
| --------------- | --- |
| JpTokenizerMeCab | 94.9 (0.5) |
| JpTokenizerSentencePiece | 95.6 (0.4) |

- MeCab の平均が、$94.9 \% (\pm 0.5 \%)$、SentencePiece の平均が $95.6 \% (\pm 0.4 \%)$ 
    - 精度は、MeCab < SentencePiece
    - 精度差は、偶然ではかなり発生しづらく（0.7%未満）、何らかの意味・理由があると言える

## まとめ

- MeCab, SentencePiece の精度を比較すると、有意に、SentencePiece の方が(約0.6%)よい
- 精度と実行時間の関係は、以下のようになる
    
| tokenizer name | accuracy mean (std) | elapsed time mean (std) | cpu time mean (std) |
| -------------- | --- | ----------------------- | ------------------- |
| JpTokenizerMeCab | 94.9 % (0.5 %) | 1.0 min (0.6 sec) | 4.9 min (4.0 sec) |
| JpTokenizerSentencePiece | 95.6 % (0.4 %) | 1.6 min (0.7 sec) | 9.0 min (4.8 sec) |


- 経過時間（elapsed time）を、比較すると 約 0.6 min = 36 sec の差であった
- CPU時間（cpu time）を、比較すると 約 4.9 min, 9.0 min と、倍近く差がある
    - これは、SentencePiece が、マルチCPUで動作することが起因していると考えられる
        - 故に、CPU時間が倍近くになっている
    - MeCab 単体は、1 cpu で動作するが、SentencePiece の学習（fit()）は、8 cpu で動作することがCPU時間に影響を与えていると考える
    - 形態素解析(MeCab, SentencePiece)後のpipeline は、同じである（いずれも途中から8cpu を利用する）
- 以上をまとめると
    - 計算資源が十分（2 cpu 以上）ある場合は、経過時間の差は大きくない（いずれも実用に耐えうる）
    - 計算資源が十分な場合は、若干だがより精度が高い SentencePiece を利用してよく
    - 計算資源が1cpuに限られている場合で、経過時間を優先すべきときは、MeCab を利用した方が良さそうである
        - 例： 1cpu だと、CPU時間≒経過時間になるため、倍ぐらいの時間差がでる