# MeCab, SentencePiece の精度評価

- 評価データセット：ldcc
- 評価方法：pipeline
    - ../model/
        - pipe-jptokenizermecab.gz
        - pipe-jptokenizersentencepiece.gz

In [1]:
import numpy
import pandas
import scipy.stats

In [2]:
import sys
sys.path.append('../')

from classify_ldcc import DocRecord, DatasetLdcc
from classify_ldcc import JpTokenizerMeCab, JpTokenizerSentencePiece

'pattern' package not found; tag filters are not available for English


## Pipelineの確認

In [3]:
import os
import joblib
from classify_ldcc import ident_tokener, SparsetoDense, Transer
os.chdir("../")
pipe_mecab = joblib.load("model/pipe-jptokenizermecab.gz")
pipe_sentencepiece = joblib.load("model/pipe-jptokenizersentencepiece.gz")
os.chdir("notebook/")

In [4]:
pipe_mecab

Pipeline(memory=None,
         steps=[('tokenizer',
                 <classify_ldcc.JpTokenizerMeCab object at 0x7ff1b4edb978>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=False, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf...
                 LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='gain',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_j

In [5]:
pipe_sentencepiece

Pipeline(memory=None,
         steps=[('tokenizer',
                 <classify_ldcc.JpTokenizerSentencePiece object at 0x7ff131b16128>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=False, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, sm...
                 LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='gain',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_j

In [6]:
result_csv = "../data/result.csv"
columns = ["tokenizer", "train_acc", "valid_acc", "elapsed_time", "cpu_time"]
df = pandas.read_csv(result_csv, header=None, names=columns)
df.head()

Unnamed: 0,tokenizer,train_acc,valid_acc,elapsed_time,cpu_time
0,JpTokenizerMeCab,1.0,0.940299,61.854007,296.743731
1,JpTokenizerSentencePiece,1.0,0.954772,93.701337,543.958018
2,JpTokenizerMeCab,1.0,0.952058,60.94181,289.572545
3,JpTokenizerSentencePiece,1.0,0.95251,94.718693,547.378444
4,JpTokenizerMeCab,1.0,0.943917,61.125902,292.175544


## 回数情報を追加

In [7]:
tokenizers = df["tokenizer"].drop_duplicates()
n = len(df) // 2
times = numpy.array([list(range(1, n+1)) for tkr in tokenizers]).T.ravel()
times
df["times"] = times[:len(df)]
df.head()

Unnamed: 0,tokenizer,train_acc,valid_acc,elapsed_time,cpu_time,times
0,JpTokenizerMeCab,1.0,0.940299,61.854007,296.743731,1
1,JpTokenizerSentencePiece,1.0,0.954772,93.701337,543.958018,1
2,JpTokenizerMeCab,1.0,0.952058,60.94181,289.572545,2
3,JpTokenizerSentencePiece,1.0,0.95251,94.718693,547.378444,2
4,JpTokenizerMeCab,1.0,0.943917,61.125902,292.175544,3


## 実行時間を評価

In [8]:
_acc_df = df.pivot(index="tokenizer", columns="times", values=["valid_acc", "train_acc", "elapsed_time", "cpu_time"]).T
#_acc_df["mean"] = pvdf.mean(axis=1)
#_acc_df["std"] = pvdf.std(axis=1)
_acc_df.head(10)

Unnamed: 0_level_0,tokenizer,JpTokenizerMeCab,JpTokenizerSentencePiece
Unnamed: 0_level_1,times,Unnamed: 2_level_1,Unnamed: 3_level_1
valid_acc,1,0.940299,0.954772
valid_acc,2,0.952058,0.95251
valid_acc,3,0.943917,0.952962
valid_acc,4,0.954772,0.962913
valid_acc,5,0.956128,0.960199
valid_acc,6,0.952058,0.958842
valid_acc,7,0.949344,0.958842
valid_acc,8,0.947083,0.96246
valid_acc,9,0.945726,0.958842
valid_acc,10,0.948892,0.954319


### 経過時間

In [9]:
edf = _acc_df.loc["elapsed_time"].dropna().T
edf["mean"] = edf.mean(axis=1)
edf["std"] = edf.std(axis=1)
edf

times,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,mean,std
tokenizer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JpTokenizerMeCab,61.854007,60.94181,61.125902,61.077122,61.930245,60.87357,61.346454,60.805689,60.799273,60.492267,...,61.262253,61.133369,61.754891,61.204561,61.28028,60.941494,61.124489,61.42484,61.337754,0.599082
JpTokenizerSentencePiece,93.701337,94.718693,92.480894,93.132043,92.901012,93.051905,92.675434,93.246276,92.686435,94.048845,...,93.473539,93.650603,93.477266,94.049311,93.301715,94.01532,93.246925,94.315198,93.57495,0.74462


In [10]:
for tkr, m, s in edf[["mean", "std"]].reset_index().values:
    print(f"{tkr}: {m/60:.1f} min ({s:.1f} sec)")

JpTokenizerMeCab: 1.0 min (0.6 sec)
JpTokenizerSentencePiece: 1.6 min (0.7 sec)


### CPU時間

In [11]:
cdf = _acc_df.loc["cpu_time"].dropna().T
cdf["mean"] = cdf.mean(axis=1)
cdf["std"] = cdf.std(axis=1)
cdf

times,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,mean,std
tokenizer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JpTokenizerMeCab,296.743731,289.572545,292.175544,290.710029,298.243288,288.990884,292.800738,290.656328,289.569884,288.375333,...,293.00305,292.3537,296.096029,291.818412,293.606473,290.806997,291.328865,294.110273,293.35075,3.965812
JpTokenizerSentencePiece,543.958018,547.378444,533.848559,535.987757,537.832567,536.450868,537.144222,537.024407,536.028281,543.482934,...,540.933809,540.183844,540.746575,541.80144,540.050591,542.45719,537.523537,543.682308,540.473576,4.804031


In [12]:
for tkr, m, s in cdf[["mean", "std"]].reset_index().values:
    print(f"{tkr}: {m/60:.1f} min ({s:.1f} sec)")

JpTokenizerMeCab: 4.9 min (4.0 sec)
JpTokenizerSentencePiece: 9.0 min (4.8 sec)


## 精度評価

In [13]:
acc_df = _acc_df.loc["valid_acc"].dropna()
acc_df

tokenizer,JpTokenizerMeCab,JpTokenizerSentencePiece
times,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.940299,0.954772
2,0.952058,0.952510
3,0.943917,0.952962
4,0.954772,0.962913
5,0.956128,0.960199
6,0.952058,0.958842
7,0.949344,0.958842
8,0.947083,0.962460
9,0.945726,0.958842
10,0.948892,0.954319


In [14]:
acc = acc_df.dropna().T.copy()
acc["mean"] = acc.mean(axis=1)
acc["std"] = acc.std(axis=1)
acc["mean"] *= 100
acc["std"] *= 100
acc.sort_values("mean", ascending=False)

times,1,2,3,4,5,6,7,8,9,10,...,93,94,95,96,97,98,99,100,mean,std
tokenizer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
JpTokenizerSentencePiece,0.954772,0.95251,0.952962,0.962913,0.960199,0.958842,0.958842,0.96246,0.958842,0.954319,...,0.956581,0.955224,0.95251,0.949344,0.95251,0.955676,0.956128,0.962913,95.578924,0.374516
JpTokenizerMeCab,0.940299,0.952058,0.943917,0.954772,0.956128,0.952058,0.949344,0.947083,0.945726,0.948892,...,0.945274,0.944369,0.945274,0.938489,0.945274,0.947083,0.950701,0.950701,94.865219,0.462855


In [15]:
for tkr, m, s in acc[["mean", "std"]].reset_index().values:
    print(f"{tkr}: {m:.1f} % ({s:.1f} %)")

JpTokenizerMeCab: 94.9 % (0.5 %)
JpTokenizerSentencePiece: 95.6 % (0.4 %)


## 検定

### 正規性の検定

In [16]:
for tkr in acc_df.columns:
    W, pvalue = scipy.stats.shapiro(acc_df[tkr].dropna())
    print(tkr, W, pvalue, pvalue < 0.05, "棄却" if pvalue < 0.05 else "非棄却")

JpTokenizerMeCab 0.9839680790901184 0.2669599950313568 False 非棄却
JpTokenizerSentencePiece 0.9881190061569214 0.5170230269432068 False 非棄却


### 正規乱数で検定に必要なサンプルサイズを評価

In [17]:
# 正規乱数 サンプルサイズ=10
x = numpy.random.normal(0, 1, 10)
scipy.stats.shapiro(x)

(0.9098219275474548, 0.27977025508880615)

In [18]:
# 正規乱数 サンプルサイズ=100
x = numpy.random.normal(0, 1, 100)
scipy.stats.shapiro(x)

(0.9824507832527161, 0.20535781979560852)

In [19]:
# 一様乱数 サンプルサイズ=10
x = numpy.random.uniform(0, 1, 10)
scipy.stats.shapiro(x)  # <- 棄却できず

(0.8532965183258057, 0.06357227265834808)

In [20]:
# 一様乱数 サンプルサイズ=50
x = numpy.random.uniform(0, 1, 50)
scipy.stats.shapiro(x)

(0.9346017241477966, 0.008286652155220509)

In [21]:
# 一様乱数 サンプルサイズ=100
x = numpy.random.uniform(0, 1, 100)
scipy.stats.shapiro(x)

(0.9585599899291992, 0.003178815357387066)

- サンプルサイズ=10 では、正規分布からのサンプルであることを否定するのは難しそう
    - サンプルサイズ=100 でやり直した
    - やり直した結果、正規性は棄却されなかった
        - i.e. 正規性があると考えても(測定データと)矛盾しない
- 50サンプルで、ギリギリな印象
- 結果的に、50-100サンプルは正規性を否定できるためのサンプルとして取得したい

### t検定（対応あり）
- MeCab, SentencePiece の2群のみを比較するため、t検定でよい
- t検定は、正規性に頑健性があるので、参考として実行する

In [22]:
cols = acc_df.columns
for base in cols:
    for target in [trg for trg in cols if trg != base]:
        t, pvalue = scipy.stats.ttest_rel(acc_df[base], acc_df[target])
        if pvalue < 0.05:
            print(base, target, t, pvalue, (pvalue < 0.05))

JpTokenizerMeCab JpTokenizerSentencePiece -18.456124412176518 8.098020258517182e-34 True
JpTokenizerSentencePiece JpTokenizerMeCab 18.456124412176518 8.098020258517182e-34 True


### ウィルコクソンの符号順位検定
- 両側検定
- 連続補正なし（精度は、離散分布ではないため）

In [23]:
cols = acc_df.columns
for base in cols:
    for target in [trg for trg in cols if trg != base]:
        w, pvalue = scipy.stats.wilcoxon(acc_df[base], acc_df[target], correction=False)
        if pvalue < 0.05:
            print(base, target, w, pvalue, (pvalue < 0.05))

JpTokenizerMeCab JpTokenizerSentencePiece 2.0 8.834559079893054e-18 True
JpTokenizerSentencePiece JpTokenizerMeCab 2.0 8.834559079893054e-18 True


### 検定結果

- t検定も、ウィルコクソンの符号順位和検定のいずれも、有意差がある結果になった

| tokenizer name | accuracy mean (std) |
| --------------- | --- |
| JpTokenizerMeCab | 95.0 (0.5) |
| JpTokenizerSentencePiece | 95.6 (0.4) |

- MeCab の平均が、$95.0 \% (\pm 0.5 \%)$、SentencePiece の平均が $95.6 \% (\pm 0.4 \%)$ 
    - 精度は、MeCab < SentencePiece
    - 精度差は、偶然ではかなり発生しづらく（0.7%未満）、何らかの意味・理由があると言える

## まとめ

- MeCab, SentencePiece の精度を比較すると、有意に、SentencePiece の方が(約0.6%)よい
- 精度と実行時間の関係は、以下のようになる
    
| tokenizer name | accuracy mean (std) | elapsed time mean (std) | cpu time mean (std) |
| -------------- | --- | ----------------------- | ------------------- |
| JpTokenizerMeCab | 94.9 % (0.5 %) | 1.0 min (0.6 sec) | 4.9 min (4.0 sec) |
| JpTokenizerSentencePiece | 95.6 % (0.4 %) | 1.6 min (0.7 sec) | 9.0 min (4.8 sec) |


- 経過時間（elapsed time）を、比較すると 約 0.6 min = 36 sec の差であった
- CPU時間（cpu time）を、比較すると 約 4.9 min, 9.0 min と、倍近く差がある
    - これは、SentencePiece が、マルチCPUで動作することが起因していると考えられる
        - 故に、CPU時間が倍近くになっている
    - MeCab 単体は、1 cpu で動作するが、SentencePiece の学習（fit()）は、8 cpu で動作することがCPU時間に影響を与えていると考える
    - 形態素解析(MeCab, SentencePiece)後のpipeline は、同じである（いずれも途中から8cpu を利用する）
- 以上をまとめると
    - 計算資源が十分（2 cpu 以上）ある場合は、経過時間の差は大きくない（いずれも実用に耐えうる）
    - 計算資源が十分な場合は、若干だがより精度が高い SentencePiece を利用してよく
    - 計算資源が1cpuに限られている場合で、経過時間を優先すべきときは、MeCab を利用した方が良さそうである
        - 例： 1cpu だと、CPU時間≒経過時間になるため、倍ぐらいの時間差がでる