## 第5回レポート追加課題

カウントベースの手法を用いたPTBデータセットでの評価のプログラムを実行した.

In [2]:
# coding: utf-8
import sys
sys.path.append('..')
import numpy as np
from common.util import most_similar, create_co_matrix, ppmi
from dataset import ptb


window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
print('counting  co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)

print('calculating SVD ...')
try:
    # truncated SVD (fast!)
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
                             random_state=None)
except ImportError:
    # SVD (slow)
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]

querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


counting  co-occurrence ...
calculating PPMI ...
1.0% done
2.0% done
3.0% done
4.0% done
5.0% done
6.0% done
7.0% done
8.0% done
9.0% done
10.0% done
11.0% done
12.0% done
13.0% done
14.0% done
15.0% done
16.0% done
17.0% done
18.0% done
19.0% done
20.0% done
21.0% done
22.0% done
23.0% done
24.0% done
25.0% done
26.0% done
27.0% done
28.0% done
29.0% done
30.0% done
31.0% done
32.0% done
33.0% done
34.0% done
35.0% done
36.0% done
37.0% done
38.0% done
39.0% done
40.0% done
41.0% done
42.0% done
43.0% done
44.0% done
45.0% done
46.0% done
47.0% done
48.0% done
49.0% done
50.0% done
51.0% done
52.0% done
53.0% done
54.0% done
55.0% done
56.0% done
57.0% done
58.0% done
59.0% done
60.0% done
61.0% done
62.0% done
63.0% done
64.0% done
65.0% done
66.0% done
67.0% done
68.0% done
69.0% done
70.0% done
71.0% done
72.0% done
73.0% done
74.0% done
75.0% done
76.0% done
77.0% done
78.0% done
79.0% done
80.0% done
81.0% done
82.0% done
83.0% done
84.0% done
85.0% done
86.0% done
87.0% done
88.

同様にして他のクエリでも距離の近い単語を表示した.

In [3]:
querys = ['morning', 'month', 'watch', 'report']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


[query] morning
 evening: 0.6000343561172485
 session: 0.5879909992218018
 afternoon: 0.5740914940834045
 midnight: 0.5687413811683655
 p.m.: 0.555871307849884

[query] month
 year: 0.7092810273170471
 last: 0.6997873783111572
 week: 0.689609169960022
 earlier: 0.5959507822990417
 february: 0.5755984783172607

[query] watch
 bunny: 0.46495112776756287
 die: 0.39813417196273804
 nsc: 0.38462623953819275
 reporter: 0.38444390892982483
 send: 0.3709089457988739

[query] report
 reports: 0.570824384689331
 provided: 0.45376211404800415
 figures: 0.45343369245529175
 numbers: 0.434186726808548
 purchasing: 0.4120323956012726


### 感想
例での「you, year, car, toyota」のクエリではかなり類似していると感じたが, 自分で試したクエリではwatchの出力結果が個人的にはあまり関係が無い単語のように思った. また, SVDに相当の時間がかかり, これがカウントベースでの問題点だと感じた.

### 参考文献
斎藤 康毅　『ゼロから作るDeep Learning②自然言語処理編』, 2018, オライリー・ジャパン, p.57-92