## PTB(Penn Treebank)データセット

PTBとは、手法の品質を測定するためにベンチマークとしてよく使われるコーパス。

word2vecの作者であるTomas Mikolovが作成。PTBコーパスは元となるPTBにいくつかの前処理(具体的な数字をNで置き換えるなど）が施されている。

In [1]:
import numpy as np
import sys
sys.path.append("../../deep-learning-from-scratch-2")
from dataset import ptb

In [2]:
corpus, word_to_id, id_to_word = ptb.load_data("train")

Downloading ptb.train.txt ... 
Done


In [3]:
len(corpus)

929589

In [5]:
corpus[:30]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [7]:
id_to_word[1]

'banknote'

## PTBによる評価

では、カウントベース手法をPTBコーパスを使って評価してみる

In [8]:
import numpy as np
import sys
sys.path.append("../../deep-learning-from-scratch-2")
from dataset import ptb
from common.util import most_similar, create_co_matrix, ppmi

In [10]:
window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = ptb.load_data("train")
vocab_size = len(word_to_id)
print("counting co-occurrence")
C = create_co_matrix(corpus, vocab_size, window_size)
print("calculating PPMI...")
W = ppmi(C, verbose=True)

counting co-occurrence
calculating PPMI...


  pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
  pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)


1.0% done
2.0% done
3.0% done
4.0% done
5.0% done
6.0% done
7.0% done
8.0% done
9.0% done
10.0% done
11.0% done
12.0% done
13.0% done
14.0% done
15.0% done
16.0% done
17.0% done
18.0% done
19.0% done
20.0% done
21.0% done
22.0% done
23.0% done
24.0% done
25.0% done
26.0% done
27.0% done
28.0% done
29.0% done
30.0% done
31.0% done
32.0% done
33.0% done
34.0% done
35.0% done
36.0% done
37.0% done
38.0% done
39.0% done
40.0% done
41.0% done
42.0% done
43.0% done
44.0% done
45.0% done
46.0% done
47.0% done
48.0% done
49.0% done
50.0% done
51.0% done
52.0% done
53.0% done
54.0% done
55.0% done
56.0% done
57.0% done
58.0% done
59.0% done
60.0% done
61.0% done
62.0% done
63.0% done
64.0% done
65.0% done
66.0% done
67.0% done
68.0% done
69.0% done
70.0% done
71.0% done
72.0% done
73.0% done
74.0% done
75.0% done
76.0% done
77.0% done
78.0% done
79.0% done
80.0% done
81.0% done
82.0% done
83.0% done
84.0% done
85.0% done
86.0% done
87.0% done
88.0% done
89.0% done
90.0% done
91.0% done
92.0% do

In [11]:
print("calculating SVD...")
try:
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components = wordvec_size, n_iter=5, random_state=None)
except Importerror:
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

calculating SVD...

[query] you
 i: 0.659272404453
 we: 0.643764752712
 do: 0.558835135859
 've: 0.525614995864
 else: 0.520987551911

[query] year
 month: 0.719843874716
 last: 0.656193725542
 earlier: 0.617504343103
 next: 0.609180893681
 quarter: 0.57230778268

[query] car
 auto: 0.643679285739
 luxury: 0.591547609951
 cars: 0.547585550667
 corsica: 0.494852574098
 domestic: 0.488622881854

[query] toyota
 motor: 0.704159513726
 nissan: 0.667169334528
 motors: 0.644274502467
 honda: 0.598282060488
 lexus: 0.568803599714


In [12]:
U.shape

(10000, 100)

In [14]:
U[:, :100].shape

(20, 100)

In [16]:
word_vecs = U[:, :50]
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


[query] you
 i: 0.870114229959
 we: 0.786287576884
 someone: 0.727762149555
 'd: 0.714735312514
 somebody: 0.70648361836

[query] year
 earlier: 0.764775907173
 month: 0.764628556481
 next: 0.713874340124
 third: 0.706989101134
 quarter: 0.694524582574

[query] car
 luxury: 0.751224408098
 auto: 0.749123094778
 cars: 0.622722510069
 vehicle: 0.610167445636
 domestic: 0.609535835182

[query] toyota
 motor: 0.786378712118
 nissan: 0.773280435139
 motors: 0.70806340548
 brown-forman: 0.699834371766
 mazda: 0.696984723071


In [17]:
W.shape

(10000, 10000)