## 60. 単語ベクトルの読み込みと表示
Google Newsデータセット（約1,000億単語）での学習済み単語ベクトル（300万単語・フレーズ，300次元）をダウンロードし，”United States”の単語ベクトルを表示せよ．ただし，”United States”は内部的には”United_States”と表現されていることに注意せよ．

In [1]:
import gensim
wv = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [2]:
print(wv["United_States"].shape)
wv["United_States"]

(300,)


array([-3.61328125e-02, -4.83398438e-02,  2.35351562e-01,  1.74804688e-01,
       -1.46484375e-01, -7.42187500e-02, -1.01562500e-01, -7.71484375e-02,
        1.09375000e-01, -5.71289062e-02, -1.48437500e-01, -6.00585938e-02,
        1.74804688e-01, -7.71484375e-02,  2.58789062e-02, -7.66601562e-02,
       -3.80859375e-02,  1.35742188e-01,  3.75976562e-02, -4.19921875e-02,
       -3.56445312e-02,  5.34667969e-02,  3.68118286e-04, -1.66992188e-01,
       -1.17187500e-01,  1.41601562e-01, -1.69921875e-01, -6.49414062e-02,
       -1.66992188e-01,  1.00585938e-01,  1.15722656e-01, -2.18750000e-01,
       -9.86328125e-02, -2.56347656e-02,  1.23046875e-01, -3.54003906e-02,
       -1.58203125e-01, -1.60156250e-01,  2.94189453e-02,  8.15429688e-02,
        6.88476562e-02,  1.87500000e-01,  6.49414062e-02,  1.15234375e-01,
       -2.27050781e-02,  3.32031250e-01, -3.27148438e-02,  1.77734375e-01,
       -2.08007812e-01,  4.54101562e-02, -1.23901367e-02,  1.19628906e-01,
        7.44628906e-03, -

## 61. 単語の類似度
“United States”と”U.S.”のコサイン類似度を計算せよ．

In [3]:
wv.similarity("United_States", "U.S.")

0.73107743

In [4]:
# 計算で求める
import numpy as np

def cos_sim(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))


united_states_v = wv["United_States"]
us_v = wv["U.S."]

cos_sim(united_states_v, us_v)

0.7310775

## 62. 類似度の高い単語10件
“United States”とコサイン類似度が高い10語と，その類似度を出力せよ．

In [5]:
wv.most_similar("United_States", topn=10)

[('Unites_States', 0.7877248525619507),
 ('Untied_States', 0.7541370391845703),
 ('United_Sates', 0.74007248878479),
 ('U.S.', 0.7310774326324463),
 ('theUnited_States', 0.6404393911361694),
 ('America', 0.6178410053253174),
 ('UnitedStates', 0.6167312264442444),
 ('Europe', 0.6132988929748535),
 ('countries', 0.6044804453849792),
 ('Canada', 0.6019070148468018)]

## 63. 加法構成性によるアナロジー
“Spain”の単語ベクトルから”Madrid”のベクトルを引き，”Athens”のベクトルを足したベクトルを計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．

In [6]:
wv.most_similar(positive=["Spain", "Athens"], negative=["Madrid"], topn=10)

[('Greece', 0.6898481249809265),
 ('Aristeidis_Grigoriadis', 0.5606848001480103),
 ('Ioannis_Drymonakos', 0.5552908778190613),
 ('Greeks', 0.545068621635437),
 ('Ioannis_Christou', 0.5400862693786621),
 ('Hrysopiyi_Devetzi', 0.5248444676399231),
 ('Heraklio', 0.5207759737968445),
 ('Athens_Greece', 0.516880989074707),
 ('Lithuania', 0.5166866183280945),
 ('Iraklion', 0.5146791934967041)]

In [7]:
new_v = wv["Spain"] - wv["Madrid"] + wv["Athens"]
wv.similar_by_vector(new_v, topn=10)

#この方法だと計算に使用した単語("Athens"も結果の中に入っている)

[('Athens', 0.7528455853462219),
 ('Greece', 0.6685472726821899),
 ('Aristeidis_Grigoriadis', 0.5495778322219849),
 ('Ioannis_Drymonakos', 0.5361456871032715),
 ('Greeks', 0.5351786613464355),
 ('Ioannis_Christou', 0.5330226421356201),
 ('Hrysopiyi_Devetzi', 0.5088489055633545),
 ('Iraklion', 0.5059264898300171),
 ('Greek', 0.5040615797042847),
 ('Athens_Greece', 0.5034109354019165)]

## 64. アナロジーデータでの実験
単語アナロジーの評価データをダウンロードし，vec(2列目の単語) - vec(1列目の単語) + vec(3列目の単語)を計算し，そのベクトルと類似度が最も高い単語と，その類似度を求めよ．  
求めた単語と類似度は，各事例の末尾に追記せよ．

In [8]:
from tqdm import tqdm

# 単語アナロジーの評価データのダウンロード
with open("data/questions-words.txt", "r") as f:
    lines = f.readlines()

new_list = []
    
for line in tqdm(lines):
    if ":" in line:
        new_list.append(line)
        continue
    line = line.strip() # 改行削除
    word_list = line.split() #空白区切り
    similar_word, similarity = wv.most_similar(positive=[word_list[1], word_list[2]], negative=[word_list[0]], topn=1)[0]
    new_list.append(line+" "+similar_word+" "+str(similarity)+"\n")
    
with open("data/analogy.txt", "w") as wf:
    wf.writelines(new_list)

100%|██████████| 19558/19558 [44:04<00:00,  7.40it/s]


## 65. アナロジータスクでの正解率
64の実行結果を用い，意味的アナロジー（semantic analogy）と文法的アナロジー（syntactic analogy）の正解率を測定せよ．

意味的アナロジー  
capital-common-countries, capital-world, currency, city-in-state, family  
文法的アナロジー	
gram1-adjective-to-adverb, gram2-opposite, gram3-comparative,  gram4-superlative, gram5-present-participle, gram6-nationality-adjective,  gram7-past-tense, gram8-plural, gram9-plural-verbs

In [9]:
with open("data/analogy.txt", "r") as f:
    lines = f.readlines()

sem = True # 意味的アナロジー
sem_acc = 0
syn_acc = 0
sem_sum = 0
syn_sum =0
for line in lines:
    if ":" in line:
        if "gram" in line:
            sem = False
        continue
    line_list = line.split()
    if sem:
        sem_sum += 1
        if line_list[3] == line_list[4]:
            sem_acc += 1
    else:
        syn_sum += 1
        if line_list[3] == line_list[4]:
            syn_acc += 1

print("意味的アナロジー正解率：{}%".format(round(sem_acc/sem_sum*100, 2)))    
print("文法的アナロジー正解率：{}%".format(round(syn_acc/syn_sum*100, 2))) 

意味的アナロジー正解率：73.09%
文法的アナロジー正解率：74.0%
