# Word2vec字嵌入(word embedding)

Word2vec是根據字詞在鄰近距離的方法做訓練。input layer是一個語料庫所有字詞的one-hot vector，假設語料庫有1,000個字，只有目前輸入的字它的位置是1，其餘為0的向量。輸入之後會經過一層hidden layer，hidden layer的數量就是我們要將字詞轉成向量的維度大小。輸出的神經元數量與輸入一樣，都是語料庫的大小。而其中的數值則是目前字詞的context window內的鄰近字詞機率。如果是鄰近字詞，該位置的數值越接近1越好，如果不是，就越接近0越好。

## 使用Word2vec字嵌入的gensim套件

word2vec是gensim套件最主要的演算法。

In [1]:
import gensim
import logging

In [2]:
#設置日誌紀錄器(logger)，觀看詳細的訓練過程
logging.basicConfig(format='%(asctime)s: %(levelname)s : %(message)s', level=logging.INFO)

In [3]:
from gensim.models import word2vec, Word2Vec

In [4]:
sentence = word2vec.Text8Corpus('./data/text8.txt')

In [5]:
#這是計算字嵌入的特定演算法
print(sentence)

<gensim.models.word2vec.Text8Corpus object at 0x000001C0551619D0>


> gensim需要可以迭代的物件(如list、generator、tuple等等)，裡面是切分成字符(tokenized)的句子。設置好這個變數之後，就可以讓gensim開始學習工作了。

__gensim.models.Word2Vec參數__
- min_count:忽略出現次數小於該值的字詞
- size:要學習的字詞維度

In [6]:
#實體gensim模組
model = gensim.models.Word2Vec(sentence, min_count=1, size=20)

2020-09-07 21:59:22,625: INFO : collecting all words and their counts
2020-09-07 21:59:22,679: INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-09-07 21:59:31,814: INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2020-09-07 21:59:31,814: INFO : Loading a fresh vocabulary
2020-09-07 21:59:32,861: INFO : effective_min_count=1 retains 253854 unique words (100% of original 253854, drops 0)
2020-09-07 21:59:32,865: INFO : effective_min_count=1 leaves 17005207 word corpus (100% of original 17005207, drops 0)
2020-09-07 21:59:34,426: INFO : deleting the raw counts dictionary of 253854 items
2020-09-07 21:59:34,438: INFO : sample=0.001 downsamples 36 most-common words
2020-09-07 21:59:34,440: INFO : downsampling leaves estimated 12819131 word corpus (75.4% of prior 17005207)
2020-09-07 21:59:35,613: INFO : estimated required memory for 253854 words and 20 dimensions: 167543640 bytes
2020-09-07 21:59:35,613: INFO : resetting 

2020-09-07 22:02:18,380: INFO : EPOCH 3 - PROGRESS: at 76.48% examples, 607840 words/s, in_qsize 5, out_qsize 0
2020-09-07 22:02:19,402: INFO : EPOCH 3 - PROGRESS: at 81.60% examples, 609502 words/s, in_qsize 6, out_qsize 1
2020-09-07 22:02:20,410: INFO : EPOCH 3 - PROGRESS: at 86.83% examples, 612508 words/s, in_qsize 6, out_qsize 0
2020-09-07 22:02:21,413: INFO : EPOCH 3 - PROGRESS: at 91.95% examples, 614731 words/s, in_qsize 5, out_qsize 0
2020-09-07 22:02:22,422: INFO : EPOCH 3 - PROGRESS: at 97.06% examples, 616313 words/s, in_qsize 6, out_qsize 0
2020-09-07 22:02:22,969: INFO : worker thread finished; awaiting finish of 2 more threads
2020-09-07 22:02:22,971: INFO : worker thread finished; awaiting finish of 1 more threads
2020-09-07 22:02:22,978: INFO : worker thread finished; awaiting finish of 0 more threads
2020-09-07 22:02:22,978: INFO : EPOCH - 3 : training on 17005207 raw words (12820503 effective words) took 20.8s, 617818 effective words/s
2020-09-07 22:02:24,006: INFO :

> 已經用我們自己的語料庫建立一個word2vec模型，可以使用裡面所包含的字詞向量。每一個字詞都用20維的向量來表示。

In [7]:
#獲得"king"字詞的向量
model.wv['king']

array([-2.6589146 , -2.059375  , -0.8230269 ,  2.2753518 , -4.2195506 ,
        2.453612  ,  0.90548056,  6.1119876 ,  2.3973694 ,  0.6910609 ,
        1.8435425 ,  3.5594847 , -6.884309  ,  2.8693833 ,  0.28955057,
        5.24688   ,  0.10235706, -0.23963721, -2.4963558 ,  2.7566025 ],
      dtype=float32)

In [8]:
#進行向量計算觀察是否與我們想的一樣
#女 + 國王 - 男 = 女王
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

2020-09-07 22:03:03,398: INFO : precomputing L2-norms of word weight vectors


[('emperor', 0.9057022333145142),
 ('son', 0.8896150588989258),
 ('charlemagne', 0.8806945085525513),
 ('tsar', 0.8794891834259033),
 ('empress', 0.8761159181594849),
 ('pope', 0.8669358491897583),
 ('consul', 0.8635085821151733),
 ('prince', 0.8602508902549744),
 ('elector', 0.850213885307312),
 ('ruler', 0.8485316634178162)]

In [9]:
#倫敦之餘英國相當於巴黎之於?
model.wv.most_similar(positive=['Paris', 'England'], negative=['London'], topn=1)

KeyError: "word 'Paris' not in vocabulary"

> 由國王女王可以看出，語料庫並沒有完全學習到我們想要的語意。而倫敦巴黎則是語料庫內並沒有巴黎這個字詞。這就顯示了__「字嵌入」會受限於所選擇的語料庫及計算字嵌入的機器__

In [10]:
#使用gensim事先訓練好的模型，由300萬個字詞訓練，每個字詞為300維度的向量表示方法
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin',binary=True)

2020-09-07 22:04:21,924: INFO : loading projection weights from ./data/GoogleNews-vectors-negative300.bin
2020-09-07 22:05:01,986: INFO : loaded (3000000, 300) matrix from ./data/GoogleNews-vectors-negative300.bin


In [11]:
#總共有300萬個單字
len(model.wv.vocab)

  len(model.wv.vocab)


3000000

In [12]:
#女 + 國王 - 男 = 女王
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

  model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
2020-09-07 22:05:02,013: INFO : precomputing L2-norms of word weight vectors


[('queen', 0.7118192911148071)]

In [13]:
#倫敦之餘英國相當於巴黎之於?
model.wv.most_similar(positive=['Paris', 'England'], negative=['London'], topn=1)

  model.wv.most_similar(positive=['Paris', 'England'], negative=['London'], topn=1)


[('France', 0.667637825012207)]

In [14]:
#選出不屬於同一類別的字詞
model.wv.doesnt_match("duck bear cat tree".split())

  model.wv.doesnt_match("duck bear cat tree".split())
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'tree'

In [15]:
#女人和男人的相似度
model.wv.similarity('woman', 'man')

  model.wv.similarity('woman', 'man')


0.76640123

In [16]:
#樹和男人的相似度
model.wv.similarity('tree', 'man')

  model.wv.similarity('tree', 'man')


0.22937459

## 字嵌入應用:資訊檢索

In [17]:
import numpy as np

In [18]:
#先定義一個取得字嵌入的function
def get_embedding(string):
    try:
        return model.wv[string]
    except:
        return None

In [19]:
#建立文章標題(我們要找尋的目標)
sentence = [
    "this is about a dog",
    "this is about a cat",
    "this is about a nothing"
]

In [24]:
#將每一句話轉換成向量的形式，方法是透過將每一個字的字嵌入做加總平均
#先建立一個3 X 300的零矩陣
vectorized_sentence = np.zeros((len(sentence), 300))
for i, sentence in enumerate(sentence):
    #將單字切割
    words = sentence.split(' ')
    print(words)
    
    #進行字嵌入
    embedded_words = [get_embedding(w) for w in words]
    embedded_words = lambda x:x is not None, embedded_words
    print(embedded_words)

['g']
(<function <lambda> at 0x000001C07501F3A0>, [array([-0.38867188, -0.01287842,  0.15234375,  0.16015625, -0.11132812,
       -0.00668335, -0.08300781, -0.15429688, -0.17382812, -0.03149414,
       -0.08886719, -0.07519531, -0.32617188,  0.13085938, -0.14160156,
        0.12695312, -0.23828125,  0.28320312, -0.22363281, -0.1171875 ,
       -0.32617188,  0.00531006, -0.1640625 , -0.02990723,  0.01501465,
        0.05249023, -0.35742188,  0.15039062, -0.0456543 , -0.05395508,
        0.18945312, -0.08154297,  0.28515625, -0.09423828, -0.23828125,
        0.16113281, -0.26953125,  0.2265625 , -0.12060547,  0.16601562,
       -0.06396484,  0.04858398,  0.1953125 ,  0.26757812,  0.00086594,
        0.01397705, -0.02416992, -0.05029297,  0.20019531,  0.02819824,
       -0.08447266,  0.22753906, -0.17871094,  0.3984375 ,  0.18359375,
       -0.03393555, -0.36328125, -0.33789062, -0.03393555, -0.21972656,
       -0.10498047, -0.05493164, -0.24902344,  0.07373047,  0.16894531,
       -0.423

  return model.wv[string]
