# word2vec skip-gram model

微型预料库：

In [2]:
corpus = [
    'he is a king',
    'she is a queen',
    'he is a man',
    'she is a woman',
    'warsaw is poland capital',
    'berlin is germany capital',
    'paris is france capital',
]

创建词汇表

大小写规范化，去除一些标点符号等，标记化

In [3]:
def tokenize_corpus(corpus):
    tokens = [x.split() for x in corpus]
    return tokens

tokenized_corpus = tokenize_corpus(corpus)

In [4]:
tokenized_corpus 

[['he', 'is', 'a', 'king'],
 ['she', 'is', 'a', 'queen'],
 ['he', 'is', 'a', 'man'],
 ['she', 'is', 'a', 'woman'],
 ['warsaw', 'is', 'poland', 'capital'],
 ['berlin', 'is', 'germany', 'capital'],
 ['paris', 'is', 'france', 'capital']]

接下来要建立word和index之间的mapping

In [5]:
vocabulary = []
for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

vocabulary_size = len(vocabulary)

In [6]:
word2idx

{'he': 0,
 'is': 1,
 'a': 2,
 'king': 3,
 'she': 4,
 'queen': 5,
 'man': 6,
 'woman': 7,
 'warsaw': 8,
 'poland': 9,
 'capital': 10,
 'berlin': 11,
 'germany': 12,
 'paris': 13,
 'france': 14}

In [6]:
vocabulary

['he',
 'is',
 'a',
 'king',
 'she',
 'queen',
 'man',
 'woman',
 'warsaw',
 'poland',
 'capital',
 'berlin',
 'germany',
 'paris',
 'france']

In [8]:
import numpy as np

window_size = 2
idx_pairs = []
# for each sentence
for sentence in tokenized_corpus:
    indices = [word2idx[word] for word in sentence]
    # for each word, threated as center word
    for center_word_pos in range(len(indices)):
        # for each window position
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            # make soure not jump out sentence
            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue
            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array

In [9]:
idx_pairs

array([[ 0,  1],
       [ 0,  2],
       [ 1,  0],
       [ 1,  2],
       [ 1,  3],
       [ 2,  0],
       [ 2,  1],
       [ 2,  3],
       [ 3,  1],
       [ 3,  2],
       [ 4,  1],
       [ 4,  2],
       [ 1,  4],
       [ 1,  2],
       [ 1,  5],
       [ 2,  4],
       [ 2,  1],
       [ 2,  5],
       [ 5,  1],
       [ 5,  2],
       [ 0,  1],
       [ 0,  2],
       [ 1,  0],
       [ 1,  2],
       [ 1,  6],
       [ 2,  0],
       [ 2,  1],
       [ 2,  6],
       [ 6,  1],
       [ 6,  2],
       [ 4,  1],
       [ 4,  2],
       [ 1,  4],
       [ 1,  2],
       [ 1,  7],
       [ 2,  4],
       [ 2,  1],
       [ 2,  7],
       [ 7,  1],
       [ 7,  2],
       [ 8,  1],
       [ 8,  9],
       [ 1,  8],
       [ 1,  9],
       [ 1, 10],
       [ 9,  8],
       [ 9,  1],
       [ 9, 10],
       [10,  1],
       [10,  9],
       [11,  1],
       [11, 12],
       [ 1, 11],
       [ 1, 12],
       [ 1, 10],
       [12, 11],
       [12,  1],
       [12, 10],
       [10,  1

这个意思就是：

“
he is

he a

is he

is a

is king

a he

a is

a king
”

![](1_uYiqfNrUIzkdMrmkBWGMPw.webp)

定义目标函数是接下来关键一步

skip-gram关心的是给定中心词，预测周围词出现的概率：$P(context|center;\theta)$。

通过遍历所有word/context对，来最大化这一概率：

max $\Pi _{center}\Pi _{context}P(context|center;\theta)$

这个公式并不适合计算，所以需要一些转换，转换成求和公式。

$min_\theta\ log \Pi_{center}\Pi_{context}P(context|center,\theta)$

$loss = -1/T \Sigma _{center}\Sigma _{context}log P(context|center, \theta)$

接下来关键就是P(context|center)应该如何定义了。

$P(context|center)=\frac{exp(u^T_{context}\ \ \ v_{center})}{\Sigma _{\omega \in vocab}\ \ \ \ \ exp(u^T_{\omega}\ v_{center})}$

看起来很复杂，一步步分解下

首先，整个分子分母的计算是一个softmax函数，控制概率0-1之间。

分子exp括号里是 向量点积运算，就是余弦相似度，两个向量越接近，它们相乘值更大。

分母里面，求和的意思是对词汇表中所有的向量求：给定一个center词，词汇表中所有词和它之间的相似度。

总的来说，就是：“对于语料库中每一个现有的 中心词-语境词 对，我们要计算它们的 "相似度分数"。然后用它除以每个理论上可能的语境的总和--以了解分数是相对较高还是较低。由于softmax被保证在0和1之间取值，它定义了一个有效的概率分布。”