# Word2Vec

Word2Vec 不是单一算法，而是一种模型体系结构和优化，可用与从大型数据集中学习单词嵌入。通过 Word2Vec 学习到的嵌入已被证明在各种下游自然语言处理任务上是成功的。

这里的 Word2Vec 基于下面两篇论文：

- 向量空间中单词表示的有效估计
    - https://arxiv.org/pdf/1301.3781.pdf
- 单词和短语的分布式表示及其组成
    - https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

上面的两篇论文提出了两种学习单词表示方法：

- 连续词袋模型(Continuous Bag-of-Words Model, CBOW) 
    - 基于周围的上下文词来预测中间词。上下文由当前(中间)词之前和之后的几个词组成。这种体系结构被称为“词袋模型”，因为上下文中的单词顺序并不重要
- 连续跳过语法模型(Continuous Skip-gram Model, Skip-gram)
    - 用于预测同一个句子中当前单词前后一定范围内的单词

In [5]:
import io
import itertools
import numpy as np
import os
import re
import string
import tqdm
import tensorflow as tf
from tensorflow.keras import Model, Sequential, layers
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [6]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

# 1.简单文本向量化

## 1.1 文本分词

In [8]:
sentence = "The wide rode shimmered in the hot sun"
tokens = list(sentence.lower().split())
tokens

['the', 'wide', 'rode', 'shimmered', 'in', 'the', 'hot', 'sun']

## 1.2 建立文本词汇表

### 1.2.1 建立词汇表

In [16]:
vocab = {
    "<pad>": 0
}
index = 1
for token in tokens:
    if token not in vocab:
        vocab[token] = index
        index += 1
vocab
vocab_size = len(vocab)
vocab_size

{'<pad>': 0,
 'the': 1,
 'wide': 2,
 'rode': 3,
 'shimmered': 4,
 'in': 5,
 'hot': 6,
 'sun': 7}

8

### 1.2.2 建立逆词汇表

In [14]:
inverse_vocab = {
    index: token for token, index in vocab.items()
}
inverse_vocab

{0: '<pad>',
 1: 'the',
 2: 'wide',
 3: 'rode',
 4: 'shimmered',
 5: 'in',
 6: 'hot',
 7: 'sun'}

## 1.3 文本向量化

In [15]:
example_sequence = [vocab[word] for word in tokens]
example_sequence

[1, 2, 3, 4, 5, 1, 6, 7]

# 2.文本生成 Skip-grams

In [17]:
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
    example_sequence,
    vocabulary_size = vocab_size,
    window_size = window_size,
    negative_samples = 0
)
positive_skip_grams

[[5, 1],
 [1, 3],
 [6, 7],
 [4, 1],
 [5, 4],
 [3, 4],
 [6, 1],
 [3, 5],
 [1, 7],
 [1, 4],
 [3, 2],
 [6, 5],
 [7, 6],
 [1, 2],
 [1, 6],
 [2, 4],
 [3, 1],
 [2, 1],
 [7, 1],
 [4, 2],
 [5, 3],
 [2, 3],
 [1, 5],
 [4, 3],
 [5, 6],
 [4, 5]]

In [19]:
for target, context in positive_skip_grams:
    print(f"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})")

(5, 1): (in, the)
(1, 3): (the, rode)
(6, 7): (hot, sun)
(4, 1): (shimmered, the)
(5, 4): (in, shimmered)
(3, 4): (rode, shimmered)
(6, 1): (hot, the)
(3, 5): (rode, in)
(1, 7): (the, sun)
(1, 4): (the, shimmered)
(3, 2): (rode, wide)
(6, 5): (hot, in)
(7, 6): (sun, hot)
(1, 2): (the, wide)
(1, 6): (the, hot)
(2, 4): (wide, shimmered)
(3, 1): (rode, the)
(2, 1): (wide, the)
(7, 1): (sun, the)
(4, 2): (shimmered, wide)
(5, 3): (in, rode)
(2, 3): (wide, rode)
(1, 5): (the, in)
(4, 3): (shimmered, rode)
(5, 6): (in, hot)
(4, 5): (shimmered, in)


In [20]:
target_word, context_word = positive_skip_grams[0]
num_ns = 4
context_class = tf.reshape(tf.constant(context_word, dtype = "int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes = context_class,
    num_true = 1,
    num_sampled = num_ns,
    unique = True,
    range_max = vocab_size,
    seed = SEED,
    name = "negative_sampling",
)
negative_sampling_candidates
[inverse_vocab[index.numpy()] for index in negative_sampling_candidates]

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([2, 1, 4, 3])>

['wide', 'the', 'shimmered', 'rode']