# 词向量

我们已经实现了用**单热编码**来表示一个单词；使用**词袋**来表示一组相关的单词；使用**嵌入**来更有效地表示每个单词。

嵌入的核心是使用网络模型训练过的权重来表示对应的单词。

作为网络模型的参数，我们可以将权重看作模型在海量数据中挖掘出的“内在规则”。由于单热编码的特性（仅有一位为 1），在训练过程中，每个单词的语义规则被精确地压缩进了一个特定的权重切片中。因此我们认为这个权重切片包含了这个单词的特有信息，可以用来代表这个单词。

如果打个比方的话，单热编码就好比是我们给每种花起个名字。无论是牡丹、还是玫瑰，都是独特的名字，可以让我们区别每种花；而嵌入权重就好比是我们调查过每种花后，给每种花写的简介。不仅可以用来区别每种花，还可以告诉我们每种花的特点。

这种包含了单词信息的权重切片，被称作**词向量**（Word Vector）。

在上一个章节，我们已经完成了一种训练词向量的方法。使用的标签值是对照一个词袋（观众影评）的观众态度：喜欢，或者讨厌。因此，我们从这次训练中可以学习到的，就是每个单词所表达的喜欢，或者讨厌的情绪。以此为基础，我们可以从训练后的词向量中提取出近义词和反义词。

现在，我们将尝试一种新的训练词向量的方法。使用的标签值也将是单词。

---

2013年，Google 的托马斯（Tomas）团队发布了 Word2Vec。它基于**分布置信假设**（Distributional Hypothesis），即：具有相似上下文的词，其语义也是相似的。

**Word2Vec** 的核心思路是用文字训练文字，目标是用文字预测文字：

* **连续词袋**（Continuous Bag-of-Words, CBOW）：通过上下文预测中心词；
* **跳字**（Skip-gram）：通过中心词预测上下文。

具体讲，比如数据集中的一条数据是这样一句话：“老鼠爱大米”。那么：

* **连续词袋**使用“老”、“鼠”、“大”、“米”四个字组成的词袋（的单热编码）作为输入数据，使用“爱”这个字（的单热编码）作为标签值；
* **跳字**则是使用“爱”这个字（的单热编码）作为输入数据，“老”、“鼠”、“大”、“米”四个字（的单热编码）分别作为标签值。

形象地讲：

* **连续词袋**是训练网络模型学习做填空题；
* **跳字**是训练模型从一个词发散思维，做头脑风暴、思维导图。

In [1]:
import csv
import math
import re
from abc import abstractmethod, ABC

import numpy as np

np.random.seed(99)

## 基础架构

### 张量

In [2]:
class Tensor:

    def __init__(self, data):
        self.data = np.array(data)
        self.grad = np.zeros_like(self.data)
        self.gradient_fn = lambda: None
        self.parents = set()

    def backward(self):
        if self.gradient_fn:
            self.gradient_fn()

        for p in self.parents:
            p.backward()

    @property
    def size(self):
        return np.prod(self.data.shape[1:])

    def __repr__(self):
        return f'Tensor({self.data})'

### 基础数据集

In [3]:
class Dataset(ABC):

    def __init__(self, batch_size=1):
        self.batch_size = batch_size

        self.test_labels = self.test_features = None
        self.train_labels = self.train_features = None

        self.load()
        self.train()

    @abstractmethod
    def load(self):
        pass

    def train(self):
        self.features = self.train_features
        self.labels = self.train_labels

    def eval(self):
        self.features = self.test_features
        self.labels = self.test_labels

    def shape(self):
        return Tensor(self.features).size, Tensor(self.labels).size

    def items(self):
        return Tensor(self.features), Tensor(self.labels)

    def __len__(self):
        return len(self.features) // self.batch_size

    def __getitem__(self, index):
        start = index * self.batch_size
        end = start + self.batch_size

        feature = Tensor(self.features[start: end])
        label = Tensor(self.labels[start: end])
        return feature, label

    def estimate(self, predictions):
        pass

### 基础层

In [4]:
class Layer(ABC):

    def __init__(self):
        self.training = True

    def __call__(self, x: Tensor):
        return self.forward(x)

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

    @abstractmethod
    def forward(self, x: Tensor):
        pass

    @property
    def parameters(self):
        return []

    def __repr__(self):
        return ''

### 基础损失函数

In [5]:
class Loss(ABC):

    def __call__(self, p: Tensor, y: Tensor):
        return self.loss(p, y)

    @abstractmethod
    def loss(self, p: Tensor, y: Tensor):
        pass

### 基础优化器

In [6]:
class Optimizer(ABC):

    def __init__(self, parameters, lr):
        self.parameters = parameters
        self.lr = lr

    def reset(self):
        for p in self.parameters:
            p.grad = np.zeros_like(p.data)

    @abstractmethod
    def step(self):
        pass

### 基础模型

In [7]:
class Model(ABC):

    def __init__(self, layer, loss_fn, optimizer):
        self.layer = layer
        self.loss_fn = loss_fn
        self.optimizer = optimizer

    @abstractmethod
    def train(self, dataset, epochs):
        pass

    @abstractmethod
    def test(self, dataset):
        pass

## 数据

### IMDB 数据集

我们继续使用同样的 IMDB 数据集来训练 Word2Vec 词向量，但是只需要观众影评的部分。

我们来实现一个采用 **连续词袋**（CBOW） 策略的网络模型。

我们构建训练集的策略是：对每一条观众影评，我们每次截取 5 个单词（sequence_length + 1）。使用中间单词的单热编码作为标签值；两边的单词组成一个词袋，作为特征值。

因此模型的预测值将不再是一个数值，而是和一个单热编码等长的 $n$ 个数值，这里的 $n$ 是词表的长度。

我们据此也修改了**评估函数**（estimate），比较预测值（单热编码）对应的索引编码，和标签值（单热编码）对应的索引编码。

In [8]:
class IMDBDataset(Dataset):

    def __init__(self, filename, sequence_length=5):
        self.filename = filename
        self.sequence_length = sequence_length
        super().__init__()

    def load(self):
        self.reviews = []
        self.sentiments = []
        with open(self.filename, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
            next(reader)
            for _, row in enumerate(reader):
                self.reviews.append(row[0])
                self.sentiments.append(row[1])

        split_reviews = []
        for line in self.reviews:
            split_reviews.append(self.clean_text(line.lower()).split())

        self.vocabulary = set(word for line in split_reviews for word in line)
        self.word2index = {word: index for index, word in enumerate(self.vocabulary)}
        self.index2word = {index: word for index, word in enumerate(self.vocabulary)}
        self.tokens = [[self.word2index[word] for word in line if word in self.word2index] for line in split_reviews]

        self.train_features, self.train_labels = self.create_sequence(self.tokens[:-20])
        self.test_features, self.test_labels = self.create_sequence(self.tokens[-20:])

    @staticmethod
    def clean_text(text):
        text = re.sub(r'<[^>]+>', '', text)
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        return text

    def create_sequence(self, tokens):
        predict_position = self.sequence_length // 2
        features, labels = [], []
        for line in tokens:
            for index in range(len(line) - self.sequence_length + 1):
                feature = []
                for i in range(self.sequence_length):
                    if i != predict_position:
                        feature.append(line[index + i])
                features.append(feature)
                labels.append(self.onehot(line[index + predict_position]))
        return features, labels


    def encode(self, text):
        words = self.clean_text(text.lower()).split()
        return [self.word2index[word] for word in words]

    def decode(self, tokens):
        return " ".join([self.index2word[index] for index in tokens])

    def onehot(self, token):
        ebd = np.zeros(len(self.vocabulary))
        ebd[token] = 1
        return ebd

    @staticmethod
    def argmax(vector):
        return np.argmax(vector)

    def estimate(self, predictions):
        count = 0
        for i in range(len(predictions)):
            if self.argmax(predictions[i].data[0]) == self.argmax(self.labels[i].data):
                count += 1
        return count / len(predictions)

## 模型

### 线性层

In [9]:
class Linear(Layer):

    def __init__(self, in_size, out_size):
        super().__init__()
        self.weight = Tensor(np.random.randn(out_size, in_size) * np.sqrt(2 / in_size))
        self.bias = Tensor(np.zeros(out_size))

    def forward(self, x: Tensor):
        p = Tensor(x.data @ self.weight.data.T + self.bias.data)

        def gradient_fn():
            self.weight.grad += p.grad.T @ x.data
            self.bias.grad += np.sum(p.grad, axis=0)
            x.grad += p.grad @ self.weight.data

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    @property
    def parameters(self):
        return [self.weight, self.bias]

    def __repr__(self):
        return f'Linear[weight{self.weight.data.shape}; bias{self.bias.data.shape}]'

### 顺序层

In [10]:
class Sequential(Layer):

    def __init__(self, layers):
        super().__init__()
        self.layers = layers

    def train(self):
        for l in self.layers:
            l.train()

    def eval(self):
        for l in self.layers:
            l.eval()

    def forward(self, x: Tensor):
        for l in self.layers:
            x = l(x)
        return x

    @property
    def parameters(self):
        return [p for l in self.layers for p in l.parameters]

    def __repr__(self):
        return '\n'.join(str(l) for l in self.layers if str(l))

### 嵌入层

In [11]:
class Embedding(Layer):

    def __init__(self, vocabulary_size, embedding_size):
        super().__init__()
        self.vocabulary_size = vocabulary_size
        self.embedding_size = embedding_size

        self.weight = Tensor(np.random.randn(embedding_size, vocabulary_size) * np.sqrt(2 / vocabulary_size))

    def forward(self, x: Tensor):
        p = Tensor(np.mean(self.weight.data.T[x.data], axis=1))

        def gradient_fn():
            if type(self.weight.grad) is not np.ndarray:
                self.weight.grad = np.zeros_like(self.weight.data)
            self.weight.grad.T[x.data] += p.grad / len(x.data[-1])

        p.gradient_fn = gradient_fn
        p.parents = {self.weight}
        return p

    @property
    def parameters(self):
        return [self.weight]

    def __repr__(self):
        return f'Embedding[weight{self.weight.data.shape}; vocabulary={self.vocabulary_size}; embedding={self.embedding_size}]'

### Sigmoid 激活函数

In [12]:
class Sigmoid(Layer):

    def __init__(self, clip_range=(-100, 100)):
        super().__init__()
        self.clip_range = clip_range

    def forward(self, x: Tensor):
        z = np.clip(x.data, self.clip_range[0], self.clip_range[1])
        p = Tensor(1 / (1 + np.exp(-z)))

        def gradient_fn():
            x.grad += p.grad * p.data * (1 - p.data)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Sigmoid[]'

### Softmax 激活函数

In [13]:
class Softmax(Layer):

    def __init__(self, axis=-1):
        super().__init__()
        self.axis = axis

    def forward(self, x: Tensor):
        exp = np.exp(x.data - np.max(x.data, axis=self.axis, keepdims=True))
        p = Tensor(exp / np.sum(exp, axis=self.axis, keepdims=True))

        def gradient_fn():
            grad = np.sum(p.data * p.grad, axis=self.axis, keepdims=True)
            x.grad += p.data * (p.grad - grad)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Softmax[]'

### 损失函数（交叉熵）

In [14]:
class CELoss(Loss):

    def loss(self, p: Tensor, y: Tensor):
        exp = np.exp(p.data - np.max(p.data, axis=-1, keepdims=True))
        softmax = exp / np.sum(exp, axis=-1, keepdims=True)

        log = np.log(np.clip(softmax, 1e-10, 1))
        ce = Tensor(0 - np.sum(y.data * log) / len(y.data))

        def gradient_fn():
            p.grad += (softmax - y.data) / len(y.data)

        ce.gradient_fn = gradient_fn
        ce.parents = {p}
        return ce

### 损失函数（二元交叉熵）

In [15]:
class BCELoss(Loss):

    def loss(self, p: Tensor, y: Tensor):
        clipped = np.clip(p.data, 1e-7, 1 - 1e-7)
        bce = Tensor(-np.mean(y.data * np.log(clipped) + (1 - y.data) * np.log(1 - clipped)))

        def gradient_fn():
            p.grad += (clipped - y.data) / (clipped * (1 - clipped)) / len(y.data)

        bce.gradient_fn = gradient_fn
        bce.parents = {p}
        return bce

### 优化器（随机梯度下降）

In [16]:
class SGDOptimizer(Optimizer):

    def step(self):
        for p in self.parameters:
            p.data -= p.grad * self.lr

### 神经网络模型

In [17]:
class WBModel(Model):

    def train(self, dataset, epochs):
        self.layer.train()
        dataset.train()

        for epoch in range(epochs):
            for i in range(len(dataset)):
                features, labels = dataset[i]

                predictions = self.layer(features)
                loss = self.loss_fn(predictions, labels)
                self.optimizer.reset()
                loss.backward()
                self.optimizer.step()

    def test(self, dataset):
        self.layer.eval()
        dataset.eval()

        predictions = []
        for i in range(len(dataset)):
            features, labels = dataset[i]
            predictions.append(self.layer(features))
        return predictions

## 设置

### 学习率

In [18]:
LEARNING_RATE = 0.01

### 轮次

In [19]:
EPOCHS = 100

### 序列长度

我们增加了一个新的超参数：**序列长度**（Sequence Length）。用来定义每次迭代从数据集截取多长的文字作为输入。

In [20]:
SEQUENCE_LENGTH = 5

## 训练

### 迭代

训练 Word2Vec 词向量，输出层的预测值将不再是一个数值，而是和单热编码长度（或者词表长度）相同。因此这是一个多元分类问题，我们将采用**交叉熵损失函数**（CELoss）。同时也不需要显式地使用输出层激活函数，CELoss 已经包括 SoftMax 激活函数的逻辑。

In [21]:
dataset = IMDBDataset('tinyimdb.csv', sequence_length=SEQUENCE_LENGTH)
layer = Sequential([Embedding(len(dataset.vocabulary), 32),
                    Linear(32, len(dataset.vocabulary))])
loss = CELoss()
optimizer = SGDOptimizer(layer.parameters, lr=LEARNING_RATE)

model = WBModel(layer, loss, optimizer)
model.train(dataset, EPOCHS)
print(layer)

Embedding[weight(32, 86); vocabulary=86; embedding=32]
Linear[weight(86, 32); bias(86,)]


## 验证

### 测试

In [22]:
predictions = model.test(dataset)
print(f'Accuracy: {dataset.estimate(predictions)}')

Accuracy: 0.5032051282051282


使用一个非常小的数据集，通过 100 轮的快速训练，我们的网络模型已经可以成功预测出超过一半的中心词。充分显示出 Word2Vec 的思想逻辑是正确、有效的。

### 对比

我们来实际地看一看网络模型的预测效果吧。

In [23]:
features, labels = dataset.items()
for i in range(len(predictions)):
    pos = np.argmax(predictions[i].data[0])
    print(f'Feature: {dataset.decode(features.data[i])} | '
          f'Label: {dataset.decode([dataset.argmax(labels.data[i])])} | '
          f'Prediction: {dataset.decode([dataset.argmax(predictions[i].data[0])])}')

Feature: this movie excellent i | Label: was | Prediction: was
Feature: movie was i loved | Label: excellent | Prediction: wonderful
Feature: was excellent loved the | Label: i | Prediction: i
Feature: excellent i the story | Label: loved | Prediction: enjoyed
Feature: i loved story and | Label: the | Prediction: the
Feature: loved the and effect | Label: story | Prediction: screenplay
Feature: the story effect the | Label: and | Prediction: and
Feature: story and the actress | Label: effect | Prediction: actor
Feature: and effect actress was | Label: the | Prediction: the
Feature: effect the was wonderful | Label: actress | Prediction: and
Feature: the actress wonderful recommend | Label: was | Prediction: was
Feature: actress was recommend actor | Label: wonderful | Prediction: excellent
Feature: was wonderful actor character | Label: recommend | Prediction: actress
Feature: wonderful recommend character scene | Label: actor | Prediction: actor
Feature: recommend actor scene plot | L

### 近似词

我们同样来看一看 Word2Vec 词向量会认为那些词比较相似。

In [24]:
def similar(dataset, layer, target='excellent'):
    target_index = dataset.word2index[target]
    scores = {}

    for word, index in dataset.word2index.items():
        raw_diff = layer.layers[0].weight.data.T[index] - layer.layers[0].weight.data.T[target_index]
        squared_diff = raw_diff ** 2
        scores[word] = math.sqrt(sum(squared_diff))

    return dict(sorted(scores.items(), key=lambda i: i[1])[:10])

print(similar(dataset, layer, target='excellent'))
print(similar(dataset, layer, target='terrible'))

{'excellent': 0.0, 'fantastic': 2.2691989455385126, 'best': 2.301648361367808, 'perfect': 2.344969171918984, 'great': 2.428808355618313, 'us': 2.7038015123537265, 'bad': 2.744339868666713, 'good': 2.755688618136458, 'see': 2.776683673895416, 'horrible': 2.8117908508372906}
{'terrible': 0.0, 'poor': 2.45307870660518, 'fantastic': 2.7639079511575475, 'best': 2.995568590213762, 'good': 3.0647706533607884, 'amazing': 3.0818405628645307, 'great': 3.102849434635568, 'boring': 3.2651024666148754, 'bad': 3.2744305081455254, 'for': 3.280588261142521}


可以看出，Word2Vec 认为相似的单词并不全是同义词、或者近义词，甚至会是反义词。

这是应为我们的训练标的（标签值）不再是喜欢、或者讨厌，而是可能和上下文一起使用的单词。我们训练的目标不再是单词表达的态度，而是单词之间的相关性。

## 课后练习

修改数据集的构建方法，尝试训练一个**跳字** Word2Vec 词向量。