在聊`word embedding`之前，我们需要先了解一下**语言建模**。

语言建模是基于已有的人类组织的文本语料，来去无监督学习如何组织一句话，并还能得到单词的语义表征。

在`NLP`中的语言建模主要有以下几种：

1. 统计模型`N-gram`：最早是通过统计单词出现的词频，然后再结合贝叶斯公式去对一个文本进行分类判断。

2. 之后结合神经网络，出现了无监督学习方式的`NNLM`。

3. 之后就再出现了大规模的无监督学习，`word2vec`，`BERT`。

## N-gram模型

`N-gram`模型是基于统计的模型，其简单，但是泛化能力差，无法得到单词的语义信息。`N-gram`模型的定义是`n`个相邻字符构成的序列。`n`如果等于`1`的话，我们称之为`unigram`，两个单词构成的序列我们称之为`bigram`，三个单词构成的序列称之为`trigram`。

`N-gram`模型也有基于单词的`N-gram`模型和基于字母的`N-gram`模型，基于单词的`N-gram`模型特征维度随着语料词汇增大和`n`增大而指数增大（`curse of dimensionality`，维度灾难）。基于单词的`N-gram`模型特征维度只随着`n`增大而增大。

但是其实我们是想拿到单词的语义表征，表示成稀疏式的one-hot embedding只能反映单词的位置信息，并不能反映两两单词之间的关系。稀疏式的one-hot embedding的向量长度与单词表数目相同，向量太长也不利于计算。

## 基于神经网络的语言模型(NNLM)



## Word2vec模型

改进1：抛弃隐含层，并提出`CBOW`和`Skip-gram`。

`Continuous Bag-of-Words`（CBOW）不同于`NNLM`，`CBOW`考虑了前后上下文，使用周围单词预测中间单词。

$$
J_{\theta} = \frac{1}{T} \sum_{t=1}^{T}log p (w_{t}|w_{t-n}, \cdots, w_{t-1}, w_{t+1}, \cdots, w_{t+n})
$$

`Skip-gram`与`Continuous Bag-of-Words`相反，使用中间单词预测周围单词：

$$
J_{\theta} = \frac{1}{T} \sum_{t=1}^{T} \sum_{-n \leq j  \leq n, \neq 0} log p(w_{t+j} | w_{t})
$$

改进2: 优化Softmax

Softmax计算量跟k呈线性关系：

$$
\sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}
$$

提出Hierarchical Softmax。

改进3: 引入负采样

Continuous Bag of Words（CBOW）。输入是前n个单词和后n个单词。目标使得预测中间单词的概率最大，同时使得负样本单词的概率最小。

$$
g(w)=\prod_{u \in\{w\} \cup N E G(w)} p(u \mid \operatorname{Context}(w))
$$

Skip-gram输入是中间单词，目标是使得上下文单词概率最大，负样本单词的概率最小。

## PyTorch中的Word Embedding

- `PyTorch`中对于`Word Embedding`的介绍如下[Word Embedding](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)

下面是对PyTorch官网的核心知识点总结:

embedding编码是将单词编码到一个指定的维度，本质是查表

### Embedding核心介绍

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x10e4bce70>

In [2]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
print(embeds.weight)

Parameter containing:
tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519],
        [-0.1661, -1.5228,  0.3817, -1.0276, -0.5631]], requires_grad=True)


我们将"hello"这个单词获取其index，再查表即可得到embedding:

In [3]:
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


### N-Gram语言模型

回顾一下`N-Gram`语言模型：给定一个$w$序列，期望计算:

$$
P\left(w_{i} \mid w_{i-1}, w_{i-2}, \ldots, w_{i-n+1}\right)
$$

其中$w_{i}$是整个序列的第$i$个单词。

接下来我们举个例子:

假设`CONTEXT_SIZE = 2`，编码之后的`EMBEDDING_DIM=10`。

In [4]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE-1, -1, -1)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
# 打印出ngrams的前3个
print(ngrams[:3])


[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


之后就是构建word到index的字典，方便之后对字典进行查询，首先需要判断总的词表数目为多少，采用set函数即可得到所有不同词的集合。

In [5]:
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [6]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [7]:
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [8]:
for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)


        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[521.7259202003479, 519.2380111217499, 516.765364408493, 514.3083226680756, 511.8656668663025, 509.43723130226135, 507.0212330818176, 504.61711144447327, 502.22495698928833, 499.84271144866943]
tensor([-0.0365,  0.1829, -1.2690, -0.5939,  0.4525,  0.3140, -0.6911, -0.2820,
         0.0993,  0.4963], grad_fn=<SelectBackward0>)


之后官网还提供了一些处理词袋模型的预处理。