# **Word2Vec的Pytorch实现**

在这里我们使用skip-gram模型和负采样来实现Word2Vec模型

In [1]:
import collections
import math
import random
import sys
import time
import os
import numpy as np
import torch
from torch import nn
import torch.utils.data as Data

sys.path.append('../utils/')
import d2lzh as d2l
print(torch.__version__)

1.2.0+cu92


## **处理数据集**

PTB数据集    
- 采样自华尔街日报的文章
- 数据集的每一行是一个句子，词语由空格隔开    

In [2]:
with open('../datasets/ptb/ptb.train.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    raw_datasets = [st.split() for st in lines] # list套list, 内部是每个句子
print(f'{len(raw_datasets)}')

42068


In [3]:
# 打印前三句的前5个词
for st in raw_datasets[:3]:
    print(st[:5])

['aer', 'banknote', 'berlitz', 'calloway', 'centrust']
['pierre', '<unk>', 'N', 'years', 'old']
['mr.', '<unk>', 'is', 'chairman', 'of']


## **建立词语索引**

我们只保留在数据集中出现5次及以上的词

In [4]:
counter = collections.Counter([tk for st in raw_datasets for tk in st])
counter = dict(filter(lambda x : x[1]>=5, counter.items()))

counter是一个词袋模型(词->词频)

In [5]:
idx_to_token = [tk for tk, _ in counter.items()]
token_to_idx = {tk: idx for idx, tk in enumerate(idx_to_token)}
dataset = [[token_to_idx[tk] for tk in st if tk in token_to_idx] for st in raw_datasets]
num_tokens = sum(len(st) for st in dataset)

In [6]:
num_tokens

887100

## **二次采样**

**什么是二次采样?**     
文本数据中由很多词会频繁出现，比如英文的“the”,“a”；中文的“的”“是”。通常来说，在一个context里面，一个词和较低频的词同时出现比和较高频次
的词共同出现对训练模型更加有效。因此，训练词嵌入模型时可以对词进行二次采样。

具体的说，数据集中每个索引词将有一定几率被丢弃，这个概率为：     
$\large P(w_i) = max(1 - \sqrt{\frac {t}{f(w_i)}}, 0)$

在上式中$f(w_i)$是数据集中词$w_i$的个数与总次数的数目之比,$t$是一个超参数（实验中为$10^(-4)$），就是说只有$f(w_i)$的次数比$t$大的时候我们才会进行二次采样，丢弃$w_i$

In [7]:
def discard(idx):
    return random.uniform(0, 1) < 1 - math.sqrt(1e-4 / counter[idx_to_token[idx]] * num_tokens) # 返回true就丢弃

二次采样作用于数据集的每一个句子中的词，并对每一个词进行采样决定是否丢弃

In [8]:
subsampled_dataset = [[tk for tk in st if not discard(tk)] for st in dataset]

In [9]:
def compare_counts(token):
    return '# %s: before=%d, after=%d' % (token, sum(
        [st.count(token_to_idx[token]) for st in dataset]), sum(
        [st.count(token_to_idx[token]) for st in subsampled_dataset]))

compare_counts('the') # '# the: before=50770, after=2013'

'# the: before=50770, after=2114'

In [10]:
compare_counts('join') # '# join: before=45, after=45'

'# join: before=45, after=45'

## **提取中心词和背景词**

In [11]:
def get_centers_and_contexts(dataset, max_window_size):
    centers, contexts = [], []
    for st in dataset:
        if len(st) < 2: # 当句子长度小于2的时候，只有一个词不存在上下文
            continue
        centers += st
        for center_i in range(len(st)):
            window_size = random.randint(1, max_window_size)
            indices = list(range(max(0, center_i - window_size),
                                 min(len(st), center_i + 1 + window_size))) # 保证窗口不会超出句子
            indices.remove(center_i) # 移除中心词
            contexts.append([st[idx] for idx in indices])
    return centers, contexts

In [12]:
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)

dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1]
center 1 has contexts [0, 2]
center 2 has contexts [1, 3]
center 3 has contexts [1, 2, 4, 5]
center 4 has contexts [3, 5]
center 5 has contexts [3, 4, 6]
center 6 has contexts [5]
center 7 has contexts [8]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]


在实验中我们设置最大的窗口背景为5。

In [13]:
all_centers, all_contexts = get_centers_and_contexts(subsampled_dataset, 5)

## **负采样**

减轻Softmax计算的方式主要有两个分别是**层次Softmax**和**负采样**，我们随即采取K个噪声词（实验中K=5), 噪声词的采样概率设置为词频和总次数之比的0.75次方（论文推荐）

In [14]:
def get_negatives(all_contexts, sampling_weights, K):
    '''
    all_contexts：词的上下文环境
    sampling_weights:权重
    K:噪声词的个数
    '''
    all_negatives, neg_candidates, i = [], [], 0
    population = list(range(len(sampling_weights))) # 各个词的id
    for contexts in all_contexts:
        negetives = []
        while len(negetives) < len(contexts) * K:
            # 每一个背景词对应5个噪声词
            if i == len(neg_candidates):
                # 根据权重随机生成k个词的索引作为噪声词
                i, neg_candidates = 0, random.choices(population, sampling_weights, k=int(1e5)) # 从列表中以权重取出1e5个词
            neg, i = neg_candidates[i], i + 1 # 直接从1e5个词里面取出词，避免重复choices导致的低效率
            if neg not in set(contexts):
                # 噪声词不能是背景词
                negetives.append(neg)
        all_negatives.append(negetives) # 将k个噪声词加入数组中
    return all_negatives

In [15]:
sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)

## **读取数据**

In [25]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, centers, contexts, negatives):
        assert len(centers) == len(contexts) == len(negatives)
        
        self.centers = centers
        self.contexts = contexts
        self.negatives = negatives
        
    def __getitem__(self, index):
        return (self.centers[index], self.contexts[index], self.negatives[index])
    
    def __len__(self):
        return len(self.centers)

我们将通过随机小批量来读取它们。在一个小批量数据中，第$i$个样本包括一个中心词以及它所对应的$n_i$个背景词和$m_i$个噪声词。由于每个样本的背景窗口大小可能不一样，其中背景词与噪声词个数之和$n_i+m_i$也会不同。在构造小批量时，我们将每个样本的背景词和噪声词连结在一起，并添加填充项0直至连结后的长度相同，即长度均为$\max_i n_i+m_i$（`max_len`变量）。为了避免填充项对损失函数计算的影响，我们构造了掩码变量`masks`，其每一个元素分别与连结后的背景词和噪声词`contexts_negatives`中的元素一一对应。当`contexts_negatives`变量中的某个元素为填充项时，相同位置的掩码变量`masks`中的元素取0，否则取1。为了区分正类和负类，我们还需要将`contexts_negatives`变量中的背景词和噪声词区分开来。依据掩码变量的构造思路，我们只需创建与`contexts_negatives`变量形状相同的标签变量`labels`，并将与背景词（正类）对应的元素设1，其余清0。

下面我们实现这个小批量读取函数`batchify`。它的小批量输入`data`是一个长度为批量大小的列表，其中每个元素分别包含中心词`center`、背景词`context`和噪声词`negative`。该函数返回的小批量数据符合我们需要的格式，例如，包含了掩码变量。

`collate_fn`参数的作用是指定batch化的方式,也可以定义为batch化之前进行的操作

In [26]:
def batchify(data):
    max_len = max(len(c) + len(n) for _, c, n in data) # 选择噪声词和正确词数量和最大的样本
    centers, contexts_negtives, masks, labels = [], [], [], []
    for center, context, negtive, in data:
        cur_len = len(context) + len(negtive)
        centers += [center]
        contexts_negtives += [context + negtive + [0] * (max_len - cur_len)] # 填充0
        masks += [[1] * cur_len + [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + [0] * (max_len - len(context))]
    return (torch.tensor(centers).view(-1, 1), torch.tensor(contexts_negtives),torch.tensor(masks), torch.tensor(labels))

In [27]:
batch_size = 512
num_workers = 0 if sys.platform.startswith('win32') else 4

dataset = MyDataset(all_centers, all_contexts, all_negatives)
data_iter = Data.DataLoader(dataset, batch_size, shuffle=True, 
                            collate_fn=batchify, num_workers=num_workers)
for batch in data_iter:
    for name, data in zip(['centers', 'contexts_negatives', 'masks',
                           'labels'], batch):
        print(name, 'shape:', data.shape)
    break

centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])


## **跳字模型**

### **嵌入层**

https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding

嵌入层的权重是一个矩阵，行数为词典大小，列数为每个词的向量维度

In [31]:
emded = nn.Embedding(num_embeddings=20, embedding_dim=4)
emded.weight

Parameter containing:
tensor([[ 9.8456e-02,  5.2874e-01,  1.4461e-02,  2.4089e-01],
        [ 7.7678e-02,  7.4174e-01, -3.9631e-01, -3.2123e-01],
        [ 1.2502e+00,  1.5653e+00, -1.3283e+00,  1.2080e-02],
        [ 5.3090e-01,  2.0589e-01, -1.2895e+00, -5.5388e-01],
        [-3.5950e-01, -2.2590e-01,  1.4277e-03, -7.9689e-01],
        [-1.1261e+00, -3.4923e-01,  3.1314e-01,  9.1321e-02],
        [-1.3020e+00, -6.8355e-01,  1.9837e-01,  2.0580e-01],
        [ 1.4635e-01,  2.0343e+00,  1.5899e-01, -3.6710e-02],
        [ 4.6829e-01, -1.5326e+00, -2.6721e-01, -3.9027e-01],
        [ 2.2660e-01, -7.8798e-02,  2.9451e-01,  7.1559e-01],
        [ 7.2752e-01, -1.3321e+00, -8.6455e-01, -5.8383e-01],
        [ 3.0608e-01, -8.6557e-01, -1.6215e-01,  1.5413e+00],
        [ 1.6460e-01, -8.5500e-01,  2.1569e+00, -3.2907e-01],
        [ 2.0840e-01, -7.0282e-02,  2.4446e-01,  3.0358e-01],
        [-1.7751e-01, -2.9913e-01,  9.8362e-01,  1.5230e-01],
        [ 1.8791e-01,  1.8806e+00,  2.2711e-01, 

In [33]:
x = torch.tensor([[1, 2, 3], [1, 5, 6]], dtype=torch.long)
emded(x)

tensor([[[ 0.0777,  0.7417, -0.3963, -0.3212],
         [ 1.2502,  1.5653, -1.3283,  0.0121],
         [ 0.5309,  0.2059, -1.2895, -0.5539]],

        [[ 0.0777,  0.7417, -0.3963, -0.3212],
         [-1.1261, -0.3492,  0.3131,  0.0913],
         [-1.3020, -0.6835,  0.1984,  0.2058]]], grad_fn=<EmbeddingBackward>)

### **小批量乘法**

假设两个矩阵分别为$batch\_size \times a \times b$维和$batch\_size \times b \times c$维，小批量乘法得出的结果是$batch
\_size \times a \times c$维

In [34]:
X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
torch.bmm(X, Y).shape

torch.Size([2, 1, 6])

### **skip-gram**的前向计算

在前向计算中，跳字模型的输入包含中心词索引`center`以及连结的背景词与噪声词索引`contexts_and_negatives`。其中`center`变量的形状为(批量大小, 1)，而`contexts_and_negatives`变量的形状为(批量大小, `max_len`)。这两个变量先通过词嵌入层分别由词索引变换为词向量，再通过小批量乘法得到形状为(批量大小, 1, `max_len`)的输出。输出中的每个元素是中心词向量与背景词向量或噪声词向量的内积。

In [43]:
def skip_gram(center, contexts_and_negtives, embed_v, embed_u):
    v = embed_v(center) # batch_size, 1, d_model
    u = embed_u(contexts_and_negtives) # batch_size, max_len, d_model
    pred = torch.bmm(v, u.permute(0, 2, 1)) # batch_size, 1, max_len
    return pred

## **训练函数**

### **二元交叉熵损失函数**

损失函数由两个特点
- 是二元损失函数
- 需要考虑masks矩阵的影响

In [36]:
class SigmoidBinaryCrossEntropyLoss(nn.Module):
    def __init__(self):
        super(SigmoidBinaryCrossEntropyLoss, self).__init__()
    
    def forward(self, inputs, targets, mask=None):
        '''
        input:(batch_size, len)
        target:the same shape with input
        '''
        inputs, targets, mask = inputs.float(), targets.float(), mask.float()
        res = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction='none', weight=mask)
        return res.mean(dim=1)

loss = SigmoidBinaryCrossEntropyLoss()

In [37]:
pred = torch.tensor([[1.5, 0.3, -1, 2], [1.1, -0.6, 2.2, 0.4]])
# 标签变量label中的1和0分别代表背景词和噪声词
label = torch.tensor([[1, 0, 0, 0], [1, 1, 0, 0]])
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 1, 0]])  # 掩码变量
loss(pred, label, mask) * mask.shape[1] / mask.float().sum(dim=1)

tensor([0.8740, 1.2100])

### **初始化模型参数**

词向量维度被设置为100

In [38]:
embed_size = 100
net = nn.Sequential(
    nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size),
    nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size)
)

In [44]:
def train(net, lr, num_epochs):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print("train on", device)
    net = net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    for epoch in range(num_epochs):
        l_sum, n = 0.0, 0
        for batch in data_iter:
            center, context_negative, mask, label = [d.to(device) for d in batch]
            pred = skip_gram(center, context_negative, net[0], net[1])
            
            l = (loss(pred.view(label.shape), label, mask) * 
                 mask.shape[1] / mask.float().sum(dim=1)).mean() # 一个batch的平均loss， 一个batch其实包含了max_len * batch_size个词语对
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            l_sum += l.cpu().item()
            n += 1
        print(f'epoch{epoch+1}, loss:{l_sum/n:.4f}')

In [48]:
train(net, 0.01, 100)

train on cuda
epoch1, loss:0.3171
epoch2, loss:0.3061
epoch3, loss:0.3015
epoch4, loss:0.2980
epoch5, loss:0.2949
epoch6, loss:0.2923
epoch7, loss:0.2899
epoch8, loss:0.2878
epoch9, loss:0.2860
epoch10, loss:0.2844
epoch11, loss:0.2830
epoch12, loss:0.2815
epoch13, loss:0.2804
epoch14, loss:0.2792
epoch15, loss:0.2781
epoch16, loss:0.2774
epoch17, loss:0.2765
epoch18, loss:0.2756
epoch19, loss:0.2749
epoch20, loss:0.2743
epoch21, loss:0.2736
epoch22, loss:0.2729
epoch23, loss:0.2726
epoch24, loss:0.2719
epoch25, loss:0.2714
epoch26, loss:0.2711
epoch27, loss:0.2704
epoch28, loss:0.2699
epoch29, loss:0.2698
epoch30, loss:0.2694
epoch31, loss:0.2690
epoch32, loss:0.2685
epoch33, loss:0.2684
epoch34, loss:0.2681
epoch35, loss:0.2677
epoch36, loss:0.2676
epoch37, loss:0.2672
epoch38, loss:0.2668
epoch39, loss:0.2667
epoch40, loss:0.2665
epoch41, loss:0.2663
epoch42, loss:0.2660
epoch43, loss:0.2659
epoch44, loss:0.2656
epoch45, loss:0.2654
epoch46, loss:0.2653
epoch47, loss:0.2651
epoch48,

此时net[0]对应的就是词向量矩阵

## **使用**

In [50]:
def get_similar_tokens(query_token, k, embed):
    W = embed.weight.data
    x = W[token_to_idx[query_token]]
    # 添加的1e-9是为了数值稳定性
    cos = torch.matmul(W, x) / (torch.sum(W * W, dim=1) * torch.sum(x * x) + 1e-9).sqrt()
    _, topk = torch.topk(cos, k=k+1)
    topk = topk.cpu().numpy()
    for i in topk[1:]:  # 除去输入词
        print('cosine sim=%.3f: %s' % (cos[i], (idx_to_token[i])))
        
get_similar_tokens('parent', 3, net[0])

cosine sim=0.447: unit
cosine sim=0.420: revco
cosine sim=0.408: core
