## 词向量
#### 学习目标

- 学习词向量的概念
- 用Skip-thought模型训练词向量
- 学习使用PyTorch dataset和dataloader
- 学习定义PyTorch模型
- 学习torch.nn中常见的Module
    - Embedding
- 学习常见的PyTorch operations
    - bmm
    - logsigmoid
- 保存和读取PyTorch模型

- 第二课使用的训练数据可以从以下链接下载到。
    - 链接:https://pan.baidu.com/s/1tFeK3mXuVXEy3EMarfeWvg 密码:v2z5

#### 内容

- 在这一份notebook中，我们会（尽可能）尝试复现论文[Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)中训练词向量的方法. 我们会实现Skip-gram模型，并且使用论文中noice contrastive sampling的目标函数。

- 这篇论文有很多模型实现的细节，这些细节对于词向量的好坏至关重要。我们虽然无法完全复现论文中的实验结果，主要是由于计算资源等各种细节原因，但是我们还是可以大致展示如何训练词向量。

- 以下是一些我们没有实现的细节
    - subsampling：参考论文section 2.3

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tud
from torch.nn.parameter import Parameter

from collections import Counter
import numpy as np
import random
import math

import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 为保证实验结果可以复现，我们经常会把各种random seed固定在某一个值
SEED = 2019
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    
# 设定一些超参数
K = 100         # 负样本数量
C = 3           # word2vec窗口大小（半径）
EPOCHS = 2      # 跑完一次全量数据为一次EPOCHS
MAX_VOCAB_SIZE = 30000    # 词典容量，即有3万个单词
BATCH_SIZE = 128          # 每批次数据大小
LEARNING_RATE = 0.2       # 初始学习率
EMBEDDING_SIZE = 100      # 词向量维度


TRAIN_FILE = "data/text8/text8.train.txt"
EVAL_FILE = "data/text8/text8.dev.txt"
TEST_FILE = "data/text8/text8.test.txt"

In [3]:
# tokenize函数，把一篇文本转化成一个个单词
def word_tockenize(text):
    return text.split()

## 数据预处理（文本处理）
- 从文本文件中读取所有的文字，通过这些文本创建一个vocabulary
- 由于单词数量可能太大，我们只选取最常见的MAX_VOCAB_SIZE个单词
- 我们添加一个UNK单词表示所有不常见的单词
- 我们需要记录单词到index的mapping，以及index到单词的mapping，单词的count，单词的(normalized) frequency，以及单词总数。

In [4]:
def text_preprocess(txt_file):
    with open(txt_file, "r") as f:
        text = f.read()
    text = [w for w in word_tockenize(text.lower())]
    # 词频最大的MAX_VOCAB_SIZE-1个单词，剩下一个留给”<unk>"
    vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1)) 
    # 其他未在vocab中的单词用<unk>代替，并计算unk的词频，添加到vocab字典中
    vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))
    
    idx2word = [word for word in vocab.keys()]  
    word2idx = {word: i for i, word in enumerate(idx2word)}
    
    # 负样本采样
    # 每个单词出现的数量列表
    word_counts = np.array([count for count in vocab.values()], 
                           dtype=np.float32)
    word_freqs = word_counts / np.sum(word_counts)
    word_freqs = word_freqs ** (3./4.)
    word_freqs = word_freqs / np.sum(word_freqs)
    vocab_size = len(idx2word)
    print(vocab_size)
    return text, idx2word, word2idx, word_freqs, word_counts, vocab_size

In [5]:
text, idx2word, word2idx, word_freqs, word_counts, VOCAB_SIZE = text_preprocess(TRAIN_FILE)

30000


In [6]:
idx2word[:10]

['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two']

## 实现DataLoader
一个dataloader需要以下内容：

- 把所有text编码成数字，然后用subsampling预处理这些文字。
- 保存vocabulary，单词count，normalized word frequency
- 每个iteration sample一个中心词
- 根据当前的中心词返回context单词
- 根据中心词sample一些negative单词
- 返回单词的counts

这里有一个好的tutorial介绍如何使用[PyTorch dataloader](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).
为了使用dataloader，我们需要定义以下两个function:

- ```__len__``` function需要返回整个数据集中有多少个item
- ```__get__``` 根据给定的index返回一个item

有了dataloader之后，我们可以轻松随机打乱整个数据集，拿到一个batch的数据等等。

## 理解torch.multinomial 采样函数
--------------
- torch.multinomial(input, num_samples, replacement=False, out=None) → LongTensor
    - 作用是对input的每一行做num_samples次取值，输出的张量是每一次取值时input张量对应行的**下标**。
    - 输入是一个input张量，一个取样数量可以是列表，和一个布尔值replacement。
    - input张量可以看成一个权重张量，每一个元素代表其在该行中的权重。如果有元素为0，那么在其他不为0的元素
    - 被取干净之前，这个元素是不会被取到的。
    - num_samples是每一行的取值次数，该值不能大于每一样的元素数，否则会报错。
    - replacement指的是取样时是否是有放回的取样，True是有放回，False无放回。
----
- 官方例子
```python
weights = torch.tensor([0, 0.8, 2, 0], dtype=torch.float) # create a tensor of weights
print(torch.multinomial(weights, 2))
# tensor([2, 1])
print(torch.multinomial(weights, 4, replacement=True))
# tensor([2, 2, 1, 2])
```

In [7]:
class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, text, word2idx, idx2word, word_freqs, word_counts):
        super(WordEmbeddingDataset, self).__init__()
        # 文档编码
        self.text_encoded = [word2idx.get(t, VOCAB_SIZE-1) for t in text]
        self.text_encoded = torch.Tensor(self.text_encoded).long()

        self.word2idx = word2idx
        self.idx2word = idx2word
        self.word_freqs = torch.Tensor(word_freqs)
        self.word_counts = torch.Tensor(word_counts)
        
    def __len__(self):
        ''' 
        返回整个数据集（所有单词）的长度
        '''
        return len(self.text_encoded)
    
    def __getitem__(self, idx):
        ''' 
        这个function返回以下数据用于训练
            - 中心词
            - 这个单词附近的(positive)单词
            - 随机采样的K个单词作为negative sample
        '''
        center_word = self.text_encoded[idx]
        pos_indices = list(range(idx-C, idx))+list(range(idx+1, idx+C+1))
        # 返回除中心词之外的附近的单词的索引
        pos_indices = [i % len(self.text_encoded) for i in pos_indices]
        # 周围词编码列表
        pos_words = self.text_encoded[pos_indices]
        # 多项式分布采样
        neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True)
        
        return center_word, pos_words, neg_words 

## 创建Dataset与Dataloader

In [8]:
dataset = WordEmbeddingDataset(text, word2idx, idx2word, word_freqs, word_counts)
dataloader = tud.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)

In [9]:
i = 0
for center_word, pos_words, neg_words in dataloader:
    print(center_word[0], center_word.size())
    print(pos_words[0], center_word.size())
    print(neg_words[0], center_word.size())
    print("-"*20)
    i += 1
    if i == 1:
        break

tensor(428) torch.Size([128])
tensor([13053,   267,   314,     1,     0,   113]) torch.Size([128])
tensor([ 9576,  2082,   279, 21829,  2247,   953,  3666, 23787,  1562,   465,
          408, 11302,  1286,     7,   546,     1,    39,  2556,    16,  6145,
         6253,    25,   476,   271,    16,  1830,    28,    12,  1607,    23,
         7888,     0,  7863,   331,   521,   406,  6392, 12881,  5695,  5322,
         2289,   697, 26196,  1396, 13726,  7781,    19, 21110,    60,    90,
           15,   776, 18314,  2433,    19, 12410,  9995, 16336, 29999,  1439,
         2480, 17925,    14,     2,  3978,  1225,   349,  1537,  2450,  1053,
        26771,  2577,     0,  4777,  1503,  6077,   601,   241,  1621,  8592,
           11, 24580,   103,   557,  9434,  9990,    15,     6, 27198,    32,
          926,  2860,   251,  2720,   196,    19, 10197,  3980,  5902,   117,
            1, 11078,  1063,   912,    26,    25,  7395,    43,  1201,   368,
          645,   761,    59, 26620, 25032, 

## 定义pytorch 模型

In [10]:
class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        ''' 
        初始化输入和输出embedding
        '''
        super(EmbeddingModel, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        
        # 两个都初始化的好处？
        initrange = 0.5 / self.embed_size
        self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        self.in_embed.weight.data.uniform_(-initrange, initrange)
        
        self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        self.out_embed.weight.data.uniform_(-initrange, initrange)
        
    def forward(self, input_labels, pos_labels, neg_labels):
        '''
        input_labels: 中心词, [batch_size]
        pos_labels: 中心词周围 context window 出现过的单词 [batch_size * (window_size * 2)]
        neg_labelss: 中心词周围没有出现过的单词，从 negative sampling 得到 
                    [batch_size, (window_size * 2 * K)]
        
        return: loss, [batch_size]
        '''
        
        batch_size = input_labels.size(0)
        
        input_embedding = self.in_embed(input_labels) # B * embed_size
        pos_embedding = self.out_embed(pos_labels) # B * (2*C) * embed_size
        neg_embedding = self.out_embed(neg_labels) # B * (2*C * K) * embed_size
      
        log_pos = torch.bmm(pos_embedding, input_embedding.unsqueeze(2)).squeeze() # B * (2*C)
        log_neg = torch.bmm(neg_embedding, -input_embedding.unsqueeze(2)).squeeze() # B * (2*C*K)

        log_pos = F.logsigmoid(log_pos).sum(1)
        log_neg = F.logsigmoid(log_neg).sum(1) # batch_size
       
        loss = log_pos + log_neg
        
        return -loss
    
    def input_embeddings(self):
        return self.in_embed.weight.data.cpu().numpy()

In [13]:
model = EmbeddingModel(VOCAB_SIZE, EMBEDDING_SIZE)
model = model.to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

In [14]:
def evaluate(filename, embedding_weights): 
    if filename.endswith(".csv"):
        data = pd.read_csv(filename, sep=",")
    else:
        data = pd.read_csv(filename, sep="\t")
    human_similarity = []
    model_similarity = []
    for i in data.iloc[:, 0:2].index:
        word1, word2 = data.iloc[i, 0], data.iloc[i, 1]
        if word1 not in word_to_idx or word2 not in word_to_idx:
            continue
        else:
            word1_idx, word2_idx = word_to_idx[word1], word_to_idx[word2]
            word1_embed, word2_embed = embedding_weights[[word1_idx]], embedding_weights[[word2_idx]]
            model_similarity.append(float(sklearn.metrics.pairwise.cosine_similarity(word1_embed, word2_embed)))
            human_similarity.append(float(data.iloc[i, 2]))

    return scipy.stats.spearmanr(human_similarity, model_similarity)# , model_similarity

def find_nearest(word):
    index = word_to_idx[word]
    embedding = embedding_weights[index]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
    return [idx_to_word[i] for i in cos_dis.argsort()[:10]]

## 训练模型
-----
- 模型一般需要训练若干个epoch
- 每个epoch我们都把所有的数据分成若干个batch
- 把每个batch的输入和输出都包装成cuda tensor
- forward pass，通过输入的句子预测每个单词的下一个单词
- 用模型的预测和正确的下一个单词计算cross entropy loss
- 清空模型当前gradient
- backward pass
- 更新模型参数
- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证数据集上做模型的评估

In [15]:

for e in range(EPOCHS):
    for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
        # TODO
        input_labels = input_labels.long()
        pos_labels = pos_labels.long()
        neg_labels = neg_labels.long()

        input_labels = input_labels.to(device)
        pos_labels = pos_labels.to(device)
        neg_labels = neg_labels.to(device)
            
        optimizer.zero_grad()
        loss = model(input_labels, pos_labels, neg_labels).mean()
        loss.backward()
        optimizer.step()

        if i % 1000 == 0:
            print("epoch: {}, iter: {}, loss: {}".format(e, i, loss.item()))
            
        
#         if i % 2000 == 0:
#             embedding_weights = model.input_embeddings()
#             sim_simlex = evaluate("simlex-999.txt", embedding_weights)
#             sim_men = evaluate("men.txt", embedding_weights)
#             sim_353 = evaluate("wordsim353.csv", embedding_weights)
#             with open(LOG_FILE, "a") as fout:
#                 print("epoch: {}, iteration: {}, simlex-999: {}, men: {}, sim353: {}, nearest to monster: {}\n".format(
#                     e, i, sim_simlex, sim_men, sim_353, find_nearest("monster")))
#                 fout.write("epoch: {}, iteration: {}, simlex-999: {}, men: {}, sim353: {}, nearest to monster: {}\n".format(
#                     e, i, sim_simlex, sim_men, sim_353, find_nearest("monster")))
                
    embedding_weights = model.input_embeddings()
    np.save("embedding-{}".format(EMBEDDING_SIZE), embedding_weights)
    torch.save(model.state_dict(), "embedding-{}.th".format(EMBEDDING_SIZE))

epoch: 0, iter: 0, loss: 420.0472106933594
epoch: 0, iter: 1000, loss: 110.64915466308594
epoch: 0, iter: 2000, loss: 63.69944763183594
epoch: 0, iter: 3000, loss: 55.475433349609375
epoch: 0, iter: 4000, loss: 45.3078498840332
epoch: 0, iter: 5000, loss: 41.215545654296875
epoch: 0, iter: 6000, loss: 37.93198776245117
epoch: 0, iter: 7000, loss: 35.823787689208984
epoch: 0, iter: 8000, loss: 35.2523193359375
epoch: 0, iter: 9000, loss: 35.85188674926758
epoch: 0, iter: 10000, loss: 35.528446197509766
epoch: 0, iter: 11000, loss: 34.1132698059082
epoch: 0, iter: 12000, loss: 34.4886474609375
epoch: 0, iter: 13000, loss: 33.55486297607422
epoch: 0, iter: 14000, loss: 33.06645202636719
epoch: 0, iter: 15000, loss: 35.073768615722656
epoch: 0, iter: 16000, loss: 33.03211212158203
epoch: 0, iter: 17000, loss: 32.87288284301758
epoch: 0, iter: 18000, loss: 32.25053024291992
epoch: 0, iter: 19000, loss: 33.15672302246094
epoch: 0, iter: 20000, loss: 33.965084075927734
epoch: 0, iter: 21000, 

epoch: 1, iter: 53000, loss: 30.422870635986328
epoch: 1, iter: 54000, loss: 30.511688232421875
epoch: 1, iter: 55000, loss: 30.220746994018555
epoch: 1, iter: 56000, loss: 30.517898559570312
epoch: 1, iter: 57000, loss: 30.228534698486328
epoch: 1, iter: 58000, loss: 29.6726016998291
epoch: 1, iter: 59000, loss: 30.818809509277344
epoch: 1, iter: 60000, loss: 30.15318489074707
epoch: 1, iter: 61000, loss: 30.516082763671875
epoch: 1, iter: 62000, loss: 30.354167938232422
epoch: 1, iter: 63000, loss: 30.65067481994629
epoch: 1, iter: 64000, loss: 30.16307830810547
epoch: 1, iter: 65000, loss: 30.29519271850586
epoch: 1, iter: 66000, loss: 30.450511932373047
epoch: 1, iter: 67000, loss: 30.45915985107422
epoch: 1, iter: 68000, loss: 30.23889923095703
epoch: 1, iter: 69000, loss: 30.60623550415039
epoch: 1, iter: 70000, loss: 30.522491455078125
epoch: 1, iter: 71000, loss: 30.621078491210938
epoch: 1, iter: 72000, loss: 30.627422332763672
epoch: 1, iter: 73000, loss: 30.85080337524414
ep

In [None]:
## 未完待续
 需要明白的点：
    采样
    输入输出的embedding是分开的？
    后面的验证是什么鬼？
    如何评价？
 