# 第二课 词向量

第二课学习目标
- 学习词向量的概念
- 用Skip-thought模型训练词向量
- 学习使用PyTorch dataset和dataloader
- 学习定义PyTorch模型
- 学习torch.nn中常见的Module
    - Embedding
- 学习常见的PyTorch operations
    - bmm
    - logsigmoid
- 保存和读取PyTorch模型
    

第二课使用的训练数据可以从以下链接下载到。

链接:https://pan.baidu.com/s/1tFeK3mXuVXEy3EMarfeWvg  密码:v2z5

在这一份notebook中，我们会（尽可能）尝试复现论文[Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)中训练词向量的方法. 我们会实现Skip-gram模型，并且使用论文中noice contrastive sampling的目标函数。

这篇论文有很多模型实现的细节，这些细节对于词向量的好坏至关重要。我们虽然无法完全复现论文中的实验结果，主要是由于计算资源等各种细节原因，但是我们还是可以大致展示如何训练词向量。

以下是一些我们没有实现的细节
- subsampling：参考论文section 2.3

In [1]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tud
from torch.nn.parameter import Parameter
   
from collections import Counter
import numpy as np
import random
import math

import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity # 计算余弦相似度
import os, sys
pwd = os.getcwd()
father_path = os.path.abspath(os.path.dirname(pwd)+os.path.sep+".")
grader_father = os.path.abspath(os.path.dirname(pwd)+os.path.sep+"..")
sys.path.append(father_path)
sys.path.append(grader_father)
print(pwd,"\n",father_path,"\n",grader_father)

USE_CUDA = torch.cuda.is_available()

# 为了保证实验结果可以复现，我们经常会把各种random seed固定在某一个值
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
if USE_CUDA:
    torch.cuda.manual_seed(53113)
    
# 设定一些超参数
BATCH_SIZE = 128 # the batch size
LEARNING_RATE = 0.2 # the initial learning rate

K = 100 # number of negative samples
C = 3 # nearby words threshold，一般设置3~5
NUM_EPOCHS = 2 # The number of epochs of training
MAX_VOCAB_SIZE = 30000 # the vocabulary size
EMBEDDING_SIZE = 100
     
LOG_FILE = "word-embedding.log"

# tokenize函数，把一篇文本转化成一个个单词
def word_tokenize(text):
    return text.split()

  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}


D:\00 PERSONAL-LEARNING\python_learn\第二课资料 
 D:\00 PERSONAL-LEARNING\python_learn 
 D:\00 PERSONAL-LEARNING


### 1. 根据语料,创建 vocabulary
### 2. 为vocabulary的每一个词创建index({word:index})
### 3. 计算每个词出现的freqs, 并进行归一化,用于负采样

- 从文本文件中读取所有的文字，通过这些文本创建一个vocabulary
- 由于单词数量可能太大，我们只选取最常见的MAX_VOCAB_SIZE个单词
- 我们添加一个UNK单词表示所有不常见的单词
- 我们需要记录单词到index的mapping，以及index到单词的mapping，单词的count，单词的(normalized) frequency，以及单词总数。

In [2]:
# 数据集路劲
TRAIN_PATH = os.path.join(pwd,"data/text8/text8.train.txt")
SIMLEX_PATH = os.path.join(pwd,"data/simlex-999.txt")
MEN_PATH = os.path.join(pwd,"data/men.txt")
SIM353 = os.path.join(pwd,"data/wordsim353.csv")
print(TRAIN_PATH)

with open(TRAIN_PATH, "r") as fin:
    text = fin.read()
    
text = [w for w in word_tokenize(text.lower())]
# 统计每个词在语料中出现的次数，选择最出现频率最高的 MAX_VOCAB_SIZE-1 个词
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))# 留一个位置给“<unk>”
'''论文中指出，所有词的输入都要转化为数字（index）  word_to_idx'''
# 列表类型，word
idx_to_word = [word for word in vocab.keys()] 
# 字典类型，{word：index} 记录下 word在语料中的位置（index），机器只能处理数字，不能处理单词
word_to_idx = {word:i for i, word in enumerate(idx_to_word)}
''' 论文中，需要计算词的freqs,用于负采样 '''
word_counts = np.array([count for count in vocab.values()], dtype=np.float32)
word_freqs = word_counts / np.sum(word_counts) # 归一化
# 可以增加freqs小的词的freqs, 降低freqs大的词的freqs. 这样就可以增加freqs小的词(重要的词)的采样可能性
word_freqs = word_freqs ** (3./4.)   
word_freqs = word_freqs / np.sum(word_freqs) # 用来做 negative sampling
'''重置VOCAB_SIZE,以防止语料库大小大于VOCAB_SIZE指定大小的情况'''
VOCAB_SIZE = len(idx_to_word)

D:\00 PERSONAL-LEARNING\python_learn\第二课资料\data/text8/text8.train.txt


### 实现Dataloader

一个dataloader需要以下内容：

- 把所有text编码成数字，然后用subsampling预处理这些文字。
- 保存vocabulary，单词count，normalized word frequency
- 每个iteration sample一个中心词
- 根据当前的中心词返回context单词
- 根据中心词sample一些negative单词
- 返回单词的counts

这里有一个好的tutorial介绍如何使用[PyTorch dataloader](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).
为了使用dataloader，我们需要定义以下两个function:

- ```__len__``` function需要返回整个数据集中有多少个item
- ```__get__``` 根据给定的index返回一个item

有了dataloader之后，我们可以轻松随机打乱整个数据集，拿到一个batch的数据等等。

In [22]:
class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, text, VOCAB_SIZE, word_to_idx, idx_to_word, word_freqs):
        ''' text: a list of words, all text from the training dataset
            word_to_idx: the dictionary from word to idx
            idx_to_word: idx to word mapping
            word_freq: the frequency of each word
            word_counts: the word counts
        '''
        super(WordEmbeddingDataset, self).__init__()
        self.text_encoded = [word_to_idx.get(t, VOCAB_SIZE-1) for t in text]
        self.text_encoded = torch.Tensor(self.text_encoded).long()
        self.word_to_idx = word_to_idx
        self.idx_to_word = idx_to_word
        self.word_freqs = torch.Tensor(word_freqs)
        
    def __len__(self):
        return len(self.text_encoded)
        
    def __getitem__(self, idx):
        ''' 这个function返回以下数据用于训练
            - 中心词
            - 这个单词附近的(positive)单词
            - 随机采样的K个单词作为negative sample
        '''
        center_word = self.text_encoded[idx]
        pos_indices = list(range(idx-C, idx)) + list(range(idx+1, idx+C+1))
        pos_indices = [i%len(self.text_encoded) for i in pos_indices]
        pos_words = self.text_encoded[pos_indices] 
        neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0])
        return center_word, pos_words, neg_words 

创建dataset和dataloader

### 定义PyTorch模型

In [15]:
class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        ''' 初始化输出和输出embedding
        '''
        super(EmbeddingModel, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        
        initrange = 0.5 / self.embed_size
        self.out_embed.weight.data.uniform_(-initrange, initrange)
        self.in_embed.weight.data.uniform_(-initrange, initrange)
        
    def forward(self, input_labels, pos_labels, neg_labels):
        '''
        input_labels: 中心词, [batch_size]
        pos_labels: 中心词周围 context window 出现过的单词 [batch_size * (window_size * 2)]
        neg_labelss: 中心词周围没有出现过的单词，从 negative sampling 得到 [batch_size, (window_size * 2 * K)]
        return: loss, [batch_size]
        '''
        input_embedding = self.in_embed(input_labels) # B * embed_size
        pos_embedding = self.out_embed(pos_labels) # B * (2*C) * embed_size
        neg_embedding = self.out_embed(neg_labels) # B * (2*C * K) * embed_size
        log_pos = torch.bmm(pos_embedding, input_embedding.unsqueeze(2)).squeeze() # B * (2*C)
        log_neg = torch.bmm(neg_embedding, -input_embedding.unsqueeze(2)).squeeze() # B * (2*C*K)
        log_pos = F.logsigmoid(log_pos).sum(1)
        log_neg = F.logsigmoid(log_neg).sum(1) # batch_size
        loss = log_pos + log_neg
        return -loss
    
    def input_embeddings(self):
        return self.in_embed.weight.data.cpu().numpy()
        

定义一个模型以及把模型移动到GPU

In [26]:
dataset = WordEmbeddingDataset(text, VOCAB_SIZE, word_to_idx, idx_to_word, word_freqs)
dataloader = tud.DataLoader(dataset, 
                            batch_size = BATCH_SIZE, 
                            shuffle = True, 
                            num_workers = 0)     


In [28]:
next(iter(dataloader))

[tensor([   61,     5,    22,     6,   786, 29999,   126,  7321,    37,  6701,
         10888,    13,     7,    41,     5,    10,     0,     1,   548,    24,
          2878,  4897,   812, 29999,     4, 29999,    12,  2258,     0, 10101,
            34,   190,     1,   135,     1,    89,   150,  2384,  1053,   614,
             6,   407,    10,    39,     1,   325,     5,     7,  5407, 11807,
           750,    31,    16,  1389,   552,  2579,    68,  5214,     5,  1559,
            85,     8,   431,  3730,   104,  2366,    15,     6,     0,  6480,
            51,     1,  7494,    13,     3,     3,   181,  1873,   338,     2,
            21, 21431,   301,    14,    28,   324,  1569,   136,   731,   625,
             0,   703,    15,     8,  1734,    30,  1110,  4920,    25,   349,
           742,  2322,     4,     7,   958, 29999,  7555,     3,     4,   345,
             6,    67,     5,    52,     9,     0, 15525,   154,  4935,     0,
          7185,     0,   115,     2,   566,     2,  

下面是评估模型的代码，以及训练模型的代码

In [7]:
def evaluate(filename, embedding_weights): 
    if filename.endswith(".csv"):
        data = pd.read_csv(filename, sep=",")
    else:
        data = pd.read_csv(filename, sep="\t")
    human_similarity = []
    model_similarity = []
    for i in data.iloc[:, 0:2].index:
        word1, word2 = data.iloc[i, 0], data.iloc[i, 1]
        if word1 not in word_to_idx or word2 not in word_to_idx:
            continue
        else:
            word1_idx, word2_idx = word_to_idx[word1], word_to_idx[word2]
            word1_embed, word2_embed = embedding_weights[[word1_idx]], embedding_weights[[word2_idx]]
            model_similarity.append(float(sklearn.metrics.pairwise.cosine_similarity(word1_embed, word2_embed)))
            human_similarity.append(float(data.iloc[i, 2]))

    return scipy.stats.spearmanr(human_similarity, model_similarity)# , model_similarity

def find_nearest(word):
    index = word_to_idx[word]
    embedding = embedding_weights[index]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
    return [idx_to_word[i] for i in cos_dis.argsort()[:10]]

训练模型：
- 模型一般需要训练若干个epoch
- 每个epoch我们都把所有的数据分成若干个batch
- 把每个batch的输入和输出都包装成cuda tensor
- forward pass，通过输入的句子预测每个单词的下一个单词
- 用模型的预测和正确的下一个单词计算cross entropy loss
- 清空模型当前gradient
- backward pass
- 更新模型参数
- 每隔一定的iteration输出模型在当前iteration的loss，以及在验证数据集上做模型的评估

In [8]:
model = EmbeddingModel(VOCAB_SIZE, EMBEDDING_SIZE)
if USE_CUDA:
    model = model.cuda()

optimizer = torch.optim.SGD(model.parameters(), lr = LEARNING_RATE)
for e in range(NUM_EPOCHS):
    for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
        if USE_CUDA:
            input_labels = input_labels.cuda()
            pos_labels = pos_labels.cuda()
            neg_labels = neg_labels.cuda()
            
        optimizer.zero_grad()
        loss = model(input_labels, pos_labels, neg_labels).mean()
        loss.backward()
        optimizer.step()

        if i % 1000 == 0:
            with open(LOG_FILE, "a") as fout:
                fout.write("epoch: {}, iter: {}, loss: {}\n".format(e, i, loss.item()))
                print("epoch: {}, iter: {}, loss: {}".format(e, i, loss.item()))
            
        if i % 5000 == 0:
            embedding_weights = model.input_embeddings()
            sim_simlex = evaluate(SIMLEX_PATH, embedding_weights)
            sim_men = evaluate(MEN_PATH, embedding_weights)
            sim_353 = evaluate(SIM353_PATH, embedding_weights)
            with open(LOG_FILE, "a") as fout:
                print("epoch: {}, iteration: {}, simlex-999: {}, men: {}, sim353: {}, nearest to monster: {}\n".format(
                    e, i, sim_simlex, sim_men, sim_353, find_nearest("monster")))
                fout.write("epoch: {}, iteration: {}, simlex-999: {}, men: {}, sim353: {}, nearest to monster: {}\n".format(
                    e, i, sim_simlex, sim_men, sim_353, find_nearest("monster")))
                
#     embedding_weights = model.input_embeddings()
    torch.save(model.state_dict(), "embedding-{}.th".format(EMBEDDING_SIZE))

BrokenPipeError: [Errno 32] Broken pipe

In [31]:
model.load_state_dict(torch.load("embedding-{}.th".format(EMBEDDING_SIZE)))

<All keys matched successfully>

## 在 MEN 和 Simplex-999 数据集上做评估

In [32]:
embedding_weights = model.input_embeddings()
print("simlex-999", evaluate("simlex-999.txt", embedding_weights))
print("men", evaluate("men.txt", embedding_weights))
print("wordsim353", evaluate("wordsim353.csv", embedding_weights))

simlex-999 SpearmanrResult(correlation=0.23862807852014967, pvalue=8.082212007686697e-14)
men SpearmanrResult(correlation=0.3879450295711117, pvalue=1.6836442672726613e-93)
wordsim353 SpearmanrResult(correlation=0.4607564965641069, pvalue=3.598974422141974e-18)


## 寻找nearest neighbors

In [33]:
for word in ["good", "fresh", "monster", "green", "like", "america", "chicago", "work", "computer", "language"]:
    print(word, find_nearest(word))

good ['good', 'bad', 'luck', 'quick', 'safe', 'loving', 'happy', 'nice', 'fun', 'pretty']
fresh ['fresh', 'grazing', 'liquor', 'brackish', 'cloth', 'grain', 'rotting', 'smoke', 'clean', 'frozen']
monster ['monster', 'ness', 'giant', 'loch', 'creature', 'beast', 'hammer', 'serpent', 'snake', 'killer']
green ['green', 'yellow', 'blue', 'gray', 'purple', 'colored', 'pink', 'orange', 'white', 'grey']
like ['like', 'resemble', 'resembling', 'including', 'similarly', 'exotic', 'such', 'unlike', 'eg', 'etc']
america ['america', 'europe', 'carolina', 'americas', 'asia', 'africa', 'dakota', 'oceania', 'cherokee', 'australia']
chicago ['chicago', 'illinois', 'boston', 'detroit', 'atlanta', 'cincinnati', 'milwaukee', 'cleveland', 'baltimore', 'philadelphia']
work ['work', 'composing', 'filmmaking', 'works', 'compose', 'inventions', 'inventing', 'experimentation', 'pioneering', 'journalistic']
computer ['computer', 'computers', 'computing', 'hardware', 'graphics', 'programmable', 'electronics', 'm

## 单词之间的关系

In [34]:
man_idx = word_to_idx["man"] 
king_idx = word_to_idx["king"] 
woman_idx = word_to_idx["woman"]
embedding = embedding_weights[woman_idx] - embedding_weights[man_idx] + embedding_weights[king_idx]
cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
for i in cos_dis.argsort()[:20]:
    print(idx_to_word[i])

king
throne
princess
son
emperor
constantine
vii
claudius
daughter
charlemagne
heir
prince
eldest
ruler
isabella
tudor
empress
kings
augustus
baldwin
