### CBOW

context --> target

和skip_gram 基本相同，输入和输出不同。所以如何创建单词嵌套和从句子中生成嵌套数据与之不同。

In [51]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import string
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
import collections
import pickle

In [2]:
vocabulary_size = 10000
embedding_size = 200
batch_size = 50
windows_size = 2
num_sampled = int(batch_size/2) #控制多少个批量转换为随机噪声
epochs = 1000

valid_words = ['cliche','love','hate']

sess = tf.Session()

### 1.读取数据

In [6]:
data_file_path = 'dataset\\rt-polaritydata'

def load_data(data_file_path):
    pos_file = os.path.join(data_file_path,'rt-polarity.pos')# 正例
    neg_file = os.path.join(data_file_path,'rt-polarity.neg')# 负例

    #read data
    pos_data = []
    with open(pos_file,'r',encoding='gb18030',errors='ignore') as temp_pos_file:
        for row in temp_pos_file:
            pos_data.append(row)

    neg_data = []
    with open(neg_file,'r',encoding='gb18030',errors='ignore') as temp_neg_file:
        for row in temp_neg_file:
            neg_data.append(row)

    texts = pos_data+neg_data
    target = [1]*len(pos_data)+[0]*len(neg_data)
    
    return texts,target

texts,target = load_data(data_file_path)

### 2.数据预处理
#### 2.1 转小写、去除标点数字和空白、去除“停词”

In [9]:
def normalize_text(texts):
    stops = stopwords.words('english')#需要提前下载nltk_data，放在指定位置

    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    return(texts)

texts = normalize_text(texts)

#### 2.2 筛选影评长度大于3的数据，为更好确保影评的有效性

In [10]:
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

### 3.构建词汇表

1. 统计词频，取频次前voacabulary_size的词语和频次；
2. 对每个词语赋予一个数值，这里采用赋予一个新数值 这个词典前面的长度，就可以实现递增唯一赋值

In [12]:
def build_dictionary(sentences,vocabulary_size):

    # Turn sentences (list of strings) into lists of words
    split_sentences = [s.split() for s in sentences]
    words = [x for sublist in split_sentences for x in sublist]

    # Initialize list of [word, word_count] for each word, starting with unknown
    count = [('RARE', -1)]

    # Now add most frequent words, limited to the N-most frequent (N=vocabulary size) 
    # most_common 取出现频次最多的N个
    count.extend(collections.Counter(words).most_common(vocabulary_size-1))
    
    # Now create the dictionary
    word_dictionary = {}
    # For each word, that we want in the dictionary, add it, then make it
    # the value of the prior dictionary length
    for i,(word, word_count) in enumerate(count):
        word_dictionary[word] = len(word_dictionary)
    
    return word_dictionary

word_dictionary = build_dictionary(texts,vocabulary_size)

### 4.对数据集单词进行数值映射
当单词的频次在前N个即在word_dict中，即映射相应数值，否则映射rare的0;
- input : sentences 里面的元素是英文单词本身
- output : text_data 里面的元素是数值，是根据词汇表 word_dictionary 映射英文单词对应的数值

In [13]:
def text_to_number(sentences):
    text_data = []
    for sentence in sentences:
        sentence_data = []
        for word in sentence.split():
            if word in word_dictionary:
                sentence_data.append(word_dictionary[word])
            else:
                sentence_data.append(0)
        text_data.append(sentence_data)
    return text_data

text_data = text_to_number(texts)

验证集的单词也进行映射，此时验证集单词应选择在前N频次的单词

In [14]:
# 在前N个词汇中选择
valid_example = [word_dictionary[x] for x in valid_words]
valid_example

[1434, 28, 980]

In [42]:
rand_sentence = np.random.choice(text_data)
# Generate consecutive windows to look at
window_sequences = [rand_sentence[max((ix - windows_size), 0):(ix + windows_size + 1)] for ix, x in
                    enumerate(rand_sentence)]
# 选择中心词
# Denote which element of each window is the center word of interest
label_indices = [ix if ix < windows_size else windows_size for ix, x in enumerate(window_sequences)]

# 这里和skip-gram不一样，sk是(中心词，周围词)，而cbow是（[周围词..]，中心词）
# Pull out center word of interest for each window and create a tuple for each window
batch_and_labels = [(x[:y] + x[(y + 1):], x[y]) for x, y in zip(window_sequences, label_indices)]
# Make it in to a big list of tuples (target word, surrounding word)

# 此时的x输入是Windows_size大小的周围词，y是中心词,，所以要选择周围词大小等于两倍窗口大小的那些输入，去除起始不足的周围词
batch_and_labels = [(x,y) for x,y in batch_and_labels if len(x)==2*windows_size]

# extract batch and labels
batch, labels = [list(x) for x in zip(*batch_and_labels)]
batch, labels

([[184, 2556, 2336, 0],
  [2556, 2611, 0, 8989],
  [2611, 2336, 8989, 96],
  [2336, 0, 96, 546]],
 [2611, 2336, 0, 8989])

### 5.生成cbow模型的批量数据

In [49]:
def generate_batch_data(sentences, batch_size, windows_size):
    # Fill up data batch
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        batch, labels = [], []
        # select random sentence to start
        rand_sentence = np.random.choice(sentences)
        # Generate consecutive windows to look at
        window_sequences = [rand_sentence[max((ix - windows_size), 0):(ix + windows_size + 1)] for ix, x in
                            enumerate(rand_sentence)]
        # 选择中心词
        # Denote which element of each window is the center word of interest
        label_indices = [ix if ix < windows_size else windows_size for ix, x in enumerate(window_sequences)]

        # 这里和skip-gram不一样，sk是(中心词，周围词)，而cbow是（[周围词..]，中心词）
        # Pull out center word of interest for each window and create a tuple for each window
        batch_and_labels = [(x[:y] + x[(y + 1):], x[y]) for x, y in zip(window_sequences, label_indices)]
        # Make it in to a big list of tuples (target word, surrounding word)

        # 此时的x输入是Windows_size大小的周围词，y是中心词,，所以要选择周围词大小等于两倍窗口大小的那些输入，去除起始不足的周围词
        batch_and_labels = [(x,y) for x,y in batch_and_labels if len(x)==2*windows_size]
        
        # extract batch and labels
        # 比书里的例子中多加了一个判断，因为只筛选了长度>2的句子，而这里的输入要求x大小是2*windows_size
        # 如果句子长度小于2*windows_size，则batch_and_labels选择出来是空列表，下一行会出错，所以要加一个判断
        if batch_and_labels:
            batch, labels = [list(x) for x in zip(*batch_and_labels)]
        
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
        
    # Trim batch and label at the end
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]

    # Convert to numpy array
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))

    return (batch_data, label_data)

### 6.模型参数初始化
#### 6.1初始化变量和函数
声明嵌套函数，声明占位符和嵌套查找函数，声明NCE损失的w和b

In [27]:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))

# x是中心词的左右windows_size大小的单词，所以输入大小是2*windows_size
x_inputs = tf.placeholder(tf.int32,shape=[batch_size, 2*windows_size])
y_target = tf.placeholder(tf.int32,shape=[batch_size,1])
valid_dataset = tf.constant(valid_example,dtype=tf.int32)

nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev = 1.0/np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

x_inputs

<tf.Tensor 'Placeholder_2:0' shape=(50, 4) dtype=int32>

#### 6.2处理单词嵌套
输入通过embed循环将窗口大小的单词嵌套加在一起

In [31]:
embed = tf.zeros([batch_size,embedding_size])
for element in range(2*windows_size):
    embed += tf.nn.embedding_lookup(embeddings,x_inputs[:,element])
embed

<tf.Tensor 'add_8:0' shape=(50, 200) dtype=float32>

#### 6.3定义NCE损失，声明优化器函数

In [33]:
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights,nce_biases,y_target,embed,num_sampled,vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

### 7.构建模型

#### 7.1模型变量初始化，定义词向量保存

In [34]:
init = tf.initialize_all_variables()
sess.run(init)

saver = tf.train.Saver({"embeddings":embeddings})

Instructions for updating:
Use `tf.global_variables_initializer` instead.


#### 7.2 模型训练

In [53]:
loss_vec,loss_x_vec = [],[]
for i in range(epochs):   
    
    batch_input,batch_labels = generate_batch_data(text_data,batch_size,windows_size)

    feed_dict = {x_inputs:batch_input,y_target:batch_labels}
    sess.run(optimizer,feed_dict)
    
    if (i+1)%100 ==0:
        loss_val = sess.run(loss,feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print("loss at epoch{}:{}".format(i+1,loss_val))
        
    # 保存词表和模型
    if (i+1)%500 == 0:
        
        with open("vocab\\movie_vocab.pkl","wb") as f:
            pickle.dump(word_dictionary,f)
        
        #  Parent directory of cbow_movie_embeddings.ckpt doesn't exist, can't save.
        # 出现了这个错误，于是给ckpt文件加了一个文件夹
        save_path = saver.save(sess,os.path.join("models","cbow_movie_embeddings.ckpt"))
        
        print("Dictionary and embeddings saved.")
    

loss at epoch100:12.670909881591797
loss at epoch200:11.690191268920898
loss at epoch300:21.26810073852539
loss at epoch400:19.435401916503906
loss at epoch500:27.821025848388672
Dictionary and embeddings saved.
loss at epoch600:2.5985732078552246
loss at epoch700:22.63739776611328
loss at epoch800:19.126953125
loss at epoch900:10.38302230834961
loss at epoch1000:14.826190948486328
Dictionary and embeddings saved.


来看一下.pkl文件保存了啥-->就是word_dictionary本身那个词汇赋值

In [55]:
f = open("vocab\\movie_vocab.pkl","rb")
data = pickle.load(f)
data

{'form': 515,
 'del': 1241,
 'brush': 8498,
 'seemed': 2033,
 'subjects': 998,
 'planet': 1940,
 'threshold': 8499,
 'italian': 1289,
 'coldhearted': 9296,
 'speak': 2906,
 'tract': 6935,
 'appreciated': 3969,
 'lacks': 237,
 'generated': 5294,
 'selfimportant': 4327,
 'range': 3970,
 'jerking': 5295,
 'maybe': 574,
 'ideological': 8501,
 'grownups': 3093,
 'bleak': 2289,
 'boisterous': 3629,
 'succumbing': 6953,
 'idealistic': 5296,
 'exploits': 4834,
 'punk': 4835,
 'enthusiastically': 6936,
 'conquer': 5445,
 'curmudgeon': 4836,
 'performer': 6095,
 'depicting': 6937,
 'years': 93,
 'murky': 3360,
 'necessity': 6938,
 'construction': 3094,
 'formulaic': 742,
 'turn': 397,
 'whine': 5944,
 'extremely': 755,
 'spielberg': 999,
 'dustin': 7275,
 'issue': 1698,
 'jagged': 6939,
 'dynamite': 8505,
 'mildmannered': 8506,
 'fleming': 6940,
 'deliciously': 4758,
 'moments': 110,
 'atmospheric': 4328,
 'unfocused': 1941,
 'enriched': 6096,
 'oftenfunny': 8509,
 'cia': 3630,
 'payback': 9083,

### 8.验证

#### 8.1创建函数寻找验证单词周围的单词
计算验证单词和所有词向量之间的余弦相似度
$$cos(\theta) = \frac{a\cdot b}{||a||\times ||b||}= \frac{\sum_i^N(x_i\times y_i)}{\sqrt{\sum_i^N(x_i)^2}\times\sqrt{\sum_i^N(y_i)^2}}$$

两个向量的余弦相似度计算：
- 句子A：(1,2,1,1,1)
- 句子B：(1,1,0,1,1)

计算过程如下：

$$cos(\theta) = \frac{1\times1+2\times1+1\times0+1\times1+1\times1}{\sqrt{1^2+2^2+1^2+1^2+1^2}\times\sqrt{1^2+1^2+0^2+1^2+1^2}}$$

这里就是将分母先放在外面norm时完成，后面用matmul对两个向量中的每个数字对应相乘就好

In [57]:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings),1,keep_dims = True))
normalized_embeddings = embeddings/norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,valid_dataset)
similarity = tf.matmul(valid_embeddings,normalized_embeddings,transpose_b=True)

In [58]:
batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, windows_size)
feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}

# Run the train step
sim = sess.run(similarity, feed_dict=feed_dict)

word_dictionary_rev = dict(zip(word_dictionary.values(),word_dictionary.keys()))
# word_dictionary_rev = {0:'rare',1:'film',2:'movie'...}

for j in range(len(valid_words)):
    
    valid_word = word_dictionary_rev[valid_example[j]]
    print("-"*30)
    print("预测词为：",valid_word)
    log_str = "Nearest to {}:".format(valid_word)
    topk = 5
    nearst = (-sim[j,:]).argsort()[1:topk+1]
    for k in range(topk):
        print("预测的周围词为:",word_dictionary_rev[nearst[k]])
        
        # 自己和自己迭代，就可以一直在后面添加了
        log_str = "%s %s ,"%(log_str,word_dictionary_rev[nearst[k]])
        
    print(log_str[:-1])

------------------------------
预测词为： cliche
预测的周围词为: kudos
预测的周围词为: downer
预测的周围词为: pool
预测的周围词为: heartrending
预测的周围词为: fame
Nearest to cliche: kudos , downer , pool , heartrending , fame 
------------------------------
预测词为： love
预测的周围词为: payne
预测的周围词为: mcdormand
预测的周围词为: loquacious
预测的周围词为: ranges
预测的周围词为: devoid
Nearest to love: payne , mcdormand , loquacious , ranges , devoid 
------------------------------
预测词为： hate
预测的周围词为: end
预测的周围词为: substantial
预测的周围词为: miscast
预测的周围词为: tediously
预测的周围词为: eve
Nearest to hate: end , substantial , miscast , tediously , eve 
