## Skip-gram

- 数据集：电影评论数据集，下载地址：http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz
- 

In [27]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import string
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
import collections

In [28]:
vocabulary_size = 10000
embedding_size = 200
batch_size = 50
windows_size = 2
num_sampled = int(batch_size/2) #控制多少个批量转换为随机噪声
epochs = 1000

valid_words = ['cliche','love','hate']

sess = tf.Session()

### 1.读取导入数据

In [29]:
data_file_path = 'dataset\\rt-polaritydata'
pos_file = os.path.join(data_file_path,'rt-polarity.pos')# 正例
neg_file = os.path.join(data_file_path,'rt-polarity.neg')# 负例

#read data
pos_data = []
with open(pos_file,'r',encoding='gb18030',errors='ignore') as temp_pos_file:
    for row in temp_pos_file:
        pos_data.append(row)

neg_data = []
with open(neg_file,'r',encoding='gb18030',errors='ignore') as temp_neg_file:
    for row in temp_neg_file:
        neg_data.append(row)
pos_data[:3],neg_data[:3]

(['锘縯he rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \n',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n',
  'effective but too-tepid biopic\n'],
 ['锘縮implistic , silly and tedious . \n',
  "it's so laddish and juvenile , only teenage boys could possibly find it funny . \n",
  'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \n'])

In [30]:
texts = pos_data+neg_data
target = [1]*len(pos_data)+[0]*len(neg_data)
len(texts)

10661

### 2.数据预处理
#### 2.1 转小写、去除标点数字和空白、去除“停词”

In [49]:
def normalize_text(texts):
    stops = stopwords.words('english')#需要提前下载nltk_data，放在指定位置

    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    return(texts)

texts = normalize_text(texts)
texts[:5]

['锘縯he rock destined st centurys new conan hes going make splash even greater arnold schwarzenegger jeanclaud van damme steven segal',
 'gorgeously elaborate continuation lord rings trilogy huge column words cannot adequately describe cowriterdirector peter jacksons expanded vision j r r tolkiens middleearth',
 'effective tootepid biopic',
 'sometimes like go movies fun wasabi good place start',
 'emerges something rare issue movie thats honest keenly observed doesnt feel like one']

#### 2.2 筛选影评长度大于3的数据，为更好确保影评的有效性

In [32]:
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

### 3.构建词汇表

统计词频，取频次前voacabulary_size的词语和频次返回

In [33]:
sentences = texts
# vocabulary_size = 10000

# Turn sentences (list of strings) into lists of words
split_sentences = [s.split() for s in sentences]
words = [x for sublist in split_sentences for x in sublist]

# Initialize list of [word, word_count] for each word, starting with unknown
count = [('RARE', -1)]

# Now add most frequent words, limited to the N-most frequent (N=vocabulary size) 
# most_common 取出现频次最多的N个
count.extend(collections.Counter(words).most_common(vocabulary_size-1))
count[:5]

[('RARE', -1), ('film', 1445), ('movie', 1263), ('one', 726), ('like', 721)]

创建词表，对每个词语赋予一个数值

In [34]:
# Now create the dictionary
word_dictionary = {}
# For each word, that we want in the dictionary, add it, then make it
# the value of the prior dictionary length
for i,(word, word_count) in enumerate(count):
    word_dictionary[word] = len(word_dictionary)
    #只是为了查看形成的word_dict是什么样的
    if i==9:
        print(word_dictionary)

{'story': 5, 'even': 7, 'comedy': 9, 'like': 4, 'movie': 2, 'one': 3, 'film': 1, 'much': 6, 'good': 8, 'RARE': 0}


text to number
对数据集中的单词映射数值，频次在前N个即在word_dict中，即映射相应数值，否则则映射rare的0

In [35]:
# text to number
text_data = []
for sentence in texts:
    sentence_data = []
    for word in sentence.split():
        if word in word_dictionary:
            sentence_data.append(word_dictionary[word])
        else:
            sentence_data.append(0)
    text_data.append(sentence_data)
len(text_data)

10405

选择验证单词，创建验证单词的索引

In [36]:
# 在前N个词汇中选择
valid_example = [word_dictionary[x] for x in valid_words]
valid_example

[1484, 28, 968]

### 4.生成skip-gram模型的批量数据

#### 4.1生成输入词对word_pair

In [37]:
# 生成窗口序列
# windows_size = 2
rand_sentence = np.random.choice(text_data)
# rand_sentence_spilt = rand_sentence.split()
rand_sentence_spilt = rand_sentence
window_sequences=[rand_sentence_spilt[max((ix-windows_size),0):min((ix+windows_size+1),len(rand_sentence))] 
                  for ix,x in enumerate(rand_sentence_spilt)]
rand_sentence,window_sequences

([3108, 2514, 111, 9, 39, 146, 165, 111, 807, 15],
 [[3108, 2514, 111],
  [3108, 2514, 111, 9],
  [3108, 2514, 111, 9, 39],
  [2514, 111, 9, 39, 146],
  [111, 9, 39, 146, 165],
  [9, 39, 146, 165, 111],
  [39, 146, 165, 111, 807],
  [146, 165, 111, 807, 15],
  [165, 111, 807, 15],
  [111, 807, 15]])

In [38]:
# 找到中心词，根据中心词生成输入词对
label_indices = [ix if ix<windows_size else windows_size for ix,x in enumerate(window_sequences)]
batch_and_labels = [(x[y],x[:y]+x[(y+1):]) for x,y in zip(window_sequences,label_indices)]
tuple_data = [(x,y_) for x,y in batch_and_labels for y_ in y]
batch_and_labels[:2],tuple_data[:5]

([(3108, [2514, 111]), (2514, [3108, 111, 9])],
 [(3108, 2514), (3108, 111), (2514, 3108), (2514, 111), (2514, 9)])

#### 4.2 生成批量数据

In [40]:
batch_data,label_data = [],[]
# batch_size = 4
word_input,labels_output = [list(x) for x in zip(*tuple_data)]
batch_data.extend(word_input[:batch_size])
label_data.extend(labels_output[:batch_size])

batch_data = np.array(batch_data)
label_data = np.transpose(np.array([label_data]))
label_data.shape

(34, 1)

#### 4.3 封装生成batch数据的函数
为了在训练的时候调用，我们将上述分开的阶段封装成一个生成batch的函数

In [41]:
def generate_batch_data(sentences,batch_size,windows_size):
    batch_data,label_data = [],[]
    
    # 这个判断并不是没有意义的
    # 可能句子本身的长度小于batch_size，所以此时如果直接截断是不够batch_size的
    # 要重复extend，直到>batch_size
    while len(batch_data) < batch_size:
    
        # 生成窗口序列,这里的sentences其实已经是转化成数字形式的了
        rand_sentence = np.random.choice(sentences)

        window_sequences=[rand_sentence[max((ix-windows_size),0):min((ix+windows_size+1),len(rand_sentence))] 
                          for ix,x in enumerate(rand_sentence)]

        # 找到中心词，根据中心词生成输入词对
        label_indices = [ix if ix<windows_size else windows_size for ix,x in enumerate(window_sequences)]
        batch_and_labels = [(x[y],x[:y]+x[(y+1):]) for x,y in zip(window_sequences,label_indices)]
        tuple_data = [(x,y_) for x,y in batch_and_labels for y_ in y]

        # 生成批量数据
        word_input,labels_output = [list(x) for x in zip(*tuple_data)]
        
        batch_data.extend(word_input[:batch_size])
        label_data.extend(labels_output[:batch_size])
        
    # 如果句子长度小于batch大小，那需要几次进入while循环，extend直至大于，然后取前batch_size大小
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]
    
    #转成数组格式以送入模型训练
    bacth_data = np.array(batch_data)
    #转置为了生成（batch,1)的大小
    label_data = np.transpose(np.array([label_data]))
    
    return batch_data,label_data

### 5.初始化
初始化嵌套函数，声明占位符和嵌套查找函数

In [42]:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))

x_inputs = tf.placeholder(tf.int32,shape=[batch_size])
y_target = tf.placeholder(tf.int32,shape=[batch_size,1])
valid_dataset = tf.constant(valid_example,dtype=tf.int32)

embed = tf.nn.embedding_lookup(embeddings,x_inputs)
embeddings

<tf.Variable 'Variable_3:0' shape=(10000, 200) dtype=float32_ref>

### 6.定义NCE损失函数
在skip-gram模型中，是对输入词求输出词语为想要词语的概率，实际是个分类问题。embedding_size是10000的话，意味着1万的高稀疏性分类。
因此在这里使用噪声对比损失函数（noise-contrastive error,NCE)

https://www.cnblogs.com/linhao-0204/p/9126037.html
https://blog.csdn.net/wizardforcel/article/details/84075703

In [43]:
import math
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev = 1.0/math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# 书里这个部分y_target和embed的位置写反了
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights,nce_biases,y_target,embed,num_sampled,vocabulary_size))

### 7.构建模型
#### 7.1声明优化器函数，初始化模型变量

In [44]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)

#### 7.2模型训练

In [45]:
loss_vec,loss_x_vec = [],[]
for i in range(epochs):   
    
    batch_input,batch_labels = generate_batch_data(text_data,batch_size,windows_size)

    feed_dict = {x_inputs:batch_input,y_target:batch_labels}
    sess.run(optimizer,feed_dict)
    
    if (i+1)%100 ==0:
        loss_val = sess.run(loss,feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print("loss at epoch{}:{}".format(i+1,loss_val))
    

loss at epoch100:77.96843719482422
loss at epoch200:69.6238784790039
loss at epoch300:53.15748977661133
loss at epoch400:61.729766845703125
loss at epoch500:25.412565231323242
loss at epoch600:43.033470153808594
loss at epoch700:42.40412139892578
loss at epoch800:42.47169494628906
loss at epoch900:49.41889953613281
loss at epoch1000:20.44341278076172


### 8.验证

#### 8.1创建函数寻找验证单词周围的单词
计算验证单词和所有词向量之间的余弦相似度
$$cos(\theta) = \frac{a\cdot b}{||a||\times ||b||}= \frac{\sum_i^N(x_i\times y_i)}{\sqrt{\sum_i^N(x_i)^2}\times\sqrt{\sum_i^N(y_i)^2}}$$

两个向量的余弦相似度计算：
- 句子A：(1,2,1,1,1)
- 句子B：(1,1,0,1,1)

计算过程如下：

$$cos(\theta) = \frac{1\times1+2\times1+1\times0+1\times1+1\times1}{\sqrt{1^2+2^2+1^2+1^2+1^2}\times\sqrt{1^2+1^2+0^2+1^2+1^2}}$$

这里就是将分母先放在外面norm时完成，后面用matmul对两个向量中的每个数字对应相乘就好

In [46]:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings),1,keep_dims = True))
normalized_embeddings = embeddings/norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,valid_dataset)
similarity = tf.matmul(valid_embeddings,normalized_embeddings,transpose_b=True)

送入模型计算

In [47]:
batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, windows_size)
feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}

    # Run the train step
sim = sess.run(similarity, feed_dict=feed_dict)
sim[1,:]

array([ 0.30178624,  0.23389468,  0.28382635, ..., -0.03705493,
       -0.025969  ,  0.04007516], dtype=float32)

In [48]:
word_dictionary_rev = dict(zip(word_dictionary.values(),word_dictionary.keys()))
# word_dictionary_rev = {0:'rare',1:'film',2:'movie'...}
for j in range(len(valid_words)):
    
    valid_word = word_dictionary_rev[valid_example[j]]
    print("-"*30)
    print("预测词为：",valid_word)
    log_str = "Nearest to {}:".format(valid_word)
    topk = 5
    nearst = (-sim[j,:]).argsort()[1:topk+1]
    for k in range(topk):
        print("预测的周围词为:",word_dictionary_rev[nearst[k]])
        
        # 自己和自己迭代，就可以一直在后面添加了
        log_str = "%s %s ,"%(log_str,word_dictionary_rev[nearst[k]])
    print(log_str[:-1])

------------------------------
预测词为： cliche
预测的周围词为: along
预测的周围词为: indeed
预测的周围词为: scooter
预测的周围词为: fanciful
预测的周围词为: versus
Nearest to cliche: along , indeed , scooter , fanciful , versus 
------------------------------
预测词为： love
预测的周围词为: la
预测的周围词为: life
预测的周围词为: stories
预测的周围词为: american
预测的周围词为: hours
Nearest to love: la , life , stories , american , hours 
------------------------------
预测词为： hate
预测的周围词为: involving
预测的周围词为: fairly
预测的周围词为: embarrassment
预测的周围词为: worthy
预测的周围词为: boobs
Nearest to hate: involving , fairly , embarrassment , worthy , boobs 
