## 训练词向量
将所有的训练数据、测试数据和网上抓取的汽车评论数据用来训练词向量。首先使用结巴分词。

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import jieba 
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import logging
import pickle
import multiprocessing
from tqdm import tqdm 
import time 
import sys
import os 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### 读入所有数据，并处理成 [[w1, w2, ...], [sentence2], ...] 的形式。

In [2]:
ss = u'使用结巴分词进行分词处理。'
for each in jieba.cut(ss):
    print each

Building prefix dict from the default dictionary ...
2017-06-21 16:22:34,401 : DEBUG : Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
2017-06-21 16:22:34,407 : DEBUG : Loading model from cache /tmp/jieba.cache
Loading model cost 0.320 seconds.
2017-06-21 16:22:34,724 : DEBUG : Loading model cost 0.320 seconds.
Prefix dict has been built succesfully.
2017-06-21 16:22:34,726 : DEBUG : Prefix dict has been built succesfully.


使用
结巴
分词
进行
分词
处理
。


In [3]:
from joblib import Parallel, delayed

# 使用多进程实现分词
def cut_comment(df):
    df['words'] = df['Content'].apply(lambda ss: list(jieba.cut(ss)))
    return df

def apply_parallel(df_grouped, func):
    """利用 Parallel 和 delayed 函数实现并行运算"""
    results = Parallel(n_jobs=-1)(delayed(func)(group) for name, group in df_grouped)
    return pd.concat(results)

In [4]:
df_comments = list()  # 把所有的 comments 提取出来。
time0 = time.time()
# **训练数据和 测试数据
raw_data_path = '../raw_data/'
raw_files = [ 'Train.csv','TrainSecond.csv', 'Test.csv','TestSecond.csv']
for raw_file in raw_files:
    file_path = raw_data_path + raw_file
    with open(file_path, 'rb') as inp:
        df = pd.read_csv(inp, sep='\t')
        df_grouped = df.groupby(df.index)
        df = apply_parallel(df_grouped, cut_comment).loc[:, ['Content', 'words']]
        df_comments.append(df)
print 'time costed %g seconds' % (time.time() - time0)  
time0 =time.time()

# **抓取的评论数据
comment_path = '../comments/'
comment_files = os.listdir(comment_path)
for comment_file in comment_files:
    file_path = comment_path + comment_file
    with open(file_path, 'rb') as inp:
        df =  pd.read_table(inp, names=['Content'])
        df_grouped = df.groupby(df.index)
        df = apply_parallel(df_grouped, cut_comment)
        df_comments.append(df)
        
df_comment = pd.concat(df_comments, ignore_index=True)
print 'time costed %g seconds' % (time.time() - time0)

sentences = df_comment['words'].values
print 'There are %d sentences' % len(sentences)

time costed 120.502 seconds
time costed 13.4129 seconds
There are 73860 sentences


### 训练词向量

In [5]:
%time model = Word2Vec(sentences=sentences, size=200, window=5, min_count=1, workers=multiprocessing.cpu_count())
# model_outp = 'data/pretrained_embedding.model' # 保存训练好的词向量
# %time model.save(model_outp)

2017-06-21 16:25:59,633 : INFO : collecting all words and their counts
2017-06-21 16:25:59,635 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-21 16:25:59,781 : INFO : PROGRESS: at sentence #10000, processed 291104 words, keeping 24897 word types
2017-06-21 16:25:59,865 : INFO : PROGRESS: at sentence #20000, processed 554243 words, keeping 38148 word types
2017-06-21 16:25:59,967 : INFO : PROGRESS: at sentence #30000, processed 847647 words, keeping 47355 word types
2017-06-21 16:26:00,069 : INFO : PROGRESS: at sentence #40000, processed 1122790 words, keeping 55449 word types
2017-06-21 16:26:00,164 : INFO : PROGRESS: at sentence #50000, processed 1384733 words, keeping 63274 word types
2017-06-21 16:26:00,245 : INFO : PROGRESS: at sentence #60000, processed 1646737 words, keeping 69963 word types
2017-06-21 16:26:00,462 : INFO : PROGRESS: at sentence #70000, processed 2469154 words, keeping 82698 word types
2017-06-21 16:26:00,826 : INFO : collected

CPU times: user 1min 22s, sys: 844 ms, total: 1min 23s
Wall time: 21.7 s


### 导出词向量
将训练得到的词向量导出为 [vocab_size, embedding_size] 的 np.array 数据。

index2word = model.wv.index2word 为一个list，长度为 vocab_size, 每个元素对应一个 vocab.

index2vec = model.wv.syn0 也是一个list, 长度为 vocab_size，对应为每个 vocab 的词向量。

所以 id = 1, 对应的词为，index2word[id]；对应的词向量为，index2vec[id]

由于我们训练好的词向量之后不需要再进行调整，而是直接保存下来在 TensorFlow 中调用。所以直接把 model.wv 保存下来即可，而不必把整个 model 保存下来，这样能够节省更多的内存。model.mv 是 read-only 的，比 model 的体积要小一些。

在这例中，数据量比较小。结果 model 保存后为 45.27MB，而 model.wv 保存后占 23.42MB，大概只有前者的一半体积。

model 的保存与导入方法是：

``` python
model.save(file_name)
model.load(file_name)
```

model.wv 的保存于导入方法是：

``` python
word_vectors = model.wv
word_vectors.save(fname)
word_vectors = KeyedVectors.load(fname)
```

In [6]:
word_vectors = model.wv
word_vectors.save('../data/pretrained_wv.model')
del model

2017-06-21 16:26:40,577 : INFO : saving KeyedVectors object under ../data/pretrained_wv.model, separately None
2017-06-21 16:26:40,579 : INFO : not storing attribute syn0norm
2017-06-21 16:26:40,580 : INFO : storing np array 'syn0' to ../data/pretrained_wv.model.syn0.npy
2017-06-21 16:26:41,953 : INFO : saved ../data/pretrained_wv.model


In [7]:
index2word = word_vectors.index2word
index2vec = word_vectors.syn0
id = 1
print index2word[id]
print index2vec[id][:20]

的
[-0.78331262 -1.02543843  0.29934859 -0.3891288  -0.18953632  0.49492931
  1.70470905 -0.83419657 -0.94910449  0.46337521 -0.23899332 -0.57442439
 -0.07045357 -1.57719004  0.67850661 -0.02768125  1.06980443  0.64851797
  0.11539994  0.59483492]


In [8]:
print word_vectors[u'的'][:20]

[-0.78331262 -1.02543843  0.29934859 -0.3891288  -0.18953632  0.49492931
  1.70470905 -0.83419657 -0.94910449  0.46337521 -0.23899332 -0.57442439
 -0.07045357 -1.57719004  0.67850661 -0.02768125  1.06980443  0.64851797
  0.11539994  0.59483492]


In [9]:
# word_vectors.vocab 为一个字典，记录了语料中各个词的信息。
# 包括 词频，词的id, 至于 sample_int 什么意思暂时没搞清楚
print word_vectors.vocab[u'的']

Vocab(count:225320, index:1, sample_int:681566758)


In [10]:
result = word_vectors.most_similar_cosmul(positive=[u'国产', u'本田'], negative=[u'日本'], topn=5)
for w,v in result:
    print w,v

2017-06-21 16:26:46,199 : INFO : precomputing L2-norms of word weight vectors


迈锐宝 0.984410703182
雅阁 0.982565581799
欧蓝德 0.973016619682
途观 0.937127709389
睿 0.935886979103


### 构造 tensorflow 的 embedding 
参考下面的例子：
[Using a pre-trained word embedding (word2vec or Glove) in TensorFlow](https://stackoverflow.com/questions/35687678/using-a-pre-trained-word-embedding-word2vec-or-glove-in-tensorflow)

```python
W = tf.get_variable(name="W", shape=embedding.shape, tf.constant_initializer(embedding), trainable=False)
inputs = tf.nn.embedding_lookup(W, X_inputs) 
```
其中，embedding 就是我们在这里训练得到的词向量：
``` python 
embedding = np.asarray(word_vectors.syn0)
```
X_input 的每个元素是词所对应的id，和这里的 index2word 所一一对应。


In [11]:
embedding = np.asarray(word_vectors.syn0)
print embedding.shape
sr_id2word = pd.Series(index2word, index=range(len(index2word)))
sr_word2id = pd.Series(range(len(index2word)), index=index2word)

print sr_id2word[:5]
print sr_word2id[:5]

(105166, 200)
0    ，
1    的
2    。
3    了
4    在
dtype: object
，    0
的    1
。    2
了    3
在    4
dtype: int64


## 添加 UNKNOWN 符号
用 句号 的词向量来表示。

In [12]:
result = word_vectors.most_similar_cosmul(u'。', topn=5)
for w,v in result:
    print w,v

， 0.77500975132
； 0.74831867218
8.37 0.710772633553
一定 0.697985589504
同样 0.696451961994


把 'UNKNOWN' 符号添加到词向量中。

In [13]:
pad_word = 'UNKNOWN'
vocab_size = index2vec.shape[0]
pad_vec = word_vectors[u'。']
sr_id2word[vocab_size] = pad_word
sr_word2id[pad_word] = vocab_size
index2vec = np.vstack([index2vec, pad_vec])
W_embedding = index2vec

In [14]:
"""
保存数据.
W_embedding： shape=[vocab_size, embedding_size] 的 ndarray. 第 i 行表示 id=i 的词的词向量。
sr_id2word: 
sr_word2id:
"""

with open('../data/embedding_data.pkl', 'wb') as outp:
    pickle.dump(W_embedding, outp)
    pickle.dump(sr_id2word, outp)
    pickle.dump(sr_word2id, outp)