# 词向量与词嵌入

本章介绍词嵌入（word embedding）方法。依次有如下内容：
1. 概述
2. 机器学习对于词的表示方法
3. 神经网络语言模型（Neural Network Language Model）
4. word2vec构造词向量
5. 使用keras完成词向量的训练与可视化

### 机器学习对于词的表示方法

在NLP建模中，最重要和最根本的任务就是在模型中对输入的单词进行表示，方便模型识别单词之间的相似性和差异性。词的表达主要有两大类：早期的NLP工作中，词的表示大多是基于基本符号（atomic symbol）；而在现代NLP中，更多的是以词向量（word vector）的方法来表示，从而使得词间相似性计算更加便捷。下面先简要介绍早期的词表示方法，再着重解释现代的词向量方法。

#### 词向量方法

在真实世界中，单词的数量是巨大的。比如根据现代汉语词典，中文里常用的基本词和词组大约有5万6千多个，加上现在网络上新兴的各种单词及其组合，则无人有过精确统计；至于英语，韦伯斯特词典则大约收录了1百万个英语单词。当然，汉语的词数量不适宜跟英语相比，中文的词主要是一个词根作用，加上其各种组合应该数量也是巨大的。

这些单词之间并不是完全独立的，也不是每一个单词都代表一个完全独立的概念。我们可以认为我们在谈话时所表达的实际语义实际更少，是在一个数量更低的维度上。比如性别（男 vs 女）；交通工具（汽车 vs 飞机），食品（宫保鸡丁 vs 火锅）等等。

我们下面介绍几种常见的词向量表示方法：
1. 独热编码（One Hot Encoding）
2. 基于SVD的编码
3. 迭代编码方法

#### 独热编码 （One Hot Encoding）

独热编码是对于当前单词表中的单词使用一个向量进行表达的简单方式。假如当前的单词表有10个不同的单词，那么独热编码使用10位的0、1数字来对该单词表进行编码，出现每个单词在对应的序号位标为1，否则为0。下面举例说明：

假设我们的文档分词后产生如下的单词表：[“中国”，“国家”，“主席”，“习近平”，“北京”，“钓鱼台”，“会见”，“到”，“访”，“日本”，“首相”]，共11个单词，并且单词序号也按照上面的次序，那么我们的独热编码$w$就使用一个11个0、1数字的向量来表示这些单词：

w(“中国”)：1000000000； w(“国家”)：01000000000，......，w(“首相”)：00000000001

一般说来，对于一个具有N个元素的单词表，独热编码将每一个单词映射为$R^{1\times|N|}$的向量，向量中对应单词序号的位置数值为1，其余为0。

当然，独热编码虽然简单，但是其有几个问题：
1. 存储效率极低
2. 每个单词是完全独立的存在，之间并无联系。比如$w （“中国”）^T w（“日本”）=0$



#### 基于SVD的编码
我们看到，独热编码虽然简单，但是对于建模来说并不是较好的选择。研究者探索了可以解决存储效率和单词之间关系问题的方法，其中基于SVD的编码便是较常见的一种。这种方法依赖于通过word-document矩阵生成的共生矩阵（cooccurance matrix），下面分别介绍：
1. 词-文矩阵（word-document matrix）。这是以单词为行，以文本为列的矩阵，反映一个文本中以0、1编码表示的所有单词；
2. 基于滑动窗的共生矩阵（window based co-occurance matrix）。该矩阵是一个“词-词矩阵”（word-word matrix），反映的是在一个给定的窗口中两个单词同时出现的频次统计；

直接按照逻辑来实现对以上矩阵的生产并不难，但是sklearn里面已经有现成的方法可供调用，直接对以列表形式出现的文本集合进行操作，非常方便。下面先举一个简单示例说明逻辑，再展示实践中可用的生产性代码。在英文文档中，可以直接使用sklearn的文字处理模块进行操作，但是中文需要先进行分词再使用sklearn的模块进行处理。

下面，我们：
- 首先展示如何生成**词-文矩阵**；
- 再展示如何生成基于滑动窗的**共生矩阵**；


In [1]:
# 对于英文文档列表，可以直接使用sklearn工具生成 WORD-DOCUMENT Matrix 
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
docs = ['this is a good book',
        'this cat is this good',
        'cat dog fight']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
print(count_model.get_feature_names())
print(X.todense())
print(count_model.vocabulary_)

['book', 'cat', 'dog', 'fight', 'good', 'is', 'this']
[[1 0 0 0 1 1 1]
 [0 1 0 0 1 1 2]
 [0 1 1 1 0 0 0]]
{'this': 6, 'is': 5, 'good': 4, 'book': 0, 'cat': 1, 'dog': 2, 'fight': 3}


**对中文进行“词-文矩阵”生成**

In [18]:
import jieba
import re
jieba.initialize()  
# 简单示例
documents = [u'独热编码是对于当前单词表中的单词使用1个向量进行表达的简单方式，独热编码有自身的缺点和有点。', 
             u'假如当前的单词表有10个不同的单词，并且每个单词都不一样', 
             u'研究者探索了可以解决存储效率和单词之间关系问题的方法', 
             u'独热编码虽然简单']

vocabulary = {}
documents_after = []


# 先构造使用jieba分词后的文本列表，我们构造一个函数来进行操作

def cn_list_seg(documents, removeDigits=True):
    vocabulary = {}
    documents_after = []
    documents = list(documents)
    for doc in documents:
        # 每一个文本的分句要单独拎出来进行分词处理，
        sentences = doc.split(',.，。；')
        seg_doc = ''
        for sentence in sentences:
            result=jieba.tokenize(sentence)
            # 如果分词出来的结果包含数字，就扔掉
            for word in result:
                if (removeDigits) & bool(re.search('\d+', word[0])):
                    pass
                else:
                    seg_doc = seg_doc + ' ' + word[0]
                    if word in vocabulary.keys():
                        vocabulary[word[0]] += 1
                    else:
                        vocabulary[word[0]] = 1
        documents_after.append(seg_doc)    
    return documents_after, vocabulary
    
documents_after, vocabulary = cn_list_seg(documents[:])

下面进行矩阵构建操作

In [3]:
print(u'中文分词后的文档：')
print(documents_after)
print('\n')

# 再使用sklearn模块生成word-document矩阵
cn_count_model = CountVectorizer(ngram_range=(2,2)) # default unigram model
cnX = cn_count_model.fit_transform(documents_after)

print(u'标注化（Tokenized）后的特征列表：')
print(cn_count_model.get_feature_names())
print('\n')

print(u'词-文矩阵：')
print(cnX.todense())
print('\n')

print(u'词-文矩阵对应的特征索引号（矩阵列的序号）：')
print(cn_count_model.vocabulary_)
print('\n')

# 将词-文矩阵的列打上标签：
voc_df=pd.DataFrame.from_dict(cn_count_model.vocabulary_, columns=['idx'], orient='index').sort_values(by=['idx'])
cols = list(voc_df.index)
print('带标签的词-文矩阵:')
pd.DataFrame(cnX.todense(), columns=cols)

中文分词后的文档：
[' 独热 编码 是 对于 当前 单词表 中 的 单词 使用 个 向量 进行 表达 的 简单 方式 ， 独热 编码 有 自身 的 缺点 和 有点 。', ' 假如 当前 的 单词表 有 个 不同 的 单词 ， 并且 每个 单词 都 不 一样', ' 研究者 探索 了 可以 解决 存储 效率 和 单词 之间 关系 问题 的 方法', ' 独热 编码 虽然 简单']


标注化（Tokenized）后的特征列表：
['不同 单词', '之间 关系', '使用 向量', '假如 当前', '关系 问题', '单词 一样', '单词 之间', '单词 使用', '单词 并且', '单词表 不同', '单词表 单词', '可以 解决', '向量 进行', '存储 效率', '对于 当前', '并且 每个', '当前 单词表', '探索 可以', '效率 单词', '方式 独热', '每个 单词', '独热 编码', '研究者 探索', '简单 方式', '编码 对于', '编码 自身', '编码 虽然', '缺点 有点', '自身 缺点', '虽然 简单', '表达 简单', '解决 存储', '进行 表达', '问题 方法']


词-文矩阵：
[[0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 2 0 1 1 1 0 1 1 0 1 0 1 0]
 [1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0]]


词-文矩阵对应的特征索引号（矩阵列的序号）：
{'独热 编码': 21, '编码 对于': 24, '对于 当前': 14, '当前 单词表': 16, '单词表 单词': 10, '单词 使用': 7, '使用 向量': 2, '向量 进行': 12, '进行 表达': 32, '表达 简单': 30, '简单 方式': 23, '方式 独热': 

Unnamed: 0,不同 单词,之间 关系,使用 向量,假如 当前,关系 问题,单词 一样,单词 之间,单词 使用,单词 并且,单词表 不同,...,编码 对于,编码 自身,编码 虽然,缺点 有点,自身 缺点,虽然 简单,表达 简单,解决 存储,进行 表达,问题 方法
0,0,0,1,0,0,0,0,1,0,0,...,1,1,0,1,1,0,1,0,1,0
1,1,0,0,1,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


但是在实际工作中存在大量文本，我们会发现不同词的出现频率非常不同。这时候单纯使用词-文矩阵里面的频率计数到模型中会造成有偏差的结果。事实上，如果一个不常见的词在某一个文本中出现，那么其携带的信息量反而是非常高的。一种常见的变换方法就是TF-IDF（term-frequency inverse document frequency）。这个方法将高频词按照在所有文本中出现的频次进行降低权重的操作，从而来“突出”低频词的作用。tfidf使得文本相对更可比，计算文本之间的相似性的时候更有意义。

TFIDF在sklearn里面有现成的工具进行操作。

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

cn_tf_model = TfidfVectorizer()
X = cn_tf_model.fit_transform(documents_after)
print(cn_tf_model.get_feature_names())
print(X.shape)

voc_df=pd.DataFrame.from_dict(cn_tf_model.vocabulary_, columns=['idx'], orient='index').sort_values(by=['idx'])
cols = list(voc_df.index)
print('带标签的词-文矩阵:')
Xdf=pd.DataFrame(X.todense(), columns=cols)
Xdf.T


['一样', '不同', '之间', '使用', '假如', '关系', '单词', '单词表', '可以', '向量', '存储', '对于', '并且', '当前', '探索', '效率', '方式', '方法', '有点', '每个', '独热', '研究者', '简单', '编码', '缺点', '自身', '虽然', '表达', '解决', '进行', '问题']
(4, 31)
带标签的词-文矩阵:


Unnamed: 0,0,1,2,3
一样,0.0,0.356398,0.0,0.0
不同,0.0,0.356398,0.0,0.0
之间,0.0,0.0,0.309976,0.0
使用,0.248108,0.0,0.0,0.0
假如,0.0,0.356398,0.0,0.0
关系,0.0,0.0,0.309976,0.0
单词,0.158364,0.454968,0.197854,0.0
单词表,0.195611,0.280988,0.0,0.0
可以,0.0,0.0,0.309976,0.0
向量,0.248108,0.0,0.0,0.0


#### 滑动窗口的共生矩阵生成
滑动窗口的共生矩阵因为包含了两个单词在一定相邻距离上同时出现的频次，因此反映了单词之间的相关程度。在上面的例子中因为信息量较小，两个词同时出现多次的情况没有，因此这里使用较大的外部数据来进行展示。我们使用的是1000条酒店评论数据。

In [5]:
#XTX=np.dot(cnX.todense().T, cnX.todense())
#XTX.shape
documents = []
stopword = []
datafile = './nlp_data/hotel_reviews_data/1000_pos.txt'
stopwordfile = './nlp_data/hotel_reviews_data/stopWord.txt'

# 先读入stopword
#fo=open(stopwordfile, encoding='UTF-8')
with open(stopwordfile, encoding='UTF-8') as fo:
    for line in fo:
       stopword.append(line.strip('\n'))

print(stopword[70:87])

# 再读入原始评论文档
#fo=open(datafile, encoding='UTF-8')
with open(datafile, encoding='UTF-8') as fo:
    for line in fo:
       documents.append(line.strip('\n'))
    
print(documents[:4])

['除非', '除了', '此', '此间', '此外', '从', '从而', '打', '待', '但', '但是', '当', '当着', '到', '得', '的', '的话']
['距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较为简单.', '商务大床房，房间很大，床有2M宽，整体感觉经济实惠不错!', '早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。', '宾馆在小街道上，不大好找，但还好北京热心同胞很多~']


如果滑动窗口大小设为1，则考虑全局的共生矩阵

In [6]:
documents_after, vocabulary = cn_list_seg(documents)

# 再使用sklearn模块生成word-document矩阵
min_n = 1
max_n = 1
cn_count_model = CountVectorizer(ngram_range=(min_n, max_n), stop_words = stopword) # default unigram model
cnX = cn_count_model.fit_transform(documents_after)

print(u'标注化（Tokenized）后的特征列表：')
#print(cn_count_model.get_feature_names())
print('\n')

print(u'词-文矩阵：')
#print(cnX.todense())
#cnXdense = cnX.todense()
#XTX = np.dot(cnXdense.T, cnXdense)
%time XTX = cnX.T * cnX
print('\n')

'''
print(u'词-文矩阵对应的特征索引号（矩阵列的序号）：')
print(cn_count_model.vocabulary_)
print('\n')

# 将共生矩阵的列打上标签：
voc_df=pd.DataFrame.from_dict(cn_count_model.vocabulary_, columns=['idx'], orient='index').sort_values(by=['idx'])
cols = list(voc_df.index)
print('带标签的词-文矩阵:')
pd.DataFrame(cnX.todense(), columns=cols)
'''

标注化（Tokenized）后的特征列表：


词-文矩阵：
Wall time: 15.6 ms




"\nprint(u'词-文矩阵对应的特征索引号（矩阵列的序号）：')\nprint(cn_count_model.vocabulary_)\nprint('\n')\n\n# 将共生矩阵的列打上标签：\nvoc_df=pd.DataFrame.from_dict(cn_count_model.vocabulary_, columns=['idx'], orient='index').sort_values(by=['idx'])\ncols = list(voc_df.index)\nprint('带标签的词-文矩阵:')\npd.DataFrame(cnX.todense(), columns=cols)\n"

In [7]:
# 再使用sklearn模块生成word-document矩阵
min_n = 1
max_n = 1
cn_count_model = CountVectorizer(ngram_range=(min_n, max_n), stop_words=stopword) # default unigram model
cnX = cn_count_model.fit_transform(documents_after)
features = cn_count_model.get_feature_names()

print(features[:20])

['accor', 'always', 'amberleyhotel', 'and', 'angel', 'anyone', 'ask', 'bay', 'bed', 'body', 'bus', 'can', 'cheak', 'check', 'checkin', 'cnn', 'copy', 'ctrip', 'dfs', 'did']


In [8]:
def _word_ngrams(tokens, stop_words=None):
        """Turn tokens into a sequence of n-grams after stop words filtering"""
        # handle stop words
        if stop_words is not None:
            tokens = [w for w in tokens if w not in stop_words]

        # handle token n-grams
        min_n, max_n = 3, 3
        if max_n != 1:
            original_tokens = tokens
            if min_n == 1:
                # no need to do any slicing for unigrams
                # just iterate through the original tokens
                tokens = list(original_tokens)
                min_n += 1
            else:
                tokens = []

            n_original_tokens = len(original_tokens)

            # bind method outside of loop to reduce overhead
            tokens_append = tokens.append
            space_join = " ".join

            for n in range(min_n,
                            min(max_n + 1, n_original_tokens + 1)):
                for i in range(n_original_tokens - n + 1):
                    tokens_append(space_join(original_tokens[i: i + n]))

        return tokens

In [9]:
ngrams = _word_ngrams(features)
print(features)
print(ngrams)

['accor', 'always', 'amberleyhotel', 'and', 'angel', 'anyone', 'ask', 'bay', 'bed', 'body', 'bus', 'can', 'cheak', 'check', 'checkin', 'cnn', 'copy', 'ctrip', 'dfs', 'did', 'else', 'even', 'excellent', 'floor', 'floors', 'for', 'hk', 'hour', 'house', 'housekeeping', 'iia', 'in', 'it', 'keeping', 'kfc', 'ktv', 'ld', 'lg', 'match', 'mini', 'my', 'nice', 'no', 'not', 'novotel', 'ok', 'on', 'other', 'out', 'panda', 'quarry', 'ramada', 'recommend', 'rmb', 'room', 'sasa', 'schedule', 'see', 'shop', 'shopping', 'shuttle', 'soho', 'soup', 'stay', 'suggest', 'sweet', 'taxi', 'the', 'there', 'though', 'to', 'top', 'tt', 'twin', 'upgrade', 'very', 'was', 'xx', 'ymca', '一一', '一下', '一下床', '一丝', '一个', '一些', '一件', '一份', '一伙', '一会', '一伸', '一位', '一住', '一侧', '一共', '一再', '一出', '一分钟', '一副', '一半', '一句', '一台', '一向', '一周', '一圈', '一块', '一城', '一夜', '一大', '一天', '一如既往', '一定', '一家', '一家人', '一对', '一小', '一小块', '一层', '一床', '一店', '一张', '一律', '一把', '一指', '一排', '一支', '一日', '一早', '一是', '一晚', '一望无际', '一期', '一本正经', '一朵', 

** 这里结束传统的基于词频的表达方法，开始介绍基于预测的表达方法。**

### 迭代嵌入方法
我们在上面介绍了基于SVD的词嵌入方法。通过共生矩阵以及SVD算法，独热法表示的单词之间的关系得到抽象，并映射到较低维度的致密空间中。但是这种方法基于全局信息，对存储量需求大。我们现在介绍一种迭代学习的方法来将单词映射到一个新的包含了上下文关系的空间中。这种方法的典型代表叫word2vec。在word2vec这个方法中需要引入两个概念“中心词”（center word）和“上下文”（context）

以“资产富裕的人爱投资股票”这句话为例，其经过处理后得到如下的单词：\[ '资产','富裕'，'人', '爱', '投资', '股票' \]。如果选定*‘富裕’*这个单词，在模型中其被称为“中心词”，而其上下文就是前后的\[ '资产','人', '爱', '投资', '股票' \]。一个中心词的上下文被word2vec这个算法用来衡量其含义。如果我们在大量的文本中都发现类似于“资产”，“财富”，“投资”这样的词经常性地出现在*“富裕”*这个词的周边，就能推断*“富裕”*的含义。有相似上下文的词就在word2vec这个模型里具备相似的含义，可以被视作是同义词，其对应的词向量则应该距离上接近。

当然，在实践中，上下文通常被定义为中心词左右对称给定长度的窗口覆盖的词，如下图所示：

<img src="./pics/Chapter1-1.png" width="400">
"中心词与上下文的独热表示"

在word2vec中，词向量本身就是模型的参数，通过对数据的建模可以获得参数的值。word2vec包含两种不同的子模型：
1. 连续型模型（Continuous Bag Of Words Model, CBOW Model）
2. 跳跃型模型（Skip-Gram Model）

下面分别介绍。

** 连续型模型 （CBOW） **

在CBOW中，模型是根据上下文单词来预测中心词。其基本步骤如下：


其架构如下图所示（图源自于："Deep learning for sentence classification"）。图中标识分别为：
1. $x_{ik}$为独热表示的上下文单词
2. $W_{V \times N}$为输入层的权重矩阵，大小为$V \times N$，其中V是词典大小，而N是设定的词向量维度
3. $h_i$是由$x_{ik}$与权重矩阵$W$相乘的到的隐含层结果，大小为$1 \times N$的向量
4. $W^{'}_{N \times V}$为输出层的权重矩阵。注意，这里的$W^{'}$不是$W$矩阵的转置，而是另外一个全新的矩阵。
5. $y_j$为待预测的中心词的独热表示


<img src="./pics/Chapter1-CBOW.png" width="400">


CBOW的计算过程比较直接：

1. 首先，对于给定窗口长度m，上下文$x^{(c)}$用相应的独热法表示，对于待预测的中心词$y^{(c)}$也用相应的独热编码表示，那么我们的输入数据为：$(x^{(c-m)},...,x^{(c-1)}, x^{(c+1)},..., x^{(c+m)} )$
2. 其次，对于上述上下文单词，与输入层权重矩阵相乘，得到嵌入向量$v_{c-m} = Wx^{(c-m)}, v_{c-m+1} = Wx^{(c-m+1)}, ....$
3. 将上述嵌入向量取平均：$\bar(v) = \frac{\sum_{-m}^{m}v_{c+j}}{2m}$
4. 将平均的嵌入向量与输出层权重矩阵相乘，得到输出打分向量$z = W{'}\bar{v}$。相似的单词该得分应该更高；
5. 应用softmax函数将上述打分向量变为概率输出$y=\textrm{softmax}(z)$。如果预测准确的话，那么概率向量$y$会在独热编码为1的地方具备最高的概率值。

word2vec模型就是要在迭代的过程中，通过不断优化$W$和$W'$两个权重矩阵使得我们的语言模型尽可能地接近实际的数据表现。

** 跳跃型模型 （Skip-Gram） **

Skip-Gram模型与CBOW正好相反，模型是根据中心词来预测上下文。其基本步骤如下：


其架构如下图所示（图源自于："Deep learning for sentence classification"）。图中标识分别为：
1. $x_{ik}$为独热表示的中心词
2. $W_{V \times N}$为输入层的权重矩阵，大小为$V \times N$，其中V是词典大小，而N是设定的词向量维度
3. $h_i$是由$x_{ik}$与权重矩阵$W$相乘的到的隐含层结果，大小为$1 \times N$的向量
4. $W^{'}_{N \times V}$为输出层的权重矩阵。注意，这里的$W^{'}$不是$W$矩阵的转置，而是另外一个全新的矩阵。
5. $y_{Cj}$为待预测的上下文词的独热表示

<img src="./pics/Chapter1-SkipGram.png" width="400">

Initialization
Step 1

    Fix the hyperparameters such as window size (2 * n + 1), the size of feature vectors (N), learning rate, number of epochs etc.
    Initialize other parameters such as the connection weights between the input layer and hidden layer, and hidden layer and the softmax layers (can be random initialization).

Step 2

From each sentence in the input file, form a multi-set which will be a collection of every possible contiguous sub-sequence of size (2 * n + 1). Union of all such multi-sets (corresponding to each sentence) will be our training corpus (which is also a multi-set of windows).
Step 3

Make a vocabulary by collecting all possible distinct words. Let the Vocabulary be V.
Step 4 [redundant step added for clarity in understanding]

To each word in dictionary assign a random feature vector. If x is the word then Vec[x] represents the feature vector. The ith row of weight matrix between the input layer and hidden layer stores the feature vector for ith word in the vocabulary, hence in step 1 we have already initialized the word vectors for each word in the dictionary.
One Training Step

Input: Given a single window  w-n  …  w-1  x  w1  ….  wn  from the training corpus.
Forward Step

 

Input Layer

    Activation Function: Identity
    Input: One hot representation of word x
    Weight Matrix Size: [V * N]
    Objective: Select vector Vec[x] corresponding to x

By feeding one hot representation of word x, the input vector to the hidden layer will be the feature vector corresponding to the word x.

Note: The ith row of the hidden weight matrix is the word vector of the ith word in the vocabulary.

 
Hidden Layer

    Activation Function: Identity
    Input: Word vector corresponding to x
    Weight Matrix Size: [N * V] (for each of 2 * n softmax unit)

There are 2 * n softmax units (each having |V| nodes) in the output layer. Let the softmax units be S-n … S-1 ,S1  …. Sn. Feed Vec[x] to each of the softmax unit (each with its own weight matrix). zth node of softmax unit i tries to predict the probability of zth word in dictionary appearing in context position i with respect to the current context. Let P(i, z) be the output of zth node of the ith softmax unit.

 
Output Layer

The following bullet points are with respect to a single softmax unit (say ith) unit in the output layer:

    Input : Output of hidden layer.
    Objective (Redundant): zth node of softmax unit i tries to predict the probability of zth word in dictionary appearing in context position i with respect to the current context.

Crucial Step: For each context position in set P, add the log of the output of each corresponding context word’s node. So let S = $latex \sum_{i = -n, i \neq 0}^{i = n} \log P(i, wi) $ where  wi is the  ith context word and i ∈ P (Objective here is to maximize this log likelihood of the observation).
Backward Step (Back-Propagation Step):

Objective: Try to maximize S (log likelihood of observation as defined above).

S is a function of weights of the model, all words in vocabulary and the word vector corresponding to x. Using back propagation we tune weights of the model (specifically all weight between hidden layer and output layer) and improve the feature vectors word x (which is also embodied in model in the form of weight between input layer and hidden layer), so as to maximize S. This is same as minimizing the negative log likelihood of the observation.

Redundant Note: During one training step, in a window where x forms the middle word, only the weights between hidden layer and output layer, and feature vector of x will be improved. Feature vectors of all other words will not change (Do some math yourself :p). From the next training instance, the modified feature vector of x will be used as its feature vector. In this way, a feature vector gets incrementally.
Training

Run the training step on the training corpus for several epochs.
Output VS Outcome

Output (of the model): Given a window (a training instance), we get the output as the values which we get at the output nodes. The output of the nodes (of output layer) tries to estimate probabilities. These probabilities can be useful for language modeling, though it is generally not used.

Outcome: After the model is trained completely, the  ith rows of the weight matrix between input layer and hidden layer is the feature vector for the ith word in the dictionary. These feature vectors for each word in dictionary are the outcome of the model. And Word2vec is famous good feature vectors.

下面我们在keras中来实现基于CBOW算法的word2vec模型。我们需要进行如下的设计：
1. 首先我们要将原始输入的文字进行标注化（tokenize），并用下标代表每一个单词；
2. 

In [128]:
from __future__ import absolute_import

from keras import backend as K
import numpy as np
from keras.utils.np_utils import accuracy
from keras.models import Sequential, Model
from keras.layers import Input, Lambda, Dense, merge
from keras.layers.embeddings import Embedding
from keras.optimizers import SGD
from keras.objectives import mse

import global_settings as G
from sentences_generator import Sentences
import vocab_generator as V_gen
import save_embeddings as S

k = G.window_size # context windows size
context_size = 2*k

# Creating a sentence generator from demo file
sentences = Sentences("test_file.txt")
vocabulary = dict()
V_gen.build_vocabulary(vocabulary, sentences)
V_gen.filter_vocabulary_based_on(vocabulary, G.min_count)
reverse_vocabulary = V_gen.generate_inverse_vocabulary_lookup(vocabulary, "vocab.txt")

# generate embedding matrix with all values between -1/2d, 1/2d
embedding = np.random.uniform(-1.0/2.0/G.embedding_dimension, 1.0/2.0/G.embedding_dimension, (G.vocab_size+3, G.embedding_dimension))

# Creating CBOW model
# Model has 3 inputs
# Current word index, context words indexes and negative sampled word indexes
word_index = Input(shape=(1,))
context = Input(shape=(context_size,))
negative_samples = Input(shape=(G.negative,))
# All the inputs are processed through a common embedding layer
shared_embedding_layer = Embedding(input_dim=(G.vocab_size+3), output_dim=G.embedding_dimension, weights=[embedding])
word_embedding = shared_embedding_layer(word_index)
context_embeddings = shared_embedding_layer(context)
negative_words_embedding = shared_embedding_layer(negative_samples)
# Now the context words are averaged to get the CBOW vector
cbow = Lambda(lambda x: K.mean(x, axis=1), output_shape=(G.embedding_dimension,))(context_embeddings)
# The context is multiplied (dot product) with current word and negative sampled words
word_context_product = merge([word_embedding, cbow], mode='dot')
negative_context_product = merge([negative_words_embedding, cbow], mode='dot', concat_axis=-1)
# The dot products are outputted
model = Model(input=[word_index, context, negative_samples], output=[word_context_product, negative_context_product])
# binary crossentropy is applied on the output
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
print(model.summary())

# model.fit_generator(V_gen.pretraining_batch_generator(sentences, vocabulary, reverse_vocabulary), samples_per_epoch=G.train_words, nb_epoch=1)
model.fit_generator(V_gen.pretraining_batch_generator(sentences, vocabulary, reverse_vocabulary), samples_per_epoch=10, nb_epoch=1)
# Save the trained embedding
S.save_embeddings("embedding.txt", shared_embedding_layer.get_weights()[0], vocabulary)

# input_context = np.random.randint(10, size=(1, context_size))
# input_word = np.random.randint(10, size=(1,))
# input_negative = np.random.randint(10, size=(1, G.negative))

# print "word, context, negative samples"
# print input_word.shape, input_word
# print input_context.shape, input_context
# print input_negative.shape, input_negative

# output_dot_product, output_negative_product = model.predict([input_word, input_context, input_negative])
# print "word cbow dot product"
# print output_dot_product.shape, output_dot_product
# print "cbow negative dot product"
# print output_negative_product.shape, output_negative_product

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


ImportError: cannot import name 'accuracy'

In [10]:
import zipfile, os, urllib

def maybe_download(filename, url, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

url = 'http://mattmahoney.net/dc/'
filename = maybe_download('text8.zip', url, 31344016)

Found and verified text8.zip


In [11]:
import tensorflow as tf
# Read the data into a list of strings.
def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words."""
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

vocabulary = read_data(filename)
print(vocabulary[:7])
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse']

  from ._conv import register_converters as _register_converters


['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse']


['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse']

In [12]:
documents = []
stopword = []
datafile = './nlp_data/hotel_reviews_data/1000_pos.txt'
stopwordfile = './nlp_data/hotel_reviews_data/stopWord.txt'

# 先读入stopword
#fo=open(stopwordfile, encoding='UTF-8')
with open(stopwordfile, encoding='UTF-8') as fo:
    for line in fo:
       stopword.append(line.strip('\n'))

#print(stopword[70:87])

# 再读入原始评论文档
#fo=open(datafile, encoding='UTF-8')
with open(datafile, encoding='UTF-8') as fo:
    for line in fo:
       documents.append(line.strip('\n'))
    
print(documents[:4])

['距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较为简单.', '商务大床房，房间很大，床有2M宽，整体感觉经济实惠不错!', '早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。', '宾馆在小街道上，不大好找，但还好北京热心同胞很多~']


In [13]:
def build_dataset(words, n_words):
    """Process raw inputs into a dataset."""
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

In [133]:
import collections
data, count, dictionary, reversed_dictionary = build_dataset(vocabulary, 10000)

In [20]:
def cn_list_seg(documents, removeDigits=True):
    vocabulary = {}
    documents_after = []
    documents = list(documents)
    for doc in documents:
        # 每一个文本的分句要单独拎出来进行分词处理，
        doc_elements = doc.split(',.，。；')
        seg_doc = ''
        for element in doc_elements:
            result=jieba.tokenize(element)
            # 如果分词出来的结果包含数字，就扔掉
            for word in result:
                if (removeDigits) & bool(re.search('\d+', word[0])):
                    pass
                else:
                    seg_doc = seg_doc + ' ' + word[0]
                    if word in vocabulary.keys():
                        vocabulary[word[0]] += 1
                    else:
                        vocabulary[word[0]] = 1
        documents_after.append(seg_doc)    
    return documents_after, vocabulary
    
def cn_text_fit(texts):
    word_counts = {}
    for text in texts:
        seq = text.split(' ')
        for w in seq:
            if w in word_counts:
                word_counts[w] += 1
            else:
                word_counts[w] = 1           

        wcounts = list(word_counts.items())
        wcounts.sort(key=lambda x: x[1], reverse=True)
        # forcing the oov_token to index 1 if it exists
        sorted_voc = []
        sorted_voc.extend(wc[0] for wc in wcounts)

        # note that index 0 is reserved, never assigned to an existing word
        word_index = dict(list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))

        index_word = dict((c, w) for w, c in word_index.items())

        for w, c in list(word_docs.items()):
            index_docs[word_index[w]] = c
    return word_index, index_word, index_docs
        

word_index, index_word, indecs_doc = cn_text_fit(documents_after)

NameError: name 'word_docs' is not defined

## 中文字符预处理：
1. 标注化（tokenizer）
2. 序列化
3. 建立字典
4. 建立正向、反向查阅表

In [34]:
documents = [u'独热编码是对于当前单词表中的单词使用1个向量进行表达的简单方式，独热编码有自身的缺点和有点。', 
             u'研究者探索了可以解决存储效率和单词之间关系问题的方法', 
             u'独热编码虽然简单', 
             u'假如当前的单词表有10个不同的单词，并且每个单词都不一样']

from keras.preprocessing.text import Tokenizer

In [164]:
from collections import defaultdict
from collections import OrderedDict
def cntext_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',):
    translate_dict = dict((c, ' ') for c in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)
    words = jieba.tokenize(text)
    seq = [w[0] for w in words if w[0]]
    return seq

def fit_on_cntexts(texts, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
    '''
    假设输入是一组中文列表
    '''
    index_docs = defaultdict(int)
    word_docs = defaultdict(int)
    word_counts = OrderedDict()
    word_index = dict()
    index_word = dict()
    document_count = 0
    for text in texts:
        document_count += 1
        print(text)
        
        if isinstance(text, list):
            longtext = ' '.join(text)           
            text = longtext
        seq = cntext_to_word_sequence(text, filters)
        for w in seq:
            if w in word_counts:
                word_counts[w] += 1
            else:
                word_counts[w] = 1
        for w in set(seq):
        # In how many documents each word occurs
           word_docs[w] += 1

        wcounts = list(word_counts.items())
        wcounts.sort(key=lambda x: x[1], reverse=True)
        # forcing the oov_token to index 1 if it exists        
        sorted_voc = []
        sorted_voc.extend(wc[0] for wc in wcounts)
        print(sorted_voc)

        # note that index 0 is reserved, never assigned to an existing word
        word_index = dict(
            list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1))))
        )

        index_word = dict((c, w) for w, c in word_index.items())

        for w, c in list(word_docs.items()):
            index_docs[word_index[w]] = c

    return seq, word_index, index_word, word_docs, index_docs, word_counts, document_count
    
def cntexts_to_sequences(texts, num_words, oov_token, document_count, filters, word_index ):
    """Transforms each text in texts to a sequence of integers.
    Only top `num_words-1` most frequent words will be taken into account.
    Only words known by the tokenizer will be taken into account.
    # Arguments
        texts: A list of texts (strings).
    # Returns
        A list of sequences.
    """
    return list(cntexts_to_sequences_generator(texts, num_words, oov_token, document_count, filters, word_index))    


def cntexts_to_sequences_generator(texts, num_words, oov_token, document_count, filters, word_index ):
    """Transforms each text in `texts` to a sequence of integers.
    Each item in texts can also be a list,
    in which case we assume each item of that list to be a token.
    Only top `num_words-1` most frequent words will be taken into account.
    Only words known by the tokenizer will be taken into account.
    # Arguments
        texts: A list of texts (strings).
    # Yields
        Yields individual sequences.
    """
    num_words = num_words
    oov_token_index = word_index.get(oov_token)
    for text in texts:
        document_count += 1
        print(text)

        if isinstance(text, list):
            longtext = ' '.join(text)           
            text = longtext
        seq = cntext_to_word_sequence(text, filters)
        print(seq, '\n')
                
        vect = []
        temp = []
        for w in seq:
            i = word_index.get(w)
            temp.append(i)
            if i is not None:
                if num_words and i >= num_words:
                    if oov_token_index is not None:
                        vect.append(oov_token_index)
                else:
                    vect.append(i)
            elif oov_token is not None:
                vect.append(oov_token_index)
        print(temp)
        print('-------------')
        yield vect       

        
class Tokenizer(object):
    """Text tokenization utility class.
    This class allows to vectorize a text corpus, by turning each
    text into either a sequence of integers (each integer being the index
    of a token in a dictionary) or into a vector where the coefficient
    for each token could be binary, based on word count, based on tf-idf...
    # Arguments
        num_words: the maximum number of words to keep, based
            on word frequency. Only the most common `num_words-1` words will
            be kept.
        filters: a string where each element is a character that will be
            filtered from the texts. The default is all punctuation, plus
            tabs and line breaks, minus the `'` character.
        lower: boolean. Whether to convert the texts to lowercase.
        split: str. Separator for word splitting.
        char_level: if True, every character will be treated as a token.
        oov_token: if given, it will be added to word_index and used to
            replace out-of-vocabulary words during text_to_sequence calls
    By default, all punctuation is removed, turning the texts into
    space-separated sequences of words
    (words maybe include the `'` character). These sequences are then
    split into lists of tokens. They will then be indexed or vectorized.
    `0` is a reserved index that won't be assigned to any word.
    """

    def __init__(self, num_words=None,
                 filters='!"#$%&()*+，。；,-./:;<=>?@[\\]^_`{|}~\t\n',
                 lower=True,
                 split=' ',
                 char_level=False,
                 oov_token=None,
                 document_count=0,
                 **kwargs):
        # Legacy support
        if 'nb_words' in kwargs:
            warnings.warn('The `nb_words` argument in `Tokenizer` '
                          'has been renamed `num_words`.')
            num_words = kwargs.pop('nb_words')
        if kwargs:
            raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))

        self.word_counts = OrderedDict()
        self.word_docs = defaultdict(int)
        self.filters = filters
        self.split = split
        self.lower = lower
        self.num_words = num_words
        self.document_count = document_count
        self.char_level = char_level
        self.oov_token = oov_token
        self.index_docs = defaultdict(int)
        self.word_index = dict()
        self.index_word = dict()

    def fit_on_texts(self, texts):
        """Updates internal vocabulary based on a list of texts.
        In the case where texts contains lists,
        we assume each entry of the lists to be a token.
        Required before using `texts_to_sequences` or `texts_to_matrix`.
        # Arguments
            texts: can be a list of strings,
                a generator of strings (for memory-efficiency),
                or a list of list of strings.
        """
        for text in texts:
            self.document_count += 1
            if self.char_level or isinstance(text, list):
                if self.lower:
                    if isinstance(text, list):
                        text = [text_elem.lower() for text_elem in text]
                    else:
                        text = text.lower()
                seq = text
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split)
            for w in seq:
                if w in self.word_counts:
                    self.word_counts[w] += 1
                else:
                    self.word_counts[w] = 1
            for w in set(seq):
                # In how many documents each word occurs
                self.word_docs[w] += 1

        wcounts = list(self.word_counts.items())
        wcounts.sort(key=lambda x: x[1], reverse=True)
        # forcing the oov_token to index 1 if it exists
        if self.oov_token is None:
            sorted_voc = []
        else:
            sorted_voc = [self.oov_token]
        sorted_voc.extend(wc[0] for wc in wcounts)

        # note that index 0 is reserved, never assigned to an existing word
        self.word_index = dict(
            list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))

        self.index_word = dict((c, w) for w, c in self.word_index.items())

        for w, c in list(self.word_docs.items()):
            self.index_docs[self.word_index[w]] = c
            
    def fit_on_cntexts(self, texts):
        '''
        假设输入是一组中文列表
        '''
        for text in texts:
            self.document_count += 1

            if isinstance(text, list):
                longtext = ' '.join(text)           
                text = longtext
            seq = cntext_to_word_sequence(text, self.filters)
            for w in seq:
                if w in self.word_counts:
                    self.word_counts[w] += 1
                else:
                    self.word_counts[w] = 1
            for w in set(seq):
            # In how many documents each word occurs
               self.word_docs[w] += 1

        wcounts = list(word_counts.items())
        wcounts.sort(key=lambda x: x[1], reverse=True)
        # forcing the oov_token to index 1 if it exists        
        if self.oov_token is None:
            sorted_voc = []
        else:
            sorted_voc = [self.oov_token]
        sorted_voc.extend(wc[0] for wc in wcounts)

        # note that index 0 is reserved, never assigned to an existing word
        self.word_index = dict(
                list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1))))
        )

        self.index_word = dict((c, w) for w, c in self.word_index.items())

        for w, c in list(word_docs.items()):
            self.index_docs[self.word_index[w]] = c

    def fit_on_sequences(self, sequences):
        """Updates internal vocabulary based on a list of sequences.
        Required before using `sequences_to_matrix`
        (if `fit_on_texts` was never called).
        # Arguments
            sequences: A list of sequence.
                A "sequence" is a list of integer word indices.
        """
        self.document_count += len(sequences)
        for seq in sequences:
            seq = set(seq)
            for i in seq:
                self.index_docs[i] += 1

    def texts_to_sequences(self, texts):
        """Transforms each text in texts to a sequence of integers.
        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.
        # Arguments
            texts: A list of texts (strings).
        # Returns
            A list of sequences.
        """
        return list(self.texts_to_sequences_generator(texts))
    
    def cntexts_to_sequences(self, texts):
        """Transforms each text in texts to a sequence of integers.
        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.
        # Arguments
            texts: A list of texts (strings).
        # Returns
            A list of sequences.
        """
        return list(self.cntexts_to_sequences_generator(texts))    

    def texts_to_sequences_generator(self, texts):
        """Transforms each text in `texts` to a sequence of integers.
        Each item in texts can also be a list,
        in which case we assume each item of that list to be a token.
        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.
        # Arguments
            texts: A list of texts (strings).
        # Yields
            Yields individual sequences.
        """
        num_words = self.num_words
        oov_token_index = self.word_index.get(self.oov_token)
        
        for text in texts:
            if self.char_level or isinstance(text, list):
                if self.lower:
                    if isinstance(text, list):
                        text = [text_elem.lower() for text_elem in text]
                    else:
                        text = text.lower()
                seq = text
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split)
            vect = []
            for w in seq:
                i = self.word_index.get(w)
                if i is not None:
                    if num_words and i >= num_words:
                        if oov_token_index is not None:
                            vect.append(oov_token_index)
                    else:
                        vect.append(i)
                elif self.oov_token is not None:
                    vect.append(oov_token_index)
            yield vect
            
    def cntexts_to_sequences_generator(self, texts):
        """Transforms each text in `texts` to a sequence of integers.
        Each item in texts can also be a list,
        in which case we assume each item of that list to be a token.
        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.
        # Arguments
            texts: A list of texts (strings).
        # Yields
            Yields individual sequences.
        """
        num_words = self.num_words
        oov_token_index = self.word_index.get(self.oov_token)
        for text in texts:
            self.document_count += 1

            if isinstance(text, list):
                longtext = ' '.join(text)           
                text = longtext
            seq = cntext_to_word_sequence(text, self.filters)
                    
            vect = []
            for w in seq:
                i = self.word_index.get(w)
                if i is not None:
                    if num_words and i >= num_words:
                        if oov_token_index is not None:
                            vect.append(oov_token_index)
                    else:
                        vect.append(i)
                elif self.oov_token is not None:
                    vect.append(oov_token_index)
            yield vect            

    def sequences_to_texts(self, sequences):
        """Transforms each sequence into a list of text.
        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.
        # Arguments
            texts: A list of sequences (list of integers).
        # Returns
            A list of texts (strings)
        """
        return list(self.sequences_to_texts_generator(sequences))

    def sequences_to_texts_generator(self, sequences):
        """Transforms each sequence in `sequences` to a list of texts(strings).
        Each sequence has to a list of integers.
        In other words, sequences should be a list of sequences
        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.
        # Arguments
            texts: A list of sequences.
        # Yields
            Yields individual texts.
        """
        num_words = self.num_words
        oov_token_index = self.word_index.get(self.oov_token)
        for seq in sequences:
            vect = []
            for num in seq:
                word = self.index_word.get(num)
                if word is not None:
                    if num_words and num >= num_words:
                        if oov_token_index is not None:
                            vect.append(self.index_word[oov_token_index])
                    else:
                        vect.append(word)
                elif self.oov_token is not None:
                    vect.append(self.index_word[oov_token_index])
            vect = ' '.join(vect)
            yield vect

    def texts_to_matrix(self, texts, mode='binary'):
        """Convert a list of texts to a Numpy matrix.
        # Arguments
            texts: list of strings.
            mode: one of "binary", "count", "tfidf", "freq".
        # Returns
            A Numpy matrix.
        """
        sequences = self.texts_to_sequences(texts)
        return self.sequences_to_matrix(sequences, mode=mode)

    def sequences_to_matrix(self, sequences, mode='binary'):
        """Converts a list of sequences into a Numpy matrix.
        # Arguments
            sequences: list of sequences
                (a sequence is a list of integer word indices).
            mode: one of "binary", "count", "tfidf", "freq"
        # Returns
            A Numpy matrix.
        # Raises
            ValueError: In case of invalid `mode` argument,
                or if the Tokenizer requires to be fit to sample data.
        """
        if not self.num_words:
            if self.word_index:
                num_words = len(self.word_index) + 1
            else:
                raise ValueError('Specify a dimension (`num_words` argument), '
                                 'or fit on some text data first.')
        else:
            num_words = self.num_words

        if mode == 'tfidf' and not self.document_count:
            raise ValueError('Fit the Tokenizer on some data '
                             'before using tfidf mode.')

        x = np.zeros((len(sequences), num_words))
        for i, seq in enumerate(sequences):
            if not seq:
                continue
            counts = defaultdict(int)
            for j in seq:
                if j >= num_words:
                    continue
                counts[j] += 1
            for j, c in list(counts.items()):
                if mode == 'count':
                    x[i][j] = c
                elif mode == 'freq':
                    x[i][j] = c / len(seq)
                elif mode == 'binary':
                    x[i][j] = 1
                elif mode == 'tfidf':
                    # Use weighting scheme 2 in
                    # https://en.wikipedia.org/wiki/Tf%E2%80%93idf
                    tf = 1 + np.log(c)
                    idf = np.log(1 + self.document_count /
                                 (1 + self.index_docs.get(j, 0)))
                    x[i][j] = tf * idf
                else:
                    raise ValueError('Unknown vectorization mode:', mode)
        return x

    def get_config(self):
        '''Returns the tokenizer configuration as Python dictionary.
        The word count dictionaries used by the tokenizer get serialized
        into plain JSON, so that the configuration can be read by other
        projects.
        # Returns
            A Python dictionary with the tokenizer configuration.
        '''
        json_word_counts = json.dumps(self.word_counts)
        json_word_docs = json.dumps(self.word_docs)
        json_index_docs = json.dumps(self.index_docs)
        json_word_index = json.dumps(self.word_index)
        json_index_word = json.dumps(self.index_word)

        return {
            'num_words': self.num_words,
            'filters': self.filters,
            'lower': self.lower,
            'split': self.split,
            'char_level': self.char_level,
            'oov_token': self.oov_token,
            'document_count': self.document_count,
            'word_counts': json_word_counts,
            'word_docs': json_word_docs,
            'index_docs': json_index_docs,
            'index_word': json_index_word,
            'word_index': json_word_index
        }

    def to_json(self, **kwargs):
        """Returns a JSON string containing the tokenizer configuration.
        To load a tokenizer from a JSON string, use
        `keras.preprocessing.text.tokenizer_from_json(json_string)`.
        # Arguments
            **kwargs: Additional keyword arguments
                to be passed to `json.dumps()`.
        # Returns
            A JSON string containing the tokenizer configuration.
        """
        config = self.get_config()
        tokenizer_config = {
            'class_name': self.__class__.__name__,
            'config': config
        }
        return json.dumps(tokenizer_config, **kwargs)


def tokenizer_from_json(json_string):
    """Parses a JSON tokenizer configuration file and returns a
    tokenizer instance.
    # Arguments
        json_string: JSON string encoding a tokenizer configuration.
    # Returns
        A Keras Tokenizer instance
    """
    tokenizer_config = json.loads(json_string)
    config = tokenizer_config.get('config')

    word_counts = json.loads(config.pop('word_counts'))
    word_docs = json.loads(config.pop('word_docs'))
    index_docs = json.loads(config.pop('index_docs'))
    # Integer indexing gets converted to strings with json.dumps()
    index_docs = {int(k): v for k, v in index_docs.items()}
    index_word = json.loads(config.pop('index_word'))
    index_word = {int(k): v for k, v in index_word.items()}
    word_index = json.loads(config.pop('word_index'))

    tokenizer = Tokenizer(**config)
    tokenizer.word_counts = word_counts
    tokenizer.word_docs = word_docs
    tokenizer.index_docs = index_docs
    tokenizer.word_index = word_index
    tokenizer.index_word = index_word

    return tokenizer

In [165]:
#corpus = [sentence for sentence in corpus if sentence.count(' ') >= 2]
corpus = documents
tokenizer = Tokenizer()
#tokenizer.fit_on_cntexts(corpus)
#V = len(tokenizer.word_index) + 1
seq, word_index, index_word, word_docs, index_docs, word_counts, document_count = fit_on_cntexts(corpus)
print(seq,'\n')
print('---------------')
print(word_index, '\n')
print(index_word)
print(word_docs)
print(index_docs)
print(word_counts)
print(document_count)

独热编码是对于当前单词表中的单词使用1个向量进行表达的简单方式，独热编码有自身的缺点和有点。
['的', '独热', '编码', '是', '对于', '当前', '单词表', '中', '单词', '使用', '1', '个', '向量', '进行', '表达', '简单', '方式', '，', '有', '自身', '缺点', '和', '有点', '。']
研究者探索了可以解决存储效率和单词之间关系问题的方法
['的', '独热', '编码', '单词', '和', '是', '对于', '当前', '单词表', '中', '使用', '1', '个', '向量', '进行', '表达', '简单', '方式', '，', '有', '自身', '缺点', '有点', '。', '研究者', '探索', '了', '可以', '解决', '存储', '效率', '之间', '关系', '问题', '方法']
独热编码虽然简单
['的', '独热', '编码', '单词', '简单', '和', '是', '对于', '当前', '单词表', '中', '使用', '1', '个', '向量', '进行', '表达', '方式', '，', '有', '自身', '缺点', '有点', '。', '研究者', '探索', '了', '可以', '解决', '存储', '效率', '之间', '关系', '问题', '方法', '虽然']
假如当前的单词表有10个不同的单词，并且每个单词都不一样
['的', '单词', '独热', '编码', '当前', '单词表', '个', '简单', '，', '有', '和', '是', '对于', '中', '使用', '1', '向量', '进行', '表达', '方式', '自身', '缺点', '有点', '。', '研究者', '探索', '了', '可以', '解决', '存储', '效率', '之间', '关系', '问题', '方法', '虽然', '假如', '10', '不同', '并且', '每个', '都', '不', '一样']
['假如', '当前', '的', '单词表', '有', '10', '个', '不同', '的', '单词', '，', '并且', '每个', '单词', '都'

In [166]:
num_words = len(word_index) + 1
oov_token=None
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
cntexts_to_sequences(corpus, num_words, oov_token, document_count, filters, word_index )
print(tokens)

独热编码是对于当前单词表中的单词使用1个向量进行表达的简单方式，独热编码有自身的缺点和有点。
['独热', '编码', '是', '对于', '当前', '单词表', '中', '的', '单词', '使用', '1', '个', '向量', '进行', '表达', '的', '简单', '方式', '，', '独热', '编码', '有', '自身', '的', '缺点', '和', '有点', '。'] 

[3, 4, 12, 13, 5, 6, 14, 1, 2, 15, 16, 7, 17, 18, 19, 1, 8, 20, 9, 3, 4, 10, 21, 1, 22, 11, 23, 24]
-------------
研究者探索了可以解决存储效率和单词之间关系问题的方法
['研究者', '探索', '了', '可以', '解决', '存储', '效率', '和', '单词', '之间', '关系', '问题', '的', '方法'] 

[25, 26, 27, 28, 29, 30, 31, 11, 2, 32, 33, 34, 1, 35]
-------------
独热编码虽然简单
['独热', '编码', '虽然', '简单'] 

[3, 4, 36, 8]
-------------
假如当前的单词表有10个不同的单词，并且每个单词都不一样
['假如', '当前', '的', '单词表', '有', '10', '个', '不同', '的', '单词', '，', '并且', '每个', '单词', '都', '不', '一样'] 

[37, 5, 1, 6, 10, 38, 7, 39, 1, 2, 9, 40, 41, 2, 42, 43, 44]
-------------
[[2, 3, 5, 6, 7, 8, 9, 1, 10, 11, 12, 13, 14, 15, 16, 1, 17, 18, 4, 2, 3, 19, 20, 1, 21, 22, 23, 4], [22, 10, 1], [2, 3, 17], [7, 1, 8, 19, 13, 1, 10, 4, 10]]


In [129]:
print(documents)
print(tokenizer.index_word)

['独热编码是对于当前单词表中的单词使用1个向量进行表达的简单方式，独热编码有自身的缺点和有点。', '研究者探索了可以解决存储效率和单词之间关系问题的方法', '独热编码虽然简单', '假如当前的单词表有10个不同的单词，并且每个单词都不一样']
{1: '的', 2: '独热', 3: '编码', 4: ' ', 5: '是', 6: '对于', 7: '当前', 8: '单词表', 9: '中', 10: '单词', 11: '使用', 12: '1', 13: '个', 14: '向量', 15: '进行', 16: '表达', 17: '简单', 18: '方式', 19: '有', 20: '自身', 21: '缺点', 22: '和', 23: '有点'}


In [161]:
temp=jieba.tokenize(documents[1])
for w in temp:
    print(w[0])

研究者
探索
了
可以
解决
存储
效率
和
单词
之间
关系
问题
的
方法


In [106]:
from collections import defaultdict
seq, word_index, index_word, word_docs, index_docs, word_counts = fit_on_texts(documents, filters='!"#$%&()*+,-，。./:;<=>?@[\\]^_`{|}~\t\n')
print(index_word)

{1: '的', 2: '独热', 3: '编码', 4: ' ', 5: '是', 6: '对于', 7: '当前', 8: '单词表', 9: '中', 10: '单词', 11: '使用', 12: '1', 13: '个', 14: '向量', 15: '进行', 16: '表达', 17: '简单', 18: '方式', 19: '有', 20: '自身', 21: '缺点', 22: '和', 23: '有点'}


In [None]:

from keras.preprocessing.sequence import skipgrams
V = len(count)
data, labels = skipgrams(sequence=data[:10000], vocabulary_size=1000, window_size=5, negative_samples=5.)

1. 首先获取单词的索引表和反向索引表
2. 其次将原有文字信息用索引值代替
3. 再将中心词，上下文作为样本放入模型进行训练

In [105]:
file = './nlp_data/news_data/finance_news.csv'
dmsc=[]
with open(file, encoding='UTF-8') as fo:
    for i, line in enumerate(fo): 
        #print(line)
        #line = line[1]
        if i<=2000 & i>=500:
            #print(line)
            element = line.split(',')
            dmsc.append(element[1])
            
dmsc_after, vocabulary = cn_list_seg(dmsc)

In [106]:
min_n = 1
max_n = 1
cn_count_model = CountVectorizer(ngram_range=(min_n, max_n), stop_words = stopword) # default unigram model
cnX = cn_count_model.fit_transform(dmsc_after)

In [107]:
vocabulary_df=pd.DataFrame.from_dict(cn_count_model.vocabulary_, orient='index', columns=['subscription'])
vocabulary_df.reset_index(inplace=True)
vocabulary_df.sort_values('subscription')
vocabulary_df.columns=['values', 'subscription']
dictionary = cn_count_model.vocabulary_
reverse_dictionary = vocabulary_df.to_dict(orient='record')

In [108]:
rdf = vocabulary_df.set_index('subscription')
reverse_dictionary = rdf.to_dict(orient='dict')
reverse_dictionary['values']

{69: '主持人',
 736: '规定',
 733: '要求',
 709: '航空公司',
 798: '运输',
 559: '条件',
 65: '中需',
 533: '明确',
 535: '是否',
 488: '提供',
 708: '航班',
 404: '延误',
 730: '补偿',
 800: '还要',
 248: '取消',
 523: '旅客',
 548: '服务',
 171: '内容',
 771: '购票',
 624: '环节',
 268: '告知',
 597: '消费者',
 243: '发生',
 188: '分钟',
 149: '信息',
 553: '机上',
 825: '通报',
 207: '动态',
 781: '超过',
 371: '小时',
 349: '安全',
 348: '安保',
 155: '允许',
 438: '情况',
 350: '安排',
 867: '飞机',
 672: '等待',
 880: '高景',
 6: '一号',
 443: '成功',
 237: '发射',
 360: '实现',
 445: '我国',
 289: '国产',
 827: '遥感',
 223: '卫星',
 796: '运营',
 565: '模式',
 277: '商业化',
 663: '突破',
 59: '中国',
 473: '拥有',
 704: '自主',
 879: '高分辨率',
 511: '数据',
 61: '中国航天科技集团',
 157: '党群',
 383: '工作部',
 831: '部长',
 776: '贾可',
 732: '表示',
 762: '谈及',
 550: '未来',
 428: '徐文',
 353: '完成',
 591: '测试',
 91: '交付',
 283: '四维',
 590: '测绘',
 459: '技术',
 547: '有限公司',
 35: '下属',
 213: '北京航天',
 46: '世景',
 811: '进行',
 486: '推广',
 291: '国土资源',
 759: '调查',
 622: '环境监测',
 522: '方面',
 120: '优质服务',
 527: '日后',
 

In [116]:
dictionary

{'主持人': 69,
 '规定': 736,
 '要求': 733,
 '航空公司': 709,
 '运输': 798,
 '条件': 559,
 '中需': 65,
 '明确': 533,
 '是否': 535,
 '提供': 488,
 '航班': 708,
 '延误': 404,
 '补偿': 730,
 '还要': 800,
 '取消': 248,
 '旅客': 523,
 '服务': 548,
 '内容': 171,
 '购票': 771,
 '环节': 624,
 '告知': 268,
 '消费者': 597,
 '发生': 243,
 '分钟': 188,
 '信息': 149,
 '机上': 553,
 '通报': 825,
 '动态': 207,
 '超过': 781,
 '小时': 371,
 '安全': 349,
 '安保': 348,
 '允许': 155,
 '情况': 438,
 '安排': 350,
 '飞机': 867,
 '等待': 672,
 '高景': 880,
 '一号': 6,
 '成功': 443,
 '发射': 237,
 '实现': 360,
 '我国': 445,
 '国产': 289,
 '遥感': 827,
 '卫星': 223,
 '运营': 796,
 '模式': 565,
 '商业化': 277,
 '突破': 663,
 '中国': 59,
 '拥有': 473,
 '自主': 704,
 '高分辨率': 879,
 '数据': 511,
 '中国航天科技集团': 61,
 '党群': 157,
 '工作部': 383,
 '部长': 831,
 '贾可': 776,
 '表示': 732,
 '谈及': 762,
 '未来': 550,
 '徐文': 428,
 '完成': 353,
 '测试': 591,
 '交付': 91,
 '四维': 283,
 '测绘': 590,
 '技术': 459,
 '有限公司': 547,
 '下属': 35,
 '北京航天': 213,
 '世景': 46,
 '进行': 811,
 '推广': 486,
 '国土资源': 291,
 '调查': 759,
 '环境监测': 622,
 '方面': 522,
 '优质服务': 120,
 '日后': 527,
 