corpora是gensim中的一个基本概念，是文档集的表现形式，也是后续进一步处理的基础。

当然了，在实际运行中，因为单词数量极多（上万甚至10万级别），而一篇文档的单词数是有限的，所以如果还是采用密集矩阵来表示的话，会造成极大的内存浪费，所以gensim内部是用稀疏矩阵的形式来表示的。 
那么，如何将字符串形式的文档转化成01矩阵形式呢？这里就要提到词典的概念（dictionary）。词典是所有文档中所有单词的集合，而且记录了各词的出现次数等信息。 

In [1]:
from gensim import corpora
from collections import defaultdict



In [2]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

<h1>预处理过程</h1>

第一步   分词

In [3]:
#停用词
stoplist=set("for a of and to the in".split())

In [4]:
#去除停用词且分词
#生成的是列中列
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

In [5]:
#删除仅出现一次的文字
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

In [6]:
print(texts)

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


第二步  生成词典

In [7]:
dictionary = corpora.Dictionary(texts)   # 生成词典

In [8]:
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


第三步  生成corpus

In [9]:
#外部存储
dictionary.save('./temp/deerwester.dict')  # store the dictionary, for future reference
corpus = [dictionary.doc2bow(text) for text in texts]  #词袋


In [10]:
print(corpus)

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]


In [11]:
corpora.MmCorpus.serialize('./temp/deerwester.mm', corpus)  # store to disk,for later use

In [12]:
#利用下面函数从磁盘中读取相应corpus
corpus = corpora.MmCorpus('./temp/deerwester.mm')

In [13]:
print(corpus)

MmCorpus(9 documents, 12 features, 28 non-zero entries)


<h1>利用models进行处理</h1>

In [14]:
from gensim import models

In [15]:
tfidf_model=models.TfidfModel(corpus)

In [22]:
print(tfidf_model)

TfidfModel(num_docs=9, num_nnz=28)


注意，目前只是生成了一个模型，但这是类似于生成器，并不是将对应的corpus转化后的结果。对tf-idf模型而言，里面存储有各个单词的词频，文频等信息。想要将文档转化成tf-idf模式表示的向量，还要使用如下命令

In [19]:
corpus_tfidf=tfidf_model[corpus]

In [20]:
print(corpus_tfidf)

<gensim.interfaces.TransformedCorpus object at 0x000001C67E428940>


In [21]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]

In [23]:
print(corpus_lsi)

<gensim.interfaces.TransformedCorpus object at 0x000001C67E428CF8>


<h1>相似度计算</h1>

In [25]:
from gensim import similarities

In [26]:
#先 生成一个相似度计算生成器
corpus_simi_matrix = similarities.MatrixSimilarity(corpus_lsi)
# 计算一个新的文本与既有文本的相关度

test_text = "Human computer interaction".split()
test_bow = dictionary.doc2bow(test_text)
test_tfidf = tfidf_model[test_bow]
test_lsi = lsi_model[test_tfidf]
test_simi = corpus_simi_matrix[test_lsi]


print(list(enumerate(test_simi)))

[(0, 0.99916452), (1, 0.99632162), (2, 0.9990505), (3, 0.99886364), (4, 0.99996823), (5, -0.058117405), (6, -0.021589279), (7, 0.013524055), (8, 0.25163394)]


In [27]:
corpus_simi=similarities.MatrixSimilarity(corpus_tfidf)

test_text = "Human computer interaction".split()
test_bow = dictionary.doc2bow(test_text)
test_tfidf = tfidf_model[test_bow]
test_simi=corpus_simi[test_tfidf]

print(list(enumerate(test_simi)))

[(0, 0.57735026), (1, 0.44424552), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
