# [Introduction](https://radimrehurek.com/gensim/intro.html)

Gensim是一个免费Python库，它能自动的从文档中抽取语义主题，高效而且方便。

Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim的算法有： Word2Vec, FastText, Latent Semantic Analysis (LSI, LSA, see LsiModel), Latent Dirichlet Allocation (LDA, see LdaModel) 等等。通过对statistical co-occurrence patterns的分析，这些算法可以自动的发现文章中的语义结构，而且是无监督的（unsupervised），这将大大节省人工的工作。

## 核心概念

- Corpus ： A collection of digital documents。语料的数字化表示（向量，矩阵等）
- Vector space model ： 向量空间模型
- Gensim sparse vector, Bag-of-words vector： 稀疏向量可以大大减少内存空间的使用
- Gensim streamed corpus： 除了常见的list，NumPy array， Pandas dataframe等，gensim可以支持streamed object。
- Model, Transformation：文档可以表示成为向量，而模型可以被看成从一个向量空间向另外一个向量空间转化（Transformation）的过程。

# [Tutorials](https://radimrehurek.com/gensim/tutorial.html)

详细内容如下：

- Corpora and Vector Spaces
  - [From Strings to Vectors](https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors)
  - [Corpus Streaming – One Document at a Time](https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time)
  - [Corpus Formats](https://radimrehurek.com/gensim/tut1.html#corpus-formats)
  - [Compatibility with NumPy and SciPy](https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy)
- Topics and Transformations
  - [Transformation interface](https://radimrehurek.com/gensim/tut2.html#transformation-interface)
  - [Available transformations](https://radimrehurek.com/gensim/tut2.html#available-transformations)
- Similarity Queries
  - [Similarity interface](https://radimrehurek.com/gensim/tut3.html#similarity-interface)
  - [Where next?](https://radimrehurek.com/gensim/tut3.html#where-next)
- Experiments on the English Wikipedia
  - [Preparing the corpus](https://radimrehurek.com/gensim/wiki.html#preparing-the-corpus)
  - [Latent Semantic Analysis](https://radimrehurek.com/gensim/wiki.html#latent-semantic-analysis)
  - [Latent Dirichlet Allocation](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation)
- Distributed Computing
  - [Why distributed computing?](https://radimrehurek.com/gensim/distributed.html#why-distributed-computing)
  - [Prerequisites](https://radimrehurek.com/gensim/distributed.html#prerequisites)
  - [Core concepts](https://radimrehurek.com/gensim/distributed.html#core-concepts)
  - [Available distributed algorithms](https://radimrehurek.com/gensim/distributed.html#available-distributed-algorithms)

**Preliminaries**

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

下面演示了文档tfidf算法和简单的相似度模型。

In [4]:
from gensim import models
from gensim import similarities

corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
          [(3, 1.0), (5, 1.0), (6, 1.0)],
          [(9, 1.0)],
          [(9, 1.0), (10, 1.0)],
          [(9, 1.0), (10, 1.0), (11, 1.0)],
          [(8, 1.0), (10, 1.0), (11, 1.0)]]


tfidf = models.TfidfModel(corpus)
vec = [(0, 1), (4, 1)]
print(tfidf[vec])

index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
sims = index[tfidf[vec]]
print(list(enumerate(sims)))

2019-10-12 08:11:35,483 : INFO : collecting document frequencies
2019-10-12 08:11:35,484 : INFO : PROGRESS: processing document #0
2019-10-12 08:11:35,485 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)
2019-10-12 08:11:35,487 : INFO : creating sparse index
2019-10-12 08:11:35,488 : INFO : creating sparse matrix from corpus
2019-10-12 08:11:35,488 : INFO : PROGRESS: at document #0
2019-10-12 08:11:35,490 : INFO : created <9x12 sparse matrix of type '<class 'numpy.float32'>'
	with 28 stored elements in Compressed Sparse Row format>


[(0, 0.8075244024440723), (4, 0.5898341626740045)]
[(0, 0.4662244), (1, 0.19139354), (2, 0.2460055), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


## 