In [None]:
%matplotlib inline


# Topics and Transformations

In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In this tutorial, I will show how to transform documents from one vector representation
into another. This process serves two goals:

1. To bring out hidden structure in the corpus, discover relationships between
   words and use them to describe the documents in a new and
   (hopefully) more semantic way.
2. To make the document representation more compact. This both improves efficiency
   (new representation consumes less resources) and efficacy (marginal data
   trends are ignored, noise-reduction).

## Creating the Corpus



In [1]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

### Creating a transformation

In [3]:
# step 1 -- initialize a model

from gensim import models

tfidf = models.TfidfModel(corpus)  

2023-12-29 18:23:44,070 : INFO : collecting document frequencies
2023-12-29 18:23:44,071 : INFO : PROGRESS: processing document #0
2023-12-29 18:23:44,072 : INFO : TfidfModel lifecycle event {'msg': 'calculated IDF weights for 9 documents and 12 features (28 matrix non-zeros)', 'datetime': '2023-12-29T18:23:44.072376', 'gensim': '4.3.2', 'python': '3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:07:22) [Clang 14.0.6 ]', 'platform': 'macOS-14.1.2-x86_64-i386-64bit', 'event': 'initialize'}


### Transforming vectors

- tfidf 被视为只读对象，可用于将任何向量从旧表示（词袋整数计数）转换为新表示（TfIdf 实值权重）
- 调用 model[corpus] 仅在旧的 corpus 文档流周围创建一个包装器 - 实际转换是在文档迭代期间即时完成的
- 我们无法在调用 corpus_transformed = model[corpus] 时转换整个语料库，因为这意味着将结果存储在主内存中，这与 gensim 内存独立的目标相矛盾
- 如果您想多次迭代转换 corpus_transformed (注意转换成本高昂)请首先将生成的语料库序列化到磁盘，然后继续使用它

In [4]:
# step 2 -- use the model to transform vectors
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


In [9]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


In [10]:
# Transformations can also be serialized, one on top of another, in a sort of chain:
# 通过潜在语义索引将 Tf-Idf 语料库转换为潜在 2-D 空间(2-D because we set num_topics=2)
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2023-12-29 19:50:13,912 : INFO : using serial LSI version on this node
2023-12-29 19:50:13,913 : INFO : updating model with new documents
2023-12-29 19:50:13,915 : INFO : preparing a new chunk of documents
2023-12-29 19:50:13,917 : INFO : using 100 extra samples and 2 power iterations
2023-12-29 19:50:13,918 : INFO : 1st phase: constructing (12, 102) action matrix
2023-12-29 19:50:13,922 : INFO : orthonormalizing (12, 102) action matrix
2023-12-29 19:50:13,930 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2023-12-29 19:50:13,934 : INFO : computing the final decomposition
2023-12-29 19:50:13,936 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2023-12-29 19:50:13,937 : INFO : processed documents up to #9
2023-12-29 19:50:13,939 : INFO : topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"time" + -0.060*"response" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
2023-12-29 19:50:13,940 : INFO

In [11]:
# “trees”、“graph” 和 “minors” 都是相关词（并且对第一个主题的方向贡献最大）
# 而第二个主题实际上与所有其他词有关
# 正如预期的那样，前五个文档与第二个主题的相关性更强，而其余四个文档与第一个主题的相关性更强：
lsi_model.print_topics(2)

2023-12-29 19:50:51,020 : INFO : topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"time" + -0.060*"response" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
2023-12-29 19:50:51,022 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"


[(0,
  '-0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"time" + -0.060*"response" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

In [12]:
# both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, -0.06600783396090276), (1, -0.5200703306361842)] Human machine interface for lab abc computer applications
[(0, -0.19667592859142433), (1, -0.7609563167700051)] A survey of user opinion of computer system response time
[(0, -0.08992639972446302), (1, -0.7241860626752507)] The EPS user interface management system
[(0, -0.07585847652178028), (1, -0.6320551586003427)] System and human system engineering testing of EPS
[(0, -0.10150299184980069), (1, -0.573730848300296)] Relation of user perceived response time to error measurement
[(0, -0.7032108939378308), (1, 0.1611518021402568)] The generation of random binary unordered trees
[(0, -0.8774787673119832), (1, 0.16758906864659276)] The intersection graph of paths in trees
[(0, -0.9098624686818582), (1, 0.14086553628718887)] Graph minors IV Widths of trees and well quasi ordering
[(0, -0.6165825350569287), (1, -0.05392907566389443)] Graph minors A survey


Model persistency is achieved with the :func:`save` and :func:`load` functions:



In [None]:
# 模型持久性是通过 save() 和 load() 函数实现的
import os
import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
    lsi_model.save(tmp.name)  # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

The next question might be: just how exactly similar are those documents to each other?
Is there a way to formalize the similarity, so that for a given input document, we can
order some other set of documents according to their similarity? Similarity queries
are covered in the next tutorial (`sphx_glr_auto_examples_core_run_similarity_queries.py`).


## Available transformations

Gensim implements several popular Vector Space Model algorithms:

* `Term Frequency * Inverse Document Frequency, Tf-Idf <http://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_
  expects a bag-of-words (integer values) training corpus during initialization.
  During transformation, it will take a vector and return another vector of the
  same dimensionality, except that features which were rare in the training corpus
  will have their value increased.
  It therefore converts integer-valued vectors into real-valued ones, while leaving
  the number of dimensions intact. It can also optionally normalize the resulting
  vectors to (Euclidean) unit length.

 .. sourcecode:: pycon

    model = models.TfidfModel(corpus, normalize=True)

* `Okapi Best Matching, Okapi BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_
  expects a bag-of-words (integer values) training corpus during initialization.
  During transformation, it will take a vector and return another vector of the
  same dimensionality, except that features which were rare in the training corpus
  will have their value increased. It therefore converts integer-valued
  vectors into real-valued ones, while leaving the number of dimensions intact.

  Okapi BM25 is the standard ranking function used by search engines to estimate
  the relevance of documents to a given search query.

 .. sourcecode:: pycon

    model = models.OkapiBM25Model(corpus)

* `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
  transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
  a latent space of a lower dimensionality. For the toy corpus above we used only
  2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended
  as a "golden standard" [1]_.

  .. sourcecode:: pycon

    model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

  LSI training is unique in that we can continue "training" at any point, simply
  by providing more training documents. This is done by incremental updates to
  the underlying model, in a process called `online training`. Because of this feature, the
  input document stream may even be infinite -- just keep feeding LSI new documents
  as they arrive, while using the computed transformation model as read-only in the meanwhile!

  .. sourcecode:: pycon

    model.add_documents(another_tfidf_corpus)  # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
    lsi_vec = model[tfidf_vec]  # convert some new document into the LSI space, without affecting the model

    model.add_documents(more_documents)  # tfidf_corpus + another_tfidf_corpus + more_documents
    lsi_vec = model[tfidf_vec]

  See the :mod:`gensim.models.lsimodel` documentation for details on how to make
  LSI gradually "forget" old observations in infinite streams. If you want to get dirty,
  there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical
  precision of the LSI algorithm.

  `gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),
  which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm
  from Halko et al. [4]_ internally, to accelerate in-core part
  of the computations.
  See also `wiki` for further speed-ups by distributing the computation across
  a cluster of computers.

* `Random Projections, RP <http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf>`_ aim to
  reduce vector space dimensionality. This is a very efficient (both memory- and
  CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
  Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

  .. sourcecode:: pycon

    model = models.RpModel(tfidf_corpus, num_topics=500)

* `Latent Dirichlet Allocation, LDA <http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>`_
  is yet another transformation from bag-of-words counts into a topic space of lower
  dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA),
  so LDA's topics can be interpreted as probability distributions over words. These distributions are,
  just like with LSA, inferred automatically from a training corpus. Documents
  are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

  .. sourcecode:: pycon

    model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

  `gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
  modified to run in `distributed mode <distributed>` on a cluster of computers.

* `Hierarchical Dirichlet Process, HDP <http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf>`_
  is a non-parametric bayesian method (note the missing number of requested topics):

  .. sourcecode:: pycon

    model = models.HdpModel(corpus, id2word=dictionary)

  `gensim` uses a fast, online implementation based on [3]_.
  The HDP model is a new addition to `gensim`, and still rough around its academic edges -- use with care.

Adding new :abbr:`VSM (Vector Space Model)` transformations (such as different weighting schemes) is rather trivial;
see the `apiref` or directly the `Python code <https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py>`_
for more info and examples.

It is worth repeating that these are all unique, **incremental** implementations,
which do not require the whole training corpus to be present in main memory all at once.
With memory taken care of, I am now improving `distributed`,
to improve CPU efficiency, too.
If you feel you could contribute by testing, providing use-cases or code, see the `Gensim Developer guide <https://github.com/RaRe-Technologies/gensim/wiki/Developer-page>`__.

## What Next?

Continue on to the next tutorial on `sphx_glr_auto_examples_core_run_similarity_queries.py`.

## References

.. [1] Bradford. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications.

.. [2] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation.

.. [3] Wang, Paisley, Blei. 2011. Online variational inference for the hierarchical Dirichlet process.

.. [4] Halko, Martinsson, Tropp. 2009. Finding structure with randomness.

.. [5] Řehůřek. 2011. Subspace tracking for Latent Semantic Analysis.



In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('run_topics_and_transformations.png')
imgplot = plt.imshow(img)
_ = plt.axis('off')