## Transformation interface

In the previous tutorial on Corpora and Vector Spaces, we created a corpus of documents represented as a stream of vectors. To continue, let’s fire up gensim and use that corpus:

In [2]:
import os

In [4]:
>>> from gensim import corpora, models, similarities
>>> if (os.path.exists("/tmp/deerwester.dict")):
>>>    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
>>>    corpus = corpora.MmCorpus('/tmp/deerwester.mm')
>>>    print("Used files generated from first tutorial")
>>> else:
>>>    print("Please run first tutorial to generate data set")

Used files generated from first tutorial


In this tutorial, I will show how to transform documents from one vector representation into another. This process serves two goals:

1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

## Creating a transformation

The transformations are standard Python objects, typically initialized by means of a training corpus:



In [5]:
>>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

In [6]:
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x10a711cd0>

## Transforming vectors

From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):

In [9]:
>>> doc_bow = [(0, 1), (9, 1)]
>>> print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors

[(0, 0.8075244024440723), (9, 0.5898341626740045)]


In [8]:
>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
...     print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(1, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.44424552527467476), (6, 0.3244870206138555), (7, 0.3244870206138555)]
[(0, 0.5710059809418182), (6, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(2, 0.49182558987264147), (6, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (4, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(5, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


In [10]:
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
>>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Here we transformed our Tf-Idf corpus via Latent Semantic Indexing into a latent 2-D space (2-D because we set num_topics=2). Now you’re probably wondering: what do these two latent dimensions stand for? Let’s inspect with *models.LsiModel.print_topics()*:

In [12]:
>>> lsi.print_topics(2)

[(0,
  u'0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  u'-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

In [13]:
>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
...     print(doc)

[(0, 0.066007833960904289), (1, -0.52007033063618446)]
[(0, 0.19667592859142607), (1, -0.76095631677000519)]
[(0, 0.089926399724464146), (1, -0.72418606267525121)]
[(0, 0.07585847652178139), (1, -0.63205515860034267)]
[(0, 0.10150299184980212), (1, -0.57373084830029653)]
[(0, 0.70321089393783076), (1, 0.16115180214025809)]
[(0, 0.87747876731198304), (1, 0.16758906864659434)]
[(0, 0.90986246868185816), (1, 0.14086553628719065)]
[(0, 0.61658253505692873), (1, -0.053929075663893461)]


In [15]:
>>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
>>> lsi = models.LsiModel.load('/tmp/model.lsi')

## Available transformations

In [17]:
model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)

In [19]:
rpmodel = models.RpModel(corpus_tfidf, num_topics=500)

In [20]:
lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

In [21]:
lda_model

<gensim.models.ldamodel.LdaModel at 0x10a774610>