# Learning Gensim

In [86]:
from gensim import corpora, models, similarities
from pprint import pprint

In [87]:
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

This is a tiny corpus of nine documents, each consisting of only a single sentence.

First, let’s tokenize the documents, remove common words (using a toy stoplist) as well as words that only appear once in the corpus:

In [88]:
# Remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
          for document in documents]

In [89]:
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [90]:
# Remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
          for text in texts]

In [91]:
# This is the word count
pprint(frequency)

defaultdict(<type 'int'>, {'minors': 2, 'generation': 1, 'testing': 1, 'iv': 1, 'engineering': 1, 'computer': 2, 'relation': 1, 'human': 2, 'measurement': 1, 'unordered': 1, 'binary': 1, 'abc': 1, 'ordering': 1, 'graph': 3, 'system': 4, 'machine': 1, 'quasi': 1, 'random': 1, 'paths': 1, 'error': 1, 'trees': 3, 'lab': 1, 'applications': 1, 'management': 1, 'user': 3, 'interface': 2, 'intersection': 1, 'response': 2, 'perceived': 1, 'widths': 1, 'well': 1, 'eps': 2, 'survey': 2, 'time': 2, 'opinion': 1})


In [92]:
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


### Converting documents to vectors

To convert documents to vectors we'll use a document representation called bag-of-words. In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

“How many times does the word system appear in the document? Once.”
It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary:

In [93]:
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')
print dictionary

Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


Here we assigned a unique integer id to all words appearing in the corpus with the gensim.corpora.dictionary.Dictionary class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids:

In [94]:
pprint(dictionary.token2id)

{u'computer': 1,
 u'eps': 8,
 u'graph': 10,
 u'human': 2,
 u'interface': 0,
 u'minors': 11,
 u'response': 3,
 u'survey': 5,
 u'system': 6,
 u'time': 4,
 u'trees': 9,
 u'user': 7}


To actually convert tokenized documents to vectors:

In [95]:
# Suppose we take the first document that has been converted to a list
new_doc = texts[0]
print new_doc
new_vec = dictionary.doc2bow(new_doc)
print new_vec

['human', 'interface', 'computer']
[(0, 1), (1, 1), (2, 1)]


This means that the three words as indexed by the first number, i.e, human id: 0, interface id: 1, ..., all seem to appear once and all other words in the dictionary appear (implicitly) zero times.

The function `doc2bow()` simply counts the number of occurences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

In [96]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)
pprint(corpus)
# Note that this numbers as not in order. For example id = 1 corresponds to computer.
# It appears on 2nd position in the first row but 1st position on second row. 

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(0, 1), (6, 1), (7, 1), (8, 1)],
 [(2, 1), (6, 2), (8, 1)],
 [(3, 1), (4, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(5, 1), (10, 1), (11, 1)]]


In [97]:
# For comparison here we have the texts
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


By now it should be clear that the vector feature with id=10 stands for the question “How many times does the word graph appear in the document?” and that the answer is “zero” for the first six documents and “one” for the remaining three.

## Corpus Streaming - One Document at a Time

Note that corpus above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus must be able to return one document vector at a time:

In [98]:
class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there is one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

In [99]:
corpus_memory_friendly = MyCorpus()
print corpus_memory_friendly

<__main__.MyCorpus object at 0x10a5f0950>


Corpus is now an object. We didn’t define any way to print it, so print just outputs address of the object in memory. Not very useful. To see the constituent vectors, let’s iterate over the corpus and print each document vector (one at a time):

In [100]:
for vector in corpus_memory_friendly:
    pprint(vector)

[(0, 1), (1, 1), (2, 1)]
[(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(0, 1), (6, 1), (7, 1), (8, 1)]
[(2, 1), (6, 2), (8, 1)]
[(3, 1), (4, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(5, 1), (10, 1), (11, 1)]


lthough the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.

Similarly, to construct the dictionary without loading all texts into memory:

In [101]:
# Collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open("mycorpus.txt"))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
           if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)
dictionary.compactify()
print dictionary

Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn’t, and we will need to apply a transformation on this simple representation first, before we can use it to compute any meaningful document vs. document similarities.

# Topics and Transformations

In this tutorial, I will show how to transform documents from one vector representation into another. This process serves two goals:

1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

## Creating a Transformation

The transformations are standard Python objects, typically initialized by means of a training corpus

In [102]:
# Initialize a model
tfidf = models.TfidfModel(corpus)

We used our old corpus from tutorial 1 to initialize (train) the transformation model. Different transformations may require different initialization parameters; in case of TfIdf, the “training” consists simply of going through the supplied corpus once and computing document frequencies of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet Allocation, is much more involved and, consequently, takes much more time.

### Transforming Vectors

In [103]:
doc_bow = [(0,1), (1,1)]
print(tfidf[doc_bow])

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


We can apply a transformation to a whole corpus:

In [104]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    pprint(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(1, 0.44424552527467476),
 (3, 0.44424552527467476),
 (4, 0.44424552527467476),
 (5, 0.44424552527467476),
 (6, 0.3244870206138555),
 (7, 0.3244870206138555)]
[(0, 0.5710059809418182),
 (6, 0.4170757362022777),
 (7, 0.4170757362022777),
 (8, 0.5710059809418182)]
[(2, 0.49182558987264147), (6, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (4, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(5, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


In [105]:
# For comparison here are the tokens:
pprint(dictionary.token2id)

{u'computer': 5,
 u'eps': 4,
 u'graph': 1,
 u'human': 8,
 u'interface': 10,
 u'minors': 0,
 u'response': 11,
 u'survey': 6,
 u'system': 2,
 u'time': 9,
 u'trees': 3,
 u'user': 7}


In this particular case, we are transforming the same corpus that we used for training, but this is only incidental. Once the transformation model has been initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used in the training corpus at all. This is achieved by a process called folding-in for LSA, by topic inference for LDA etc.

Transformations can also be serialized, one on top of another, in a sort of chain:

In [106]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

Here we transformed our Tf-Idf corpus via Latent Semantic Indexing into a latent 2-D space (2-D because we set num_topics=2). Now you’re probably wondering: what do these two latent dimensions stand for? Let’s inspect with models.LsiModel.print_topics():

In [107]:
lsi.print_topics(2)

[u'0.703*"time" + 0.538*"interface" + 0.402*"response" + 0.187*"computer" + 0.061*"survey" + 0.060*"eps" + 0.060*"trees" + 0.058*"user" + 0.049*"graph" + 0.035*"minors"',
 u'-0.460*"survey" + -0.373*"user" + -0.332*"human" + -0.328*"minors" + -0.320*"trees" + -0.320*"eps" + -0.293*"graph" + -0.280*"system" + -0.171*"computer" + 0.161*"time"']

According to Lsi time, interface, reponse are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic

In [108]:
for doc in corpus_lsi:
    pprint(doc)

[(0, 0.066007833960904039), (1, -0.52007033063618557)]
[(0, 0.19667592859142519), (1, -0.7609563167700053)]
[(0, 0.089926399724464701), (1, -0.7241860626752511)]
[(0, 0.075858476521782264), (1, -0.63205515860034311)]
[(0, 0.10150299184980152), (1, -0.57373084830029586)]
[(0, 0.70321089393783098), (1, 0.16115180214025826)]
[(0, 0.87747876731198304), (1, 0.16758906864659445)]
[(0, 0.90986246868185794), (1, 0.14086553628719067)]
[(0, 0.61658253505692828), (1, -0.053929075663893329)]


Model persistency is achieved with the save() and load() functions:

## Saving and loading models

In [109]:
#lsi.save('/tmp/model.lsi') # same for tfidf, lda
#lis = models.LsiModel.load('/tmp/model.lis')

The next question might be: just how exactly similar are those documents to each other? Is there a way to formalize the similarity, so that for a given input document, we can order some other set of documents according to their similarity? 

## Similarity Interface

In the previous tutorials on Corpora and Vector Spaces and Topics and Transformations, we covered what it means to create a corpus in the Vector Space Model and how to transform it between different vector spaces. A common reason for such a charade is that we want to determine similarity between pairs of documents, or the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents).

To show how this can be done in gensim, let us consider the same corpus as in the previous examples (which really originally comes from Deerwester et al.’s “Indexing by Latent Semantic Analysis” seminal 1990 article):

In [110]:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm')
print corpus

MmCorpus(9 documents, 12 features, 28 non-zero entries)


To follow Deerwester’s example, we first use this tiny corpus to define a 2-dimensional LSI space:

In [111]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

Now suppose a user typed in the query “Human computer interaction”. We would like to sort our nine corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [122]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print vec_lsi

[(0, 0.46182100453271585), (1, -0.070027665278999451)]


## Initializing Query Structures

To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.

In [123]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it



Index persistency is handled via the standard save() and load() functions:

In [124]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

This is true for all similarity indexing classes (similarities.Similarity, similarities.MatrixSimilarity and similarities.SparseMatrixSimilarity). Also in the following, index can be an object of any of these. When in doubt, use similarities.Similarity, as it is the most scalable version, and it also supports adding more documents to the index later.

### Performing Queries

To obtain similarities of our query document against the nine indexed documents:

In [125]:
sims = index[vec_lsi]
print list(enumerate(sims))

[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794639), (8, 0.050041765)]


Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar), so that the first document has a score of 0.99809301 etc.

With some standard Python magic we sort these similarities into descending order, and obtain the final answer to the query “Human computer interaction”:

In [126]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

[(2, 0.99844527), (0, 0.99809301), (3, 0.9865886), (1, 0.93748635), (4, 0.90755945), (8, 0.050041765), (7, -0.098794639), (6, -0.10639259), (5, -0.12416792)]


It looks something like this:

[(2, 0.99844527), # The EPS user interface management system<br>
(0, 0.99809301), # Human machine interface for lab abc computer applications<br>
(3, 0.9865886), # System and human system engineering testing of EPS<br>
(1, 0.93748635), # A survey of user opinion of computer system response time<br>
(4, 0.90755945), # Relation of user perceived response time to error measurement<br>
(8, 0.050041795), # Graph minors A survey<br>
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering<br>
(6, -0.1063926), # The intersection graph of paths in trees<br>
(5, -0.12416792)] # The generation of random binary unordered trees<br>