# Similarity interface

In the previous tutorials on Corpora and Vector Spaces and Topics and Transformations, we covered what it means to create a corpus in the Vector Space Model and how to transform it between different vector spaces. A common reason for such a charade is that we want to determine **similarity between pairs of documents**, or **the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents)**.

In [1]:
>>> from gensim import corpora, models, similarities
>>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
>>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"
>>> print(corpus)

MmCorpus(9 documents, 12 features, 28 non-zero entries)


To follow Deerwester’s example, we first use this tiny corpus to define a 2-dimensional LSI space:

In [2]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

Now suppose a user typed in the query “Human computer interaction”. We would like to sort our nine corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [3]:
>>> doc = "Human computer interaction"
>>> vec_bow = dictionary.doc2bow(doc.lower().split())
>>> vec_lsi = lsi[vec_bow] # convert the query to LSI space
>>> print(vec_lsi)

[(0, 0.46182100453271618), (1, -0.070027665278998882)]


## Initializing query structures

In [4]:
>>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

In [5]:
>>> index.save('/tmp/deerwester.index')
>>> index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')


## Performing queries

To obtain similarities of our query document against the nine indexed documents:

In [10]:
>>> sims = index[vec_lsi] # perform a similarity query against the corpus
>>> print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794639), (8, 0.050041765)]


In [11]:
>>> from pprint import pprint  # pretty-printer
>>> sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> pprint(sims) # print sorted (document number, similarity score) 2-tuples

[(2, 0.99844527),
 (0, 0.99809301),
 (3, 0.9865886),
 (1, 0.93748635),
 (4, 0.90755945),
 (8, 0.050041765),
 (7, -0.098794639),
 (6, -0.10639259),
 (5, -0.12416792)]
