# GENSIM

## Features



- **Memory independence** – there is no need for the whole training corpus to reside fully in RAM at any one time (can process large, web-scale corpora).
- Efficient implementations for several popular vector space algorithms, including **Tf-Idf**, distributed incremental **Latent Semantic Analysis**, distributed incremental **Latent Dirichlet Allocation (LDA)** or **Random Projection**; adding new ones is easy (really!).
- I/O wrappers and converters around **several popular data formats**.
- **Similarity queries** for documents in their semantic representation.

## Core concepts
The whole gensim package revolves around the concepts of **corpus**, **vector** and **model**.

**Corpus** is collection of digital documents, represented as sparse vectors.

**Vector** in the Vector Space Model (VSM), each document is represented by an array of features. 
**Sparse vector**  is an array in which most of the elements have zero values. To save space, we omit them from the document’s representation.

**Model** - for our purposes, a model is a transformation from one document representation to another (or, in other words, from one vector space to another).

In [1]:
from gensim import corpora, models, similarities

In [2]:
# corpus of nine documents, each consisting of only a single sentence

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

In [6]:
stoplist = set('for a of the and to in'.split())

## From Strings to Vectors

In [15]:
# tokenization the documents: 
# - removing common words (using a toy stoplist)
# - removing words that only once appear in the corpus

dictionary = corpora.Dictionary(line.lower().split() for line in documents)

# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
            if stopword in dictionary.token2id]

once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
dictionary.compactify() # remove gaps in id sequence after words that were removed

print (dictionary)

Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


To convert documents to vectors, we’ll use a document representation called **bag-of-words**. In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

>*“How many times does the word **system** appear in the document? Once.”*

It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary.

In [28]:
# There are twelve distinct words in the processed corpus, which means 
# each document will be represented by twelve numbers (ie., by a 12-D vector).

print(dictionary.token2id)

{u'minors': 0, u'graph': 1, u'system': 2, u'trees': 3, u'eps': 4, u'computer': 5, u'survey': 6, u'user': 7, u'human': 8, u'time': 9, u'interface': 10, u'response': 11}


In [25]:
# The function doc2bow() simply counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result as a sparse vector.

corpus = [dictionary.doc2bow(line.lower().split()) for line in documents]
corpus

[[(5, 1), (8, 1), (10, 1)],
 [(2, 1), (5, 1), (6, 1), (7, 1), (9, 1), (11, 1)],
 [(2, 1), (4, 1), (7, 1), (10, 1)],
 [(2, 2), (4, 1), (8, 1)],
 [(7, 1), (9, 1), (11, 1)],
 [(3, 1)],
 [(1, 1), (3, 1)],
 [(0, 1), (1, 1), (3, 1)],
 [(0, 1), (1, 1), (6, 1)]]

## Transformation interface

The transformation documents from one vector representation into another serves two goals:

- To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
- To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

### TF-IDF model

**Tf-Idf** is a simple transformation which takes documents represented as bag-of-words counts and applies a weighting which discounts common terms (or, equivalently, promotes rare terms). It therefore converts integer-valued vectors into real-valued ones. It also scales the resulting vector to unit length (in the Euclidean norm).

In [55]:
tfidf = models.TfidfModel(corpus, normalize=True) # step 1 -- initialize a model

In [56]:
corpus_tfidf = tfidf[corpus] # step 2 -- use the model to transform vectors
list(corpus_tfidf)

[[(5, 0.5773502691896257), (8, 0.5773502691896257), (10, 0.5773502691896257)],
 [(2, 0.3244870206138555),
  (5, 0.44424552527467476),
  (6, 0.44424552527467476),
  (7, 0.3244870206138555),
  (9, 0.44424552527467476),
  (11, 0.44424552527467476)],
 [(2, 0.4170757362022777),
  (4, 0.5710059809418182),
  (7, 0.4170757362022777),
  (10, 0.5710059809418182)],
 [(2, 0.7184811607083769), (4, 0.49182558987264147), (8, 0.49182558987264147)],
 [(7, 0.45889394536615247), (9, 0.6282580468670046), (11, 0.6282580468670046)],
 [(3, 1.0)],
 [(1, 0.7071067811865475), (3, 0.7071067811865475)],
 [(0, 0.695546419520037), (1, 0.5080429008916749), (3, 0.5080429008916749)],
 [(0, 0.6282580468670046), (1, 0.45889394536615247), (6, 0.6282580468670046)]]

Calling ***model[corpus]*** only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling ***corpus_transformed = model[corpus]***, because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed ***corpus_transformed*** multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.


Once the transformation model has been initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used in the training corpus at all.

In [36]:
new_doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(new_doc.lower().split())
vec_tfidf = tfidf[vec_bow] # convert the query to tfidf space 
vec_tfidf

[(5, 0.7071067811865476), (8, 0.7071067811865476)]

### Latent Semantic Indexing, LSI (or sometimes LSA) 

Implements fast truncated SVD (Singular Value Decomposition). The SVD decomposition can be updated with new observations at any time, for an online, incremental, memory-efficient training.

It transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard”

LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. 

In [46]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

In [47]:
# the first 5 documents are more strongly related to the 1st topic 
# while the remaining 4 documents to the 2nd topic

corpus_lsi = lsi[corpus]
list(corpus_lsi)

[[(0, -0.65946640597973938), (1, -0.14211544403729859)],
 [(0, -2.0245430433828746), (1, 0.42088758246302371)],
 [(0, -1.5465535813286542), (1, -0.32358919425711979)],
 [(0, -1.8111412473028827), (1, -0.58905249699324758)],
 [(0, -0.93367380356343443), (1, 0.27138940499375308)],
 [(0, -0.012746183038294626), (1, 0.49016179245310371)],
 [(0, -0.048882032060470634), (1, 1.1129470269929547)],
 [(0, -0.080638360994106414), (1, 1.563455946344265)],
 [(0, -0.27381003921275676), (1, 1.34694158495377)]]

In [49]:
# new_doc related to 1st topic
vec_lsi = lsi[vec_bow] # convert the query to LSI space 
vec_lsi

[(0, -0.4618210045327158), (1, -0.07002766527899984)]

In [71]:
lsi.print_topics(num_topics=2)

[(0,
  u'-0.644*"system" + -0.404*"user" + -0.301*"eps" + -0.265*"response" + -0.265*"time" + -0.240*"computer" + -0.221*"human" + -0.206*"survey" + -0.198*"interface" + -0.036*"graph"'),
 (1,
  u'0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system" + -0.141*"eps" + -0.113*"human" + 0.107*"time" + 0.107*"response" + -0.072*"interface"')]

It appears that according to LSI, “graph”, “trees” and “minors” are all related words (and contribute the most to the direction of the 2nd topic), while the 1st topic practically concerns itself with all the other words. 

### Random Projections, RP

It aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating **TfIdf** distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

In [65]:
# create a double wrapper over the original corpus: bow->tfidf->rp
rp = models.rpmodel.RpModel(corpus_tfidf, id2word=dictionary, num_topics=2)

In [66]:
corpus_rp = rp[corpus_tfidf]
list(corpus_rp)

[[(0, 0.40824827551841736), (1, 0.40824827551841736)],
 [(0, 1.0871520042419434), (1, -0.1693640947341919)],
 [(0, 1.3973586559295654), (1, 0.5898341536521912)],
 [(0, 0.5080429315567017), (1, 0.5080429315567017)],
 [(0, 0.32448703050613403), (1, 0.32448700070381165)],
 [(0, 0.7071067690849304), (1, 0.7071067690849304)],
 [(1, 1.0)],
 [(0, 0.4918256103992462), (1, 0.226655513048172)],
 [(0, 0.5640040636062622), (1, -0.5640040636062622)]]

### Latent Dirichlet Allocation, LDA

Is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

In [68]:
lda = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2)

In [67]:
corpus_lda = lda[corpus_tfidf]
list(corpus_lda)

[[(0, 0.77106515881296822), (1, 0.22893484118703167)],
 [(0, 0.74849867304679052), (1, 0.25150132695320948)],
 [(0, 0.73472279302826116), (1, 0.26527720697173873)],
 [(0, 0.71032195545375709), (1, 0.28967804454624291)],
 [(0, 0.53307688047053547), (1, 0.46692311952946447)],
 [(0, 0.46485337784009173), (1, 0.53514662215990827)],
 [(0, 0.27635491841703491), (1, 0.72364508158296503)],
 [(0, 0.24001017633815902), (1, 0.75998982366184087)],
 [(0, 0.25321418906769572), (1, 0.74678581093230423)]]

In [70]:
lda.show_topics(num_topics=2)

[(0,
  u'0.154*graph + 0.143*system + 0.128*trees + 0.114*minors + 0.098*eps + 0.075*human + 0.069*survey + 0.060*user + 0.050*interface + 0.038*response'),
 (1,
  u'0.134*user + 0.111*time + 0.110*computer + 0.108*response + 0.102*system + 0.096*interface + 0.077*survey + 0.072*human + 0.068*trees + 0.049*eps')]

## Similarity interface

A common reason of semantic analysis is that we want to determine similarity between pairs of documents, or the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents).

We will be considering cosine similarity to determine the similarity of two vectors. Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity measures may be more appropriate.

In [75]:
# pairwise distance matrix
index_tfidf = similarities.MatrixSimilarity(corpus_tfidf) # transform corpus to tfidf space and index it

In [77]:
import pandas as pd
pd.DataFrame([i for i in index])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.0,0.256485,0.32967,0.283956,0.0,0.0,0.0,0.0,0.0
1,0.256485,1.0,0.270671,0.233138,0.707107,0.0,0.0,0.0,0.279101
2,0.32967,0.270671,1.0,0.580496,0.191394,0.0,0.0,0.0,0.0
3,0.283956,0.233138,0.580496,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.707107,0.191394,0.0,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0,0.707107,0.508043,0.0
6,0.0,0.0,0.0,0.0,0.0,0.707107,1.0,0.718481,0.324487
7,0.0,0.0,0.0,0.0,0.0,0.508043,0.718481,1.0,0.67012
8,0.0,0.279101,0.0,0.0,0.0,0.0,0.324487,0.67012,1.0


In [76]:
sims_tfidf = index_tfidf[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims_tfidf))) # print (document_number, document_similarity) 2-tuples

[(0, 0.81649655), (1, 0.31412902), (2, 0.0), (3, 0.34777319), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar). So it means that ***vec_tfidf*** has a similarity score of 0.816=81.6% with the document number zero from the corpus, with the second document has a similarity score of 31.4% etc.

In [83]:
index_lsi = similarities.MatrixSimilarity(corpus_lsi)
sims_lsi = index_lsi[vec_lsi]
sorted(enumerate(sims_lsi), key=lambda item: -item[1])

[(2, 0.99844527),
 (0, 0.99809301),
 (3, 0.9865886),
 (1, 0.93748635),
 (4, 0.90755945),
 (8, 0.050041765),
 (7, -0.098794639),
 (6, -0.10639259),
 (5, -0.12416792)]

In [84]:
index_rp = similarities.MatrixSimilarity(corpus_rp)
sims_rp = index_rp[rp[vec_bow]]
sorted(enumerate(sims_rp), key=lambda item: -item[1])

[(0, 0.0),
 (1, 0.0),
 (2, 0.0),
 (3, 0.0),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0)]

In [85]:
index_lda = similarities.MatrixSimilarity(corpus_lda)
sims_lda = index_lda[lda[vec_bow]]
sorted(enumerate(sims_lda), key=lambda item: -item[1])

[(0, 0.9962514),
 (4, 0.99348116),
 (1, 0.98535186),
 (2, 0.96887434),
 (5, 0.61379099),
 (6, 0.47172993),
 (8, 0.43059778),
 (3, 0.41381741),
 (7, 0.41209233)]