# Using Gensim with `svd2vec` output

[Gensim](https://pypi.org/project/gensim/) is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

Gensim can use `word2vec` to compute similarity (and more!) between words. `svd2vec` can save it's vectors in a `word2vec` format that Gensim can process.

In this notebook it is shown how you can use Gensim with vectors learnt from `svd2vec`. We also compare our results with the pure word2vec model. 

---
## I - Preparation

In [1]:
from svd2vec import svd2vec
from gensim.models import Word2Vec
from gensim.models.keyedvectors import Word2VecKeyedVectors

In [2]:
# Gensim does not have any implementation of an analogy method, so we add one here (3CosAdd)
def analogy_keyed(self, a, b, c, topn=10):
    return self.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2VecKeyedVectors.analogy = analogy_keyed
def analogy_w2v(self, a, b, c, topn=10):
    return self.wv.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2Vec.analogy = analogy_w2v

In [3]:
# we load our previously made text8 document list
documents = [open("text8", "r").read().split(" ")[1:]]

In [50]:
from svd2vec import Utils
documents = Utils.split(documents[0], 1701)

---
## II - Models construction

### SVD with svd2vec

In [4]:
#svd2vec_svd = svd2vec(documents, size=100, window=5, min_count=100, verbose=False)
svd2vec_svd = svd2vec.load("svd.svd2vec")

### SVD with Gensim from svd2vec

In [5]:
# we first need to export svd2vec_svd to the word2vec format
svd2vec_svd.save_word2vec_format("svd.word2vec")

# we then load the model using Gensim
gensim_svd = Word2VecKeyedVectors.load_word2vec_format("svd.word2vec")

### word2vec

In [6]:
word2vec_w2v = Word2VecKeyedVectors.load_word2vec_format("w2v.word2vec")

### word2vec with Gensim

In [51]:
import gensim
gensim_w2v = gensim.models.Word2Vec(documents, size=100, window=5, min_count=100, workers=16)

In [43]:
len(list(gensim_w2v.wv.vocab.keys()))

11815

---
## III - Cosine similarity comparison

In [52]:
def compare_similarity(w1, w2):
    print("cosine similarity between", w1, "and", w2, ":")
    print("\tsvd2vec_svd ", svd2vec_svd.similarity(w1, w2))
    print("\tgensim_svd  ", gensim_svd.similarity(w1, w2))
    print("\tgensim_w2v  ", gensim_w2v.wv.similarity(w1, w2))
    print("\tword2vec_w2v", word2vec_w2v.similarity(w1, w2))

def compare_analogy(w1, w2, w3, topn=3):
    
    def analogy_str(model):
        a = model.analogy(w1, w2, w3, topn=topn)
        s = "\n\t\t".join(["{: <20}".format(w) + str(c) for w, c in a])
        return "\n\t\t" + s
    
    print("analogy similaties :", w1, "is to", w2, "as", w3, "is to?")
    print("\tsvd2vec_svd", analogy_str(svd2vec_svd))
    print("\tgensim_svd", analogy_str(gensim_svd))
    print("\tgensim_w2v", analogy_str(gensim_w2v))
    print("\tword2vec_w2v", analogy_str(word2vec_w2v))

In [53]:
compare_similarity("good", "bad")

cosine similarity between good and bad :
	svd2vec_svd  0.4951483093832256
	gensim_svd   0.4951475
	gensim_w2v   0.7723463
	word2vec_w2v 0.728928


In [54]:
compare_similarity("truck", "car")

cosine similarity between truck and car :
	svd2vec_svd  0.8725645794464922
	gensim_svd   0.8725649
	gensim_w2v   0.71462846
	word2vec_w2v 0.6936528


In [55]:
compare_analogy("january", "month", "monday")

analogy similaties : january is to month as monday is to?
	svd2vec_svd 
		friday              0.7990049263196153
		holiday             0.7774813849657727
		day                 0.7696653269345999
	gensim_svd 
		friday              0.7990041971206665
		holiday             0.7774807810783386
		day                 0.7696648836135864
	gensim_w2v 
		week                0.7143122553825378
		evening             0.6310715675354004
		weekend             0.6066169142723083
	word2vec_w2v 
		week                0.7236202359199524
		evening             0.5867935419082642
		weekend             0.5843297839164734


In [56]:
compare_analogy("paris", "france", "berlin")

analogy similaties : paris is to france as berlin is to?
	svd2vec_svd 
		germany             0.7687125088187668
		reich               0.7243489014216623
		sch                 0.7123675101373064
	gensim_svd 
		germany             0.7687125205993652
		reich               0.7243496179580688
		sch                 0.712367594242096
	gensim_w2v 
		germany             0.8262317180633545
		finland             0.7536041140556335
		austria             0.7173164486885071
	word2vec_w2v 
		germany             0.840154767036438
		austria             0.6982203722000122
		poland              0.6571524143218994


In [57]:
compare_analogy("man", "king", "woman")

analogy similaties : man is to king as woman is to?
	svd2vec_svd 
		crowned             0.623713716342001
		isabella            0.6024687219275104
		consort             0.6019050828977524
	gensim_svd 
		crowned             0.6237134337425232
		isabella            0.6024693846702576
		consort             0.601904571056366
	gensim_w2v 
		queen               0.7210809588432312
		elizabeth           0.6706132888793945
		isabella            0.6488653421401978
	word2vec_w2v 
		queen               0.6623748540878296
		regent              0.6608081459999084
		consort             0.6403408050537109


In [58]:
compare_analogy("road", "cars", "rail")

analogy similaties : road is to cars as rail is to?
	svd2vec_svd 
		locomotives         0.7105197854472618
		diesel              0.6920861316045748
		locomotive          0.6578811562326874
	gensim_svd 
		locomotives         0.7105196714401245
		diesel              0.6920859813690186
		locomotive          0.6578816175460815
	gensim_w2v 
		vehicles            0.7365255355834961
		locomotives         0.7124711275100708
		automobiles         0.7065150737762451
	word2vec_w2v 
		locomotives         0.6976078152656555
		vehicles            0.6787285804748535
		diesel              0.6171871423721313


---
## IV - Evaluations

In [59]:
def compare_similarity(datafile):
    from gensim.test.utils import datapath
    contents = datapath(datafile)
    print("pearson correlation of", datafile)
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_pairs(contents)[0])
    print("\tgensim_svd    ", gensim_svd.evaluate_word_pairs(contents)[0][0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_pairs(contents)[0][0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_pairs(contents)[0][0])

In [60]:
compare_similarity('wordsim353.tsv')

pearson correlation of wordsim353.tsv
	svd2vec_svd    0.6701752412518817
	gensim_svd     0.6805493828205335
	gensim_w2v     0.6570723922031956
	word2vec_w2v   0.6848196247009626


In [61]:
def compare_analogy(datafile):
    from gensim.test.utils import datapath
    contents = datapath(datafile)
    print("analogies success rate of", datafile)
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_analogies(contents))
    print("\tgensim_svd    ", gensim_svd.evaluate_word_analogies(contents)[0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_analogies(contents)[0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_analogies(contents)[0])

In [62]:
compare_analogy('questions-words.txt')

analogies success rate of questions-words.txt
	svd2vec_svd    0.31634891175974356
	gensim_svd     0.31634891175974356
	gensim_w2v     0.4552049940948203
	word2vec_w2v   0.5129070355997976
