# Using Gensim with `svd2vec` output

[Gensim](https://pypi.org/project/gensim/) is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

Gensim can use `word2vec` to compute similarity (and more!) between words. `svd2vec` can save it's vectors in a `word2vec` format that Gensim can process.

In this notebook it is shown how you can use Gensim with vectors learnt from `svd2vec`. We also compare our results with the pure word2vec model. 

---
## I - Preparation

In [1]:
from svd2vec import svd2vec, FilesIO
from gensim.models import Word2Vec
from gensim.models.keyedvectors import Word2VecKeyedVectors

In [2]:
# Gensim does not have any implementation of an analogy method, so we add one here (3CosAdd)
def analogy_keyed(self, a, b, c, topn=10):
    return self.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2VecKeyedVectors.analogy = analogy_keyed
def analogy_w2v(self, a, b, c, topn=10):
    return self.wv.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2Vec.analogy = analogy_w2v

In [8]:
documents = FilesIO.load_corpus("text8")

---
## II - Models construction

### SVD with svd2vec

In [9]:
svd2vec_svd = svd2vec(documents, size=300, window=5, min_count=100, verbose=False)

### SVD with Gensim from svd2vec

In [10]:
# we first need to export svd2vec_svd to the word2vec format
svd2vec_svd.save_word2vec_format("svd.word2vec")

# we then load the model using Gensim
gensim_svd = Word2VecKeyedVectors.load_word2vec_format("svd.word2vec")

### word2vec

In [16]:
import os
if not os.path.isfile("w2v.word2vec") or True:
    # we train the model using word2vec (needs to be installed)
    !word2vec -min-count 100 -size 300 -window 5 -train text8 -output w2v.word2vec

# we load it
word2vec_w2v = Word2VecKeyedVectors.load_word2vec_format("w2v.word2vec")

Starting training using file text8
Vocab size: 11816
Words in train file: 15471434
Alpha: 0.000005  Progress: 100.04%  Words/thread/sec: 208.82k  

### word2vec with Gensim

In [7]:
gensim_w2v = Word2Vec(documents, size=300, window=5, min_count=100, workers=16)

---
## III - Cosine similarity comparison

In [8]:
def compare_similarity(w1, w2):
    print("cosine similarity between", w1, "and", w2, ":")
    print("\tsvd2vec_svd ", svd2vec_svd.similarity(w1, w2))
    print("\tgensim_svd  ", gensim_svd.similarity(w1, w2))
    print("\tgensim_w2v  ", gensim_w2v.wv.similarity(w1, w2))
    print("\tword2vec_w2v", word2vec_w2v.similarity(w1, w2))

def compare_analogy(w1, w2, w3, topn=3):
    
    def analogy_str(model):
        a = model.analogy(w1, w2, w3, topn=topn)
        s = "\n\t\t".join(["{: <20}".format(w) + str(c) for w, c in a])
        return "\n\t\t" + s
    
    print("analogy similaties :", w1, "is to", w2, "as", w3, "is to?")
    print("\tsvd2vec_svd", analogy_str(svd2vec_svd))
    print("\tgensim_svd", analogy_str(gensim_svd))
    print("\tgensim_w2v", analogy_str(gensim_w2v))
    print("\tword2vec_w2v", analogy_str(word2vec_w2v))

In [9]:
compare_similarity("good", "bad")

cosine similarity between good and bad :
	svd2vec_svd  0.5542564783462338
	gensim_svd   0.55425656


NameError: name 'gensim_w2v' is not defined

In [None]:
compare_similarity("truck", "car")

In [None]:
compare_analogy("january", "month", "monday")

In [None]:
compare_analogy("paris", "france", "berlin")

In [27]:
compare_analogy("man", "king", "woman")

analogy similaties : man is to king as woman is to?
	svd2vec_svd 
		princess            0.5237731172162106
		isabella            0.5202350726282744
		vii                 0.49219104719485585
	gensim_svd 
		princess            0.5237736701965332
		isabella            0.5202344655990601
		vii                 0.49219122529029846
	gensim_w2v 
		queen               0.6163082718849182
		isabella            0.5582364797592163
		princess            0.5404483675956726
	word2vec_w2v 
		queen               0.4885832667350769
		consort             0.46668681502342224
		isabella            0.45786744356155396


In [62]:
compare_analogy("road", "cars", "rail")

analogy similaties : road is to cars as rail is to?
	svd2vec_svd 
		locomotives         0.7007217961709339
		locomotive          0.6949958902552571
		trucks              0.6416710731236377
	gensim_svd 
		locomotives         0.7007222175598145
		locomotive          0.6949959993362427
		trucks              0.6416715383529663
	gensim_w2v 
		locomotives         0.7414308190345764
		diesel              0.7162787914276123
		vehicles            0.6914362907409668
	word2vec_w2v 
		trucks              0.56121426820755
		locomotives         0.5561363697052002
		buses               0.5301402807235718


---
## IV - Evaluations

In [10]:
def compare_similarity(path, d='\t'):
    print("pearson correlation of", os.path.basename(path))
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_pairs(path,   delimiter=d)[0])
    print("\tgensim_svd    ", gensim_svd.evaluate_word_pairs(path,    delimiter=d)[0][0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_pairs(path, delimiter=d)[0][0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_pairs(path,  delimiter=d)[0][0])
    print("")

In [12]:
compare_similarity(FilesIO.path('similarities/wordsim353.txt'))
compare_similarity(FilesIO.path('similarities/men_dataset.txt'))
compare_similarity(FilesIO.path('similarities/mturk.txt'))
compare_similarity(FilesIO.path('similarities/simlex999.txt'))
compare_similarity(FilesIO.path('similarities/rarewords.txt'))

pearson correlation of wordsim353.txt
	svd2vec_svd    0.6706350388667012
	gensim_svd     0.6802918878032329
	gensim_w2v     0.6591957994570873
	word2vec_w2v   0.6727890494950836

pearson correlation of men_dataset.txt
	svd2vec_svd    0.7028159505005802
	gensim_svd     0.7028159480404237
	gensim_w2v     0.6188095527056579
	word2vec_w2v   0.655941425061282

pearson correlation of mturk.txt
	svd2vec_svd    0.6439060836063971
	gensim_svd     0.6439061439515523
	gensim_w2v     0.656630121730164
	word2vec_w2v   0.670345306022554

pearson correlation of simlex999.txt
	svd2vec_svd    0.21355778053833865
	gensim_svd     0.2135578352603581
	gensim_w2v     0.27147404992228164
	word2vec_w2v   0.2983498673828365

pearson correlation of rarewords.txt
	svd2vec_svd    0.4401173357760652
	gensim_svd     0.4401172340986287
	gensim_w2v     0.3990529625651065
	word2vec_w2v   0.45896769712164637



In [11]:
def compare_analogy(path):
    print("analogies success rate of", os.path.basename(path))
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_analogies(path))
    print("\tgensim_svd    ", gensim_svd.evaluate_word_analogies(path)[0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_analogies(path)[0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_analogies(path)[0])

In [None]:
compare_analogy(FilesIO.path('analogies/questions-words.txt'))
compare_analogy(FilesIO.path('analogies/msr.txt'))