# Using Gensim with `svd2vec` output

[Gensim](https://pypi.org/project/gensim/) is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

Gensim can use `word2vec` to compute similarity (and more!) between words. `svd2vec` can save it's vectors in a `word2vec` format that Gensim can process.

In this notebook it is shown how you can use Gensim with vectors learnt from `svd2vec`. We also compare our results with the pure word2vec model. 

## I - Preparation

In [1]:
from svd2vec import svd2vec
from gensim.models import Word2Vec
from gensim.models.keyedvectors import Word2VecKeyedVectors

In [2]:
# Gensim does not have any implementation of an analogy method, so we add one here (3CosAdd)
def analogy_keyed(self, a, b, c, topn=10):
    return self.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2VecKeyedVectors.analogy = analogy_keyed
def analogy_w2v(self, a, b, c, topn=10):
    return self.wv.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2Vec.analogy = analogy_w2v

In [3]:
# we load our previously made text8 document list
documents = [open("text8", "r").read().split(" ")[1:]]

## II - Models construction

### SVD with svd2vec

In [4]:
svd2vec_svd = svd2vec(documents, size=100, window=5, min_count=100, verbose=True)











### SVD with Gensim from svd2vec

In [5]:
# we first need to export svd2vec_svd to the word2vec format
svd2vec_svd.save_word2vec_format("svd.word2vec")

gensim_svd = Word2VecKeyedVectors.load_word2vec_format("svd.word2vec")

### word2vec

In [6]:
word2vec_w2v = Word2VecKeyedVectors.load_word2vec_format("w2v.word2vec")

### word2vec with Gensim

In [7]:
gensim_w2v = Word2Vec(documents, size=100, window=5, min_count=100, workers=16)

## III - Cosine similarity comparison

In [8]:
def compare_similarity(w1, w2):
    print("cosine similarity between", w1, "and", w2, ":")
    print("\tsvd2vec_svd ", svd2vec_svd.similarity(w1, w2))
    print("\tgensim_svd  ", gensim_svd.similarity(w1, w2))
    print("\tgensim_w2v  ", gensim_w2v.wv.similarity(w1, w2))
    print("\tword2vec_w2v", word2vec_w2v.similarity(w1, w2))

def compare_analogy(w1, w2, w3, topn=3):
    
    def analogy_str(model):
        a = model.analogy(w1, w2, w3, topn=topn)
        s = "\n\t\t".join(["{: <20}".format(w) + str(c) for w, c in a])
        return "\n\t\t" + s
    
    print("analogy similaties :", w1, "is to", w2, "as", w3, "is?")
    print("\tsvd2vec_svd", analogy_str(svd2vec_svd))
    print("\tgensim_svd", analogy_str(gensim_svd))
    print("\tgensim_w2v", analogy_str(gensim_w2v))
    print("\tword2vec_w2v", analogy_str(word2vec_w2v))

In [9]:
compare_similarity("good", "bad")

cosine similarity between good and bad :
	svd2vec_svd  0.5504379599034731
	gensim_svd   0.5876901
	gensim_w2v   0.7860587
	word2vec_w2v 0.728928


In [10]:
compare_similarity("truck", "car")

cosine similarity between truck and car :
	svd2vec_svd  0.8939523214976283
	gensim_svd   0.89361215
	gensim_w2v   0.0072198664
	word2vec_w2v 0.6936528


In [11]:
compare_analogy("january", "month", "monday")

analogy similaties : january is to month as monday is?
	svd2vec_svd 
		friday              0.8682570767447222
		sunday              0.8353025497528591
		day                 0.8256563355542673
	gensim_svd 
		friday              0.8007766008377075
		sunday              0.7773827910423279
		holiday             0.7657022476196289
	gensim_w2v 
		dalek               0.36530444025993347
		tiles               0.361540287733078
		belfast             0.35483139753341675
	word2vec_w2v 
		week                0.7236202359199524
		evening             0.5867935419082642
		weekend             0.5843297839164734


In [12]:
compare_analogy("paris", "france", "berlin")

analogy similaties : paris is to france as berlin is?
	svd2vec_svd 
		germany             0.7825258417507117
		bavaria             0.7510495678301725
		bohemia             0.7377872864742052
	gensim_svd 
		germany             0.7823131680488586
		bavaria             0.7529295086860657
		bohemia             0.7433863878250122
	gensim_w2v 
		ill                 0.506828784942627
		broad               0.4927492141723633
		thinking            0.48729899525642395
	word2vec_w2v 
		germany             0.840154767036438
		austria             0.6982203722000122
		poland              0.6571524143218994


In [13]:
compare_analogy("man", "king", "woman")

analogy similaties : man is to king as woman is?
	svd2vec_svd 
		crowned             0.7047109153331894
		isabella            0.6854431847920825
		princess            0.6776985971421638
	gensim_svd 
		crowned             0.6701698303222656
		isabella            0.6435418725013733
		aragon              0.6117964386940002
	gensim_w2v 
		vertically          0.3507192134857178
		tribute             0.35013657808303833
		bodies              0.34026482701301575
	word2vec_w2v 
		queen               0.6623748540878296
		regent              0.6608081459999084
		consort             0.6403408050537109


In [14]:
compare_analogy("road", "cars", "rail")

analogy similaties : road is to cars as rail is?
	svd2vec_svd 
		trucks              0.7964549207299332
		locomotives         0.7963730030443246
		locomotive          0.7864856354090771
	gensim_svd 
		trucks              0.7512880563735962
		locomotives         0.7415914535522461
		locomotive          0.7235795259475708
	gensim_w2v 
		arms                0.6313320398330688
		response            0.6201116442680359
		verbal              0.5980525016784668
	word2vec_w2v 
		locomotives         0.6976078152656555
		vehicles            0.6787285804748535
		diesel              0.6171871423721313


<br /><br /><br /><br />

In [15]:
svd.analogy("cow", "cows", "pig")

NameError: name 'svd' is not defined

In [None]:
gensim_svd.analogy("cow", "cows", "pig")

In [None]:
svd.analogy("road", "cars", "rail")

In [None]:
gensim_svd.analogy("road", "cars", "rail")

In [None]:
svd2vec_svd.analogy("tokyo", "japan", "berlin")

In [None]:
gensim_svd.analogy("tokyo", "japan", "berlin")

In [None]:
gensim_w2v.analogy("tokyo", "japan", "berlin")

## III - Analogy evaluation

In [None]:
import requests
url = "http://download.tensorflow.org/data/questions-words.txt"
contents  = requests.get(url).content
analogies = [str(e)[2:-1].lower().split(" ") for e in contents.splitlines()][1:]
kept_analogies = [a for a in analogies if len(a) == 4 and all([w in svd.vocabulary for w in a])]

In [None]:
from tqdm import tqdm_notebook

svd_errors = 0
gensim_errors = 0
total = len(kept_analogies)

for a, b, c, d in tqdm_notebook(kept_analogies):
    word, _ = svd.analogy(a, b, c, topn=1)[0]
    if word != d:
        svd_errors += 1
    word, _ = gensim_svd.analogy(a, b, c, topn=1)[0]
    if word != d:
        gensim_errors += 1

In [None]:
print("svd2vec error rate", 100.0 * svd_errors / total, "%")
print("gensim  error rate", 100.0 * gensim_errors / total, "%")

In [None]:
print(svd_errors)
print(gensim_errors)

In [19]:
def compare_similarity(datafile):
    from gensim.test.utils import datapath
    contents = datapath(datafile)
    print("pearson correlation of", datafile)
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_pairs(contents)[0])
    print("\tgensim_svd    ", gensim_svd.evaluate_word_pairs(contents)[0][0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_pairs(contents)[0][0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_pairs(contents)[0][0])

In [20]:
compare_similarity('wordsim353.tsv')

pearson correlation of wordsim353.tsv
	svd2vec_svd    0.6572054770519338
	gensim_svd     0.6755807968890952
	gensim_w2v     0.026901246050755875
	word2vec_w2v   0.6848196247009626


In [29]:
datapath('wordsim353.tsv')

'/home/s150789/.local/lib/python3.5/site-packages/gensim/test/test_data/wordsim353.tsv'

In [23]:
def compare_analogy(datafile):
    from gensim.test.utils import datapath
    contents = datapath(datafile)
    print("analogies success rate of", datafile)
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_analogies(contents)[0])

In [24]:
compare_analogy('questions-words.txt')

analogies success rate of questions-words.txt
	word2vec_w2v   0.5129070355997976
