Compute Similarity Matrices 
Cosine Similarity: It is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them.
Soft Cosine Similarity: It is similar to cosine similarity but the difference is that cosine similarity considers the vector space model(VSM) features as independent whereas soft cosine proposes to consider the similarity of features in VSM.
We need to take a word embedding model to compute soft cosines.
Here we are using the pre-trained word2vec model. 

In [1]:
import gensim.downloader as api
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
from gensim import corpora

s1 = 'Afghanistan is an Asian country and capital is Kabul'.split()
s2 = 'India is an Asian country and capital is Delhi'.split()
s3 = 'Greece is a European country and capital is Athens'.split()

# Load pre-trained word vectors
word2vec_model = api.load('word2vec-google-news-300')

documents = [s1, s2, s3]
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Create the term similarity matrix
similarity_index = WordEmbeddingSimilarityIndex(word2vec_model)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary)

# Build the SoftCosineSimilarity index
soft_cosine_index = SoftCosineSimilarity(corpus, similarity_matrix, num_best=3)
query = dictionary.doc2bow(s1)
similarities = soft_cosine_index[query]
for i, sim in enumerate(similarities):
    print(f"Similarity between s1 and document {sim[0]+1}: {sim[1]:.4f}")

100%|██████████| 14/14 [02:11<00:00,  9.37s/it]

Similarity between s1 and document 1: 1.0000
Similarity between s1 and document 2: 0.8372
Similarity between s1 and document 3: 0.7568





In [2]:
import gensim.downloader as api
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
from gensim import corpora

s1 = 'Afghanistan is an Asian country and capital is Kabul'.split()
s2 = 'India is an Asian country and capital is Delhi'.split()
s3 = 'Greece is a European country and capital is Athens'.split()
s4 = 'peacock has feathers too'.split()

# Load pre-trained word vectors
word2vec_model = api.load('word2vec-google-news-300')

documents = [s1, s2, s3]
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Create the term similarity matrix
similarity_index = WordEmbeddingSimilarityIndex(word2vec_model)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary)

# Build the SoftCosineSimilarity index
soft_cosine_index = SoftCosineSimilarity(corpus, similarity_matrix, num_best=3)
query = dictionary.doc2bow(s4)#testing for new doc
similarities = soft_cosine_index[query]
for i, sim in enumerate(similarities):
    print(f"Similarity between s1 and document {sim[0]+1}: {sim[1]:.4f}")

100%|██████████| 14/14 [02:07<00:00,  9.12s/it]


Some of the similarity and distance metrics which can be calculated for this word embedding model are mentioned below: 






In [9]:
print("Odd one out:", word2vec_model.doesnt_match(['india', 'bhutan', 'china', 'mango']))  # mango

print("Distance between man and woman:", word2vec_model.distance('man', 'woman'))

print("Distances from 'king':", word2vec_model.distances('king', ['queen', 'man', 'woman']))

# Cosine similarity between 'queen' and 'king' + 'woman'
import numpy as np
combo_vector = word2vec_model['king'] + word2vec_model['woman']
similarity = word2vec_model.cosine_similarities(word2vec_model['queen'], np.array([combo_vector]))
print("Cosine similarity of queen with (king + woman):", similarity[0])

# Words closer to queen than kingdom
print("Words closer to queen than kingdom:", word2vec_model.closer_than('queen', 'kingdom'))

# Most similar to 'king'
print("Top 5 words similar to 'king':", word2vec_model.most_similar(positive=['king'], topn=5))

# Using cosmul
print("Top 5 words similar to 'queen' using cosmul:", word2vec_model.most_similar_cosmul(positive=['queen'], topn=5))


Odd one out: mango
Distance between man and woman: 0.2335987687110901
Distances from 'king': [0.34890437 0.77057326 0.8715203 ]
Cosine similarity of queen with (king + woman): 0.6532526
Words closer to queen than kingdom: ['her', 'she', 'woman', 'star', 'girl', 'King', 'herself', 'prime_minister', 'singer', 'actress', 'Miss', 'Queen', 'lady', 'king', 'crown', 'Kate', 'grandmother', 'ladies', 'lovely', 'royal', 'Diana', 'lover', 'reigning', 'bride', 'champ', 'flower', 'granddaughter', 'palace', 'Princess', 'Oprah', 'crowned', 'Prince_William', 'niece', 'glamorous', 'glamor', 'pageant', 'colony', 'contestant', 'prince', 'throne', 'hers', 'Jenna', 'Lady_Gaga', 'Actress', 'princess', 'castle', 'gown', 'ruler', 'Beyonce', 'monarchy', 'vampire', 'maid', 'Prince_Charles', 'heir', 'kings', 'idol', 'blonde', 'darling', 'heroine', 'Cinderella', 'monarch', 'mistress', 'grandma', 'prime_ministers', 'Kylie', 'diva', 'Queen_Elizabeth_II', 'Buckingham_Palace', 'Grandma', 'Obamas', 'nun', 'Queen_Eliza

In [13]:
print("Cosine similarity between 'alice' " +
      "and 'machines' - CBOW : ",
       word2vec_model.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'machines' - CBOW :  0.06818943
