# Finding Word Analogies using Word2Vec

In the word analogy task, we complete the sentence **"a is to b as c is to ___"**. An example is **'man is to woman as king is to ___'**. The human brain can recognise that 'queen' must be filled. But for a machine to recognise this pattern requires a lot of training to be done. For this purpose, we are using **Word2Vec model** which is a pre-trained model on 50 Billion words.

In detail, we are trying to find a word d, such that associated word vectors **ea,eb,ec,ed** are related in the following manner: **'eb-ea = ed-ec'**. We will measure the similarity between **eb-ea and ed-ec using cosine similarity**.

In [1]:
import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

In [3]:
def predict_word(a,b,c,word_vectors):
    ''' The function accepts a triad of words, a,b,c and returns d such that a:b::c:d '''
    
    # converting each word to its lowercase
    a,b,c = a.lower(),b.lower(),c.lower()
    
    # Similarity between |b-a| = |d-c| should be maximum
    max_similarity = -99999
    
    d = None
    
    words = word_vectors.vocab.keys()
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    # to find d such that similarity (|b-a|, |d-c|) should be maximum
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        similar = cosine_similarity([wb-wa],[wv-wc])
        
        if similar > max_similarity:
            max_similarity = similar
            d = w
    # This code is contributed by Saurabh Gupta
    return d

In [4]:
triad_1 = ("Man","Woman","King")
output = predict_word(*triad_1,word_vectors)
print(output)

queen


### Using Most Similar Method in Word_Vectors to predict analogies

We are finding **wb-wa = wd-wc**, which can also be written as **wb + wc - wb = wd**, therefore we can notice wb and wc are positive terms and wb is the negative term.

In [3]:
word_vectors.most_similar(positive = ["woman","king"],negative=["man"],topn=1 )

[('queen', 0.711819589138031)]