## Word2Vec Model

- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words
- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance

### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies

### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.
- Word2Vec Model represents each word as 300 Dimensional Vector
- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.

Gensim's Word2Vec Model provides optimum implementation of

**1) CBOW Model**

**2) SkipGram Model**

Paper 1 Efficient Estimation of Word Representations in Vector Space

Paper 2 Distributed Representations of Words and Phrases and their Compositionality

#### Word2Vec using Gensim

Link https://radimrehurek.com/gensim/models/word2vec.html

### CODE

**Load Word2Vec Model**

**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [1]:
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

FileNotFoundError: [Errno 2] No such file or directory: 'GoogleNews-vectors-negative300.bin'

In [None]:
v_apple = word_vectors["apple"]
v_mango = word_vectors["india"]

In [5]:
print(v_apple.shape)
print(v_manggo.shape)

NameError: name 'v_apple' is not defined

In [None]:
cosine_similarity([v_mango],[v_apple]) # Takes input as 2D

In [None]:
import numpy as np

## 1.Find The Odd One Out

In [None]:
def odd_one_out(words):
    '''Accepts a list of words and returns the odd word'''
    
    #Generate all word embeddings for the given list
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors)
    #print(avg_vector.shape)
    
    #Iterate over every word and find similarity
    odd_one_out = None
    min_similarity = 1.0 #Very high Value
    
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one_out = w
            
        print("Similarity btw %s and avg vector is %.2f"%(w,sim))
        
    return odd_one_out
    

In [None]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]
input_5 = ["physics","chemistry","mathematics","biology","computers"]

In [None]:
odd_one_out(input_1)

In [None]:
odd_one_out(input_2)

In [None]:
odd_one_out(input_3)

In [None]:
odd_one_out(input_4)

In [None]:
odd_one_out(input_5)

### 2. Word Analogies Task
In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors ea,eb,ec,ed are related in the following manner: eb−ea≈ed−ec. We will measure the similarity between eb−ea and ed−ec using cosine similarity.

<img src="./word2vec.png" />

Word2Vec

*man -> woman ::     prince -> princess <br>
italy -> italian ::     spain -> spanish <br>
india -> delhi ::     japan -> tokyo<br>
man -> woman ::     boy -> girl <br>
small -> smaller ::     large -> larger <br>*

Try it out
man -> coder :: woman -> ______?

In [None]:
type(word_vectors.vocab)

In [None]:
word_vectors["man"].shape

In [None]:
def predict_word(a,b,c,word_vectors):
    """ Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d """
    a,b,c = a.lower(),b.lower(),c.lower()
    
    #similarity |b-a| = |d-c| should be max
    
    max_similarity = -100
    d = None
    
    words = word_vectors.vocab.keys()
    
    wa,wb,wc = word_vectors[a], word_vectors[b], word_vectors[c]
    
    #To find d such that similarity(|b-a|,|c-d|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb - wa],[wv - wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
            
    return d

In [None]:
triad_2 = ("man","woman","prince")
predict_word(*triad_2,word_vectors)

## Built In Method , Most Similar Method

In [None]:
word_vectors.most_similar(positive=["woman","king"], negative=["man"], topn=1)