## Word2Vec Model

 - Word2Vec Google's Pretrained Model.
 - Contains Vector Representations Of 50 Billion Words.
 - Words Which Are Similar In Context Have Similar Vectors.
 - Distance/Similarity Between Two Words Can Be Measured Using Cosine Distance.

### Applications
 
 - Text Similarity
 - Language Translation
 - Finding Odd Words
 - Word Analogies

### Word Embeddings

 - Word Embeddings Are Numerical Representation Of Words, In The Form Of Vectors.
 - Word2Vec Model Represents Each Word As 300 Dimensional Vector.
 - Using The Pre-trained Model For Different Applications.
 - Model Size Is Around 3.64 GB. (Around 50 Billion Words)
 - Will Work Using Gensim, A Popular NLP Package.

**Gensim's Word2Vec Model Provides Optimum Implementation Of**
 
 1. **CBOW** Model (Continuous Bag Of Words Model)
 2. **SkipGram** Model
 
- Paper 1 : Efficient Estimation Of Word Representation In Vector Space
- Paper 2 : Distributed Representations Of Words And Phrases And Their Compositionality

#### Word2Vec Using Gensim

**Link** : https://radimrehurek.com/gensim/models/word2vec.html

#### Load Word2Vec Model

**Keyed Vectors** - *This Object Essentially Contains The Mapping Between Words And Embeddings. After Training, It Can Be Used Directly To Query Those Embeddings In Various Ways*

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)

In [2]:
word_vectors = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

In [3]:
vec_peach = word_vectors["peach"]
vec_mango = word_vectors["mango"]

print(vec_peach.shape, vec_mango.shape)

(300,) (300,)


In [4]:
print(cosine_similarity([vec_peach],[vec_mango]))

[[0.56328726]]


### Application 1 : Find The Odd One Out

In [5]:
def odd_one_out(words):
    """ Accepts A List Of Words And Returns The Odd Word"""
    
    # 1. Average Vector Of All The Vectors In The List.
    # 2. Cosine_Similarity(Word, Avg) For All Words.
    # 3. Minimum(Cosine_Similarity)
    
    # Generate All Word Embeddings For The Given List
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors, axis=0) # Axis = 0, Means Mean Across All The Rows

    # Iterate Over Every Word And Find The Similarity
    odd_one_out = None
    min_similarity = 1.0
    for w in words:
        sim = cosine_similarity([word_vectors[w]], [avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one_out = w
        print("Similarity Between %s And Average Vector Is %.2f !"%(w,sim))
    
    return odd_one_out

In [6]:
input_1 = ["apple", "mango", "juice", "party", "orange"]
input_2 = ["music", "dance", "sleep", "dancer", "food"]
input_3 = ["match", "player", "football", "cricket", "dancer"]
input_4 = ["india", "paris", "russia", "france", "germany"]

In [7]:
print("\nOdd One Out : ", odd_one_out(input_1))

Similarity Between apple And Average Vector Is 0.78 !
Similarity Between mango And Average Vector Is 0.76 !
Similarity Between juice And Average Vector Is 0.71 !
Similarity Between party And Average Vector Is 0.36 !
Similarity Between orange And Average Vector Is 0.65 !

Odd One Out :  party


In [8]:
print("\nOdd One Out : ", odd_one_out(input_2))

Similarity Between music And Average Vector Is 0.66 !
Similarity Between dance And Average Vector Is 0.81 !
Similarity Between sleep And Average Vector Is 0.51 !
Similarity Between dancer And Average Vector Is 0.72 !
Similarity Between food And Average Vector Is 0.52 !

Odd One Out :  sleep


In [9]:
print("\nOdd One Out : ", odd_one_out(input_3))

Similarity Between match And Average Vector Is 0.58 !
Similarity Between player And Average Vector Is 0.68 !
Similarity Between football And Average Vector Is 0.72 !
Similarity Between cricket And Average Vector Is 0.70 !
Similarity Between dancer And Average Vector Is 0.53 !

Odd One Out :  dancer


In [10]:
print("\nOdd One Out : ", odd_one_out(input_4))

Similarity Between india And Average Vector Is 0.81 !
Similarity Between paris And Average Vector Is 0.75 !
Similarity Between russia And Average Vector Is 0.79 !
Similarity Between france And Average Vector Is 0.81 !
Similarity Between germany And Average Vector Is 0.84 !

Odd One Out :  paris


### Application 2 : Word Analogies Task

 - In The Word Analogy Task, We Complete The Sentence "a is to b as c is to __".
 - An Example Is, "man is to woman as king is to queen".
 - In Detail, We Are Trying To Find A Word 'd', Such That The Associated Word Vectors ea, eb, ec, ed Are Related In The Following Manner : (eb - ea) ≈ (ed - ec).
 - We Will Measure The Similarity Between (eb - ea) And (ed - ec) Using Cosine Similarity.

In [11]:
def predict_word(a, b, c, word_vectors):
    """ Accepts A Triad Of Words, a, b, c And Returns d Such That 'a is to b : c is to d'."""
    
    a, b, c = a.lower(), b.lower(), c.lower()
    
    # Similarity |b-a| = |d-c| Should Be Max
    max_similarity = -100
    d = None
    
    words = word_vectors.vocab.keys()
    wa, wb, wc = word_vectors[a], word_vectors[b], word_vectors[c]
    
    # To Find 'd' Such That Similarity (|b-a|, |d-c|) Should Be Max.
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa], [wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
    
    return d

In [12]:
triad_1 = ("man", "woman", "prince")
print(predict_word(*triad_1, word_vectors))

princess


#### Using The Most Similar Method

In [13]:
print(word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1))

[('queen', 0.7118192911148071)]


In [14]:
print(word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=5))

[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133)]


In [15]:
print(word_vectors.most_similar(positive=['boy','man'], negative=['girl'], topn=5))

[('teenager', 0.5787034034729004), ('woman', 0.5065137147903442), ('youngster', 0.49353092908859253), ('guy', 0.49308693408966064), ('teen_ager', 0.48808714747428894)]
