## Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words
- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance

## Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies

## Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.
- Word2Vec Model represents each word as 300 Dimensional Vector
- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.

###### Genism's Word2Vec Model provides optimum implementation of:- 
- 1) CBOW Model
- 2) SkipGram Model
- Paper 1 -> Efficient Estimation of Word Representations in Vector Space
- Paper 2 -> Distributed Representations of Words and Phrases and their Compositionality.

## Word2Vec using Gensim
- Link https://radimarehurek.com/gesim/models/word2vec.html

## CODE
### Load Word2Vec Model
**KeyedVectors** -> This object essentially contains the mapping between words and embedings. After training, it can be used directly to query those embeddings in variou ways


In [1]:
import gensim
import numpy as np
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
word_vectors = KeyedVectors.load_word2vec_format('.\dataset\GoogleNews-vectors-negative300.bin',binary=True)

In [3]:
v_apple = word_vectors["apple"]
v_mango = word_vectors["mango"]

print(v_apple.shape)
print(v_mango.shape)

(300,)
(300,)


In [4]:
cosine_similarity([v_apple],[v_mango]) 

array([[0.57518554]], dtype=float32)

## Question - Answering - Find the Odd One Out

In [23]:
def odd_one_out(words):
    """Accepts a list of words and returns the odd one"""
    # Generate all word embedings in the given list
    all_word_vectors = [word_vectors[w] for w in words]
    
    #print(len(all_word_vectors))
    #print(all_word_vectors)
    avg_vector = np.mean(all_word_vectors,axis=0)
    #print(avg_vector.shape)
    
    # Iterate over every word and find similarity
    odd_one = None
    min_similarity = 1.0  # very high value
    
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one = w
            
        print("similarity between %s and avg vector is %.2f"%(w,sim))
    return odd_one
    
    
    

In [24]:
odd_one_out(input_1)

similarity between apple and avg vector is 0.78
similarity between mango and avg vector is 0.76
similarity between juice and avg vector is 0.71
similarity between party and avg vector is 0.36
similarity between orange and avg vector is 0.65


'party'

In [25]:
odd_one_out(input_2)

similarity between music and avg vector is 0.66
similarity between dance and avg vector is 0.81
similarity between sleep and avg vector is 0.51
similarity between dancer and avg vector is 0.72
similarity between food and avg vector is 0.52


'sleep'

In [26]:
odd_one_out(input_3)

similarity between match and avg vector is 0.58
similarity between player and avg vector is 0.68
similarity between football and avg vector is 0.72
similarity between cricket and avg vector is 0.70
similarity between dancer and avg vector is 0.53


'dancer'

In [27]:
odd_one_out(input_4)

similarity between india and avg vector is 0.81
similarity between paris and avg vector is 0.75
similarity between russia and avg vector is 0.79
similarity between france and avg vector is 0.81
similarity between germany and avg vector is 0.84


'paris'

In [7]:
input_1 = ["apple","mango","juice","party","orange"]
input_2 = ["music","dance","sleep","dancer","food"]
input_3 = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

## Word Analogies Task
- In the word analogy , we complete the sentence "a is to b as c is to __ ". An example is "man is to woman as king is to Queen". In detail , we are trying to find a word d, such that the associated word vectors ea,eb,ec,ed are related in the following manner: eb-ea=ed-ec. We will measure the similarity between eb-ea and ed-ec using cosine similarity.

In [9]:
word_vectors["man"]

array([ 0.32617188,  0.13085938,  0.03466797, -0.08300781,  0.08984375,
       -0.04125977, -0.19824219,  0.00689697,  0.14355469,  0.0019455 ,
        0.02880859, -0.25      , -0.08398438, -0.15136719, -0.10205078,
        0.04077148, -0.09765625,  0.05932617,  0.02978516, -0.10058594,
       -0.13085938,  0.001297  ,  0.02612305, -0.27148438,  0.06396484,
       -0.19140625, -0.078125  ,  0.25976562,  0.375     , -0.04541016,
        0.16210938,  0.13671875, -0.06396484, -0.02062988, -0.09667969,
        0.25390625,  0.24804688, -0.12695312,  0.07177734,  0.3203125 ,
        0.03149414, -0.03857422,  0.21191406, -0.00811768,  0.22265625,
       -0.13476562, -0.07617188,  0.01049805, -0.05175781,  0.03808594,
       -0.13378906,  0.125     ,  0.0559082 , -0.18261719,  0.08154297,
       -0.08447266, -0.07763672, -0.04345703,  0.08105469, -0.01092529,
        0.17480469,  0.30664062, -0.04321289, -0.01416016,  0.09082031,
       -0.00927734, -0.03442383, -0.11523438,  0.12451172, -0.02

In [30]:
def predict_word(a,b,c,word_vectors):
    """Accepts a traid of  words,  a,b,c and returns d such that a is to b : c is to d"""
    a,b,c = a.lower(),b.lower(),c.lower()
    
    # similarity |b-a| = |c-d| should be max
    max_similarity = -100
    
    d = None
    
    words = word_vectors.vocab.keys()
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    # to find d such that similarity(|b-a|,|d-c|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
            
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d
    

In [None]:
triad_2 = ("man","woman","prince")
predict_word(*triad_2,word_vectors)

#### Using the Most Similar Method:

In [None]:
word_vectors.most_similar(positive=['woman','king'],negative=['man'],topn=1)

### Fun Project (Bollywood Matching pair)

##### Step 1: Data Preparation
- Each sentence must be tokenized, into a list of words.
- The sentence can be text loaded into memory once, or we can build a data pipeline which iteratively feeds data to the model.

In [6]:
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

import nltk
from nltk.corpus import stopwords

stopw = set(stopwords.words('english'))

def readFile(file):
    f = open(file,'r',encoding='utf-8')
    text = f.read()
    
    # Tokenization - sentences and words
    sentences = nltk.sent_tokenize(text)
    print(len(sentences))
    
    data = []
    for sent in sentences:
        words = nltk.word_tokenize(sent)
        words = [w.lower() for w in words if len(w)>2 and w not in stopw]
        data.append(words)
        
    return data

text = readFile("bollywood.txt")
print(text)
    

FileNotFoundError: [Errno 2] No such file or directory: 'bollywood.txt'

#### Create Model

In [8]:
from gensim.models import Word2Vec

model = Word2Vec(text,size=300,window=10,min_count=1)

print(model)

NameError: name 'text' is not defined

In [9]:
words = list(model.wv.vocab)
print(vocab)
print(model["deepika"].shape)

NameError: name 'model' is not defined

### Create Analogies

In [10]:
def predict_actor(a,b,c,word_vectors):
    """Accepts a triad of words and return d such that a is to b : c is to d"""
    
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100
    
    d = None
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    options = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]
    
    for w in options:
        if w in [a,b,c]:
            continue
            
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim>max_similarity:
            max_similarity = sim
            d = w
    
    return d
        
        
        
        

#### 4.Test Your Model

In [11]:
triad = ("nick","priyanka","virat")
predict_actor(*triad,model.wv)

NameError: name 'model' is not defined