# Word2Vec Model
Word2Vec Google's Pretrained Model
Contains vector representations of 50 billion words

Words which are similar in context have similar vectors

Distance/Similarity between two words can be measured using Cosine Distance

## Applications

Text Similarity
Language Translation
Finding Odd Words
Word Analogies
Word Embeddings
Word embeddings are numerical representation of words, in the form of vectors.

Word2Vec Model represents each word as 300 Dimensional Vector

In this tutorial we are going to see how to use pre-trained word2vec model.

Model size is around 1.5 GB
We will work using Gensim, which is popular NLP Package.
Gensim's Word2Vec Model provides optimum implementation of

1) CBOW Model

2) SkipGram Model

Paper 1 Efficient Estimation of Word Representations in Vector Space

Paper 2 Distributed Representations of Words and Phrases and their Compositionality

### Word2Vec using Gensim
Link https://radimrehurek.com/gensim/models/word2vec.html

CODE
Load Word2Vec Model
KeyedVectors - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [2]:
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

In [5]:
v_apple = word_vectors["apple"] 
v_mango = word_vectors["india"]

In [6]:
print(v_apple.shape)
print(v_mango.shape)

(300,)
(300,)


In [7]:
cosine_similarity([v_mango],[v_apple])

array([[0.17158598]], dtype=float32)

In [8]:
import numpy as np

## Find the Odd One Out

In [9]:
def odd_one_out(words):
    """Accepts a list of words and returns the odd word"""
    
    # Generate all word embeddings for the given list
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors,axis=0)
    print(avg_vector.shape)
    
    #Iterate over every word and find similarity
    odd_one_out = None
    min_similarity = 1.0 #Very high value
    
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one_out = w
    
        print("Similairy btw %s and avg vector is %.2f"%(w,sim))
            
    return odd_one_out

In [10]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

In [11]:
odd_one_out(input_1) 

(300,)
Similairy btw apple and avg vector is 0.78
Similairy btw mango and avg vector is 0.76
Similairy btw juice and avg vector is 0.71
Similairy btw party and avg vector is 0.36
Similairy btw orange and avg vector is 0.65


'party'

In [12]:
odd_one_out(input_2)

(300,)
Similairy btw music and avg vector is 0.66
Similairy btw dance and avg vector is 0.81
Similairy btw sleep and avg vector is 0.51
Similairy btw dancer and avg vector is 0.72
Similairy btw food and avg vector is 0.52


'sleep'

In [13]:
odd_one_out(input_3)

(300,)
Similairy btw match and avg vector is 0.58
Similairy btw player and avg vector is 0.68
Similairy btw football and avg vector is 0.72
Similairy btw cricket and avg vector is 0.70
Similairy btw dancer and avg vector is 0.53


'dancer'

In [14]:
odd_one_out(input_4)

(300,)
Similairy btw india and avg vector is 0.81
Similairy btw paris and avg vector is 0.75
Similairy btw russia and avg vector is 0.79
Similairy btw france and avg vector is 0.81
Similairy btw germany and avg vector is 0.84


'paris'

## Word Analogies Task

man -> woman ::    prince -> princess
italy -> italian ::    spain -> spanish
india -> delhi ::  japan -> tokyo
man -> woman ::    boy -> girl
small -> smaller ::    large -> larger

Try it out
man -> coder :: woman -> ______?

In [16]:
word_vectors["man"].shape

(300,)

In [17]:
def predict_word(a,b,c,word_vectors):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    a,b,c = a.lower(),b.lower(),c.lower()
    
    # similarity |b-a| = |d-c| should be max
    max_similarity = -100 
    
    d = None
    
    words = word_vectors.vocab.keys()
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    #to find d s.t similarity(|b-a|,|d-c|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
            
    return d    

In [21]:
#triad_2 = ("man","woman","prince")
#predict_word(*triad_2,word_vectors)

## Using the Most Similar Method

In [22]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7118193507194519)]

## Training Your Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!

Continuous Bag of Words Model
Skip Gram Model
Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a superwised manner. The algorithm was developed by Tomas Mikolov.

Data Preparation
Each sentence must be tokenized, into a list of words.

The sentences can be text loaded into memory once, or we can build a data pipeline which iteratively feeds data to the model.

In [24]:
import nltk
from nltk.corpus import stopwords

In [25]:
stopw  = set(stopwords.words('english'))

In [29]:
## Read the file 
def readFile(file): 
    f = open(file,'r',encoding='utf-8')
    text = f.read()
    sentences = nltk.sent_tokenize(text)
    
    data = []
    for sent in sentences:
        words =  nltk.word_tokenize(sent)
        words = [w.lower() for w in words if len(w)>2 and w not in stopw]
        data.append(words)
        
    return data

text = readFile('bollywood.txt')

In [30]:
print(text)

[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['the', 'deepika', 'ranveer', 'celebrations', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['from', 'airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'deepika', 'ranveer', 'wedding', 'style', 'file'], ['not', 'ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'the', 'year', 'this', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['from', 'isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['but', 'nothing', 'beats', 'man', 'wedding', 'the', 'year', 'award', 'social', 'media'], ['priyanka', 'also', 'shared', 'video', 'featuring', 'nick', 'jonaswas', 'also', 'celebrating',

In [31]:
from gensim.models import Word2Vec

In [33]:
model = Word2Vec(text,window=10,min_count=1)

In [34]:
print(model)

Word2Vec(vocab=116, vector_size=100, alpha=0.025)


In [39]:
words = list(model.wv.key_to_index)

In [40]:
print(words)

['year', 'priyanka', 'nick', 'deepika', 'ranveer', 'wedding', 'the', 'chopra', 'sharma', 'ginni', 'weddings', 'jonas', 'kapil', 'chatrath', 'anand', '2018', 'isha', 'ambani', 'piramal', 'saw', 'from', 'new', 'also', 'man', 'singh', 'padukone', 'virat', 'many', 'grand', 'but', 'nothing', 'beats', 'award', 'media', 'shared', 'anushka', 'style', 'couple', 'two', 'social', 'big', 'fat', 'celebrations', 'this', 'december', 'married', 'friends', 'lavish', 'extravagant', 'one', 'entire', 'parties', 'everything', 'timeline', 'file', 'not', 'ambanis', 'pink', 'events', 'happened', 'reception', 'bollywood', 'squad', 'hooked', 'phones', 'waiting', 'come', 'biggest', 'gave', 'enough', 'reason', 'believe', 'stylish', 'attire', 'dress', 'looks', 'airport', 'side', 'morning', 'jaggo', 'celebration', 'verbier', 'switzerland', 'three', 'receptions', 'delhi', 'mumbai', 'night', 'proves', 'made', 'even', 'special', 'industry', 'long', 'time', 'there', 'glimpses', 'outstanding', 'pictures', 'london', 'ran

In [46]:
#print(model["deepika"].shape)

In [47]:
#all_vectors = [model[w] for w in  model.wv.get_normed_vectors()]
#all_vectors = np.array(all_vectors)
#print(all_vectors.shape)
#print(all)

In [49]:
#from sklearn.manifold import TSNE
#tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300,random_state=1)
#tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300,random_state=1)
#tsne = TSNE(n_components=2, verbose=1,n_iter=1000,random_state=1)

#tsne_results = tsne.fit_transform(all_vectors)

In [50]:
actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]


def predict_actor(a,b,c,word_vectors):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100 
    
    d = None
    words = actors
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    #to find d s.t similarity(|b-a|,|d-c|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d   

In [51]:
triad = ("nick","priyanka","virat")
predict_actor(*triad,model.wv)

'deepika'

In [52]:
triad = ("ranveer","deepika","priyanka")
predict_actor(*triad,model.wv)

'singh'

In [53]:
triad = ("ranveer","singh","deepika")
predict_actor(*triad,model.wv)

'ginni'