
## Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance

### Applications
- Text Similarity
- Odd one Out
- Fill in the Blanks
- Word Analogies


### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.

![Word2Vec](word2vec.png)

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, relation with other words, etc.
![](WordEmbeddings.PNG)

**Word2Vec Algorithms**

**CBOW Model** and **SkipGram Model**

CBOW and Skip gram are word2vec model. Both use neural network architecture where the skip-gram inverts contexts and targets, and tries to predict each context word from its target word whereas CBOW is the reverse of it, it has context and the targeted word needs to be identified. The CBOW model is as follows.
![](cbow.PNG)

              The SkipGram Model is as follows
![](skip.PNG)


As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context "yesterday was a really [...] day" CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.

On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is "yesterday was really [...] day", or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.

**CBOW and SkipGram Architechture**

Given a set of sentences (also called corpus) the model loops on the words of each sentence and either tries to use the current word of to predict its neighbors (its context), in which case the method is called “Skip-Gram”, or it uses each of these contexts to predict the current word, in which case the method is called “Continuous Bag Of Words” (CBOW). The limit on the number of words in each context is determined by a parameter called “window size”.
The skip-gram neural network model is actually surprisingly simple in its most basic form. 

**Working**

Train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer–we’ll see that these weights actually from the Embeddings' matrix that we’re trying to learn.
We’re going to represent an input word like “ants” as a one-hot vector. This vector will have n components (one for every word in our vocabulary) and we’ll place a “1” in the position corresponding to the word “ants”, and 0s in all of the other positions.
There is no activation function on the hidden layer neurons, but the output neurons use softmax.

                                                   CBOW-Architechture
![](arch1.PNG)
For our example, we’re going to say that we’re learning word vectors with 300 features. So the hidden layer is going to be represented by a weight matrix with 10,000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron).

                                                  SkipGram-Architechture
![](arch2.PNG)
So the end goal of all of this is really just to learn this hidden layer weight matrix — the output layer we’ll just toss when we’re done! The 1 x 300 word vector for “ants” then gets fed to the output layer. The output layer is a softmax regression classifier.

**Getting Embeddings Matrix from Neural Network**
![](gwe1.PNG)
![](gwe2.PNG)

**Cosine Similarity**
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product.
![](cosine.png)

**Word Analogies**
At its most basic, an analogy is a comparison of two things to show their similarities. Sometimes the things being compared are quite similar, but other times they could be very different. 
![](eqn.PNG)
Nevertheless, an analogy explains one thing in terms of another to highlight the ways in which they are alike.

In [2]:
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

## Training Our Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!
- Continuous Bag of Words Model
- Skip Gram Model

`Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a supervised manner.` The algorithm was developed by Tomas Mikolov.

#### Data Preparation


- Each sentence must be tokenized, into a list of words.

In [4]:
import nltk
from nltk.corpus import stopwords

In [8]:
stopw  = set(stopwords.words('english'))

In [14]:
def readFile(file): 
    f = open(file,'r')
    text = f.read()
    sentences = nltk.sent_tokenize(text)
    
    data = []
    for sent in sentences:
        words =  nltk.word_tokenize(sent)
        words = [w.lower() for w in words if len(w)>2 and w not in stopw]
        data.append(words)
        
    return data

text = readFile('bollywood.txt')

In [15]:
print(text)

[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['the', 'deepika', 'ranveer', 'celebrations', 'hooked', 'phones', 'waiting', 'whatâ€™s', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['from', 'airport', 'looks', 'reception', 'parties', 'everything', 'hereâ€™s', 'entire', 'timeline', 'deepika', 'ranveer', 'wedding', 'style', 'file'], ['not', 'ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'the', 'year', 'this', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['from', 'isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['but', 'nothing', 'beats', 'man', 'wedding', 'the', 'year', 'award', 'social', 'media'], ['priyanka', 'also', 'shared', 'video', 'featuring', 'nick', 'jonaswas'

**Model Creation**

In [13]:
from gensim.models import Word2Vec

In [16]:
model = Word2Vec(text,size=300,window=10,min_count=1)
#discard words of frequency less than min_count

In [17]:
print(model)

Word2Vec(vocab=118, size=300, alpha=0.025)


In [18]:
words=list(model.wv.vocab)
print(words)

['dress', 'morning', 'isha', 'celebrating', 'suit', 'one', 'celebrations', 'chatrath', 'singh', 'entire', 'sharma', 'believe', 'three', 'nothing', 'jonaswas', 'hooked', 'file', 'even', 'reception', 'made', 'pictures', 'anand', 'happened', 'mumbai', 'stylish', 'delhi', 'hereâ€™s', 'timeline', 'ambani', 'grand', 'year', 'bollywood', 'media', 'celebration', 'looks', 'couple', 'wife', 'biggest', 'actress', 'ambanis', 'anushka', 'saw', 'time', 'padukone', 'big', 'everything', 'attire', 'lavish', 'deepika', 'verbier', 'glimpses', 'there', 'special', 'extravagant', 'featuring', 'parties', 'but', 'ranveer', 'fat', 'receptions', 'social', 'december', 'proves', 'from', 'piramal', 'switzerland', 'pleasure', 'industry', 'kapil', 'waiting', 'weddings', 'also', 'celebrated', 'long', 'squad', 'first', 'events', 'ginni', 'chopra', 'many', 'award', 'family', 'side', 'airport', 'new', 'shared', 'christmas', 'man', 'whatâ€™s', 'enough', 'not', 'outstanding', 'night', 'video', 'pink', 'come', 'london', 'a

In [20]:
print(model["deepika"])

[  1.19208707e-03  -4.57057467e-04   5.65540045e-04   5.10682177e-04
  -1.33018405e-03  -6.26506808e-04   5.48020413e-04  -1.38523779e-03
  -1.32115511e-03   7.94710824e-04   1.35266501e-03   1.96988520e-04
   1.73914348e-04   1.22015446e-03   2.50179583e-04   7.21760967e-04
   1.13505905e-03  -1.01390167e-03   9.77296964e-04  -4.13630303e-04
  -1.19735755e-03  -1.22274400e-03  -1.36089500e-03  -1.14474073e-03
   9.45234264e-04   1.50892453e-03  -5.93634904e-04   1.27291062e-03
   1.00172311e-03  -1.66610291e-03  -7.99938454e-04   9.94241913e-04
   1.40463386e-03   1.11631374e-03  -1.00570184e-03  -1.46877384e-04
  -1.94483830e-04   2.63748254e-04   9.85280843e-04  -1.35255733e-03
   5.02862793e-04   1.47668412e-03  -6.45603985e-04  -1.22383994e-03
   1.62548013e-03  -1.77793991e-04  -5.02889976e-04   5.86938346e-04
  -4.21069330e-04  -8.95149773e-04  -3.67473316e-04   4.21759440e-04
   1.10720597e-04  -5.16295113e-05   1.09494978e-03  -1.06434175e-03
   2.07629364e-05  -1.49300671e-03

  if __name__ == '__main__':


In [21]:
print(model["deepika"].shape)

(300,)


  if __name__ == '__main__':


**Test the Model**

In [22]:
actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]

def predict_actor(a,b,c,word_vectors):
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100 
    
    d = None
    words = actors
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d

In [25]:
word_vectors=model.wv

In [26]:
triad = ("nick","priyanka","virat")
predict_actor(*triad,model.wv)

'anushka'

In [27]:
triad = ("ranveer","deepika","priyanka")
predict_actor(*triad,word_vectors)

'ginni'

In [28]:
triad = ("ranveer","singh","deepika")
predict_actor(*triad,word_vectors)

'chopra'

In [29]:
triad = ("deepika","padukone","priyanka")
predict_actor(*triad,word_vectors)

'virat'

In [31]:
triad = ("priyanka","jonas","nick")
predict_actor(*triad,word_vectors)

'chopra'

In [35]:
model.wv.save_word2vec_format("bollywood.bin")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Find the Odd One Out

In [39]:
import numpy as np

In [54]:
def odd_one_out(words,word_vectors):
    """Accepts a list of words and returns the odd word"""
    
    # Generate all word embeddings for the given list
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors,axis=0)
    
    #Iterate over every word and find similarity
    odd_one_out = None
    min_similarity = 1.0 #Very high value
    
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one_out = w
    
        print("Similairy between %s and averag vector is %.2f"%(w,sim))
            
    return odd_one_out

In [57]:
words =["ranveer","deepika","anushka","style"]

In [58]:
odd_one_out(words,word_vectors)

Similairy between ranveer and averag vector is 0.51
Similairy between deepika and averag vector is 0.55
Similairy between anushka and averag vector is 0.54
Similairy between style and averag vector is 0.46


'style'