
# Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance


### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies


### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.

- Word2Vec Model represents each word as 300 Dimensional Vector

- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.


Gensim's Word2Vec Model provides optimum implementation of 

1) **CBOW** Model 

2) **SkipGram Model**


Paper 1 [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)


Paper 2 [Distributed Representations of Words and Phrases and their Compositionality
](https://arxiv.org/abs/1310.4546)

### Word2Vec using Gensim
`Link https://radimrehurek.com/gensim/models/word2vec.html`

In [None]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-06-23 07:08:56--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.38.110
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.38.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-06-23 07:09:18 (75.4 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



# CODE ##

##### Load Word2Vec Model


**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [None]:
# Libraries
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
word_vector = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz', binary= True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
type(word_vector)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [None]:
Apple = word_vector["Apple"]
Apple

array([-1.74804688e-01,  3.00292969e-02, -2.16796875e-01,  1.56250000e-01,
       -3.57421875e-01, -6.05468750e-02,  1.36718750e-01,  9.57031250e-02,
        3.17382812e-03, -4.29687500e-02, -3.30078125e-01,  2.57812500e-01,
        2.51953125e-01, -2.77343750e-01, -6.98242188e-02, -2.95410156e-02,
        3.22265625e-01, -7.76367188e-02, -3.06396484e-02, -1.67968750e-01,
       -5.76171875e-02,  3.05175781e-02,  5.52368164e-03, -1.26953125e-01,
       -1.44042969e-02,  1.75781250e-01,  9.47265625e-02,  3.16406250e-01,
       -7.81250000e-03, -3.40270996e-03,  3.63769531e-02,  1.11816406e-01,
       -1.24023438e-01,  1.29882812e-01, -3.22265625e-02, -1.60156250e-01,
        7.56835938e-02,  6.73828125e-02,  4.08203125e-01,  2.23632812e-01,
        1.60156250e-01,  3.63769531e-02, -1.64062500e-01, -3.51562500e-01,
        4.49218750e-02,  6.34765625e-02, -1.15234375e-01,  3.12500000e-01,
       -2.80761719e-02, -9.22851562e-02,  5.98144531e-02,  1.57470703e-02,
       -1.15234375e-01,  

In [None]:
Apple.shape

(300,)

In [None]:
apple = word_vector['apple']
mango = word_vector['mango']
Google = word_vector['Google']

In [None]:
cosine_similarity([apple], [mango])

array([[0.57518554]], dtype=float32)

In [None]:
cosine_similarity([Apple], [mango])

array([[0.11593594]], dtype=float32)

In [None]:
cosine_similarity([Apple], [Google])

array([[0.56835705]], dtype=float32)

## 1. Find the Odd One Out

In [None]:
def odd_one_out(words):
    """Accepts a list of words and returns the odd word"""
    # print(words)
    all_words_vector = [word_vector[word] for word in words]
    # print(len(all_words_vector), all_words_vector[0].shape)

    avg_vector = np.mean(all_words_vector, axis = 0) # taking mean along rows
    # print(avg_vector.shape)

    odd_word = None
    max_sim = 10

    for word in words:
        temp_sim = cosine_similarity([avg_vector], [word_vector[word]])
        # print(f"{word} and avg_vector sim is ---> {temp_sim}")

        if temp_sim < max_sim:
            max_sim = temp_sim
            odd_word = word

    return odd_word

In [None]:
odd_one_out(["apple","mango","juice","party","orange"])

apple and avg_vector sim is ---> [[0.7806554]]
mango and avg_vector sim is ---> [[0.7606032]]
juice and avg_vector sim is ---> [[0.7106042]]
party and avg_vector sim is ---> [[0.357093]]
orange and avg_vector sim is ---> [[0.649024]]


In [None]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

In [None]:
print(f"The odd word is ===> {odd_one_out(input_1)}")

The odd word is ===> party


In [None]:
print(f"The odd word is ===> {odd_one_out(input_2)}")

The odd word is ===> sleep


In [None]:
print(f"The odd word is ===> {odd_one_out(input_1)}")

The odd word is ===> party


In [None]:
print(f"The odd word is ===> {odd_one_out(input_4)}")

The odd word is ===> paris


### 2. Word Analogies Task

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea≈ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

![Word2Vec](http://jalammar.github.io/images/word2vec/word2vec.png)

**man -> woman ::     prince -> princess**
**italy -> italian ::     spain -> spanish**
**india -> delhi ::     japan -> tokyo**
**man -> woman ::     boy -> girl**
**small -> smaller ::     large -> larger**

## Try it out
**man -> coder :: woman -> ______?**

In [None]:
word_vector.vocab.keys()

In [None]:
def predict_word(a, b, c, model) :
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    wv_a, wv_b, wv_c = model[a], model[b], model[c]

    min_sim = -10
    pred_word = None
    words = model.vocab.keys()

    for w in words:
        if w in [a, b, c]:
            continue

        wv_w = model[w]
        temp_sim = cosine_similarity([wv_b - wv_a], [wv_w - wv_c])

        if temp_sim > min_sim:
            min_sim = temp_sim
            pred_word = w

    return pred_word

In [None]:
a, b, c = "man", "woman", "prince"
predict_word(a, b, c, word_vector)

'princess'

## 3. Training Your Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!
- Continuous Bag of Words Model **(CBOW)**
- Skip Gram Model

`Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a superwised manner.` The algorithm was developed by Tomas Mikolov.

#### Data Preparation



- Each sentence must be tokenized, into a list of words.

- The sentences can be text loaded into memory once,
or we can build a data pipeline which iteratively feeds data to the model.


In [None]:
#libs
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
## Read the file 
def readFile(file): 
    f = open(file, 'r', encoding= 'utf8')
    text = f.read()

    # print(text)

    sent = nltk.sent_tokenize(text)
    # for s in sent:
    #     print(s)

    data = []
    for s in sent:
        words = nltk.word_tokenize(s)

        words = [w.lower() for w in words if len(w) >= 2 and w.lower() not in stopwords.words("english")]
        data.append(words)
    
    return data

In [None]:
file = '/content/bollywood_news.txt'
text = readFile(file)
print(text)

[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['deepveer', 'celebrations', 'us', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'dp', 'ranveer', 'wedding', 'style', 'file'], ['ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'year', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['nothing', 'beats', 'man', 'wedding', 'year', 'award', 'social', 'media'], ['wedding', 'season', 'year', 'kicked', 'deepika', 'padukone', 'ranveer', 'singh', 'flew', 'lake', 'como', 'tie', 'knot', 'two', 'days', 'november', '14', '15'], [

In [None]:
# lib
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(text, size= 300, window= 5, min_count= 4)

In [None]:
words = model.wv.vocab
words

{"''": <gensim.models.keyedvectors.Vocab at 0x7f7916a386a0>,
 "'s": <gensim.models.keyedvectors.Vocab at 0x7f7916a1aac8>,
 '14': <gensim.models.keyedvectors.Vocab at 0x7f7916a1a7f0>,
 '15': <gensim.models.keyedvectors.Vocab at 0x7f7916a1a7b8>,
 '2018': <gensim.models.keyedvectors.Vocab at 0x7f7916a1a550>,
 '``': <gensim.models.keyedvectors.Vocab at 0x7f7916a38668>,
 'actor': <gensim.models.keyedvectors.Vocab at 0x7f7916a38780>,
 'actress': <gensim.models.keyedvectors.Vocab at 0x7f7916a38710>,
 'added': <gensim.models.keyedvectors.Vocab at 0x7f7916a1afd0>,
 'adorable': <gensim.models.keyedvectors.Vocab at 0x7f7916a38400>,
 'also': <gensim.models.keyedvectors.Vocab at 0x7f7916a1a278>,
 'ambani': <gensim.models.keyedvectors.Vocab at 0x7f7916a1a908>,
 'anand': <gensim.models.keyedvectors.Vocab at 0x7f7916a1a780>,
 'announced': <gensim.models.keyedvectors.Vocab at 0x7f7916a38438>,
 'another': <gensim.models.keyedvectors.Vocab at 0x7f7916a1abe0>,
 'anushka': <gensim.models.keyedvectors.Vocab

In [None]:
len(words)

121

In [None]:
model["deepika"]

  """Entry point for launching an IPython kernel.


array([-3.7348998e-04, -5.5874285e-04, -8.1451074e-04,  1.5597136e-03,
        1.5422044e-03, -4.6107372e-05, -1.0270571e-03,  4.1657564e-04,
       -6.0811429e-04,  3.2852369e-04,  1.8269591e-04, -4.0312187e-04,
        9.9296437e-04,  4.5072340e-04,  5.5370515e-04, -7.6067174e-04,
       -7.3857472e-04,  8.2679508e-05, -1.2128314e-03,  2.6621201e-04,
       -1.3482868e-03, -3.8509391e-04, -4.3008931e-04, -2.2064711e-04,
        2.3773588e-04,  1.4485372e-03, -3.0974697e-04, -6.9580099e-04,
       -1.1980457e-03,  6.5361208e-04, -1.7807847e-03, -7.6328957e-04,
        6.0992717e-04,  3.1309671e-04, -4.1628638e-04, -1.3454816e-04,
       -8.4232545e-04, -1.5386868e-03, -1.6935607e-03,  1.1022186e-03,
        1.3122032e-04,  1.6259874e-03, -5.7433755e-04,  2.7768622e-04,
       -1.3069123e-03,  1.3961298e-03,  1.4801977e-04, -5.2391464e-04,
        4.5990240e-04, -1.0248454e-03,  1.3777952e-03,  1.8291152e-03,
       -2.7853053e-04, -3.1704997e-04,  1.6839803e-03, -7.2193309e-04,
      

In [None]:
def predict_actor(a, b, c, model):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    wv_a, wv_b, wv_c = model[a], model[b], model[c]

    actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka"]
    min_sim = -10
    pred_word = None
    
    for w in actors:
        if w in [a, b, c]:
            continue

        wv_w = model[w]
        temp_sim = cosine_similarity([wv_b - wv_a], [wv_w - wv_c])

        if temp_sim > min_sim:
            min_sim = temp_sim
            pred_word = w

    return pred_word

### 4. Test your Model

In [None]:
a, b, c = "nick", "priyanka", "virat"
print(predict_actor(a, b, c, model))

chopra


  This is separate from the ipykernel package so we can avoid doing imports until
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


In [None]:
a, b, c = "ranveer", "deepika", "priyanka"
print(predict_actor(a, b, c, model))

anushka


  This is separate from the ipykernel package so we can avoid doing imports until
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


In [None]:
a,b,c = "ranveer", "singh", "deepika"
predict_actor(a, b, c, model)

  This is separate from the ipykernel package so we can avoid doing imports until
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


'priyanka'

In [None]:
triad = ("deepika","padukone","priyanka")
predict_actor(*triad, model)

  This is separate from the ipykernel package so we can avoid doing imports until
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


'singh'

In [None]:
triad = ("priyanka","jonas","nick")
predict_actor(*triad, model)

  This is separate from the ipykernel package so we can avoid doing imports until
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


'virat'