
# Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance


### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies


### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.

- Word2Vec Model represents each word as 300 Dimensional Vector

- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.


Gensim's Word2Vec Model provides optimum implementation of 

1) **CBOW** Model 

2) **SkipGram Model**


Paper 1 [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)


Paper 2 [Distributed Representations of Words and Phrases and their Compositionality
](https://arxiv.org/abs/1310.4546)

### Word2Vec using Gensim
`Link https://radimrehurek.com/gensim/models/word2vec.html`

In [0]:
!git clone https://github.com/mmihaltz/word2vec-GoogleNews-vectors

Cloning into 'word2vec-GoogleNews-vectors'...
remote: Enumerating objects: 20, done.[K
remote: Total 20 (delta 0), reused 0 (delta 0), pack-reused 20[K
Unpacking objects:   5% (1/20)   Unpacking objects:  10% (2/20)   Unpacking objects:  15% (3/20)   Unpacking objects:  20% (4/20)   Unpacking objects:  25% (5/20)   Unpacking objects:  30% (6/20)   Unpacking objects:  35% (7/20)   Unpacking objects:  40% (8/20)   Unpacking objects:  45% (9/20)   Unpacking objects:  50% (10/20)   Unpacking objects:  55% (11/20)   Unpacking objects:  60% (12/20)   Unpacking objects:  65% (13/20)   Unpacking objects:  70% (14/20)   Unpacking objects:  75% (15/20)   Unpacking objects:  80% (16/20)   Unpacking objects:  85% (17/20)   Unpacking objects:  90% (18/20)   Unpacking objects:  95% (19/20)   Unpacking objects: 100% (20/20)   Unpacking objects: 100% (20/20), done.


In [0]:
!mv '/content/word2vec-GoogleNews-vectors/GoogleNews-vectors-negative300.bin.gz' '/content/'

In [0]:
!gunzip '/content/GoogleNews-vectors-negative300.bin.gz'


gzip: /content/GoogleNews-vectors-negative300.bin.gz: not in gzip format


In [0]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-06-03 09:12:59--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.229.3
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.229.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-06-03 09:13:50 (31.4 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



# CODE ##

##### Load Word2Vec Model


**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [0]:
# Libraries
import numpy as np
import gensim
from gensim.models import word2vec, KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [0]:
word_vector = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [0]:
type(word_vector)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [0]:
word_vector["Apple"].shape

(300,)

In [0]:
v_apple = word_vector["apple"]
v_banana = word_vector["banana"]

In [0]:
cosine_similarity([v_apple], [v_banana])

array([[0.5318406]], dtype=float32)

## 1. Find the Odd One Out

In [0]:
def odd_one_out(words):
    """Accepts a list of words and returns the odd word"""
    all_word_vectors = [word_vector[w] for w in words]
    # print(len(all_word_vectors), all_word_vectors[0].shape)

    avg_vector = np.mean(all_word_vectors, axis = 0)
    # print(avg_vector)
    odd_word = None
    sim = 1.0

    for w in words:
        temp_sim = cosine_similarity([avg_vector], [word_vector[w]])
        print(f"{w} and avg_vector --->  {temp_sim}")

        if temp_sim < sim :
            sim = temp_sim
            odd_word = w

    return odd_word

In [0]:
print(odd_one_out(input_1))
print(odd_one_out(input_2))
print(odd_one_out(input_3))
print(odd_one_out(input_4))

party
sleep
dancer
paris


In [0]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

In [0]:
print(odd_one_out(input_1))

apple and avg_vector --->  [[0.7806554]]
mango and avg_vector --->  [[0.7606032]]
juice and avg_vector --->  [[0.7106042]]
party and avg_vector --->  [[0.357093]]
orange and avg_vector --->  [[0.649024]]
party


In [0]:
print(odd_one_out(input_2))

music and avg_vector --->  [[0.66403615]]
dance and avg_vector --->  [[0.80607384]]
sleep and avg_vector --->  [[0.5149707]]
dancer and avg_vector --->  [[0.7154054]]
food and avg_vector --->  [[0.51771235]]
sleep


In [0]:
print(odd_one_out(input_3))

match and avg_vector --->  [[0.5837205]]
player and avg_vector --->  [[0.6805351]]
football and avg_vector --->  [[0.72256005]]
cricket and avg_vector --->  [[0.69646657]]
dancer and avg_vector --->  [[0.52681357]]
dancer


In [0]:
print(odd_one_out(input_4))

india and avg_vector --->  [[0.80707854]]
paris and avg_vector --->  [[0.74804693]]
russia and avg_vector --->  [[0.79275256]]
france and avg_vector --->  [[0.8136487]]
germany and avg_vector --->  [[0.841681]]
paris


### 2. Word Analogies Task

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea≈ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

![Word2Vec](http://jalammar.github.io/images/word2vec/word2vec.png)

**man -> woman ::     prince -> princess**
**italy -> italian ::     spain -> spanish**
**india -> delhi ::     japan -> tokyo**
**man -> woman ::     boy -> girl**
**small -> smaller ::     large -> larger**

## Try it out
**man -> coder :: woman -> ______?**

In [0]:
word_vector["man"].shape

(300,)

In [0]:
def predict_word(a, b, c, word_vector) :
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""

    wv_a, wv_b, wv_c = word_vector[a.lower()], word_vector[b.lower()], word_vector[c.lower()]
    # |vb-va| == |vd-vc|
    max_sim = -10
    words = word_vector.vocab.keys()
    temp_word = None

    for w in words:
        if w in [a,b,c] :
            continue
        
        wv = word_vector[w]
        temp_sim = cosine_similarity([wv_b - wv_a], [wv - wv_c])

        if temp_sim > max_sim :
            max = temp_sim
            temp_word = w
        
    return w


In [0]:
a, b, c = "man", "woman", "prince"
predict_word(a, b, c, word_vector)

KeyboardInterrupt: ignored

## 3. Training Your Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!
- Continuous Bag of Words Model **(CBOW)**
- Skip Gram Model

`Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a superwised manner.` The algorithm was developed by Tomas Mikolov.

#### Data Preparation



- Each sentence must be tokenized, into a list of words.

- The sentences can be text loaded into memory once,
or we can build a data pipeline which iteratively feeds data to the model.


In [0]:
#libs
import nltk
from nltk.corpus import stopwords

In [0]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [0]:
## Read the file 
def readFile(file): 
    f = open(file, 'r', encoding= 'utf-8')
    text = f.read()

    sent = nltk.sent_tokenize(text)
    data = []
    for s in sent:
        words = nltk.word_tokenize(s)
        words = [w.lower() for w in words if len(w) >2 and w not in stopwords.words("english")]
        data.append(words)
    return data

In [0]:
path = '/content/bollywood_news.txt'
text = readFile(path)
text

[['deepika',
  'padukone',
  'ranveer',
  'singh',
  'wedding',
  'one',
  'biggest',
  'bollywood',
  'events',
  'happened',
  '2018'],
 ['the',
  'deepveer',
  'celebrations',
  'hooked',
  'phones',
  'waiting',
  'come',
  'also',
  'gave',
  'enough',
  'reason',
  'believe',
  'stylish',
  'two',
  'couple'],
 ['from',
  'airport',
  'looks',
  'reception',
  'parties',
  'everything',
  'entire',
  'timeline',
  'ranveer',
  'wedding',
  'style',
  'file'],
 ['not', 'ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'],
 ['man',
  'proves',
  'wedding',
  'the',
  'year',
  'this',
  'year',
  'year',
  'big',
  'fat',
  'lavish',
  'extravagant',
  'weddings'],
 ['from',
  'isha',
  'ambani',
  'anand',
  'piramal',
  'deepika',
  'padukone',
  'ranveer',
  'singh',
  'priyanka',
  'chopra',
  'nick',
  'jonas',
  'kapil',
  'sharma',
  'ginni',
  'chatrath',
  '2018',
  'saw',
  'many',
  'grand',
  'weddings'],
 ['but',
  'nothing',
  'beats',
  'man',
  'wedding',
  'the',
 

In [0]:
from gensim.models import Word2Vec

In [0]:
model = Word2Vec(text, size = 300, window = 10, min_count = 1)

In [0]:
words = model.wv.vocab
words

{"'gully": <gensim.models.keyedvectors.Vocab at 0x7fea9b29b2e8>,
 "'simmba": <gensim.models.keyedvectors.Vocab at 0x7fea9b29b128>,
 "'takht": <gensim.models.keyedvectors.Vocab at 0x7fea9b29b320>,
 '...': <gensim.models.keyedvectors.Vocab at 0x7fea9b28fa90>,
 '//twitter.com/sailee_rk/status/1079964902268141568': <gensim.models.keyedvectors.Vocab at 0x7fea9b28f978>,
 '100': <gensim.models.keyedvectors.Vocab at 0x7fea9b28fc18>,
 '14th': <gensim.models.keyedvectors.Vocab at 0x7fea9b2a1d68>,
 '15th': <gensim.models.keyedvectors.Vocab at 0x7fea9b2a1e48>,
 '2-1': <gensim.models.keyedvectors.Vocab at 0x7fea9b2f70b8>,
 '2018': <gensim.models.keyedvectors.Vocab at 0x7fea9b2e3358>,
 '2019': <gensim.models.keyedvectors.Vocab at 0x7fea9b2f7940>,
 '400': <gensim.models.keyedvectors.Vocab at 0x7fea9b296320>,
 'able': <gensim.models.keyedvectors.Vocab at 0x7fea9b29bcc0>,
 'abu': <gensim.models.keyedvectors.Vocab at 0x7fea9b2e4588>,
 'according': <gensim.models.keyedvectors.Vocab at 0x7fea9b2e29b0>,
 '

In [0]:
model["deepika"]

In [0]:
# from sklearn.manifold import TSNE

# tsne = TSNE(n_components=2, verbose=1,n_iter=1000,random_state=1)
# tsne_results = tsne.fit_transform(all_vectors)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 116 samples in 0.000s...
[t-SNE] Computed neighbors for 116 samples in 0.004s...
[t-SNE] Computed conditional probabilities for sample 116 / 116
[t-SNE] Mean sigma: 0.004361
[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.828041
[t-SNE] Error after 950 iterations: 0.944916


In [0]:
# print(tsne_results)

In [0]:
def predict_actor(a, b, c, word_Vector):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    wv_a, wv_b, wv_c = word_Vector[a.lower()], word_Vector[b.lower()], word_Vector[c.lower()]
    # |vb-va| == |vd-vc|
    max_sim = -100
    
    actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]
    
    temp_word = None
    for w in actors:
        if w in [a,b,c] :
            continue
        
        wv = word_Vector[w]
        temp_sim = cosine_similarity([wv_b - wv_a], [wv - wv_c])

        if temp_sim > max_sim :
            max_sim = temp_sim
            temp_word = w
            print(temp_word)
        
    return temp_word


### 4. Test your Model

In [0]:
model.wv["virat"]

In [0]:
a, b, c = "nick", "priyanka", "virat"
predict_actor(a, b, c,model.wv)

ranveer
deepika


'deepika'

In [0]:
a, b, c = "ranveer", "deepika", "priyanka"
predict_actor(*triad,model.wv)

ranveer
nick


'nick'

In [0]:
a,b,c = "ranveer", "singh", "deepika"
predict_actor(*triad,model.wv)

ranveer
nick


'nick'

In [0]:
triad = ("deepika","padukone","priyanka")
predict_actor(*triad,model.wv)

ranveer
nick


'nick'

In [0]:
triad = ("priyanka","jonas","nick")
predict_actor(*triad,model.wv)

ranveer
singh
ginni


'ginni'