# Word2Vec from Scratch
(by Tevfik Aytekin)

In [62]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import gutenberg, brown
import gensim 
from gensim.models import Word2Vec 
from gensim.parsing.preprocessing import remove_stopwords
from  gensim.utils import simple_preprocess
from nltk.tokenize import RegexpTokenizer
import numpy as np
from sklearn.utils import shuffle
from queue import PriorityQueue
from tqdm import tqdm



# You need to call nltk.download() to download all the nltk corpora

## Definition from Wikipedia:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

In [2]:
nltk.download('brown')
num_sents = len(brown.sents())
print("number of sentences:", num_sents)

[nltk_data] Downloading package brown to
[nltk_data]     /Users/tevfikaytekin/nltk_data...
[nltk_data]   Package brown is already up-to-date!


number of sentences: 57340


An example sentence represented as a list of words

In [3]:
brown.sents()[0]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.']

Following is an example application of word2vec using Gensim library. You can see some of the parameters and can find all the details of Gensim implementation [here](https://radimrehurek.com/gensim/models/word2vec.html). The Gensim word2vec source code is [here](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py) and the original source code by Mikolov can be found [here](https://github.com/tmikolov/word2vec/blob/master/word2vec.c).



In [64]:
model = gensim.models.Word2Vec(brown.sents(),min_count = 5,
                              vector_size = 30, window = 5, negative=5) 

An example vector representation of the word "book". Since we set size = 30, the representation is an array of 30 reals.

In [5]:
model.wv['book']

array([-1.113819  , -0.26239938,  0.42764342,  0.43501478,  0.3588065 ,
        0.29601172,  1.587967  ,  0.87482476, -0.7084403 ,  0.23097405,
        0.3308485 ,  0.21532671,  0.57466847,  0.20365873,  0.41659406,
       -0.03417895,  1.1338999 ,  0.25690162, -0.03994688, -0.02766844,
        0.58070046, -0.14295667,  0.78888535,  0.20186745,  0.2501006 ,
        0.47317168,  0.18389787,  0.7076744 , -0.12536114, -1.5227041 ],
      dtype=float32)

One way to test the performance of word2vec is to look at most similar words to a given word. Below you will find most similar words of the words "book" and "eight"

In [6]:
model.wv.most_similar(positive='book')

[('story', 0.9279091954231262),
 ('remarks', 0.9207910895347595),
 ('opinion', 0.9131760597229004),
 ('novel', 0.904676616191864),
 ('fellow', 0.9035377502441406),
 ('suggestion', 0.8998563289642334),
 ('poem', 0.8995265364646912),
 ('artist', 0.8982446789741516),
 ('hero', 0.8961074948310852),
 ('name', 0.894064724445343)]

In [7]:
model.wv.most_similar(positive='eight')

[('seven', 0.9647053480148315),
 ('caves', 0.9393208622932434),
 ('thirty', 0.9381996393203735),
 ('decades', 0.9377883672714233),
 ('fifteen', 0.9361171722412109),
 ('fifty', 0.9340434074401855),
 ('65', 0.9306427836418152),
 ('nine', 0.9291650652885437),
 ('eleven', 0.9286434650421143),
 ('40', 0.9269590377807617)]

As you can see the results are quite amazing. But it might not be so for every word, for example for the word "angry" the results are not very satisying. However, if we have used a larger text the results could be better.

In [8]:
model.wv.most_similar(positive='on')

[('through', 0.8318491578102112),
 ('into', 0.8263941407203674),
 ('against', 0.7991057634353638),
 ('from', 0.7952094078063965),
 ('toward', 0.7867401242256165),
 ('over', 0.7818900346755981),
 ('along', 0.7812550663948059),
 ('behind', 0.7792336344718933),
 ('across', 0.7586532831192017),
 ('near', 0.7423183917999268)]

In [9]:
model.wv.most_similar(positive='angry')

[('businessman', 0.9537981748580933),
 ('minister', 0.9531584978103638),
 ('ankle', 0.94906085729599),
 ('oysters', 0.9490442872047424),
 ('suffering', 0.9478992223739624),
 ('gift', 0.9465452432632446),
 ('nickname', 0.9451074600219727),
 ('master', 0.9426564574241638),
 ('thoroughly', 0.9404450058937073),
 ('occasional', 0.9390085935592651)]

You can also find (cosine) similarity between two words.

In [10]:
print("Cosine similarity between 'book' and 'story':", 
    model.wv.similarity('book', 'story')) 

Cosine similarity between 'book' and 'story': 0.92790914


In [11]:
print("Cosine similarity between 'book' and 'eight':", 
    model.wv.similarity('book', 'eight')) 

Cosine similarity between 'book' and 'eight': 0.48154956


## word2vec from scratch

Now we will write word2vec from scratch. Note that the purpose of this implementation is to help understand the theory behind word2vec. The implementation is not meant to be efficient so the running time is quite slow compared to the Gensim implementation. However, the code is simpler and shows the main ingredients of word2vec.

Different objectives can be used for word2vec. The following is the objective for word2vec with negative sampling. The main idea behind this objective is to find paramater values which maximizes the dot product of word representations which are in the same context and minimizes the dot product of word representations which are not in the same context.


$$
J(\Theta) = \underset{\theta}{\operatorname{argmax}}{\sum_{c,t \in D_p}log(\sigma(v_c \cdot v_t))+\sum_{c,t \in D_n}log(\sigma(-v_c \cdot v_t))}
$$

Here, $D_p$ is the set of word pairs whose distance is at most $m$ and $D_n$ is the set of unrelated (negative) word pairs, i.e., word pairs whose distance is larger than $m$, and $\sigma$ is the sigmoid function. Below we find the derivative of this function with respect to positive and negative words which we will use in the updates of gradient descent algorithm.
$$
\frac{\partial J(\Theta)}{\partial v_{c}}=\sum_{c,t \in D_p}\frac{1}{\sigma(v_c \cdot v_t)}\sigma(v_c \cdot v_t)(1-\sigma(v_c \cdot v_t))(v_t)\\
+ \sum_{c,t \in D_n}\frac{1}{\sigma(-v_c \cdot v_t)}\sigma(-v_c \cdot v_t)(1-\sigma(-v_c \cdot v_t))(-v_t)\\
= \sum_{c,t \in D_p}(1-\sigma(v_c \cdot v_t))v_t + \sum_{c,t \in D_n}-(1-\sigma(-v_c \cdot v_t))v_t 
$$
$$
\frac{\partial J(\Theta)}{\partial v_{t \in D_p}}=\sum_{c,t \in D_p}(1-\sigma(v_c \cdot v_t))v_c 
$$
$$
\frac{\partial J(\Theta)}{\partial v_{t \in D_n}}=\sum_{c,t \in D_n}-(1-\sigma(-v_c \cdot v_t))v_c 
$$

In [12]:
def sigmoid(x):
    return (1 / (1 + np.exp(-x)))  

In [13]:
def build_indices(sents):
    """ 
  
    Parameters: 
    sents: A list of sentecens and each sentence is a list of words (i.e., a list of lists). 
  
    Returns: 
    word_freqs: frequency of each word
    word_to_index: a mapping from word names to integers.
    index_to_word: a mapping from integers to word names.
    
  
    """
    counter = 0
    word_freqs = {}
    word_to_index = {}
    index_to_word = {}
    for i in range(len(sents)): 
        for j in range(len(sents[i])):
            w = sents[i][j].lower()
            if w in word_freqs:
                word_freqs[w] += 1
            else:
                word_freqs[w] = 1
                word_to_index[w] = counter
                index_to_word[counter] = w
                counter += 1
            
    return word_freqs, word_to_index, index_to_word

In [14]:
def build_training_set(sents, word_freqs, window=5, sampling_freq = 0.001, neg_exp = 0.75, num_negs = 1, min_count=5):
    """ 
    Builds a trainig set
    
    Parameters: 
    sents: A list of sentecens and each sentence is a list of words (i.e., a list of lists).
    word_freqs: Frequency of words.
    windows: size of the context window.
    sampling_freq: words whose frequency larger than this value will be discarded.
    neg_exp: used for adjusting the negative sampling distribution.
    min_count: 
  
    Returns: 
    training_set: list of context word, positive and negatives
    """
    words_list = []
    total_freq = sum(word_freqs.values())
    
    #total_freq = sum([freq**(neg_exp) for freq in word_freqs.values()])
    word_array = []
    for word, freq in word_freqs.items():
        if ((word_freqs[word]/total_freq) < sampling_freq) and (word_freqs[word] > min_count):
            words_list.append(word)
            for i in range(int(freq**neg_exp)):
                word_array.append(word)
    
    training_set = []
    
    sampled_sents = []
    for i in range(len(sents)): 
        sent = []
        for j in range(len(sents[i])):
            w = sents[i][j].lower()
            if ((word_freqs[w] / total_freq) < sampling_freq) and (word_freqs[w] > min_count):
                sent.append(w)
        sampled_sents.append(sent)
    
    
    for i in range(len(sampled_sents)): 
        for j, w in enumerate(sampled_sents[i]):
            context = []
            for k in range(max(j-window,0),min(j+window+1,len(sampled_sents[i]))):
                w_p = sampled_sents[i][k]
                if (w == w_p):
                    continue
                w_n = []
                for k in range(num_negs):
                    w_n.append(word_array[np.random.randint(0,len(word_array))] )
                training_set.append([w,w_p,w_n])

    return training_set, np.unique(words_list)

In order to understand the produced training_set here is a very simple example sentence consisting of 6 words.

In [15]:
sents = [["a","b","c","d","e","f"]]
word_freqs = {"a":1,"b":1,"c":1,"d":1,"e":1,"f":1}


In [16]:
training_set, words_list = build_training_set(sents,window=1, word_freqs= word_freqs , sampling_freq = 1, min_count= 0, num_negs = 2)

In [17]:
# print training set
training_set

[['a', 'b', ['e', 'c']],
 ['b', 'a', ['c', 'e']],
 ['b', 'c', ['c', 'a']],
 ['c', 'b', ['e', 'f']],
 ['c', 'd', ['f', 'b']],
 ['d', 'c', ['c', 'f']],
 ['d', 'e', ['a', 'd']],
 ['e', 'd', ['e', 'e']],
 ['e', 'f', ['a', 'f']],
 ['f', 'e', ['f', 'f']]]

Let us now build the training set for the brown dataset which will take some time

In [42]:
def preprocess_brown_corpus():
    processed_sentences = []
    for sentence in brown.sents():
        processed_sentence = simple_preprocess(' '.join(sentence), deacc=True)  
        processed_sentences.append(processed_sentence)
    return processed_sentences


In [46]:
sents = preprocess_brown_corpus()

In [48]:
word_freqs, word_to_index, index_to_word = build_indices(sents)

In [49]:
training_set, words_list = build_training_set(sents,word_freqs)

In [50]:
# print first 10 examples in the trainigng set
training_set[:10]

[['fulton', 'county', ['give']],
 ['fulton', 'grand', ['hands']],
 ['fulton', 'jury', ['seemed']],
 ['fulton', 'friday', ['cause']],
 ['fulton', 'investigation', ['democracy']],
 ['county', 'fulton', ['church']],
 ['county', 'grand', ['bother']],
 ['county', 'jury', ['radical']],
 ['county', 'friday', ['experimental']],
 ['county', 'investigation', ['copies']]]

In [51]:
len(training_set)

3194890

In [60]:
def build_model(training_set, initial_alpha = 0.025, min_alpha = 0.0001, n_iters = 5, my_lambda = 0, vector_size = 30):
    word_vectors = {}
    
    # initialize word vectors
    for n in range(len(words_list)):
        word_vectors[words_list[n]] = np.random.rand(vector_size,1) - 0.5
    

    alpha = initial_alpha
    for t in range(n_iters):
        training_set = shuffle(training_set)
        objective = 0
        print("cosine of words 'friend' and 'fellow': ",np.dot(word_vectors['friend'].T, word_vectors['fellow']))
        for ex in tqdm(training_set):
            w = ex[0]
            w_p = ex[1]
            w_n_list = ex[2]
            w_v = word_vectors[w]
            w_p_v = word_vectors[w_p]
            word_vectors[w_p] = w_p_v + alpha*(((1-sigmoid(np.dot(w_v.T,w_p_v)))*w_v)-my_lambda*w_p_v)
            objective += np.log((sigmoid(np.dot(w_v.T,w_p_v))))

            for n in range(len(w_n_list)):
                w_n = w_n_list[n]
                w_n_v = word_vectors[w_n]
                word_vectors[w_n] = w_n_v + alpha*((-(1-sigmoid(-np.dot(w_v.T,w_n_v)))*w_v)-my_lambda*w_n_v)      
                objective += np.log((sigmoid(-np.dot(w_v.T,w_p_v))))
     
        alpha = initial_alpha - ((initial_alpha - min_alpha) * t / n_iters)
        print("alpha: ",alpha)
        print("Iteration: ", t)
        print("Objective: ", objective)
    print("cosine of words 'friend' and 'fellow': ",np.dot(word_vectors['friend'].T, word_vectors['fellow']))

    return word_vectors

In [None]:
word_vectors = build_model(training_set, initial_alpha = 0.025, min_alpha = 0.0001, n_iters = 5, my_lambda = 0, vector_size = 30)

In [54]:
print("cosine of words 'friend' and 'fellow': ",np.dot(word_vectors['friend'].T, word_vectors['fellow']))


cosine of words 'friend' and 'fellow':  [[1.37252457]]


In [55]:
def most_similar(word, word_vectors):
    pq = PriorityQueue()
    for w in word_vectors.keys():
        pq.put((-np.dot(word_vectors[word].T, word_vectors[w]), w))
    return pq

In [56]:
pq = most_similar('book', word_vectors)

In [57]:
for i in range(10):
    print(pq.get())

(array([[-3.50534642]]), 'book')
(array([[-2.1377247]]), 'poems')
(array([[-2.08116735]]), 'history')
(array([[-2.04876227]]), 'jazz')
(array([[-2.04088275]]), 'fiction')
(array([[-1.98213863]]), 'clergymen')
(array([[-1.95920136]]), 'interesting')
(array([[-1.95125897]]), 'jr')
(array([[-1.93972289]]), 'republics')
(array([[-1.93425389]]), 'archaeology')


In [58]:
pq = most_similar('eight', word_vectors)
for i in range(10):
    print(pq.get())

(array([[-2.35632696]]), 'ten')
(array([[-2.12707964]]), 'eight')
(array([[-2.09725713]]), 'five')
(array([[-2.05264434]]), 'minutes')
(array([[-2.04031576]]), 'twenty')
(array([[-2.00953036]]), 'fifty')
(array([[-1.99019486]]), 'hundred')
(array([[-1.97587303]]), 'forty')
(array([[-1.93774473]]), 'mile')
(array([[-1.86622555]]), 'four')


### Question
- In what ways word2vec method is better than building a co-occurrence matrix?


### PyTorch Implementation