### Skip-Gram Word2Vec (with Negative Sampling)

The skip-gram word2vec algorithm is a simple self-supervised model for learning dense word embedding vectors from a corpus of text. It is trained on the task of predicting the `probability distribution` over `context words` given a `center word`, for all possible center words from the vocabulary. The parameters of this model consist of two separate $|V| \times D$ embedding matrices ($|V|$ is the vocab size and $D$ is the embedding dimensions): matrix $W$ whose rows are the embeddings of outside words and matrix $C$ whose rows are the embeddings of center words. (We could instead have one single embedding for both outside and context words, however this approach of having separate embeddings is more convenient). Then the model for computing the probability distribution is simply defined as follows:

$P(w|c) = \frac{exp(\vec{w} \cdot \vec{c})}{\sum_{w' \in V} exp(\vec{w}' \cdot \vec{c})}$

where $\vec{w}$ and $\vec{c}$ are the embedding vectors of the context and center words resepectively. Note that this is just a softmax over all the dot products of every possible context word $w$ given a particluar center word $c$.

Now due to large volcabulary size $|V|$, computing this probability distribution is very inefficient (because of the sum in the denominator). Instead of computing a probability distribution over all possible context words, we instead simplify our task into a `binary classification` problem. Given a pair of context word $w$ and center word $c$, our simplified task is to train a binary classifier to predict whether $w$ actually occurs in the context of $c$ or not (we use label $1$ for True and $0$ for False). We define this simple binary classification problem as follows:

$P(y=1|w_{pos},c) = \sigma (\vec{w}_{pos} \cdot \vec{c})$

$P(y=0|w_{neg},c) = 1-\sigma (\vec{w}_{neg} \cdot \vec{c}) = \sigma (-\vec{w}_{neg} \cdot \vec{c})$

where $\sigma()$ is the saigmoid function, $w_{pos}$ denotes a true context word and $w_{neg}$ denotes a `noise` word which is not a true context word. For training this classifier, we will use $k$ times as many noise words than context words (this reflects the fact that each center word will have far fewer words from the vocab that appear in it's context than words that don't). During training, we will simply slide a context window of half-size $L$ over the training corpus, so at each position this gives us $2L$ different positive pairs $\{(w^{i}_{pos},c) | i =1,2..,2L\}$. For each of these positive pairs, we generate k negative samples by sampling from the unigram probability distribution over the vocabulary (making sure that none of these noise words match the positive word). Then we compute the negative log-likelihood loss for the each positive pair along with the k negative pairs:

$L = -\log(P(y=1|w_{pos},c)) - \sum_{j=1}^k \log(P(y=0|w^{j}_{neg},c)) = -\log(\sigma (\vec{w}_{pos} \cdot \vec{c})) - \sum_{j=1}^k \log(\sigma (-\vec{w}^{j}_{neg} \cdot \vec{c}))$

We can think of each window position providing us with a batch of $L$ positive instances and $kL$ negative instances. Then we can minimize this loss via gradient descent. The gradients with are:

$\frac{\partial L}{\partial \vec c} = (\sigma (\vec{w}_{pos} \cdot \vec{c})-1) \vec{w}_{pos} + \sum_{j=1}^k \sigma (\vec{w}^{j}_{neg} \cdot \vec{c}) \vec{w}^{j}_{neg}$

$\frac{\partial L}{\partial \vec{w}_{pos}} = (\sigma (\vec{w}_{pos} \cdot \vec{c})-1) \vec{c}$

$\frac{\partial L}{\partial \vec{w}^{j}_{neg}} = \sigma (\vec{w}^{j}_{neg} \cdot \vec{c}) \vec{c}$


Note: When sampling the noise words, instead of using the unigram probability distribution $P(w)$ over the vocabulary words, it's better to use a weighted version of this distirbution $P_{\alpha}(w)$:

$P(w) = \frac{count(w)}{ \sum_{w'\in V} count(w')}$

which is the unigram distribution and the weighted unigram distribution is defined as:

$P_{\alpha}(w) = \frac{count(w)^{\alpha}}{ \sum_{w'\in V} count(w')^{\alpha}}$

where $\alpha$ is an exponent between $0$ and $1$. This kind of weighting helps to slightly increase the probabilities of the rarer words and slightly suppressess the probability of the most common words. Empirically $\alpha = 0.75$ tends to work well. 



We will implement and train a skipgram word2vec model using the Stanford Treebank dataset.

In [30]:
from collections import defaultdict
import numpy as np

In [152]:
def check_punc(w):
    return any(c.isalpha() for c in w)

# remove punctuations from list of words and apply lowercase folding 
def preprocess(s):
    words = s.lower().strip().split()[1:]
    words = [w for w in words if check_punc(w)]
    return words

# load dataset
word_count = 0
unigram_count = defaultdict(int)
wierd_words = []
with open('datasetSentences.txt','r') as file:
    lines = file.readlines()
    # lowercase folding and tokenize
    sentences = []
    for line in lines[1:]:
        words = preprocess(line)
        s = []
        for word in words:
            if "\/" in word:
                ws = word.replace("\/", " ").split()
                for w in ws:
                    s.append(w)
                    unigram_count[w] += 1
                    word_count += 1
            else:
                s.append(word)    
                unigram_count[word] += 1
                word_count += 1
        sentences.append(s)        
                 

#### The skipgram paper uses a subsampling strategy to get rid of the most frequent words, like stop words. Each word $w_i$ in the training corpus is discarded with a probability given by the following:

$P(w_i) = 1 - \sqrt{\frac{T}{count(w_i)}}$ 

where $T$ is a threshold value which is a small fraction of the corpus total token count (~$10^{-5} \times N$) and $count(w_i)$ is the frequency of that word in the corpus. For more frequent words, the square root term is very close to zero and so the word will get discarder with high probability.

We also keep multiple copies of the same sentence to reduce the chances of entirely losing important words due to the subsampling.

In [153]:
def subsample_prob(word, t=1e-4):
    p = max(0, 1 - np.sqrt(t*word_count/unigram_count[word]))
    return p

num_copies = 10
discard_probs = {w:subsample_prob(w) for w in unigram_count.keys()}
sentences_subsampled = [[word for word in s if np.random.random() >= discard_probs[word]] for s in sentences*num_copies]

In [154]:
# compare before and after subsampling
for i in range(7):
    print("Before subsampling: ", sentences[i])
    print("After subsampling: ", sentences_subsampled[i])

Before subsampling:  ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', 'conan', 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', 'jean-claud', 'van', 'damme', 'or', 'steven', 'segal']
After subsampling:  ['destined', '21st', 'century', 'new', 'conan', 'splash', 'greater', 'arnold', 'schwarzenegger', 'jean-claud', 'van', 'damme', 'segal']
Before subsampling:  ['the', 'gorgeously', 'elaborate', 'continuation', 'of', 'the', 'lord', 'of', 'the', 'rings', 'trilogy', 'is', 'so', 'huge', 'that', 'a', 'column', 'of', 'words', 'can', 'not', 'adequately', 'describe', 'co-writer', 'director', 'peter', 'jackson', "'s", 'expanded', 'vision', 'of', 'j.r.r.', 'tolkien', "'s", 'middle-earth']
After subsampling:  ['gorgeously', 'elaborate', 'continuation', 'lord', 'rings', 'trilogy', 'is', 'huge', 'column', 'adequately', 'describe', 'co-writer', 'director', 'peter', 'expanded', 'vision', 'j.r.r.', 

Note that most of the stop words are gone after subsampling.

In [155]:
# create vocabulary
unk_token = "<UNK>"
vocab = [unk_token] + sorted(list(set([word for sentence in sentences for word in sentence])))
word2idx = {w:i for i,w in enumerate(vocab)}
vocab_size = len(vocab)

print(f"Vocab size: {len(vocab)}")
print(f"Num sentences: {len(sentences)}")
print(f"Total number of tokens: {word_count}")

# tokenize the sentences
sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append([word2idx[w] for w in s])

Vocab size: 19333
Num sentences: 11855
Total number of tokens: 201802


In [156]:
# unigram weighted probability distribution
alpha = 0.75
P_alpha = np.zeros(shape=(vocab_size))
for i,w in enumerate(vocab):
    P_alpha[i] = unigram_count[w]**alpha
P_alpha = P_alpha / P_alpha.sum()    
unigram_idx = np.arange(0,vocab_size)

In [160]:
# hyperparameters of word2vec model
D = 32 # embedding dim
L = 8  # context window half-size

#### Instead of sliding context window over every position from start to end of corpus, we will instead randomly select a batch of context windows on every epoch. We will also add some randomness to the context window size, by sampling a random size between [1,L]. This ensures that we get smaller context windows more often than longer windows which is helpful because context words that are closer should be related more strongly on the center word than context words that are farther away. Closer context words should therefore be sampled more often.  

In [161]:
def get_random_context(L):
    # first randomly select a sentence
    sent_idx = np.random.randint(0,len(sentences_tokenized)-1)
    sent = sentences_tokenized[sent_idx]
    # pick random context window half-length between [1..L]
    R = np.random.randint(1, L)
    # pick a random center word from the sentence
    if len(sent) > R:
        c_idx = np.random.randint(0,max(0,len(sent)-R))
    else:
        c_idx = 0

    center_word = sent[c_idx]
    context_words = sent[max(0,c_idx-R):c_idx] + sent[c_idx+1:c_idx+1+R]

    if len(context_words) == 0:
        return get_random_context(L)
    else:
        return center_word, context_words    


def get_negative_samples(wpos_idx, k=10):
    nsamples = 0
    wnegs = []
    # generate negative samples
    while nsamples < k:
        wneg_idx = np.random.choice(unigram_idx, size=1, p=P_alpha)[0]
        # make sure noise words don't match the positive word
        if wneg_idx != wpos_idx:
            wnegs.append(wneg_idx)
            nsamples += 1
    return wnegs    


def sigmoid(x):
    return 1/(1+np.exp(-x))

# skip-gram with negative samplice loss for a single (w,c) pair
def compute_loss_and_grads(wpos_idx, c_idx, W, C):
    # get negative samples
    wnegs_idx = get_negative_samples(wpos_idx)
    # get embedding vectors
    V, D = W.shape
    c = C[c_idx]   # shape: (D,)
    w_pos = W[wpos_idx]  # shape: (D,)
    w_negs = W[wnegs_idx]  # shape: (k,D)
    
    s_wpos_dot_c = sigmoid(np.dot(w_pos,c))  # shape: (1,)
    s_wneg_dot_c = sigmoid(np.dot(w_negs,c)).reshape(w_negs.shape[0],1)  # shape: (k,1)

    # compute loss
    loss =  -np.log(s_wpos_dot_c) - np.log(1-s_wneg_dot_c).sum()

    # compute gradients
    grad_c = (s_wpos_dot_c-1) * w_pos +  (s_wneg_dot_c * w_negs).sum(axis=0)  # shape: (D,)
    grad_wpos = (s_wpos_dot_c-1) * c  # shape: (D,)
    grad_wnegs = s_wneg_dot_c * c  # shape: (k,D)
    
    return loss, grad_c, grad_wpos, grad_wnegs, wnegs_idx


# compute total loss and accumulated gradients for a single context window
def skipgram(center_word_idx, context_words_idx, W, C):
    grad_C = np.zeros_like(C) 
    grad_W = np.zeros_like(W) 
    total_loss = 0.0

    # compute loss and accumulate gradients for each positive context word and negative samples
    for wpos_idx in context_words_idx:
        loss, grad_c, grad_wpos, grad_wnegs, wnegs_idx = compute_loss_and_grads(wpos_idx, center_word_idx, W, C) 
        total_loss += loss
        grad_C[center_word_idx] += grad_c
        grad_W[wnegs_idx] += grad_wnegs
        grad_W[wpos_idx] += grad_wpos

    return total_loss, grad_W, grad_C


# perform gradient descent update of parameters over a mini batch
def train_step(W, C, L, batch_size, alpha):
    grad_C = np.zeros_like(C) 
    grad_W = np.zeros_like(W)
    total_loss = 0 
    for _ in range(batch_size):
        # get a random context window
        center_word_idx, context_words_idx = get_random_context(L)
        # compute loss and gradients for this window 
        loss, grad_W_window, grad_C_window = skipgram(center_word_idx, context_words_idx, W, C)
        # accumulate loss and grads
        total_loss += loss
        grad_W += grad_W_window
        grad_C += grad_C_window

    # average over mini-batch
    total_loss /= batch_size
    grad_W /= batch_size    
    grad_C /= batch_size    

    # perform sgd update of parameters
    W -= alpha * grad_W
    C -= alpha * grad_C

    return W, C, total_loss

# training loop
def train(W, C, L, num_epochs=10, batch_size=32, alpha=0.01, print_every=100):
    for epoch in range(num_epochs):
        W, C, loss = train_step(W, C, L, batch_size, alpha)
        if epoch%print_every==0:
            print(f"Epoch #{epoch}, Train Loss: {loss}")

    return W, C    

In [162]:
# parameters: embedding matrices
W = 0.001 * np.random.randn(vocab_size, D)
C = 0.001 * np.random.randn(vocab_size, D)

W, C, = train(W, C, L, num_epochs=5000, batch_size=100, alpha=0.2, print_every=50)

Epoch #0, Train Loss: 46.66267452867085
