## Word2Vec - skipgram

### - Word2Vec 구현을 위해 아래의 링크를 참고

#####  - 논문

- __[Distributed Representations of Words and Phrases and their Compositionality, 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)__

- __[Efficient Estimation of Word Representations in Vector Space, 2013](https://arxiv.org/abs/1301.3781)__

<img src="./images/skipgram.jpg" width="900px" />

### Word2Vec training
Word2vec represents each word $w$ in a vocabulary $V$ of size $T$ as a low-dimensional dense vector $v_w$ in an embedding space $\mathbb{R}^D$. It attempts to learn the continuous word vectors $v_w$, $\forall w \in V$ , from a training corpus such that the spatial distance between words then describes the similarity between words, e.g., the closer two words are in the embedding space, the more similar they are semantically and syntactically.  

The skipgram architecture tries to predict the context given a word. The problem of predicting context words is framed as a set of independent binary classification tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position $t$ we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position $c$, using the binary logistic loss, we obtain the following negative log-likelihood:

$$ \log (1 + e^{-s(w_t, w_c)}) +  \sum_{n \in \mathcal{N}_{t,c}}^{}{\log (1 + e^{s(w_t, n)})}$$

where $w_t$ is a center word, $w_c$ is a context word, $\mathcal{N}_{t,c}$ is a set of negative examples sampled from the vocabulary. By denoting the logistic loss function $l : x \mapsto \log(1 + e^{-x})$, we can re-write the objective as:

$$ \sum_{t=1}^{T}{ \sum_{c \in C_t}^{}{ \big[ l(s(w_t, w_c))} + \sum_{n \in \mathcal{N}_{t,c}}^{}{l(-s(w_t, n))}   \big]} $$

where $s(w_t, w_c) = u_{w_t}^T v_{w_c}$


##### - git source

- __[Word2Vec with gluon](https://github.com/saurabh3949/Word2Vec-MXNet/blob/master/Word2vec%2Bwith%2BGluon.ipynb/)__
- __[Word2Vec with mxnet](https://github.com/apache/incubator-mxnet/tree/master/example/nce-loss)__
- __[Word2Vec with kears](http://adventuresinmachinelearning.com/word2vec-keras-tutorial/)__

The problem with using a full softmax output layer is that it is very computationally expensive. 
There’s another solution called negative sampling.  It is described in the original Word2Vec paper by Mikolov et al.  It works by reinforcing the strength of weights which link a target word to its context words, but rather than reducing the value of all those weights which aren’t in the context, it simply samples a small number of them – these are called the “negative samples”.

We’ll update the weights for the correct label, but only a small number of incorrect labels. This is called “negative sampling”.

To train the embedding layer using negative samples in Keras, we can re-imagine the way we train our network.  Instead of constructing our network so that the output layer is a multi-class softmax layer, we can change it into a simple binary classifier.  For words that are in the context of the target word, we want our network to output a 1, and for our negative samples, we want our network to output a 0. Therefore, the output layer of our Word2Vec Keras network is simply a single node with a sigmoid activation function.

<img src="./images/Negative-sampling-architecture_01.jpg" />

##### nce_loss VS sampled_softmax_loss

Sample softmax is all about selecting a sample of the given number and try to get the softmax loss. Here the main objective is to make the result of the sampled softmax equal to our true softmax. So algorithm basically concentrate lot on selecting the those samples from the given distribution. On other hand NCE loss is more of selecting noise samples and try to mimic the true softmax. It will take only one true class and a K noise classes.
###### https://stackoverflow.com/questions/42509878/what-is-the-difference-between-sampled-softmax-loss-and-nce-loss-in-tensorflow/43320139

## Setting

In [4]:
import mxnet as mx
from mxnet import nd
from mxnet import gluon, autograd, nd
from mxnet.gluon import Block, nn, utils

In [5]:
import os
import sys
from datetime import datetime

wrk_dir = '.'
data_dir = '{}/data/'.format(wrk_dir)
today = datetime.today()
save_dir = '{}/save/{}'.format(wrk_dir, today.strftime('%Y-%m-%d'))
conf_dir = '{}/conf'.format(wrk_dir)
os.makedirs(save_dir, exist_ok = True)
os.makedirs(conf_dir, exist_ok = True)

In [6]:
os.listdir(data_dir)

['Karabiner-Elements-App-Profiles',
 'test-neg.txt',
 'test-pos.txt',
 'text8.txt',
 'train-neg.txt',
 'train-pos.txt',
 'train-unsup.txt',
 'word2vec-sentiments-master']

In [7]:
max_sentence_length = 10000

In [8]:
from __future__ import print_function

import logging
from optparse import OptionParser

import mxnet as mx

In [9]:
max_sentence_length = 10000
WORD_DIM = 200
NEGATIVE_SAMPLES = 5
window_size = 5
ctx = mx.gpu()

In [10]:
BATCH_SIZE = 512

In [11]:
import math

In [16]:
def _load_data(name):
    buf = open(name).read()
    tks = buf.split(' ')
    vocab = {}
    freq = [0]
    data = []
    wid_to_word = ["NA"]
    for tk in tks:
        if len(tk) == 0:
            continue
        if tk not in vocab:
            vocab[tk] = len(vocab) + 1
            freq.append(0)
            wid_to_word.append(tk)
        wid = vocab[tk]
        data.append(wid)
        freq[wid] += 1
    negative = []
    for i, v in enumerate(freq):
        if i == 0 or v < 5:
            continue
        v = int(math.pow(v * 1.0, 0.75))
        negative += [i for _ in range(v)]
    return data, negative, vocab, freq, wid_to_word

In [17]:
class SimpleBatch(object):
    def __init__(self, data_names, data, label_names, label):
        self.data = data
        self.label = label
        self.data_names = data_names
        self.label_names = label_names

    @property
    def provide_data(self):
        return [(n, x.shape) for n, x in zip(self.data_names, self.data)]

    @property
    def provide_label(self):
        return [(n, x.shape) for n, x in zip(self.label_names, self.label)]

In [74]:
class DataIterWords_(mx.io.DataIter):
    def __init__(self, name, batch_size, num_label):
        super(DataIterWords_, self).__init__()
        self.batch_size = batch_size
        self.data, self.negative, self.vocab, self.freq, _ = _load_data(name)
        self.vocab_size = 1 + len(self.vocab)
        print("Vocabulary Size: {}".format(self.vocab_size))
        self.num_label = num_label
        self.provide_data = [('data', (batch_size, num_label - 1))]
        self.provide_label = [('label', (self.batch_size, num_label))]

    def sample_ne(self):
        return self.negative[random.randint(0, len(self.negative) - 1)]

    def __iter__(self):
        batch_data = []
        batch_label = []
        center_data = []
        start = random.randint(0, self.num_label - 1)
        for i in range(start, len(self.data) - self.num_label - start, self.num_label):
            context = self.data[i: i + self.num_label // 2] \
                      + self.data[i + 1 + self.num_label // 2: i + self.num_label]
            target_word = self.data[i + self.num_label // 2]
            if self.freq[target_word] < 5:
                continue
            ## w_t 
            target = [target_word]
            ## w_c + w_n
            context = context + [self.sample_ne() for _ in range(self.num_label - 2)]
            batch_data.append(context)
            batch_label.append(target)
            center_data.append([target_word])
            if len(batch_data) == self.batch_size:
                ## context
                data_all = [mx.nd.array(batch_data)]
                ## center
                label_all = [mx.nd.array(batch_label)]
                data_names = ['data']
                label_names = ['label']
                batch_data = []
                batch_label = []
                batch_label_weight = []
                yield SimpleBatch(data_names, data_all, label_names, label_all)

    def reset(self):
        pass

In [46]:
data, negative, vocab, freq, wid_to_word =  _load_data("{}/text8.txt".format(data_dir))

In [75]:
data_train = DataIterWords_("{}/text8.txt".format(data_dir), BATCH_SIZE, NEGATIVE_SAMPLES)

Vocabulary Size: 253855


In [76]:
VOCAB_SIZE = len(vocab) + 1

In [77]:
import random

In [78]:
import sys
for idx, i in enumerate(data_train):
    print ('center')
    print (i.label)
    print ('context')
    print (i.data)
    if idx == 0:
        sys.exit('')

4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [64]:
all_batches = []
for idx, batch in enumerate(data_train):
    all_batches.append(batch)

In [65]:
def find_most_similar(word_to_index, index_to_word, all_vecs, word):
    ans = []
    if word not in word_to_index:
        print("Sorry word not found. Please try another one.")
    else:  
        i1 = word_to_index[word]
        prod = all_vecs.dot(all_vecs[i1])
        i2 = (-prod).argsort()[1:10]
        for i in i2:
            ans.extend([index_to_word[i]])
        return ans

In [66]:
def callback_(model, word_nm_list):
    from sklearn.preprocessing import normalize
    keys = model.collect_params().keys()
    all_vecs = model.collect_params()[list(keys)[0]].data().asnumpy()
    all_vecs = normalize(all_vecs, copy=False)
    
    # Keep only the top 50K most frequent embeddings
    top_50k = (-np.array(freq)).argsort()
    word_to_index = {}
    index_to_word = []
    for newid, word_id in enumerate(top_50k):
        index_to_word.append(wid_to_word[word_id])
        word_to_index[wid_to_word[word_id]] = newid
    
    for word_nm in word_nm_list:
        print (word_nm, ':', ', '.join(find_most_similar(word_to_index, index_to_word, all_vecs, word_nm)))

In [67]:
class Model(gluon.HybridBlock):
    def __init__(self, **kwargs):
        super(Model, self).__init__(**kwargs)
        with self.name_scope():
            
            # Embedding for input words with dimensions VOCAB_SIZE X WORD_DIM
            self.center = nn.Embedding(input_dim=VOCAB_SIZE,
                                       output_dim=WORD_DIM,
                                       weight_initializer=mx.initializer.Uniform(1.0/WORD_DIM))
            
            # Embedding for output words with dimensions VOCAB_SIZE X WORD_DIM
            self.target = nn.Embedding(input_dim=VOCAB_SIZE,
                                       output_dim=WORD_DIM,
                                       weight_initializer=mx.initializer.Zero())

    def hybrid_forward(self, F, center, targets, labels):
        """
        Returns the word2vec skipgram with negative sampling network.
        :param F: F is a function space that depends on the type of other inputs. If their type is NDArray, then F will be mxnet.nd otherwise it will be mxnet.sym
        :param center: A symbol/NDArray with dimensions (batch_size, 1). Contains the index of center word for each batch.
        :param targets: A symbol/NDArray with dimensions (batch_size, negative_samples + 1). Contains the indices of 1 target word and `n` negative samples (n=5 in this example)
        :param labels: A symbol/NDArray with dimensions (batch_size, negative_samples + 1). For 5 negative samples, the array for each batch is [1,0,0,0,0,0] i.e. label is 1 for target word and 0 for negative samples
        :return: Return a HybridBlock object
        """
        center_vector = self.center(center)
        target_vectors = self.target(targets)
        pred = F.broadcast_mul(center_vector, target_vectors)
        pred = F.sum(data = pred, axis = 2)
        sigmoid = F.sigmoid(pred)
        loss = F.sum(labels * F.log(sigmoid) + (1 - labels) * F.log(1 - sigmoid), axis=1)
        loss = loss * -1.0 / BATCH_SIZE
        loss_layer = F.MakeLoss(loss)
        return loss_layer

In [92]:
labels = nd.zeros((BATCH_SIZE, NEGATIVE_SAMPLES+2), ctx=ctx)
labels[:,0:4] = 1

In [93]:
np.shape(labels)

(512, 7)

In [94]:
model = Model()
model.collect_params().initialize(ctx=ctx)
model.hybridize() # Convert to a symbolic network for efficiency.
trainer = gluon.Trainer(model.collect_params(), 'SGD', {'learning_rate':1})

In [95]:
import logging
import os
import time
import numpy as np
logging.basicConfig(level=logging.INFO)

In [96]:
print(model)

Model(
  (center): Embedding(253855 -> 200, float32)
  (target): Embedding(253855 -> 200, float32)
)


In [None]:
start_time = time.time()
epochs = 10
for e in range(epochs):
    moving_loss = 0.
    for i, batch in enumerate(all_batches):
        #mx.nd.waitall()
        center_words = batch.data[0].as_in_context(ctx)
        target_words = batch.label[0].as_in_context(ctx)
        
        #if (i == 0):
        #    print (center_words)
        #    print (target_words)
        #    sys.exit('')

        with autograd.record():
            loss = model(center_words, target_words, labels)
        loss.backward()
        trainer.step(1)
        
        #  Keep a moving average of the losses
        if (i == 0) and (e == 0):
            moving_loss = loss.asnumpy().sum()
        else:
            moving_loss = .99 * moving_loss + .01 * loss.asnumpy().sum()
        if ((i + 1) % 5000 == 0):
            print("Epoch %s, batch %s. Moving avg of loss: %s" % (e, i, moving_loss))
            
        #if i > 15000:
        #    break
            
    callback_(model, ['january','woman'])
       
    print(e, " epochs took %s seconds" % (time.time() - start_time))

Epoch 0, batch 4999. Moving avg of loss: 4.2216905019
january : contains, nearly, angeles, this, speakers, yeltsin, determined, values, feminism
woman : prentice, removal, participate, monroe, iberian, relationships, atlanta, sphere, ultimately
0  epochs took 68.0850031375885 seconds
Epoch 1, batch 4999. Moving avg of loss: 4.11076397124
january : contains, nearly, determined, ending, angeles, this, dna, future, speakers
woman : concerned, interstellar, magnitude, acquire, integers, monroe, ess, kandahar, framed
1  epochs took 134.55370903015137 seconds
Epoch 2, batch 4999. Moving avg of loss: 4.03314016569
january : bill, small, nearly, case, future, contains, this, peter, d
woman : concerned, magnitude, interstellar, integers, sufficient, ess, airliner, recovery, kandahar
2  epochs took 200.85797762870789 seconds
Epoch 3, batch 4999. Moving avg of loss: 3.96486449085
january : case, bill, small, determined, d, peter, contains, economy, senate
woman : airliner, sufficient, concerned, 