# chapter 04. word2vec speed improvement

CBOW model is 2 layered model. it is simple, but takes longer time to process.  
in this chapter we aim to add 2 improvements.  
1. adding new layer called "Embedding"
2. by "Negative Sampling", we use new loss function.

## 4.1 word2vec improvement(1)
input is one-hot vector. suppose there are 1M vocabulary. then single vector has 1M binary digits.  
=> solved by "Embedding"  
hidden layer * output layer is too much caculation  
=> solved by "Negative Sampling"

### 4.1.1 Embedding layer
one hot vector matrix multiplication is just extracting a row of matrix.  

### 4.1.2 Embedding layer implementation

In [None]:
import numpy as np
W = np.arange(21).reshape(7,3)
W

In [None]:
W[2]

In [None]:
W[5]

In [None]:
idx = np.array([1, 0, 3, 0 ])
W[idx]

In [None]:
# common/layers.py
class Embedding:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.idx = None
        
    def forward(self, idx):
        W, = self.params
        self.idx = idx
        out = W[idx]
        return out
        

## 4.2 word2vec improvement(2)
by "negative sampling"  

### 4.2.1 problem with caculation after hidden layer
softmax is terrible when there are 1M variables.

### 4.2.2 multi-class classification to binary classification
we must find a way to make question that can be answered y/n.  
example: "if context is 'you' and 'goofbye', is target word 'say'?"  
this needs only one output.  

### 4.2.3 Sigmoid, Cross Entropy Error.
in CEE, if t=1, CEE(x) = -logy.  
if t = 0, -log(1-y)  
backpropagation of Sigmoid and CEE is exactly y-t! (163p)

### 4.2.4 implementation

In [None]:
# cho4/negative_sampling_layer.py

class EmbeddingDot:
    def __init__(self, W):
        self.embed = Embedding(W)
        self.params = self.embed.params
        self.grads = self.embed.grads
        self.cache = None
        
    def forward(self, h, idx):
        target_W  = self.embed.forward(idx)
        out = np.sum(target_W * h, axis = 1)
        
        self.cache = (h, target_W)
        return out
    
    def backward(self, dout):
        h, target_W = self.cache
        dout = dout.reshape(dout.reshape[0], 1) # Q: ?
        dtarget_W = dout * h
        self.embed.backward(dtarget_W)
        dh = dout * target_W
        return dh

in here idx is np.array of word. its array because it wants to process minibatch.  

In [None]:
import numpy as np
W = np.array([[0, 1, 2],
              [3, 4, 5],
              [6, 7, 8],
              [9, 10, 11],
              [12, 13, 14], 
              [15, 16, 17],
              [18, 19, 20]])

idx = np.array([0, 3, 1])

h = np.array([[0, 1, 2],
            [3, 4, 5],
            [6, 7, 8]])

embed = Embedding(W)
print(embed)

target_W = embed.forward(idx)
print(target_W)

out = np.sum(target_W * h, axis = 1)
print(out)

### 4.2.5 Negative Sampling
ok, so we now can train model for correct target.  
but what happens if target words are wrong?  
we dont want to train binary classification for EVERY wrong words. for it takes too much time.  
thus, we SELECT few wrong cases. and that is Negative Sampling.

### 4.2.6 Sampling method 
we have better method than random.  
we can get statistic of corpus, and select frequent words from it.  
np.random.choice()method allows this.

In [None]:
import numpy as np
np.random.choice(10)

In [None]:
np.random.choice(10)

In [None]:
words = ['you', 'say', 'goodbye', 'I', 'hello', '.']
np.random.choice(words)

In [None]:
np.random.choice(words)

In [None]:
np.random.choice(words, size = 5) # 5 sample, overlap OK

In [None]:
np.random.choice(words, size = 5, replace = False) # 5 sample, overlap NO

In [None]:
p = [0.5, 0.1, 0.05, 0.2, 0.05, 0.1]
np.random.choice(words, p = p) # we can set probabilities.

its all good. but word2vec suggests to power 0.75 for probabilities.  
P(w_i) ** 0.75.  
to consider less probable words.  
we implemented this function in UnigramSampler.  
check ch04/negative_sampling_layer.py

### 4.2.7 Negative Sampling implementation

In [None]:
from negative_sampling_layer import UnigramSampler
from common.layers import Embedding, SigmoidWithLoss

# ch04/negative_sampling_layer.py
class NegativeSamplingLoss:
    def __init__(self, W, corpus, power = 0.75, sample_size = 5):
        self.sample_size = sample_size
        self.sampler = UnigramSampler(corpus, power, sample_size)
        self.loss_layer = [SigmoidWithLoss() for _ in range(sample_size + 1)]
        self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]
        
        self.params, self.grads = [], []
        for layer in self.embed_dot_layers:
            self.params += layer.params
            self.grads += layer.grads
            
    def forward(self, h, target):
        batch_size = target.shape[0]
        negative_sample = self.sampler.get_negative_sample(target)
        
        #positive case forward
        score = self.embed_dot_layers[0].forward(h, target)
        correct_label = np.ones(batch_size, dtype=np.int32)
        loss = self.loss_layer[0].forward(score, correct_label)
        
        #negative case forward
        negative_label = np.zeros(batch_size, dtype=np.int32)
        for i in range(self.sample_size):
            negative_target = negative_sample[:, i]
            score = self.embed_dot_layers[1 + i].forward(h, negative_target)
            loss += self.loss_layer[1 + i].forward(score, negative_label)
            
        return loss
    
    def backward(self, dout = 1):
        dh = 0
        for l0, l1 in zip(self.loss_layer, self.embed_dot_layers):
            dscore = l0.backward(dout)
            dh += l1.backward(dscore)
            
        return dh
    

## 4.3 improved word2vec training

### 4.3.1 CBOW model implementation

its in ch04/cbow.py

In [None]:
import sys
sys.path.append('..')
from common.np import *  # import numpy as np
from common.layers import Embedding
from ch04.negative_sampling_layer import NegativeSamplingLoss

class CBOW:
    def __init__(self, vocab_size, hidden_size, window_size, corpus):
        V, H = vocab_size, hidden_size

        # 가중치 초기화
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(V, H).astype('f')

        # 계층 생성
        self.in_layers = []
        for i in range(2 * window_size):
            layer = Embedding(W_in)  # Embedding 계층 사용
            self.in_layers.append(layer)
        self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)

        # 모든 가중치와 기울기를 배열에 모은다.
        layers = self.in_layers + [self.ns_loss]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        # 인스턴스 변수에 단어의 분산 표현을 저장한다.
        self.word_vecs = W_in

    def forward(self, contexts, target):
        h = 0
        for i, layer in enumerate(self.in_layers):
            h += layer.forward(contexts[:, i])
        h *= 1 / len(self.in_layers)
        loss = self.ns_loss.forward(h, target)
        return loss

    def backward(self, dout=1):
        dout = self.ns_loss.backward(dout)
        dout *= 1 / len(self.in_layers)
        for layer in self.in_layers:
            layer.backward(dout)
        return None


### 4.3.2 CBOW training code

In [None]:
from ch04 import cbow, train, eval
# all this training takes... half a day.