# Word embedding

## One-hot vetor

- Simple and requires no implied ordering.
- Huge and encodes no meaning.

## Word embedding

- Low dimension.
- Can encode meanings.

## Create word embedding

- Need
    - Crpus.
    - Embedding method.
- Self-supervised
    - Unsupervised in a sense that input date (corpus) is unlabelled.
    - Supervised in a sense that data provides context that would make up the labels. 
    
## Continuous bag of words model

- Predict a missing word based on the surrounding words

![2-4-1](images/natural-language-processing/2-4-1.png)


## Architecture of CBOW model

![2-4-2](images/natural-language-processing/2-4-2.png)

![2-4-3](images/natural-language-processing/2-4-3.png)

![2-4-4](images/natural-language-processing/2-4-4.png)

![2-4-5](images/natural-language-processing/2-4-5.png)

## CBOW cost function

- $J = -\displaystyle\sum_{k=1}^{V}y_{k}log\hat{y}_{k}$

## CBOW forward prop

![2-4-5](images/natural-language-processing/2-4-5.png)

- $Z_{1} = W_{1}X + B_{1}$
- $H = ReLU(Z_{1})$
- $Z_{2} = W_{2}H + B_{2}$
- $\hat{Y} = softmax(Z_{2})$
- $J_{batch} = -\dfrac{1}{m}\displaystyle\sum_{i=1}^{m}\displaystyle\sum_{j=1}^{V}y_{j}^{(i)}log\hat{y}_{j}^{(i)}$

## CBOW backward prop

- $W_{1} = W_{1} - \alpha\dfrac{\partial J_{batch}}{\partial W_{1}} = W_{1} - \dfrac{1}{m}ReLu(W_{2}^{T}(\hat{Y}-Y))X^{T}$
- $W_{2} = W_{2} - \alpha\dfrac{\partial J_{batch}}{\partial W_{2}} = W_{2} - \dfrac{1}{m}(\hat{Y}-Y))H^{T}$
- $b_{1} = b_{1} - \alpha\dfrac{\partial J_{batch}}{\partial b_{1}} = b_{1} - \dfrac{1}{m}ReLu(W_{2}^{T}(\hat{Y}-Y))1_{m}^{T}$
- $b_{2} = b_{2} - \alpha\dfrac{\partial J_{batch}}{\partial b_{2}} = b_{2} - \dfrac{1}{m}(\hat{Y}-Y))1_{m}^{T}$

## Extract word embedding vectors

![2-4-7](images/natural-language-processing/2-4-7.png)

![2-4-8](images/natural-language-processing/2-4-8.png)

![2-4-9](images/natural-language-processing/2-4-9.png)

![2-4-10](images/natural-language-processing/2-4-10.png)

In [1]:
def initialize_model(N,V, random_seed=1):
    '''
    Inputs: 
        N:  dimension of hidden vector 
        V:  dimension of vocabulary
        random_seed: random seed for consistent results in the unit tests
     Outputs: 
        W1, W2, b1, b2: initialized weights and biases
    '''
    
    np.random.seed(random_seed)
    
    # W1 has shape (N,V)
    W1 = np.random.rand(N,V)
    # W2 has shape (V,N)
    W2 = np.random.rand(V,N)
    # b1 has shape (N,1)
    b1 = np.random.rand(N,1)
    # b2 has shape (V,1)
    b2 = np.random.rand(V,1)
    
    return W1, W2, b1, b2

In [2]:
def softmax(z):
    '''
    Inputs: 
        z: output scores from the hidden layer
    Outputs: 
        yhat: prediction (estimate of y)
    '''
    
    # Calculate yhat (softmax)
    yhat = np.exp(z) / np.sum(np.exp(z), axis=0)
    
    return yhat

In [3]:
def forward_prop(x, W1, W2, b1, b2):
    '''
    Inputs: 
        x:  average one hot vector for the context 
        W1, W2, b1, b2:  matrices and biases to be learned
     Outputs: 
        z:  output score vector
    '''
    
    # Calculate h
    h = np.dot(W1, x) + b1
    
    # Apply the relu on h (store result in h)
    h = np.maximum(0, ℎ)
    
    # Calculate z
    z = np.dot(W2, h) + b2

    return z, h

In [4]:
def compute_cost(y, yhat, batch_size):
    # cost function 
    logprobs = np.multiply(np.log(yhat),y)
    cost = - 1/batch_size * np.sum(logprobs)
    cost = np.squeeze(cost)
    return cost

In [5]:
def back_prop(x, yhat, y, h, W1, W2, b1, b2, batch_size):
    '''
    Inputs: 
        x:  average one hot vector for the context 
        yhat: prediction (estimate of y)
        y:  target vector
        h:  hidden vector (see eq. 1)
        W1, W2, b1, b2:  matrices and biases  
        batch_size: batch size 
     Outputs: 
        grad_W1, grad_W2, grad_b1, grad_b2:  gradients of matrices and biases   
    '''
    
    # Compute the gradient of W1
    grad_W1 = (1 / batch_size) * np.dot(np.maximum(0, np.dot(W2.T, yhat - y)), x.T)
    # Compute the gradient of W2
    grad_W2 = (1 / batch_size) * np.dot(yhat - y, h.T)
    # Compute the gradient of b1
    grad_b1 = (1 / batch_size) * np.sum(np.maximum(0, np.dot(W2.T, yhat - y)), axis=1, keepdims=True)
    # Compute the gradient of b2
    grad_b2 = (1 / batch_size) * np.sum((yhat - y), axis=1, keepdims=True)
    
    return grad_W1, grad_W2, grad_b1, grad_b2

In [6]:
def gradient_descent(data, word2Ind, N, V, num_iters, alpha=0.03):
    
    '''
    This is the gradient_descent function
    
      Inputs: 
        data:      text
        word2Ind:  words to Indices
        N:         dimension of hidden vector  
        V:         dimension of vocabulary 
        num_iters: number of iterations  
     Outputs: 
        W1, W2, b1, b2:  updated matrices and biases   

    '''
    W1, W2, b1, b2 = initialize_model(N,V, random_seed=282)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in get_batches(data, word2Ind, V, C, batch_size):
        # Get z and h
        z, h = forward_prop(x, W1, W2, b1, b2)
        # Get yhat
        yhat = softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ( (iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        # Get gradients
        grad_W1, grad_W2, grad_b1, grad_b2 = back_prop(x, yhat, y, h, W1, W2, b1, b2, batch_size)
        
        # Update weights and biases
        W1 = W1 - alpha*grad_W1 
        W2 = W2 - alpha*grad_W2
        b1 = b1 - alpha*grad_b1
        b2 = b2 - alpha*grad_b2
        
        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66
            
    return W1, W2, b1, b2