## Softmax cost and gradient

In below code, $D$ is the dimesion layer. And, prediction can be computed as below

$$ \hat{y} = P(\textrm{word_i}|r,w) = softmax(r, w_i) = \frac{w_i^T r}{\sum_{j}^{|V|} w_j^T r}$$

The cost function is the cross entropy function,

$$J(y, \hat{y}) = -\sum_i y_i \log(\hat{y}_i)$$

where $y$ is the one-hot label vector, and $\hat{y}$ is the predicted probability vector for all classes.


In [None]:
def softmaxCostAndGradient(predicted, target, outputVectors):
    """ Softmax cost function for word2vec models """
    ###################################################################
    # Implement the cost and gradients for one predicted word vector  #
    # and one target word vector as a building block for word2vec     #
    # models, assuming the softmax prediction function and cross      #
    # entropy loss.                                                   #
    # Inputs:                                                         #
    #   - predicted: numpy ndarray, predicted word vector (\hat{r} in #
    #           the written component)                                #
    #   - target: integer, the index of the target word               #
    #   - outputVectors: "output" vectors for all tokens              #
    # Outputs:                                                        #
    #   - cost: cross entropy cost for the softmax word prediction    #
    #   - gradPred: the gradient with respect to the predicted word   #
    #           vector                                                #
    #   - grad: the gradient with respect to all the other word       # 
    #           vectors                                               #
    # We will not provide starter code for this function, but feel    #
    # free to reference the code you previously wrote for this        #
    # assignment!                                                     #
    ###################################################################

    D = len(predicted) # D is 3 in example
    # target is 1x1
    # predicted is 1xD
    # outputVectors VxD |V| is the number of words

    r_W = np.dot(predicted, outputVectors.T) #w_j^T * r
    r_W_softmax = softmax(r_W) # yHat
    y_hat = r_W_softmax
    #cross entropy loss
    J = -np.log(y_hat[target])

    gradPred = -outputVectors[target,:] + np.dot(r_W_softmax, outputVectors)
    grad = np.tile(r_W_softmax, (D, 1)).T * predicted
    grad[target, :] -= predicted

    return cost, gradPred, grad