Creating sentence embeddings using transition matrices that takes into account, the ordering of the words in the sentence. 

In the following simple example, we will use a vocabulary of nine words and impleent the forward propagation part of a neural network that will predict the next word in a sentence (i.e sequence of words). A sentence embedding is created by multiplying each word by a transition matrix, and summing up the resulting vectors. The traisiotn matrix is initialized as an identity matrix of size equal to the hidden neurons (which is the number of columns in the wieght matrix). The traisition matrix is optimized vi gradient descent during training the neural net.  

In [3]:
import numpy as np

def softmax(x):
    temp = np.exp(x)
    s = np.sum(temp, axis = 1, keepdims = True)
    return  temp / s  

vocab_size = 9
hidden_neurons = 3

# hidden layer weights intialized to zero
W0 = np.zeros(shape=(vocab_size, hidden_neurons))

# output layer weights initialed to random values
W1 = np.random.randn(hidden_neurons, vocab_size) 

# words vectors for our vocabulary (i.e. rows of the hidden layer weights matrix corresponding to each word in the vocabulary)
word_vecs = {}
word_vecs['yankees'] = W0[0:1]
word_vecs['bears'] = W0[1:2]
word_vecs['braves'] = W0[2:3]
word_vecs['red'] = W0[3:4]
word_vecs['sox'] = W0[4:5]
word_vecs['lose'] = W0[5:6]
word_vecs['defeat'] = W0[6:7]
word_vecs['beat'] = W0[7:8]
word_vecs['tie'] = W0[8:9]

# transition matrix initialized to identity 
tmat = np.eye(3)


**Forward Propagation**

Given an input sequence of three words, the word_vector for the first word in the sequence is passed on by the first layer to the second layer where it is multiplied by the transition matrix and added to the next word vector in the sequence. The result is then passed on to the next layer where we repeat this process of multiplying the layer input by the traisition matrix and adding to the next word_vector. After the last word_vector in the sequence is added, we pass the result on to the output layer where it gets multiplied to the output layer weights and operated on by the softmax function to obtain the final prediction. 

In [4]:
input_sequence = ['red', 'sox', 'defeat']
layer_0 = word_vecs[input_sequence[0]]
layer_1 = np.dot(layer_0, tmat) + word_vecs[input_sequence[1]]
layer_2 = np.dot(layer_1, tmat) + word_vecs[input_sequence[2]]
pred = softmax(np.dot(layer_2, W1))

**Backpropagation**

Now we will do the backpropagation and compute gradients of the word_vectors and transition matrix

In [6]:
# the target word is 'yankees', whch is the first word in our vocab 
y = np.zeros(shape=(1,vocab_size))
y[0,0] = 1

# error
E = (pred - y) * (pred - y)

# dE/dP
dP = 2 * (pred - y)
W1_grad = np.dot(layer_2.T, dP)

# dE_dL2
dL2 = np.dot(dP, W1.T)

# dE_d(defeat)
d_defeat = dL2 * 1
W_defeat_grad = d_defeat

# dE_dL1
dL1 = np.dot(dL2, tmat.T) 
tmat_grad_2 = np.dot(layer_1.T, dL2)

# dE_d(sox)
d_sox = dL1 * 1
W_sox_grad = d_sox

# dE_dL0
dL0 = np.dot(dL1, tmat.T)
tmat_grad_1 = np.dot(layer_0.T, dL1)
W_red_grad = dL0 * 1




**Optimization**

Now update the weights and traision matrix via gradient descent

In [7]:
alpha = 0.01
W1 -= alpha * W1_grad
word_vecs['defeat'] -= alpha * W_defeat_grad
word_vecs['sox'] -= alpha * W_sox_grad
word_vecs['red'] -= alpha * W_red_grad
tmat -= alpha * (tmat_grad_2 + tmat_grad_1) 


**Training Loop**

In [21]:
niters = 100

for iter in range(niters):
    
    # forward pass
    input_sequence = ['red', 'sox', 'defeat']
    layer_0 = word_vecs[input_sequence[0]]
    layer_1 = np.dot(layer_0, tmat) + word_vecs[input_sequence[1]]
    layer_2 = np.dot(layer_1, tmat) + word_vecs[input_sequence[2]]
    pred = softmax(np.dot(layer_2, W1))

    # backpropagation

    # the target word is 'yankees', whch is the first word in our vocab 
    y = np.zeros(shape=(1,vocab_size))
    y[0,0] = 1

    # error
    error = (pred - y) * (pred - y)

    # dE/dP
    dP = 2 * (pred - y)
    W1_grad = np.dot(layer_2.T, dP)

    # dE_dL2
    dL2 = np.dot(dP, W1.T)

    # dE_d(defeat)
    d_defeat = dL2 * 1
    W_defeat_grad = d_defeat

    # dE_dL1
    dL1 = np.dot(dL2, tmat.T) 
    tmat_grad_2 = np.dot(layer_1.T, dL2)

    # dE_d(sox)
    d_sox = dL1 * 1
    W_sox_grad = d_sox

    # dE_dL0
    dL0 = np.dot(dL1, tmat.T)
    tmat_grad_1 = np.dot(layer_0.T, dL1)
    W_red_grad = dL0 * 1

    # optimization
    alpha = 0.01
    W1 -= alpha * W1_grad
    word_vecs['defeat'] -= alpha * W_defeat_grad
    word_vecs['sox'] -= alpha * W_sox_grad
    word_vecs['red'] -= alpha * W_red_grad
    tmat -= alpha * (tmat_grad_2 + tmat_grad_1) 

    print(f"Iteration# {iter}, Error: {np.sum(error)}, prediction: {pred[0,0]}")


Iteration# 0, Error: 4.2055422642825566e-05, prediction: 0.9951947824251023
Iteration# 1, Error: 4.1405383410469566e-05, prediction: 0.9952323056064151
Iteration# 2, Error: 4.076912216257302e-05, prediction: 0.9952693172827043
Iteration# 3, Error: 4.0146270125834764e-05, prediction: 0.9953058272856068
Iteration# 4, Error: 3.953647040234927e-05, prediction: 0.9953418452043689
Iteration# 5, Error: 3.893937752397905e-05, prediction: 0.9953773803931074
Iteration# 6, Error: 3.835465702575492e-05, prediction: 0.9954124419778183
Iteration# 7, Error: 3.778198503744368e-05, prediction: 0.9954470388631365
Iteration# 8, Error: 3.722104789235672e-05, prediction: 0.9954811797388643
Iteration# 9, Error: 3.667154175263774e-05, prediction: 0.995514873086272
Iteration# 10, Error: 3.613317225022991e-05, prediction: 0.9955481271841806
Iteration# 11, Error: 3.5605654142761783e-05, prediction: 0.9955809501148405
Iteration# 12, Error: 3.508871098373735e-05, prediction: 0.9956133497696024
Iteration# 13, Erro