**Tensor Neural Network Framework** 

In this framework, vectors and matrices are represened by a generalized `Tensor` class. A tensor object contains the data for the vector/matrix, a unique identifier and a list of Tensor operation methods. It also includes information pertaining to how the tensor was created, e.g. if it was created by a tensor operation from other tensors, then we would call it a `child` of those `parent tensors`. (So these Tensors can be considered to form the `nodes` of a `tree`-like hierarchical structure, with data being transmitted across the node edges during forward and backward propagation). Finally, the tensor object also contains a method for computing and backpropagating deriviatives, this feature can be turned on by setting the `autograd` property to `True`. The backpropagtion occurs recursively over all the ancestors of that Tensor and stops when a Tensor which does not have any parents is reached. Any given tensor will wait until it has recieved and accumulated the backpropagted derivatives from all it's children and then it will backpropagate it's gradient to it's parents.   

In addition to this Tensor class, we also create a base `Layer` class, and define a `Linear` Layer sub-class which represents a linear layer in a neural network, i.e. it has a matrix of weights and it takes a vector of input neurons and multiplies it to the weights matrix resulting in a vector of output neurons.

We also create sub-classes for `non-lineararity layers` which take a vector of input neurons and operates on this vector with a non-linear function such as `sigmoid` or `relu`. Similarly, we also have a `loss function layer` for computing error/loss for a given target and prediction.

In [None]:
import numpy as np

class Tensor(object):
    
    def __init__(self, data, creators=None, creation_op=None, autograd=False, id=None):
        self.data = np.array(data)
        self.creators = creators
        self.creation_op = creation_op
        self.grad = None
        self.autograd = autograd
        if(id == None):
            id = np.random.randint(0,100000)
        self.id = id
        self.children = {}
        if(creators is not None):
            for creator in creators:
                if self.id not in creator.children:
                    creator.children[self.id] = 1
                else:
                    creator.children[self.id] += 1    

    def backward(self, grad=None, grad_origin=None):
        if(self.autograd):
            if(grad_origin is not None):
                # if waiting to receive gradient, decrement counter
                if(self.children[grad_origin.id] != 0):
                    self.children[grad_origin.id] -= 1
                else:
                    raise Exception("Same child cannot backpropagate more than once!")

            # if this is the beginning of the backpropagtion chain
            if(grad is None):
                grad = Tensor(np.ones_like(self.data))

            # accumulate gradients from all the children 
            if(self.grad is None):
                self.grad = grad
            else:
                self.grad += grad    

            # backpropagate to creators if all gradients from children have been received or if gradients did not originate from another node
            if((self.creators is not None) and (self.received_grads_from_all_children() or (grad_origin is None))):
                if(self.creation_op == "add"):
                    new_grad = Tensor(self.grad.data)
                    self.creators[0].backward(new_grad, self)
                    self.creators[1].backward(new_grad, self)
                if(self.creation_op == "neg"):
                    new_grad = self.grad.__neg__()
                    self.creators[0].backward(new_grad, self)    
                if(self.creation_op == "sub"):
                    new_grad = Tensor(self.grad.data)
                    self.creators[0].backward(new_grad, self)
                    new_grad = self.grad.__neg__()
                    self.creators[1].backward(new_grad, self)    
                if(self.creation_op == "mul"):
                    new_grad = self.grad * self.creators[1]
                    self.creators[0].backward(new_grad, self)
                    new_grad = self.creators[0] * self.grad
                    self.creators[1].backward(new_grad, self)
                if(self.creation_op == "mm"):
                    new_grad = self.grad.mm(self.creators[1].transpose())
                    self.creators[0].backward(new_grad, self)
                    new_grad = (self.creators[0].transpose()).mm(self.grad)
                    self.creators[1].backward(new_grad, self)
                if(self.creation_op == "transpose"):
                    new_grad = self.grad.transpose()
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "sigmoid"):
                    ones = Tensor(np.ones_like(self.grad.data))
                    # sigmoid derivative
                    new_grad = self.grad * (self * (ones - self))
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "tanh"):
                    ones = Tensor(np.ones_like(self.grad.data))
                    # tanh derivative
                    new_grad = self.grad * (ones - self*self)
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "relu"):
                    # relu derivative
                    new_grad = self.grad * (self.creators[0].data > 0)
                    self.creators[0].backward(new_grad, self)

                if("sum" in self.creation_op):
                    dim = int(self.creation_op.split("_")[1])
                    ds = self.creators[0].data.shape[dim]
                    self.creators[0].backward(self.grad.expand(dim,ds))
                if("expand" in self.creation_op):
                    dim = int(self.creation_op.split("_")[1])
                    self.creators[0].backward(self.grad.sum(dim))


    # check to see if this tensor has recieved gradients from all children, which is indicated by all children counts being zero
    def received_grads_from_all_children(self):
        for id,count in self.children.items():
            if (count != 0):
                return False
        return True     

    # Note: operations always return a new tensor object 

    # element-wise addition
    def __add__(self, other):
        # return a new tensor object containing the sum
        if(self.autograd and other.autograd):
            return Tensor(self.data + other.data, creators=[self,other], creation_op ="add", autograd=True)
        return Tensor(self.data + other.data)
    
    # element-wise negation
    def __neg__(self):
        # return a new tensor object containing the negation
        if(self.autograd):
            return Tensor(-1 * self.data, creators=[self], creation_op ="neg", autograd=True)
        return Tensor(-1 * self.data)

    # element-wise subtraction
    def __sub__(self, other):
        # return a new tensor object containing the subtraction
        if(self.autograd and other.autograd):
            return Tensor(self.data - other.data, creators=[self,other], creation_op ="sub", autograd=True)
        return Tensor(self.data - other.data)

    # element-wise multiplication
    def __mul__(self, other):
        # return a new tensor object containing the multiplication
        if(self.autograd and other.autograd):
            return Tensor(self.data * other.data, creators=[self,other], creation_op ="mul", autograd=True)
        return Tensor(self.data * other.data)
    
    # sum over all elements along given axis
    def sum(self, axis):
        # return a new tensor object containing the sum
        if(self.autograd):
            return Tensor(self.data.sum(axis), creators=[self], creation_op ="sum_"+str(axis), autograd=True)
        return Tensor(self.data.sum(axis))
    
    # expands the tensor along the given axis
    def expand(self, axis, copies):
        
        trans_cmd = list(range(0,len(self.data.shape)))
        trans_cmd.insert(axis, len(self.data.shape))
        
        new_shape = list(self.data.shape) + [copies]
        new_data = self.data.repeat(copies).reshape(new_shape)
        new_data = new_data.transpose(trans_cmd)
        
        if(self.autograd):
            return Tensor(new_data, autograd=True, creators=[self], creation_op="expand_"+str(axis))
        return Tensor(new_data)

    # transpose of matrix 
    def transpose(self):
        # return a new tensor object with the transposed tensor
        if(self.autograd):
            return Tensor(self.data.transpose(), creators=[self], creation_op ="transpose", autograd=True)
        return Tensor(self.data.transpose())

    # matrix multiplication
    def mm(self, other):
        # return a new tensor object containing the multiplication
        if(self.autograd and other.autograd):
            return Tensor(np.dot(self.data, other.data), creators=[self,other], creation_op ="mm", autograd=True)
        return Tensor(np.dot(self.data, other.data))

    def __str__(self):
        return str(self.data.__str__())
    
    def __repr__(self):
        return str(self.data.__repr__())

    # Non-linearity functions

    # sigmoid function
    def sigmoid(self):
        if(self.autograd):
            return Tensor(1.0 / (1.0 + np.exp(-self.data)), creators=[self], creation_op="sigmoid", autograd=True)
        return Tensor(1.0 / (1.0 + np.exp(self.data)))

    # tanh function
    def tanh(self):
        if(self.autograd):
            return Tensor(np.tanh(self.data), creators=[self], creation_op="sigmoid", autograd=True)
        return Tensor(np.tanh(self.data))
    
    # relu function
    def relu(self):
        if(self.autograd):
            return Tensor(self.data * (self.data > 0), creators=[self], creation_op="sigmoid", autograd=True)
        return Tensor(self.data * (self.data > 0))
    
# stochastic gradient descent optimizer    
class SGD_Optimizer(object):

    def __init__(self, parameters, alpha) -> None:
        self.parameters = parameters
        self.alpha = alpha    

    def zero(self):
        for p in self.parameters:
            p.grad.data *= 0

    def step(self, zero=True):
        for p in self.parameters:
            p.data -= self.alpha * p.grad.data

            if(zero):
                p.grad.data *= 0

# layer base class
class Layer(object):   
    def __init__(self) -> None:
        self.parameters = []

    def get_parameters(self):                     
        return self.parameters
    
# layer inherited classes
class Linear(Layer):
    def __init__(self, n_inputs, n_outputs) -> None:
        super().__init__()
        # initilize the weights
        W = np.random.randn(n_inputs, n_outputs) * np.sqrt(2.0/n_inputs)
        self.weight = Tensor(W, autograd=True)
        self.bias = Tensor(np.zeros(n_outputs), autograd=True)

        self.parameters.append(self.weight)
        self.parameters.append(self.bias)

    def forward(self, input):
        return input.mm(self.weight) + self.bias.expand(0,len(input.data))   

# a class for a senquence of layer, i.e. a neral network model
class Sequential(Layer):
    def __init__(self, layers = []) -> None:
        super().__init__()
        self.layers = layers

    def add(self, layer):
        self.layers.append(layer)

    def forward(self, input):
        for layer in self.layers:
            input = layer.forward(input)
        return input
    
    def get_parameters(self):
        params = []
        for layer in self.layers:
            params += layer.get_parameters()

        return params    
    
# means squared error loss function layer    
class MSELoss(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, pred, target):
        return ((pred-target) * (pred-target)).sum(0)

# nonlinearity layers
class Sigmoid(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input):
        return input.sigmoid()

class Tanh(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input):
        return input.tanh()

class Relu(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input):
        return input.relu()



In [None]:
a = Tensor([1,2,3,4,5], autograd=True)
b = Tensor([2,2,2,2,2], autograd=True)
c = Tensor([3,3,3,3,3], autograd=True)
d = a + (-b)
e = (-b) + c
f = d + e

print(f"node(a), id: {a.id}, children: {a.children}, creators: {a.creators}")
print(f"node(b), id: {b.id}, children: {b.children}, creators: {b.creators}")
print(f"node(c), id: {c.id}, children: {c.children}, creators: {c.creators}")
print(f"node(d), id: {d.id}, children: {d.children}, creators: {d.creators}")
print(f"node(e), id: {e.id}, children: {e.children}, creators: {e.creators}")
print(f"node(f), id: {f.id}, children: {f.children}, creators: {f.creators}")

D = Tensor([1,1,1,1,1])
f.backward(grad = D)

print(f"f grad: {f.grad}")
print(f"e grad: {e.grad}")
print(f"d grad: {d.grad}")
print(f"c grad: {c.grad}")
print(f"b grad: {b.grad}")
print(f"a grad: {a.grad}")


Example 1: Using the tensor object and autograd to train a simple two layer linear network

In [None]:
np.random.seed(1)
input_data = Tensor(np.array([[0,0], [0,1], [1,0], [1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True) 

input_neurons = input_data.data.shape[1]
hidden_neurons = 3
output_neurons = target.data.shape[1]

# initialize neural net layers
model = Sequential(layers=[Linear(input_neurons, hidden_neurons), Linear(hidden_neurons, output_neurons)])
loss_layer = MSELoss()

# initialize optimizer
optim = SGD_Optimizer(parameters=model.get_parameters(), alpha = 0.05) 

# training iterations
niters = 10
for iter in range(niters):

    # forward pass
    pred = model.forward(input_data)

    # compute loss
    loss = loss_layer.forward(pred, target)

    # backpropagation
    loss.backward()

    # optimization of weights
    optim.step()

    print(f"Iteration# {iter+1}, Loss: {loss}")


Example 2: Using the tensor object and autograd to train a network with non-linear layers

In [None]:
np.random.seed(1)
input_data = Tensor(np.array([[0,0], [0,1], [1,0], [1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True) 

input_neurons = input_data.data.shape[1]
hidden_neurons = 3
output_neurons = target.data.shape[1]

# initialize neural net layers
model = Sequential(layers=[Linear(input_neurons, hidden_neurons), Tanh(),Linear(hidden_neurons, output_neurons), Sigmoid()])
loss_layer = MSELoss()

# initialize optimizer
optim = SGD_Optimizer(parameters=model.get_parameters(), alpha = 1) 

# training iterations
niters = 10
for iter in range(niters):

    # forward pass
    pred = model.forward(input_data)

    # compute loss
    loss = loss_layer.forward(pred, target)

    # backpropagation
    loss.backward()

    # optimization of weights
    optim.step()

    print(f"Iteration# {iter+1}, Loss: {loss}")


**Adding support for language processing:**

Previously we had a `linear layer` which had a matrix of weights and forward propagatopn involved computing the vector-matrix multiplication of the inputs with the weights. We will now create a similar `embedding layer` for natural language processing. The `embedding layer` will also have a weights matrix, in this case each row of the matrix will correspond to an embedding for a word from the vocabulary, and the number of rows should be set equal to the total number of words in the vocabulary. The number of columns on the other hand will be set equal to the desired number of hidden neurons.

During forward propagation, the input vector is going to be a list of word indices and the output will be specific rows (corresponding to the input word indices) selected from the weights matrix. To do this, we will add an `index_select` operation into our tensor object. During backpropagation, the gradients accociated with only those specific rows will be computed, and so a copy of the input word indices will be stored in the tensor containing the selected word rows and utilized during backpropagation.

In [85]:
class Tensor(object):
    
    def __init__(self, data, creators=None, creation_op=None, autograd=False, id=None):
        self.data = np.array(data)
        self.creators = creators
        self.creation_op = creation_op
        self.grad = None
        self.autograd = autograd
        if(id == None):
            id = np.random.randint(0,100000)
        self.id = id
        self.children = {}
        if(creators is not None):
            for creator in creators:
                if self.id not in creator.children:
                    creator.children[self.id] = 1
                else:
                    creator.children[self.id] += 1    

    def backward(self, grad=None, grad_origin=None):
        if(self.autograd):
            if(grad_origin is not None):
                # if waiting to receive gradient, decrement counter
                if(self.children[grad_origin.id] != 0):
                    self.children[grad_origin.id] -= 1
                else:
                    raise Exception("Same child cannot backpropagate more than once!")

            # if this is the beginning of the backpropagtion chain
            if(grad is None):
                grad = Tensor(np.ones_like(self.data))

            # accumulate gradients from all the children 
            if(self.grad is None):
                self.grad = grad
            else:
                self.grad += grad    

            # backpropagate to creators if all gradients from children have been received or if gradients did not originate from another node
            if((self.creators is not None) and (self.received_grads_from_all_children() or (grad_origin is None))):
                if(self.creation_op == "add"):
                    new_grad = Tensor(self.grad.data)
                    self.creators[0].backward(new_grad, self)
                    self.creators[1].backward(new_grad, self)
                if(self.creation_op == "neg"):
                    new_grad = self.grad.__neg__()
                    self.creators[0].backward(new_grad, self)    
                if(self.creation_op == "sub"):
                    new_grad = Tensor(self.grad.data)
                    self.creators[0].backward(new_grad, self)
                    new_grad = self.grad.__neg__()
                    self.creators[1].backward(new_grad, self)    
                if(self.creation_op == "mul"):
                    new_grad = self.grad * self.creators[1]
                    self.creators[0].backward(new_grad, self)
                    new_grad = self.creators[0] * self.grad
                    self.creators[1].backward(new_grad, self)
                if(self.creation_op == "mm"):
                    new_grad = self.grad.mm(self.creators[1].transpose())
                    self.creators[0].backward(new_grad, self)
                    new_grad = (self.creators[0].transpose()).mm(self.grad)
                    self.creators[1].backward(new_grad, self)
                if(self.creation_op == "transpose"):
                    new_grad = self.grad.transpose()
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "sigmoid"):
                    ones = Tensor(np.ones_like(self.grad.data))
                    # sigmoid derivative
                    new_grad = self.grad * (self * (ones - self))
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "tanh"):
                    ones = Tensor(np.ones_like(self.grad.data))
                    # tanh derivative
                    new_grad = self.grad * (ones - self*self)
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "relu"):
                    # relu derivative
                    new_grad = self.grad * (self.creators[0].data > 0)
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "cross_entropy"):
                    # cross entropy derivative
                    new_grad = Tensor(self.softmax_output - self.target_dist)
                    self.creators[0].backward(new_grad, self)
                if(self.creation_op == "index_select"):
                    # gradient of the weights matrix of word embeddings
                    new_grad = np.zeros_like(self.creators[0].data)
                    # we only add gradients to the specific rows corresponding to the selected words 
                    indices_ = self.index_select_indices.data.flatten() 
                    grad_ = self.grad.data.reshape(len(indices_), -1)
                    for i in range(len(indices_)):
                        new_grad[indices_[i]] += grad_[i]
                    self.creators[0].backward(Tensor(new_grad), self)       

                if("sum" in self.creation_op):
                    dim = int(self.creation_op.split("_")[1])
                    ds = self.creators[0].data.shape[dim]
                    self.creators[0].backward(self.grad.expand(dim,ds))
                if("expand" in self.creation_op):
                    dim = int(self.creation_op.split("_")[1])
                    self.creators[0].backward(self.grad.sum(dim))


    # check to see if this tensor has recieved gradients from all children, which is indicated by all children counts being zero
    def received_grads_from_all_children(self):
        for id,count in self.children.items():
            if (count != 0):
                return False
        return True     

    # Note: operations always return a new tensor object 

    # element-wise addition
    def __add__(self, other):
        # return a new tensor object containing the sum
        if(self.autograd and other.autograd):
            return Tensor(self.data + other.data, creators=[self,other], creation_op ="add", autograd=True)
        return Tensor(self.data + other.data)
    
    # element-wise negation
    def __neg__(self):
        # return a new tensor object containing the negation
        if(self.autograd):
            return Tensor(-1 * self.data, creators=[self], creation_op ="neg", autograd=True)
        return Tensor(-1 * self.data)

    # element-wise subtraction
    def __sub__(self, other):
        # return a new tensor object containing the subtraction
        if(self.autograd and other.autograd):
            return Tensor(self.data - other.data, creators=[self,other], creation_op ="sub", autograd=True)
        return Tensor(self.data - other.data)

    # element-wise multiplication
    def __mul__(self, other):
        # return a new tensor object containing the multiplication
        if(self.autograd and other.autograd):
            return Tensor(self.data * other.data, creators=[self,other], creation_op ="mul", autograd=True)
        return Tensor(self.data * other.data)
    
    # sum over all elements along given axis
    def sum(self, axis):
        # return a new tensor object containing the sum
        if(self.autograd):
            return Tensor(self.data.sum(axis), creators=[self], creation_op ="sum_"+str(axis), autograd=True)
        return Tensor(self.data.sum(axis))
    
    # expands the tensor along the given axis
    def expand(self, axis, copies):
        
        trans_cmd = list(range(0,len(self.data.shape)))
        trans_cmd.insert(axis, len(self.data.shape))
        
        new_shape = list(self.data.shape) + [copies]
        new_data = self.data.repeat(copies).reshape(new_shape)
        new_data = new_data.transpose(trans_cmd)
        
        if(self.autograd):
            return Tensor(new_data, autograd=True, creators=[self], creation_op="expand_"+str(axis))
        return Tensor(new_data)

    # transpose of matrix 
    def transpose(self):
        # return a new tensor object with the transposed tensor
        if(self.autograd):
            return Tensor(self.data.transpose(), creators=[self], creation_op ="transpose", autograd=True)
        return Tensor(self.data.transpose())

    # matrix multiplication
    def mm(self, other):
        # return a new tensor object containing the multiplication
        if(self.autograd and other.autograd):
            return Tensor(np.dot(self.data, other.data), creators=[self,other], creation_op ="mm", autograd=True)
        return Tensor(np.dot(self.data, other.data))

    def __str__(self):
        return str(self.data.__str__())
    
    def __repr__(self):
        return str(self.data.__repr__())

    # Non-linearity functions

    # sigmoid function
    def sigmoid(self):
        if(self.autograd):
            return Tensor(1.0 / (1.0 + np.exp(-self.data)), creators=[self], creation_op="sigmoid", autograd=True)
        return Tensor(1.0 / (1.0 + np.exp(-self.data)))

    # tanh function
    def tanh(self):
        if(self.autograd):
            return Tensor(np.tanh(self.data), creators=[self], creation_op="tanh", autograd=True)
        return Tensor(np.tanh(self.data))
    
    # relu function
    def relu(self):
        if(self.autograd):
            return Tensor(self.data * (self.data > 0), creators=[self], creation_op="relu", autograd=True)
        return Tensor(self.data * (self.data > 0))
    
    def cross_entropy(self, target_indices):

        ex = np.exp(self.data)
        softmax_output = ex/np.sum(ex, axis = len(self.data.shape)-1, keepdims = True) 
        
        t = target_indices.data.flatten()
        p = softmax_output.reshape(len(t), -1)
        target_dist = np.eye(p.shape[1])[t]
        loss = -(np.log(p) * (target_dist)).sum(1).mean()

        if(self.autograd):
            out = Tensor(loss, creators = [self], creation_op = "cross_entropy", autograd=True)
            out.softmax_output = softmax_output
            out.target_dist = target_dist
            return out 
        return Tensor(loss) 


    # word embedding operations (the input 'indices' are just word a vector of indices, i.e. specifix row numbers that are to be selected and returned)
    def index_select(self, indices):
        if(self.autograd):
            selected_rows =  Tensor(self.data[indices.data], creators=[self], creation_op="index_select", autograd=True)
            selected_rows.index_select_indices = indices 
            return selected_rows 
        return Tensor(self.data[indices.data])

# stochastic gradient descent optimizer    
class SGD_Optimizer(object):

    def __init__(self, parameters, alpha) -> None:
        self.parameters = parameters
        self.alpha = alpha    

    def zero(self):
        for p in self.parameters:
            p.grad.data *= 0

    def step(self, zero=True):
        for p in self.parameters:
            p.data -= self.alpha * p.grad.data

            if(zero):
                p.grad.data *= 0

# layer base class
class Layer(object):   
    def __init__(self) -> None:
        self.parameters = []

    def get_parameters(self):                     
        return self.parameters
    
# layer inherited classes
class Linear(Layer):
    def __init__(self, n_inputs, n_outputs) -> None:
        super().__init__()
        # initilize the weights
        W = np.random.randn(n_inputs, n_outputs) * np.sqrt(2.0/n_inputs)
        self.weight = Tensor(W, autograd=True)
        self.bias = Tensor(np.zeros(n_outputs), autograd=True)

        self.parameters.append(self.weight)
        self.parameters.append(self.bias)


    def forward(self, input):
        return input.mm(self.weight) + self.bias.expand(0,len(input.data))   

# embedding layer inherited class
class Embedding(Layer):
    def __init__(self, vocab_size, hidden_neurons) -> None:
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_neurons = hidden_neurons

        # initialize the weights matrix of word embeddings 
        weight = (np.random.rand(vocab_size, hidden_neurons)-0.5)/hidden_neurons
        self.weight = Tensor(weight, autograd=True)
        self.parameters.append(self.weight)   

    def forward(self, input):
        return self.weight.index_select(input)    
        

# a class for a senquence of layer, i.e. a neral network model
class Sequential(Layer):
    def __init__(self, layers = []) -> None:
        super().__init__()
        self.layers = layers

    def add(self, layer):
        self.layers.append(layer)

    def forward(self, input):
        for layer in self.layers:
            input = layer.forward(input)
        return input
    
    def get_parameters(self):
        params = []
        for layer in self.layers:
            params += layer.get_parameters()

        return params    
    
# means squared error loss function layer    
class MSELoss(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, pred, target):
        return ((pred-target) * (pred-target)).sum(0)

# cross entropy loss function layer    
class CrossEntropyLoss(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input, target):
        return input.cross_entropy(target)


# nonlinearity layers
class Sigmoid(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input):
        return input.sigmoid()

class Tanh(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input):
        return input.tanh()

class Relu(Layer):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, input):
        return input.relu()


Example of word embedding forward pass and backprop in action

In [None]:
# initialize a weights matrix for a vocabulary of 5 words and 5 hidden neurons
w = Tensor(np.eye(5), autograd=True)
print("weights matrix:")
print(w)

# forward pass for an input containing two sentence vectors with three words each
input_indices = Tensor(np.array([[1,2,3], [2,3,4]]))
selected_rows = w.index_select(input_indices)
print("Selected rows:")
print(selected_rows)

# compute gradient of weights for the given input
selected_rows.backward()
print("weights gradient:")
print(w.grad)


Example of a training network with embedding layer

In [None]:
np.random.seed(1)
input_data = Tensor(np.array([1,2,1,2]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True) 
vocab_size = 5
hidden_neurons = 3
output_neurons = target.data.shape[1]

# initialize neural net layers
model = Sequential(layers=[Embedding(vocab_size, hidden_neurons),Tanh(),Linear(hidden_neurons, output_neurons), Sigmoid()])
loss_layer = MSELoss()

# initialize optimizer
optim = SGD_Optimizer(parameters=model.get_parameters(), alpha = 0.5) 

# training iterations
niters = 10
for iter in range(niters):

    # forward pass
    pred = model.forward(input_data)

    # compute loss
    loss = loss_layer.forward(pred, target)

    # backpropagation
    loss.backward()

    # optimization of weights
    optim.step()

    print(f"Iteration# {iter+1}, Loss: {loss}")

In [None]:
np.random.seed(1)

# input data indices
input_data = Tensor(np.array([1,2,1,2]), autograd=True)
# target indices
target = Tensor(np.array([0,1,0,1]), autograd=True) 

vocab_size = 3
hidden_neurons = 3
output_neurons = len(target.data)

# initialize neural net layers
model = Sequential(layers=[Embedding(vocab_size, hidden_neurons),Tanh(),Linear(hidden_neurons, output_neurons)])
loss_layer = CrossEntropyLoss()

# initialize optimizer
optim = SGD_Optimizer(parameters=model.get_parameters(), alpha = 0.1) 

# training iterations
niters = 10
for iter in range(niters):

    # forward pass
    pred = model.forward(input_data)

    # compute loss
    loss = loss_layer.forward(pred, target)

    # backpropagation
    loss.backward()

    # optimization of weights
    optim.step()

    print(f"Iteration# {iter+1}, Loss: {loss}")

**Creating a `Recurrent Layer` to handle sequenced inputs**

A recurrent layer is made up of several linear and non-linear sublayers, each of these is called an `RNN cell` so that the `recurrent neural network is a chain of these RNN cells`. Then given a `sequence of input vectors`, the first vector in the sequence is fed into the first RNN cell, and a vector called the `hidden state` is computed. This hidden state is simply the input vector multiplied by a weight matrix (`W_ih`) added to the vector obtained from multiplying another weight matrix (`W_hh`) to the hidden state computed in the previous RNN cell, and the combined result is passed through a non-linearity layer (containing an activation function). Then this hiddent state is multiplied by a final weight matrix (`W_ho`) to compute a prediction. So each vector in the input sequence is fed into it's corresponding RNN cell and a hidden state and prediction are computed in its RNN cell. The `hidden state` is the key component here that contains information about the ordering of the items in the input sequence. The first RNN cell requires a hidden state to be initialized so that constitutes as an extra set of parameters.    

For natural language processing, the inputs and prediction vectors are word embeddings, so the size of these vectors (i.e. number of input and output neurons) will be the size of the vocabulary. And we're free to choose any number of hidden neurons we want.

For the following examples, we will constructs models that can only handle input sequences of fixed size (i.e. 6 words per sentence).

In [86]:
class RNNcell(Layer):
    def __init__(self, input_neurons, hidden_neurons, output_neurons, activation = "sigmoid") -> None:
        super().__init__()
        self.input_neurons = input_neurons
        self.hidden_neurons = hidden_neurons
        self.output_neurons = output_neurons
        
        # initialize the nonlinearity layer
        if(activation == "sigmoid"):
            self.activation = Sigmoid()
        elif(activation == "tanh"):
            self.activation = Tanh()
        elif(activation == "relu"):
            self.activation = Relu()
        else:
            raise Exception("ERROR: Non-linearity function not found!")

        # initialize the wieghts
        self.w_ih = Linear(input_neurons, hidden_neurons)
        self.w_hh = Linear(hidden_neurons, hidden_neurons)
        self.w_ho = Linear(hidden_neurons, output_neurons)

        self.parameters += self.w_ih.get_parameters()
        self.parameters += self.w_hh.get_parameters()
        self.parameters += self.w_ho.get_parameters()

    def forward(self, input, prev_hidden):

        # compute hidden state for this RNN cell
        input_times_weight = self.w_ih.forward(input) 
        combined = input_times_weight + self.w_hh.forward(prev_hidden)   
        hidden = self.activation.forward(combined)
        #compute prediction
        pred = self.w_ho.forward(hidden)
       
        return pred, hidden
     
    def init_hidden(self, batch_size = 1):
        # initialize the hidden state
        return Tensor(np.zeros(shape=(batch_size, self.hidden_neurons)), autograd=True) 

        


Training an RNN with th Babi text dataset

In [87]:
# read training data from file
f = open('tasksv11/en/qa1_single-supporting-fact_train.txt', 'r')
raw = f.readlines()
f.close()

In [88]:
# tokenize the first 1000 senteneces (remove numbers, newline characters and punctuations)
tokens= []
for i, sentence in enumerate(raw[0:1000]):
    tokenized_sent = sentence.lower().replace("\n","").replace("\t","").replace("?","").replace(".","").split(" ")[1:] 
    if((i+1)%3 == 0):
        # get rid of number from the last word
        last_word = tokenized_sent[-1]
        tokenized_sent[-1] = "".join([char for char in last_word if not char.isnumeric()])
    # pad the sentence at the beginning with '-' characters to make it 6 words long
    padded_sent = ['-'] * (6 - len(tokenized_sent)) + tokenized_sent  
    tokens.append(padded_sent)

# create a vocabulary from the data
vocab = set()
for sentence in tokens:
    for word in sentence:
        vocab.add(word)
vocab = list(vocab)

# create a dictionary of vocab word indices
word_index = {}
for i, word in enumerate(vocab):
    word_index[word] = i    

In [89]:
# function for converting a list of words into a list of word indices
def words_to_indices(words):
    indices = [word_index[word] for word in words]
    return indices


In [90]:
# prepare the input data, i.e. list of word indices
indices = []
for sentence in tokens:
    indices.append(words_to_indices(sentence))

data = np.array(indices)    

In [None]:
niters = 1000
batch_size = 100
hidden_neurons = 16

# initialize the RNN layers
embed = Embedding(len(vocab), hidden_neurons)
# Note: since we're going to feed in outputs from the embedding layer into the RNN cell, the input neurons size needs to be equal to the the length of the embedding vectors, which is the hidden neurons size
model = RNNcell(hidden_neurons, hidden_neurons, len(vocab))
loss_layer = CrossEntropyLoss()
params = embed.get_parameters() + model.get_parameters() 
optim = SGD_Optimizer(params, alpha=0.05)


In [None]:

# train the network to predict the last word (i.e the 6th word) in every sentence in the input set
for iter in range(niters):
    
    total_loss = 0.0
    correct = 0

    # train in batches
    for j in range(int(len(data)/batch_size)):
    
        batch_lo = j * batch_size 
        batch_hi = min((j+1) * batch_size, len(data)) 
        batch = data[batch_lo:batch_hi]

        sent = []
        for ix in range(6):
            sent.append(vocab[batch[0,ix]])
        #if(iter == 9):
        #    print(f"Sentence: {sent}")

        # initilaize hidden state
        hidden = model.init_hidden(batch_size) 

        # forward pass through RNN cells (5 word input sequence so 5 RNN cell passes)
        for k in range(5):
            input = Tensor(batch[:, k], autograd=True)
            # create the word embedding from the input word
            rnn_input = embed.forward(input)
            # feed the word embedding into the RNN cell
            prediction, hidden = model.forward(rnn_input, hidden)
 
        # compute loss (i.e. compare predicted word from the last RNN cell to last word in the sentence)
        target = Tensor(batch[:, 5], autograd=True)
        loss = loss_layer.forward(prediction, target)
        total_loss += loss.data

    
        # compute prediction accuracy
        for ix in range(batch_size):
            correct += int(np.argmax(prediction.data[ix]) == target.data[ix])
        #    print(f"Actual word: {vocab[target.data[ix]]}, Predicted word: {vocab[np.argmax(prediction.data[ix])]} ")
            
        # backward pass
        loss.backward()
       
        # weights optimization
        optim.step()

    if((iter+1) % 20 == 0):
        print(f"Iteration# {iter+1}, Loss: {total_loss}, Accuracy: {float(correct)/(float(len(data)))}")


Now train the network to predict every next word in the sentence starting from the furst word

In [98]:
niters = 1000
batch_size = 100
hidden_neurons = 16

np.random.seed(1)

# initialize the RNN layers
embed = Embedding(len(vocab), hidden_neurons)
# Note: since we're going to feed in outputs from the embedding layer into the RNN cell, the input neurons size needs to be equal to the the length of the embedding vectors, which is the hidden neurons size
model = RNNcell(hidden_neurons, hidden_neurons, len(vocab), activation="tanh")

# initialize loss layers for predictions at each RNN cell
loss_layers = [CrossEntropyLoss()]*5

params = embed.get_parameters() + model.get_parameters() 
optim = SGD_Optimizer(params, alpha=0.001)


In [99]:

# train the network to predict the next word in the given input sequence
for iter in range(niters):
    
    total_loss = 0.0
    correct = 0
    incorrect = 0

    # train in batches
    for j in range(int(len(data)/batch_size)):
    
        batch_lo = j * batch_size 
        batch_hi = min((j+1) * batch_size, len(data)) 
        batch = data[batch_lo:batch_hi]

        # initilaize hidden state
        hidden = model.init_hidden(batch_size) 

        # forward pass through RNN cells (5 word input sequence so 5 RNN cell passes)
        for k in range(5):
            
            input = Tensor(batch[:, k], autograd=True)
            # create the word embedding from the input word
            rnn_input = embed.forward(input)
            # feed the word embedding into the RNN cell to predict the next word
            prediction, hidden = model.forward(rnn_input, hidden)
    
            # compute loss (i.e. compare predicted word from the last RNN cell to last word in the sentence)
            target = Tensor(batch[:, k+1], autograd=True)
            loss = loss_layers[k].forward(prediction, target)
            total_loss += loss.data
        
            # compute prediction accuracy
            for ix in range(batch_size):
                if(np.argmax(prediction.data[ix]) == target.data[ix]):
                    correct += 1
                else:
                    incorrect += 1    

            # backpropagate the loss gadients
            loss.backward()
        
        # weights optimization
        optim.step()

    if(iter%5 == 0):
        print(f"Iteration# {iter+1}, Loss: {total_loss}, Accuracy: {float(correct)/(float(correct + incorrect))}")


Iteration# 1, Loss: 128.2137931573217, Accuracy: 0.3456
Iteration# 6, Loss: 52.26564782456881, Accuracy: 0.597
Iteration# 11, Loss: 46.44082954570831, Accuracy: 0.5944
Iteration# 16, Loss: 45.080847575597055, Accuracy: 0.596
Iteration# 21, Loss: 44.51627970974205, Accuracy: 0.5952
Iteration# 26, Loss: 44.213536699158794, Accuracy: 0.5974
Iteration# 31, Loss: 44.025950382718484, Accuracy: 0.5994
Iteration# 36, Loss: 43.89834974625383, Accuracy: 0.599
Iteration# 41, Loss: 43.805697250392846, Accuracy: 0.599
Iteration# 46, Loss: 43.73513468971332, Accuracy: 0.598
Iteration# 51, Loss: 43.679456698705565, Accuracy: 0.5978
Iteration# 56, Loss: 43.63433077644276, Accuracy: 0.5986
Iteration# 61, Loss: 43.59697171101288, Accuracy: 0.599
Iteration# 66, Loss: 43.565478841166914, Accuracy: 0.5984
Iteration# 71, Loss: 43.53850097398656, Accuracy: 0.5982
Iteration# 76, Loss: 43.51505599501536, Accuracy: 0.5986
Iteration# 81, Loss: 43.49442135484167, Accuracy: 0.5986
Iteration# 86, Loss: 43.476061578