In [4]:
#Some dependencies for making a neural network 
import numpy as np

# Section 0 

## Preliminaries, descriptions and questions

1. Goal of this particular neural network: Extract patterns from given observation inspired by human brain functioning. Note that inspiration does not equal representation, ie, neural network is not an exact replication of cognitive processing. Also, I like to view a neural network may as a possible strategy to form robust representations for AI. For me, it is a machine learning technique that can play a big role in AI. I have presented the below thoughts in context of AI. 

### Disclaimer: All thoughts below are my own interpretation and may not be an accurate representation of literature. 
2. Given observations are currently represented as vectors of information (matrices, vectors seem the most convenient and intuitive way). 
    1. In a 'supervised setting' each/some observation has a corresponding outcome/label. 
    2. Personally, I am intrigued by the process of feeding input data to an AI system. Is a streams of vectors (a matrix)the correct approach? What mechanism does the human brain use to feed in data through sensory agents? 
        1. Humans and living beings have two crucial mechanisms to obtain 'input data'- sensory agents and interactive agents. <b> Q: How does the environmental interaction and sensing play crucial roles in influencing cognitive processes?</b> 
        2. The environment surrounding the interaction/sense may also influence the cognitive process/output. The result too could be stored for recall later (memory). I am not aware of the exact mechanism of how memory is recalled or stored. <b> Again, this is an interesting area to explore. Should inputs be represented as a tuple of (input, environment, sensor_type, memory/recall, dependency on other inputs), or perhaps memory should be decoupled and part of processing unit as a current_state (hint of an FSM)? </b>
        3. A particular interesting example of how environment and memory influence cognitive processing is 'priming.' Semantic priming influences how one may process a stream of words differently depending on previous exposure (memory) or the current environment. Thus for text tasks, I feel representing and including information on the above would be beneficial. <b>Now, taking this example forward, how can we form a 'representation' given some data? This is what neural networks/ deep learning may be used to do. Different architectures can be postulated to form more accurate/useful representations. From what I've experienced, forming more robust architectures for particular domains/problems is the current trend and approach to creating intelligent and thoughtful machines. However, this may lead to domain specific architectures, but again perhaps all may be collaborated to form an AI system.<i>Personally, I find it to be a good practice to keep probing how the problem and model we are working on  contributes to intelligent machines. </i></b> Sometimes we can create great pattern learning models which have high accuracy in tasks but not understand why they work. This is a current research question in deep learning.
        
        
        
    3. Depending on task, we may represent input data (observations) differently:
        1. Example: Images: Represent them as pixel intensities
        2. Example 2: Documents: Global sense: Word vectors- represent each word by context features
        3. Example 3: ....

3. <b>A concise (not exhaustive) summary/listing of primary components of human brain processing</b> that I found useful (will add proper academic papers and verified theory later) : http://www.teach-nology.com/teachers/methods/info_processing/ Key points:
    1. Input through different stimuli and encoded storage in different modes- structural, phonemic, semantic 
    2. Transformation algorithms- bottom up processing and top down processing. THIS Is what we'll explore with neural networks. 
    3. Attention filter for signals and selection of relevant cognitive processes
    4. Short term memory: Electric signal loop through certain neurons; Long term memory: Protein structures
    5. Organization of knowledge in brain: Many postulated models
    6. Retrieval of memory/recall: retrieval cue, priming, distortions. <b>How are words formed for conveying a particular thought? </b>
    7. <b>How is emotion implemented without language, just senses?</b> 
    
   <b>Chollet: <i>The ability to convey intent and emotions through voice seems like a universal constant that predates language. Ever noticed how the tone you'd use to ask a question, or give an order, is the same in essentially every language?</i></b>
   

# Section 1.1: Vanilla Neural network construction

## The purpose of this notebook

This notebook explores part 2B of the above: How to create transformation algorithms that transform inputs into more useful representations for given objectives. 

Additionally, it provides a foundation/accustoms one to effectively adapt new biological/mathematical concepts into code. 

The notebook starts with a simple neural network model- Artificial Neural Network- and then explores variants of the same. It also covers necessary mathematical and comp sci concepts in the process.

## Input Data

In [5]:
#Input data- assume a matrix. Each row represents a particular vector which represents a set of observations.
#Assuming we process parallely and independently, we can process multiple input events together. This, we call a batch of inputs which will yield a batch of outputs. We will discuss non-independent batches later. 

x= np.array([[1,2,3],[4,5,6]])
y= np.array([[1,10],[2,4]]) # we want two ouputs for each set of observations
#Feel free to replace the above x_in and y with any dataset



## Our standard neural network aims to do the following (bold parts are crucial): 

1. Given set of inputs and outputs, find the best transformation of inputs (possibly and usually, multiple transformations) to obtain outputs. This it does through 'weighting each previous input then applying a non-linearity to the weighted input (other mechanisms- perhaps biological inspired activations, not necessarily 'non-linear', can be used)

2. In finding the best transformations, we are finding how to weight certain data. Given a lot of data, we are eventually extracting recurring patterns in the data. 

3. The best transformations are dependent on weights for each neuron (other variables can be included, but proper mechanisms need to be developed for how to modify the variables for best usage. Ex: backprop/gradient passing is used for finding best weights. However, it is disputed whether this is the best mechanism to adjust weights.)

4. In this particular ANN by using backprop/gradient passing we will try to find (not guaranteed to be optimum-mathematically or biologically) the best weights/variables to satisfy a given objective/cost function. There are sophisticated mathematical tweaks and improvements on passing of gradients and the optimization process.  Good overview: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f 

5. The cost function plays the role of a metric to see whether our model is outputting satisfactory values (DO 
HUMANS have cost functions?)

6. <b>New variabes can be added in this paradigm- and to adjust them, we need to derive and implement the gradient with respect to the cost function. </b>
> Ex: An example variable can be associated primes to a particular input.  

<b>Otherwise new activation/feedforward computation mechanisms, variable-adjusting mechanisms and objective functions- specific to the variable- need to be proposed and implemented. </b> 

>Ex: In an hebbian learner the variable is co-occuring neurons, objective is unknown, but mechanism is to increase weights/strength of connection whenever there is co-occurance.

7. <b>For every new variable, think of three things- 
    1. How does it impact forward value calculation.
    2. How should we adjust the variable (if needed) in light of correctness/objective function
    3. What objective function (if any) needs to be used</b>


## Making a standard neural network 

#### The below focuses on creating a neural network from a neuron level. The focus is not entirely on matrix transformations and layer wise transformations, but also on how particular neurons may act differently in the same computation step.
##### Hence, this is not meant to be a scalable/efficient network, but hopefully a playground to create slightly different neural network architectures.  

In [6]:
x = np.array([[4,3,2],[1,2,3]])

In [7]:
import numpy as np
import matplotlib.pyplot as plt 
np.random.seed(5)
##Auxiliary functions 

def visual_plot(x,y):
    plt.plot(x,y)
    plt.show()
    
def weighted_linear(x,W=1,der= False, visual=False):
    if(visual):
        visual_plot(np.arange(-10,10,0.1), weighted_linear(np.arange(-10,10,0.1),1))
    if(not der):
        return x #np.dot(x,np.transpose(W)) #actually it should just be x since we're assuming weights are computed in prior step
    else:
        return x/x #this should not be 1, think why

def linear(x, der = False): #incase of confusion
    return weighted_linear(x, W=1, der=der)

def sigmoid(z, der=False, visual = False):
    if(visual):
        visual_plot(np.arange(-10,10,0.1), sigmoid(np.arange(-10,10,0.1)))
    if(der):
        return der_sigmoid(z)
    return 1/(1+np.exp(-z))

def der_sigmoid(z, visual = False):
    if(visual):
        visual_plot(np.arange(-10,10,0.1), del_sigmoid(np.arange(-10,10,0.1)))
    return sigmoid(z)*(1-sigmoid(z))

def relu(z,visual = False):
    return np.array(map(lambda x: max(0,x)),z)
    
def tanh(z, der = False, visual = False):
    if(visual):
        visual_plot(np.arange(-10,10,0.1), tanh(np.arange(-10,10,0.1)))
    if(der):
        return der_tanh(z)
    return (np.exp(z)-np.exp(-z))/(np.exp(z)+np.exp(-z))

def der_tanh(z, visual = False):
    return 1- np.power(tanh(z),2)

'''
def mse(correct_y, predicted_y, der = False, visual = False):
    #calculates mse cost given predicted y vector and correct y vector for a single training example
    
    if(visual):
        visual_plot(np.arange(-5,10,0.1),np.power([5]-np.arange(-5,10,0.1),2)/2)
    if(not der): #mse is calculated as 1/2n* (squared difference)
        return np.mean(np.power(correct_y-predicted_y,2), axis=0)/(2) #average over rows
        return np.power(correct_y-predicted_y, 2)/2
    else:
        #derivative with respect to predicted_y
        return np.mean(predicted_y-correct_y, axis =0) # average
        return predicted_y - correct_y
    
'''

def l2_norm(x, vector_axis=1):
    """x is an input tensor
    output is l2 norm of input_shape_with_last_axis=1
    """
    return np.sqrt(np.sum(np.power(x,2), axis =vector_axis))
    
def mse(correct_y, predicted_y, der_and_avg= True, der = False, visual = False):
    '''
    mse(y,y') = (1/2m) * sum_across_examples(y-y')^2 where m is batch size/number of training examples 
    calculates mse cost given predicted y matrix and correct y' matrix for a given batch'''
    '''If inputs are of shape m*n then output is of tuple (m*n, 1*n, m*n)'''
    assert correct_y.shape == predicted_y.shape
    
    if(visual):
        visual_plot(np.arange(-5,10,0.1),np.power([5]-np.arange(-5,10,0.1),2)/2)
    
    diff = predicted_y - correct_y
    m = predicted_y.shape[0]
    
    if(der_and_avg):
        der_c_wrt_pred_y = diff/m ##m*n of derivatives 
        mse_individual = np.power(diff,2) 
        mse_avg_across_output_dim = np.average(mse_individual, axis =1)
        #mse_individual = l2_norm(diff) #m*1 mse l2 norm for each example
        mse_avg = np.mean(mse_avg_across_output_dim, axis =0)/2 #1*1 mse avg according to formula
        return mse_avg, der_c_wrt_pred_y 
    else:    
        if(not der): #just the avg cost
            mse_individual = np.power(diff,2)#l2_norm(diff)
            return np.mean(mse_individual)/2 # mse_individual  #THIS IS 1*n (averaged across batch) 
        #return np.sum(np.power(correct_y-predicted_y,2))/(correct_y.shape[0]*2)
        else:#just the cost derivate
            return diff/m  #np.mean(diff, axis =0) #THIS IS m*n, derivative of mse with respect to prediction


def softmax(z,der=False, visual=False):
    '''Return a more uniform (gibbs) distribution over inputs'''
    e_to_x = np.exp(z)
    e_to_x_sum = np.sum(e_to_x, axis = 1)
    softmaxed_out = e_to_x/e_to_x_sum
    if(not der):
        return softmaxed_out
    else:
        return softmaxed_out,  c           
    
def bitwise_cross_entropy(correct_y, predicted_y, der=False, visual= False):
    '''correct_y is of shape m*n where n = ceil(lg(num_outputs))
    Use this when neurons in output layer is bitwise (not one hot) encoded
    '''
    None
    
    
def cross_entropy(correct_y, predicted_y, der=False, visual = False): #assume correct_y is a scalar for each training example
    '''
    correct_y is a one hot encoded array of shape m*n 
    predicted_y is vector of output probabilies(softmax)
    if conditions are used in case a single dimensional vector/single point input is given
    '''
    if(visual):
        visual_plot(np.arange(0,1,0.01), -np.log(np.arange(0,1,0.01)))
        #visual_plot(np.arange(-10,10,0.1))
    
    if(correct_y.ndim>1):
        indices = np.argmax(correct_y,axis=1) #correct_y example [[[0,0,1,0],[0,0,0,1]]] given four classes and two training examples 
        predicted_prob_given_indices = map(lambda y,ind: y[ind], predicted_y, indices) #select prob values of predictions given the index
        num_inputs = correct_y.shape[0]
    else:
        #stochastic example
        indices = np.array(np.argmax(correct_y))
        predicted_prob_given_indices = predicted_y[indices]
        num_inputs = 1
    return np.sum(-np.log(predicted_prob_given_indices))/num_inputs

def hadamard_product(x,y): #element wise product
    return np.multiply(x,y)

activation_dict = {1:sigmoid, 2: tanh, 3: weighted_linear, 4: relu}

In [8]:
y = [[0,0.2,0.8,0],[0,0.1,0.2,0.7]]
ind = np.argmax([[0,0,1,0],[0,0,0,1]], axis=1)
#predicted_prob_given_indices = map(lambda y,ind: y[ind],list(zip(y, ind)))
map(lambda x,a: x[a], y,ind)

[0.8, 0.7]

In [9]:
#Remember the purpose of this neural network is to transform data into a better representation in a 'bottom up' approach. 
#A transformation is simply a function of x: f(x). While there exist many different forms/models of transformation functions- linear weights, polynomials, any function one thinks will help

#A neuron, however, is modelled based on its biological components. Here's a good summary: https://www.khanacademy.org/science/biology/human-biology/neuron-nervous-system/a/overview-of-neuron-structure-and-function. A typical neuron consists of the following:
#1) Neurons form a network- a top down processing network leads to later neurons forming more abstract representations
#2) Neurons have activation signals which influence other activation signals
#3) A neuron might weight each 'synaptic input' differently
#4) Each neuron represents a particular representation. 
#5) Remember in this architecture, each neuron represents a value between 0 and 1 (non-fuzzy) value. The first layer neurons correspond to the non-complex representations, which become more complex as they activate further neurons. 
#I have used an OOP approach since it would be interesting to model different type of neurons. However, for scalable implementations a bunch of layer wise matrices should be used.  
#6) We will explore adding a new variable as well
class neuron:
    
    def __init__(self, input_size,activation = sigmoid, lr= 0.1, belongs_to = None):
        self.shape = (input_size,1)
        self.weights = np.random.random(self.shape) - np.random.random(self.shape) #np.random.random(self.shape) #initialize random values  #np.zeros(self.shape)#
        self.bias = np.array([0])
        self.activation = activation
        self.learning_rate = lr
        self.surrounding_activations = None
        self.belongs_to = belongs_to
        self.neighbours = []
        #self.previous_layer_activations = []
        self.error_wrt_C = None
        self.historical_activations = None #Hebbian learner to mantain co-occuring high activations together-> if n1 and n2 co-occur a lot, then if n1 occurs add a bias term?
    def forward_pass(self,input_x, weighted_as_well = False): #weighted as well means just weighted, not activated
        '''Input x is of shape m*n where m is batch size and n is input dims'''
        weighted_input = np.dot(input_x, self.weights)+self.bias
        if(weighted_as_well): #this returns both activated and non activated output 
            return (self.activation(weighted_input),weighted_input, self.activation(weighted_input, der = True)) 
        else:
            return self.activation(weighted_input)
    def adjust_parameters(self):
        None
    def calc_error(self,correct_y, predicted_y):
        '''Calculates and returns error based on metric of neuron'''
        return None 
    def calc_gradient(self, y,x):
        '''Calculates gradient of y as function wrt x'''
        if(self.activation == sigmoid):
            return del_sigmoid()
        elif(y== calc_error):
            None
        return y(x+1e-10)-y(x)/1e-10 #taking limit to calculate gradient 
    def get_weights(self):
        return self.weights
    def get_bias(self):
        return self.bias
    def set_weights(self,weights):
        self.weights = weights
    def set_bias(self, bias):
        self.bias  = bias
    
    def set_error_wrt_C(self, error_wrt_C):
        sefl.error_wrt_C = error_wrt_C
    def give_relevant_info(self, info): #this function serves to receive relevant information, add any variables here
        None
    def add_neighbour(self,neigbhour):
        self.neighbours.append(neigbhour)
        
    #def set_previous_layer_activations(self, previous_layer_activations):
     #   self.previous_layer_activations = previous_layer_activations
        
    #def gradient_descent(self, )
    def update_parameters(self, previous_layer_activations):
        '''Previous layer activations  is of shape m*n
           Error_wrt_c is of shape m*1
        '''
        
        '''CURRENTLY a single error will be sent across input parameters (weights and bias)'''
        #1. BACKPROPOGATION TO UPDATE WEIGHTS AND BIAS WRT OVERALL COST 
        #assert(self.error_wrt_C)
       # print("del_C_wrt_del_z",self.error_wrt_C)
        avg_neuron_error = np.mean(self.error_wrt_C, axis =0)
        num_examples = self.error_wrt_C.shape[0]
        #weight_der = input_error*a_prev
        #bias_der = input_error*1
        bias_der = avg_neuron_error*1
        self.bias = self.bias - self.learning_rate*self.bias
        
        weight_ders = np.dot(np.transpose(previous_layer_activations), self.error_wrt_C)
        avg_weight_ders = weight_ders/num_examples
        self.weights = self.weights - self.learning_rate*avg_weight_ders
       # print(avg_weight_ders)
        #lambda w,  = avg_neuron_error* 
        #for w_ in self.weights:
         #   w_ = w_ 
            
        
        #2. AUXILLIARY UPDATES
        
    #def backward_pass(self,err): #improves weights of neuron
     #   '''Calculates weight update based on dC/dw'''
    #first calculate the gradient of error vs output
       # del_C_out = 
        #then obtain del (dC/dw) =dC/dout *dout/dWi


In [10]:
#Testing the neuron which takes five inputs
n1 = neuron(5)
n1.get_weights().shape
n1.forward_pass([[0.11,0.4,0.1,0.1,0.21]])
#n1.forward_pass([[0.11,0.4,0.1,0.1,0.21],[0.11,0.4,0.1,0.1,0.21]], weighted_as_well=True)


array([[ 0.52328645]])

In [11]:
#Making a layer of independent neurons
#Layer can have additional properties- deactivate neurons at a particular step, 
#A layer should work independently whether placed in a top down or bottom up processing 
class Layer:
    '''
    think of this as a GROUPING FOR MULTIPLE NEURONS--> 
    ANY CHANGES TO backprop, etc actually take place at neuron level. This only serves to clump together and pass activations in a GROUP
    Takes in input shape, max number of neurons in layer and feedforward function to output active neurons'''
    '''
    Weights are stored as num_neurons*input_dims
    Bias are stored as num_neurons*1
    activation value is stored as m*n where m is number 
    Potential Additions :
    1) Different number of activation layers
    2) Active neurons changes as per need'''
    def __init__(self, input_dims, max_num_neurons =10, input_layer= False, final_layer = False, neuron_activation_tuples = [(sigmoid,10)]):
        '''Input_shape: batch_size* input_dims'''
        self.dims = input_dims #input_shape[-1] #number of dims of input
        #self.batch_size = input_shape[0] 
        self.input_layer = input_layer
        self.final_layer = final_layer
        self.max_num_neurons = max_num_neurons #max_num_neurons is the total number of neurons to be grouped
        self.num_active_neurons = self.max_num_neurons
        self.neurons = []
        self.neuron_activations = []
        if(neuron_activation_tuples == [(sigmoid,10)]): #default case
            neuron_activation_tuples[0] = (sigmoid, self.max_num_neurons)
        
        n_index = 0     
        for activation_tuple in neuron_activation_tuples:#adding activations
            for i in range(n_index, n_index + activation_tuple[1]):
                self.neurons += [neuron(self.dims, activation_tuple[0])]
                self.neuron_activations.append(activation_tuple[0])
            n_index = activation_tuple[1]+1 
        self.active_neurons_index = range(len(self.neurons)) #this may be used later to switch number of active neurons    
         
        
        #if(not (len(self.neuron_activations)== self.max_num_neurons) and not (self.input_layer)):
            #print("Max number neurons must match number of activations specified in activation tuple argument")
            ##self.max_num_neurons = len(self.neuron_activations)
            
        #if(not final_layer):  
         #   ''' we will use half tanh and half sigmoid here'''
          #  self.neurons = [neuron(self.dims,sigmoid) for i in range(max_num_neurons//2)]
           # self.neurons += [neuron(self.dims, tanh) for i in range(max_num_neurons//2, max_num_neurons)]
            #self.active_neurons_index = range(len(self.neurons))
            
        #else:
         #   inp = int(input("Enter 1 for softmax, 2 for sigmoid/tanh and 3 for regression"))
          #  self.neurons = [neuron(self.dims, activation_dict[inp]) for i in range(max_num_neurons)]
           # self.active_neurons_index = range(len(self.neurons))
 
        """ Below are state variables """
        #self.neuron_activations = [neuron_.activation for neuron_ in self.neurons]
        if(not self.input_layer):
            assert self.max_num_neurons == len(self.neuron_activations)
            self.weight_matrix = self.get_weights()
            self.bias_vector = self.get_bias()
            
        
        self.inputs = []
        self.activation = []#set as per input is given, otherwise need to account for batch size
        self.non_activated = []
        '''self.errors is calculated from neuron view (can include other objectives)
           self.error_wrt_primary_C is calculated from final output view (overall cost/objective)
        '''
        self.errors = []
        self.errors_wrt_primary_C = [] 
        self.del_a_z = []
        self.previous_layer_activations = [] #We need this for standard backpropogation 
        self.next_layer = None #pointer to the next layer
        self.prev_layer = None
        
    def get_layer_errors(self):
        return self.errors
   

    def set_layer_errors_wrt_primary_C(nparray, self): 
        self.errors_wrt_primary_C = nparray
        
    def get_layer_errors_wrt_primary_C(self):
        return self.errors_wrt_primary_C
    
    def get_weights(self):
        weight_matrix = np.zeros((self.max_num_neurons, self.dims))
        for i, neuron_ in enumerate(self.neurons):
            weight_matrix[i, :] = np.array(neuron_.get_weights())[:,0] #reshape drop second axis
        return np.transpose(weight_matrix)
    
    def get_bias(self):
        bias_vector = np.zeros((self.max_num_neurons))
        for i, neuron_ in enumerate(self.neurons):
            bias_vector[i] = np.array(neuron_.get_bias())
        return np.transpose(bias_vector)
    
    #def set_batch_size(self, batch_size):
     #   self.batch_size = batch_size
        
    def simple_feedforward(self, inputs):
        '''USe this when multiple activations in same layer are not used and when neuron level propogation is not needed'''
        self.inputs = inputs
        #assert self.batch_size == self.inputs.shape[0]
        if(self.input_layer):
            self.activation = np.array(inputs)
        else:
            self.non_activated = np.dot(x, l1.get_weights())+l1.get_bias()
            #temp = 
            #self.activated = 
            #non_activated = np.dot(x, l1.get_weights())+l1.get_bias()
            #activated = sigmoid(non_activated[:,:3])
            #activated2 = tanh(non_activated[:,3:])
           # np.concatenate((activated,activated2),axis=1) # ONLY IF 1st axis corresponds to neuron 
    
    def set_layer_errors_wrt_primary_C(self, del_C_wrt_unactivated):
        self.errors_wrt_primary_C = del_C_wrt_unactivated
        
    def set_previous_layer_activations(self, prev_layer_acts):
        self.previous_layer_activations = prev_layer_acts
        
    def feedforward(self, inputs, weighted_as_well = True):
        drop_second_axis = lambda x: x.reshape(x.shape[0])
        self.inputs = inputs
        if(self.input_layer):
            self.activation = np.array(inputs) 
            return (np.array(inputs))
        
        
        self.activation, self.non_activated, self.del_a_z = zip(*[neuron_.forward_pass(self.inputs, True) for neuron_ in self.neurons])
        
        self.activation = np.array(self.activation).transpose()[0]
        self.non_activated = np.array(self.non_activated).transpose()[0]
        self.del_a_z = np.array(self.del_a_z).transpose()[0] #should this be calculated here or taken from neuron?? 
        #print(self.del_a_z.shape)    
            
                                   #append every neurons output with a list -> convert to np array once done-> store parallel weighted_only list
            #we can do this directly using the weight matrix
            #in this case, we will use neurons index
            #need to store activation, weighted input, 
            #return drop_second_axis()
        return self.activation

    def update_parameters_normal_layer(self):
        """PARRALELIZE THIS FUNCTION"""
        '''Pass respective error to each neuron--> PARALLELIZE this '''
        for i, n_ in enumerate(self.neurons):
            '''WE NEED A FUNCTION TO REPLACE THIS- obtain batchwise errors/axis'''
            n_.error_wrt_C = self.errors_wrt_primary_C[:,i] 
            n_.update_parameters(self.prev_layer.activation)
        #print("Updated parameters of layer: {}".format(self))
        #parallel code
        
        #after parallel updates
    
    
    def update_parameters_final_layer(self):
        self.update_parameters_normal_layer()
    
    
    def update_parameters(self):
        if(self.final_layer):
            self.update_parameters_final_layer()
        else:
            self.update_parameters_normal_layer()
            
        
                                   

In [12]:
x = [(1,2,3),(5,3,6)]
y,z,a = zip(*x)
np.array(a)

array([3, 6])

In [13]:
l1 = Layer(3, 6, False, True) # input_shape, max_num_neurons =10, input_layer= False, final_layer

In [14]:
l1.feedforward(x).shape
l1.del_a_z

array([[  2.48759683e-01,   2.43876500e-01,   1.40448632e-01,
          6.81743337e-02,   2.66663494e-02,   9.41953690e-02],
       [  2.47733650e-01,   2.38370060e-01,   5.16146411e-02,
          8.36912951e-02,   1.92893212e-04,   1.50231531e-01]])

In [15]:
non_activated = np.dot(x, l1.get_weights())+l1.get_bias()
activated = sigmoid(non_activated[:,:3])
activated2 = tanh(non_activated[:,3:])
#np.stack([activated,activated2], axis =1)
print(activated.shape)
np.concatenate((activated,activated2),axis=1) #1st axis/column needs to be joined


(2, 3)


array([[ 0.53521814,  0.5782528 ,  0.16901455, -0.98745916, -0.99841179,
         0.97268716],
       [ 0.5476062 ,  0.6078422 ,  0.05459529, -0.97958471, -0.99999993,
         0.90305873]])

In [16]:
l1.feedforward(x).shape

(2, 6)

In [17]:
l1.del_a_z

array([[  2.48759683e-01,   2.43876500e-01,   1.40448632e-01,
          6.81743337e-02,   2.66663494e-02,   9.41953690e-02],
       [  2.47733650e-01,   2.38370060e-01,   5.16146411e-02,
          8.36912951e-02,   1.92893212e-04,   1.50231531e-01]])

In [18]:
[neuron_.forward_pass(x, weighted_as_well= True) for neuron_ in l1.neurons]

[(array([[ 0.53521814],
         [ 0.5476062 ]]), array([[ 0.14110621],
         [ 0.19100337]]), array([[ 0.24875968],
         [ 0.24773365]])), (array([[ 0.5782528],
         [ 0.6078422]]), array([[ 0.31560506],
         [ 0.438251  ]]), array([[ 0.2438765 ],
         [ 0.23837006]])), (array([[ 0.16901455],
         [ 0.05459529]]), array([[-1.59262748],
         [-2.85166555]]), array([[ 0.14044863],
         [ 0.05161464]])), (array([[ 0.07358979],
         [ 0.09219036]]), array([[-2.53281082],
         [-2.28717917]]), array([[ 0.06817433],
         [ 0.0836913 ]])), (array([[ 0.0274181 ],
         [ 0.00019293]]), array([[-3.56875085],
         [-8.55298793]]), array([[ 0.02666635],
         [ 0.00019289]])), (array([[ 0.89472095],
         [ 0.81586147]]), array([[ 2.13989747],
         [ 1.48855622]]), array([[ 0.09419537],
         [ 0.15023153]]))]

In [19]:
activation_dict = {1:sigmoid, 2: tanh, 3: weighted_linear, 4: relu, 5: linear}
class Artificial_Neural_Network:
    def __init__(self, num_neurons_in_layer = [(5),(3,[(sigmoid, 2),(tanh, 1)]), (1,[(sigmoid, 1)])], cost_func = mse):
        self.num_layers = len(num_neurons_in_layer)
        self.num_neurons_in_each_layer = num_neurons_in_layer
        self.input_dims = num_neurons_in_layer[0]
        self.output_dims = num_neurons_in_layer[-1]
        self.layers = []

        self.cost_func = cost_func
        self.layers.append(Layer(num_neurons_in_layer[0], num_neurons_in_layer[0],True)) #input layer
        prv_layer_neurons = num_neurons_in_layer[0]
        for i, layer_tuple in enumerate(num_neurons_in_layer[1:]):
            if(not (i==len(num_neurons_in_layer)-2)): #since we started at two
                self.layers.append(Layer(prv_layer_neurons, layer_tuple[0], False, False, layer_tuple[1] ))
            else: #last layer/ output layer
                self.layers.append(Layer(prv_layer_neurons, layer_tuple[0], False, True, layer_tuple[1] ))
            self.layers[i].next_layer = self.layers[i+1] 
            self.layers[i+1].prev_layer = self.layers[i]
            prv_layer_neurons = layer_tuple[0]
            
#def __init__(self, input_dims, max_num_neurons =10, input_layer= False, final_layer = False,
#neuron_activation_tuples = [(sigmoid,10)] )
    
    def feedforward(self, inputs):
        '''
        Input of shape m*n_input_layer
        Output of shape m*n_final_layer
        '''
        forward_val = inputs
        for layer_ in self.layers:
            layer_.set_previous_layer_activations(forward_val)
            forward_val = layer_.feedforward(forward_val)
            
        return forward_val
    
    
    def backprop(self, der_C_wrt_prediction):
        '''Input of shape m*n final_layer
           No outputs, only updates layer and neuron wise
        '''
        for layer in self.layers[::-1]:
            if(layer.final_layer):
                #print(der_C_wrt_prediction.shape)
                #print(layer.del_a_z)
                assert der_C_wrt_prediction.shape == layer.del_a_z.shape
                del_C_wrt_unactivated = np.multiply(der_C_wrt_prediction, layer.del_a_z) #ELEMENT WISE
                layer.set_layer_errors_wrt_primary_C(del_C_wrt_unactivated)
                #print("DEL_C")
                #print(del_C_wrt_unactivated)
                layer.update_parameters()
                upper_layer_error = del_C_wrt_unactivated
            
            elif(layer.input_layer):
                continue #DO nothing here
            
            else:
                #del_C.... is of shape m*n and layer (the upper layer from which error is propogated) weights are of shape k*n 
                del_C_wrt_activated = np.dot(upper_layer_error, np.transpose(layer.next_layer.get_weights()))   #modify weights after backpropogating error to previous layer
                del_C_wrt_unactivated = np.multiply(del_C_wrt_activated, layer.del_a_z)
                #layer.del_a_z = del_C_wrt_unactivated #
                #print(del_C_wrt_unactivated.shape)
                layer.set_layer_errors_wrt_primary_C(del_C_wrt_unactivated) 
                layer.update_parameters() #update parameters by calculating del_C_wrt_parameter
                #update layer weights
                upper_layer_error = del_C_wrt_unactivated
                
            #del_C_wrt_unactivated =   #IMP for now a neuron's error is not equaivalent to layer's index error
            #layer.set_layer_error() 
            
        
    def cost_calculation(self, y_true, y_pred):
        return self.cost_func(y_true,y_pred)
    
    def training(self,x,y, print_metrics = True):
        '''This takes in x and y and updates weights after processing entire input'''
        '''For given x vector or matrix (M*N) where m is the number of tikraining examples, and y is the expected output (M*Y_dim):
        1) Calculate forward pass (obtain resultant activations)
        2) Obtain the associated cost with forward pass
        3a) Adjust weights and biases depending on cost (CURRENTLY using gradient descent)
        3b) Backpropogate error/cost through chain rule through each layer
        3c) Define error as delC/del(prev_activation)--> use this to calculate delC/delW --> delC/del_prev_A* del_prev_A/delW
        Update each weight and bias based on backpropogation
        
        Output of shape 
        
        '''
        if(x.ndim ==1): #only 1 training example provided (stochastic)
            x = x.reshape(1, x.shape[0])
            
        #get forward pass predictions 
        forward_pass_output = self.feedforward(x)
        #predictions = 
        predictions = forward_pass_output
        
        
        #perform error calculation of relevant terms
        assert y.shape == predictions.shape
        avg_error, der_C_wrt_prediction = self.cost_calculation(y, predictions)
        
        if(print_metrics):
            print("Predicted value: {} ".format(predictions))
        
        print("Average batch cost: {}".format(avg_error))
        #print(der_C_wrt_prediction.shape)
        #perform parameter updates by calling relevant functions
        self.backprop(der_C_wrt_prediction)

In [20]:
#ANN = Artificial_Neural_Network([(3),(10,[(sigmoid, 5), (tanh, 5)]), (2,[(sigmoid, 1), (tanh, 1)]), (2,[(sigmoid, 2)])], cost_func= mse)
ANN = Artificial_Neural_Network([(3),(50,[(sigmoid, 25),(tanh, 25)]), (2,[(linear, 2)])])

In [21]:
true_values = np.array([[0.5,0.8],[0.8,0.5],[0.5,0.8],[0.8,0.5]])
true_values2 = np.array([[5,8],[8,5],[5,8],[8,5]])
xor_true_values = np.array([[1,0],[1,0],[0,1],[0,1]])
input_values = np.array([[0.1, 0.1, 0.1],[0.9,0.9,0.9],[0.1, 0.1, 0.1],[0.9,0.9,0.9]])
xor_input_values = np.array([[0,0],[1,1],[0,1],[1,0]])
#input_values, predicted_values = 
#ANN.layers

In [22]:
ANN.training(np.array([[0.5,0.8,0.9]]), np.array([[5,6]]))

Predicted value: [[ 1.41626931  1.25735186]] 
Average batch cost: 8.83395925489


In [None]:
#Linear output 
for i in range(10000):
    ANN.training(input_values, true_values2)

In [24]:
# Non linear XOR problem and classification tasks 

''' To do: 
XOR classification example, how to classify for 2 examples, then extend to n examples'''

' To do: \nXOR classification example, how to classify for 2 examples, then extend to n examples'

# Section 1.2: Questioning the constructed NNet

## Relevant questions at this stage:

1. What could be the possible consequences of using a high learning rate (>=0.8) in an mse cost error for linear output? Ans: This is done by exploring the derivation of backpropogation. 
2. What is the performance change when the the same value between 0 and 1 is predicted using a sigmoid output vs a linear output? (Performance = Time taken or number of computations to learn mapping
3. If there are any differences in above question, how can we resolve this issue. 


## Answering Q2
## Exploring using sigmoidal output and linear output

In [None]:
#Linear output activation
ANN = Artificial_Neural_Network([(3),(50,[(sigmoid, 25),(tanh, 25)]), (2,[(linear, 2)])])
for i in range(10000):
    ANN.training(input_values, true_values)

In [None]:
#Sigmoid output activation
ANN = Artificial_Neural_Network([(3),(50,[(sigmoid, 25),(tanh, 25)]), (2,[(sigmoid, 2)])])
for i in range(10000):
    ANN.training(input_values, true_values)
    


## Answering Q3 

#### Find the  culprits for slow learning in an >> ANN architecture<< (remember, we can create different architectures for better and faster learning. Thus, the question here is wrt an ANN architecture)

#### Ans: How does learning occur--> Backprop--> When is learning 'slow' --> Which terms contribute to slow learning? ---> How can we elimintate the effect of these terms--> What are the changeable aspects in an ANN (remember -> forward calc, backward calc, and cost calculation) 

## Improving learning by changing cost function: Cross entropy derivation and implementation

In [None]:
##To do cross entropy example here

In [None]:
ANN = Artificial_Neural_Network([(3),(50,[(sigmoid, 25),(tanh, 25)]), (2,[(linear, 2)])])

# Section 2.1:  Applying the ANN for actual regression and classification tasks. Why are these tasks and datasets relevant for ML and AI? 


## Applying our ANN for a regression task

The regression dataset I have chosen is http://archive.ics.uci.edu/ml/datasets/yacht+hydrodynamics. How does deep learning help-> our network will take the provided inputs--> feedforward these to form representation out of the input. Each additional layer is stipulated to form a more complex/specific representation helpful in minimizing an objective- exs: classification accuracy, closer values to . (Bottom up processing) 

Traditionally, to make models for such tasks, researchers would extract useful features (through domain knowledge or some careful analysis/experimentation) and feed into a decision boundary model. 
In deep learning, the network ideally just requires the inputs and learns to extract the best features from the data. The final layer representation can then be used for classification/regression, etc. 

Q: Why is MNIST a useful dataset? Would a model performing 100% on MNIST be deemed a worthy number recongition system? Why or why not? (Remember to think about how we encounter numbers and how they look) 

In [25]:
##regression dataset
data_x = []
data_y = []
with open('./data/yacht_hydrodynamics.data') as f:
   #read in the data 
    while(True):
        line = f.readline()
        if(line == '\n'):
            break
        else:
            line_list = line.split()
            data_x.append(line_list[:-1])
            data_y.append(line_list[-1])


In [26]:
data_x = np.array(data_x, dtype="float")
data_y = np.array(data_y, dtype="float").reshape(len(data_y),1)
data_y.shape

(308, 1)

In [27]:
ANN = Artificial_Neural_Network([(6),(200,[(sigmoid, 120),(tanh, 80)]), (1,[(linear, 1)])])

In [None]:
for i in range(10000):
    ANN.training(data_x, data_y, False)

# Section 2.2: Mechanisms of tuning for specific tasks and further improvement of the ANN learning algo

As we see above, training a neural network for particular tasks seems pretty mystic. Further, while the architecture is mathematically sound, it does consume computational resources since there are so many variables to tune. 
Thus, there are some relevant questions to ask:
1. How can training be made computationally less expensive and of higher performance?: 
    1. Introduce mathematical tricks: This includes changing and adding new elements to our nnet components such as a better cost function, constraining weights and activations through neurons.

    2. Introduce different transformation architectures that still make mathematical sense. Ex: Convnets, RNN...

    3. Further to point 2 , design architectures after undersanding the task requirements
    
    4. For computation, introduce a computation graph to compute gradients and variables. (This is only to limit 

2. Role of neural nets with respect to the question of AI:
    1. Should we work on domain/problem-specific transformation algos/architectures and then see how to combine them for AI systems?  
    2. Or instead, should we first clearly define the problems in AI as Minsky does in Perceptrons and Society of mind, and build algorithms suited to those problems and then work on adapting them for applications such as NLP, etc
    3. Should we invest effort in introducing a theory of the mind- do psychological biases play a role in understanging, etc? 

3. How can training be made interpretable? 

# Section 3.1: Deriving new architectures

# RNN, addition of new 'variables' and how to optimize


# Section 3.2: What else can we use ANNs for

In [28]:
#Autoencoder and other unsupervised tasks

# Section 4: Neural network playground
### 4.1) Implementing an Alzhiemers Neuron



In [None]:
class neuron:
    
    def __init__(self, input_size,activation = sigmoid, lr= 0.1, belongs_to = None):
        self.shape = (input_size,1)
        self.weights = np.random.random(self.shape) - np.random.random(self.shape) #np.random.random(self.shape) #initialize random values  #np.zeros(self.shape)#
        self.bias = np.array([0])
        self.activation = activation
        self.learning_rate = lr
        self.surrounding_activations = None
        self.belongs_to = belongs_to
        self.neighbours = []
        #self.previous_layer_activations = []
        self.error_wrt_C = None
        self.historical_activations = None #Hebbian learner to mantain co-occuring high activations together-> if n1 and n2 co-occur a lot, then if n1 occurs add a bias term?
    def forward_pass(self,input_x, weighted_as_well = False): #weighted as well means just weighted, not activated
        '''Input x is of shape m*n where m is batch size and n is input dims'''
        weighted_input = np.dot(input_x, self.weights)+self.bias
        if(weighted_as_well): #this returns both activated and non activated output 
            return (self.activation(weighted_input),weighted_input, self.activation(weighted_input, der = True)) 
        else:
            return self.activation(weighted_input)
    def adjust_parameters(self):
        None
    def calc_error(self,correct_y, predicted_y):
        '''Calculates and returns error based on metric of neuron'''
        return None 
    def calc_gradient(self, y,x):
        '''Calculates gradient of y as function wrt x'''
        if(self.activation == sigmoid):
            return del_sigmoid()
        elif(y== calc_error):
            None
        return y(x+1e-10)-y(x)/1e-10 #taking limit to calculate gradient 
    def get_weights(self):
        return self.weights
    def get_bias(self):
        return self.bias
    def set_weights(self,weights):
        self.weights = weights
    def set_bias(self, bias):
        self.bias  = bias
    
    def set_error_wrt_C(self, error_wrt_C):
        sefl.error_wrt_C = error_wrt_C
    def give_relevant_info(self, info): #this function serves to receive relevant information, add any variables here
        None
    def add_neighbour(self,neigbhour):
        self.neighbours.append(neigbhour)
        
    #def set_previous_layer_activations(self, previous_layer_activations):
     #   self.previous_layer_activations = previous_layer_activations
        
    #def gradient_descent(self, )
    def update_parameters(self, previous_layer_activations):
        '''Previous layer activations  is of shape m*n
           Error_wrt_c is of shape m*1
        '''
        
        '''CURRENTLY a single error will be sent across input parameters (weights and bias)'''
        #1. BACKPROPOGATION TO UPDATE WEIGHTS AND BIAS WRT OVERALL COST 
        #assert(self.error_wrt_C)
       # print("del_C_wrt_del_z",self.error_wrt_C)
        avg_neuron_error = np.mean(self.error_wrt_C, axis =0)
        num_examples = self.error_wrt_C.shape[0]
        #weight_der = input_error*a_prev
        #bias_der = input_error*1
        bias_der = avg_neuron_error*1
        self.bias = self.bias - self.learning_rate*self.bias
        
        weight_ders = np.dot(np.transpose(previous_layer_activations), self.error_wrt_C)
        avg_weight_ders = weight_ders/num_examples
        self.weights = self.weights - self.learning_rate*avg_weight_ders

In [30]:
class alzheimer_neuron(neuron):
    '''This is just a fun example of a neuron
    It does not represent the full process of what happens during development of Alzheimers disease
    But a friend of mine gave the following gist: Continous high activations can fire the neuron off permanently
    Below neurons get deactivated for a given time if having too many high activations
    How can this be useful: Continously high activated neurons can lead to high gradient values, and learning might be forced through other patterns if neuron is being excited by everything'''
    def __init__(self, input_size, starting_state = 1, deactivation_threshold = 0.8, deactivation_time = 5, history = 10, num_continous_exceed = 5, max_high_deactivations = 50, activation = sigmoid, lr= 0.1, belongs_to = None):
        super(alzheimer_neuron, self).__init__(activation, lr, belongs_to)
        self.state = starting_state #1 means active, 0 means deactive
        self.previous_high_activations = []
        self.deactivation_threshold = deactivation_threshold #activation value above which 
        self.deactivation_time = deactivation_time #number of steps for deactivation
        self.max_high_deactivations = 50 #after 50 above threshold deactivations, neuron is deactivated for steps = deactivation time
        self.max_num_continously_exceeded = num_continous_exceed
        self.continously_exceeeded = 0 #how many in continous have exceeded
        self.history = history #NOT USed now if previous activations have to be limited
    

### 4.2) Implementing associtiavity in neurons: What is the new variable, how do we tune the relevant parameter, what's the cost func? 

# Section -1: To dos and some more Qs

In [None]:
'''IMP TO DO
1)  def update_parameters_normal_layer(self): Line 140 of layer class 
        #Parallelize and obtain error axis through a getter function
        
        for i, n_ in enumerate(self.neurons):
            #WE NEED A FUNCTION TO REPLACE THIS- obtain batchwise errors/axis
            n_.error_wrt_C = self.errors_wrt_primary_C[:,i] 
            n_.update_parameters(self.prev_layer.activation)
        #print("Updated parameters of layer: {}".format(self))
        
        '''

'''Fun stuff to do
1) Alzheimer neuron-> as mentioned by Bobby
2) Genetic neurons --> Each neuron is represented by some gene which makes it behave a particular way. Question is: Should genes be precoded or genes will be learned as training progresses. Similar gene neurons behave in simlar ways 
4) What is a possible optimum/cost for judging how well a concept is learned by a machine? 
3) Is a cost function the best way to judge a nnets performance? What is the best way to determine the best configuration of weights and errors. Is it wrong to learn through gradients? 
'''

'''Stuff to do:
VI: <B> ADD CROSS entropy </B>
1) Parrallelize update parameters for layer class
2) Generalize backpropogate so that axis for each --> batch, neurons, time etc can be represented
3) Autoencoder
4) RNN
5) 
'''


In [None]:
'''To do:
1) Insert hebbian learner--> calculate associative strength of neurons, if two neurons co-occur then make a weight between the two
2) Insert a redundant learner/ Alzheimer's neuron --> mantain a self.last_10activations list, see rate of change of input-> last_10_inputs, EXTEND the base neuron class 
5) VVVVIP: Parallelize the update parameters in Layer class for each neuron  
3) ADD a variable in code to accept axis of error and axis of other things 
4) Generalize the backpropogate function in layer to specify axis of error
LINE 140 Parallelize it
''' 