In [1]:
import numpy as np

![](helper/1.JPG)

Input shape (m,28,28)


1 layer: conv with 3*3 filter, stride = 1, pad = 0 . 

        out_height = (28-3)/1 + 1 = 26             [(image_size-filter_size+2*pad)/stride + 1]
        out_dimension = (m,26,26,8)
        Weight shape = (3,3,8)                     In case there was preceding conv layer. Weight shape = (3,3, n_prev_filters, n_next_filters)

2 layer: max pool 2*2 filter  stride = 2

        out_height = (26-2)/2 + 1 = 13              [(image_size-filter_size)/stride + 1]
        out_dimension = (m,13,13,8)                   Maxpool is just downsampling, no of filters remain the same.

3 layer: softmax (with flatten)
        
        intermediate_length = (m,13*13*8)              conv layer flattened into a 1D array 
        out_dimension = (m,10)
        Weight shape = (13*13*8, 10)               weight shape is always n_previous_layer, n_next layer
        bias shape = (1,10)                        bias is always shape of next layer


## Convolution 

![Figure 1](helper/2.gif)
Figure 1 


So here we have a depthwise convolution.This operation has to be repeated for n_filters to stack all filters as output.  
    for i in range(no_filters):
    
        image_slice*weights[i]  where image_slice and weights[i] are shape (filter_size*filter_size*no_of_prev_filters)

or simply, computing output for all weights together
     image_slice * weights 
     
        where image_slice is shape (filter_size*filter_size*no_of_prev_filters), weights (filter_size*filter_size*no_of_prev_filters*no_of_next_filters)

## Maxpooling

* pooling has no learnable hyperparameter
* Number of filters post pooling remain same as previous layer. Just the h,w of output reduces to (h_prev-f)/stride+1
* Average pool blurs it by 1 pixel (for filter window 2*2)
* Max pooling works better for white object in black bacground, min pooling for black object and white background. 


    for image_slices in generate_slices:
        output[i, j] = np.amax(im_region, axis=(0, 1))   # axis=(0, 1) ensures we have max value for each feature map

## Softmax 

input needs to be flattened if a conv layer. 

weight shape input_flatten, n_of_output_class 

bias n_of_output_class 

totals = np.dot(flattened_input, weights) + bias

$$ np.exp(totals)/np.sum(np.exp(totals), axis = 0) $$   
numerator is a vector of shape n_of_output_class, denominator is scalar.

# BackPropagation

Categorical cross-entropy 

$$ L = -log(y_c) $$
$$ L = 0  $$ if i not c 

label = np.argmax(y)

so, loss = -np.log(out[label])

gradient is non zero only for index corresponding to label 

Since we know only the gradient corresponding to true value will be non zero, we find index corresponding to label by checking non zero gradient. If not this way, getting the index for label might be complicated. 
 Now, imagine the softmax array. 


 $$ \frac{\delta out[k]}{\delta t} = -e^{tc}e^{tk}/S^2   $$ if k!=c 
 $$ \frac{\delta out[k]}{\delta t} = e^{tc}(1-e^{tc})/S^2   $$ if k==c

 So lets make an array for $e^{tc}/S^2$ and change the index corresponding to label eith (1-e^{tc})

 $$ \frac{\delta t}{\delta w} = input $$  (this will be the flattened input)
 $$ \frac{\delta t}{\delta b} = 1 $$
 $$ \frac{\delta t}{\delta input} = w $$

 $$ \frac{\delta t}{\delta input} = \frac{\delta t}{\delta input}*\frac{\delta t}{\delta input}*\frac{\delta t}{\delta input} $$

 d_L_d_t = gradient*d_out_d_t     # gradient is scalar, d_out_d_t is 10 

        #d_t_d_w is last_input: 13*13*8, we need (13*13*8, 10)

        d_L_d_t = d_t_d_w[np.newaxis].T @ d_L_d_t[np.newaxis]   #newaxis, because we need two arrays to do matrix multiplication

        d_L_d_w = d_L_d_out*d_out_d_t*d_t_d_w
        d_L_d_b = d_L_d_t
        d_L_d_inputs = d_t_d_inputs*d_L_d_t   #d_t_d_inputs = weights (13*13*8, 10), d_L_d_t is (10,1) or (10)?    #DOUBT: we need a newaxis here?
 

## Max Pool 
Going back to Figure 1, for every pixel in output feature map, there is a corresponding image_slice. 
So for every dL_dout, we have a corresponding image_slice.
Find arg max for each of these image slices. 
Only update d_L_d_input = dL_dout for the index which has max value.  
Repeated for all feature maps. 


    image[stride*ind_out_h+h_slice_max, stride*ind_out_w + w_slice_max, ind_f] = d_L_d_out[ind_out_h, ind_out_w, ind_f]

## Average Pooling 

$$ avg = \frac{\sum x}{n} $$

Thus d_avg_d_x = 1/n      (read in quora) 

## Convolution

Changing any weight will change the output.  Every output pixel uses every pixel weight. 
In forward pass, out[i,j] = image_slice@(stride*i,stride*j)*weights[f] 
So for d_out_d_w 
 Every out is affected by the correponding image_slice. 

image_slice@(stride*i,stride*j)  * out[i,j] gives trickle down from one out pixel. We need to sum these for all out pixels. 

out[i,j,f] is scalar, image_slice@(stride*i,stride*j) is f*f*n_prev_filter 
We need weight f*f*n_prev_filter*n_filter_out

So, we get last dimension by stacking values for each filter as below

for generate image_regions:
    for f in range(out_filters):
        d_L_d_f[f] += image_slice*d_L_d_out[i,j,f]

