# CNN - Training - II

In this notebook, we present the backward calculation in the Conv layer.

Previously we have calculated the $\frac{\partial \mathcal{L}}{\partial in}$ for both the Softmax and the Max Pool layers. The Max Pool layer passes $\frac{\partial \mathcal{L}}{\partial out}$ to the Conv layer.

<img src="http://engineering.unl.edu/images/uploads/CNN_Backward_Propagation.png" width=700, height=500>



For the backward calculation of the Conv layer, our goal is two-fold.

First, we need to compute the variation in loss due to the input signal, i.e., 
- $\frac{\partial \mathcal{L}}{\partial in}$

This calculation is based on the variation in loss due to the output signal that it receives from the Max Pool layer:

$ \frac{\partial \mathcal{L}}{\partial in} \leftarrow \frac{\partial \mathcal{L}}{\partial out}$


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_AnyLayer.png" width=500, height=400>

The second goal during backpropagation is to compute the variation in loss due to the layer weights and bias terms:
- $\frac{\partial \mathcal{L}}{\partial \vec{w}}$
- $\frac{\partial \mathcal{L}}{\partial b}$

Then, update the weights and the bias terms:

- $\vec{w} \leftarrow \vec{w} - \eta *  \frac{\partial \mathcal{L}}{\partial \vec{w}}$
- $b \leftarrow b - \eta *  \frac{\partial \mathcal{L}}{\partial b}$



# Backward Calculation: Conv Layer

The backward calculation for the Conv layer is done in two steps.
- Activation Layer: Given derivative w.r.t. activation $Y$ of layer $l$, compute the derivatives w.r.t. the output maps $Z$ of he same layer.

- Conv Layer: From derivative w.r.t. the output maps $Z$ of layer $l$, compute the derivative w.r.t. the input map $Y$ at layer $l - 1$ and $W$ at layer $l$.





## Backward Calculation: Activation Layer


During the forward computation, the activation feature maps $Y(i, j, k)$ are obtained by element-wise application of the activation function to the convolved feature maps $Z(i, j, k)$. 

- $Y(i, j, k) = f(Z(i, j, k))$


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Activation_Backward.png" width=800, height=400>


During the backward computation, for every position of the feature maps $Y(i, j, k)$, we already have the derivative of the loss w.r.t. $Y(i, j, k)$. We obtained this from backward propagation (from the next Pooling layer).

We compute the derivatives of the loss w.r.t. $Z(i, j, k)$ using the chain rule of calculus:

$\frac{\partial \mathcal{L}}{\partial Z(i, j, k)} = \frac{\partial \mathcal{L}}{\partial Y(i, j, k)} f^{'}(Z(i, j, k))$

## Backward Computation: Conv Layer


As mentioned at the beginning, during the backward computation, we have two tasks.

Given $\frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)}$ of layer $l$:

- Task 1: Compute $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$ of layer $l-1$
- Task 2: Compute $\frac{\partial \mathcal{L}}{\partial W(i^{''}, j^{''}, k_{in}, k})$ of layer $l$





### Task 1: Compute $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$ of layer $l-1$


First, let's illustrate the big picture. Consider the following figure.

- Input Feature Maps $Y$ in layer $l - 1$: We have $n_k$ channels in the input feature map $Y$ of layer $l - 1$. An arbitrary input channel, i.e., an input feature map, is denoted by $Y(k_{in})$.

- Filters $W$ in layer $l$: We have $f_k$ 3D filters, each with $n_k$ channels. In the filter box in the figure, each column represents a single filter with $n_k$ channels. A filter number is denoted by $k$. Thus, an arbitrary filter is denoted by its filter number $k$ and the channel number $k_{in}$.

- Output Feature Maps $Z$ in layer $l$: There are $f_k$ output feature maps. I.e., there are $f_k$ output channels. An arbitrary output map is denoted by $Z(k)$.

- Activated Output Feature Maps $Y$ in layer $l$: There are $f_k$ activated output feature maps. An arbitrary activated output map is denoted by $Y(k)$.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_1.png" width=800, height=400>

From the backward computation of layer $l$ we already have $\frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)}$. Observe that the loss $\mathcal{L}$ propagates backwards to layer $l - 1$ maps via the maps of layer $l$. Using the loss derivative $\frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)}$, we want to compute $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$ for layer $l-1$.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_2.png" width=800, height=400>

We will do this computation progressively. Consider the above figure. We focus on an arbitrary input map $Y(k_{in})$ of layer $l - 1$, which after convolving with the filter weights $W(k_{in}, k)$ of layer $l$ produces the output map $Z(k)$ in layer $l$. We will compute how would a small change in an input map value $Y(i, j, k_{in})$ in layer $l - 1$ influences:
- Step 1: A single output map value $Z(i^{'}, j^{'}, k)$ of an arbitrary map in layer $l$
- Step 2: All output map values $Z(*, *, k)$ of an arbitrary map in layer $l$
- Step 3: All output map values $Z(*, *, *)$ of all maps in layer $l$

These three steps are summarized below.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_2a.png" width=600, height=300>

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_2b.png" width=500, height=250>




### Step 1: $Y(i, j, k_{in})$ in layer $l−1$ influences a single output map value $Z(i^{'}, j^{'}, k)$ of an arbitrary map in layer $l$



Using the chain rule of calculus, we write the following influence equation:

$\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})} = \frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)}\times \frac{\partial Z(i^{'}, j^{'}, k)}{\partial Y(i, j, k_{in})}$ 


Here $\frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)}$ of layer $l$ is given. Thus, we need to compute $\frac{\partial Z(i^{'}, j^{'}, k)}{\partial Y(i, j, k_{in})}$. 

We focus on a single input map $Y(i, j, k_{in})$ in layer $l−1$ and determine how it influences a single output map value $Z(i^{'}, j^{'}, k)$ of an arbitrary map in layer $l$.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_3.png" width=800, height=400>

Observe that each $Y(i, j, k_{in})$ in layer $l−1$ influences several $Z(i^{'}, j^{'}, k)$ of an arbitrary map in layer $l$. Let's compute how each $Y(i, j, k_{in})$ in layer $l−1$ influences various locations of $Z(k)$ in layer $l$. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_4.png" width=500, height=300>

Assume that the indexing on the map $Y(i, j, k_{in})$ begins from $0, 0$ on the top left corner. We focus on a particular cell on $Y$ at location $(2, 2)$. We will compute how the map value in this location influences several output map locations in $Z$.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_5.png" width=600, height=400>

Consider the following figure. On the top-left figure we put the 3 x 3 filter on the 5 x 5 input map $Y$ in layer $l - 1$. After convolving, it produces a single output on the $0, 0$ location of the map $Z$ in layer $l$.

Carefully observe that the dark green weight value of the filter (bottom-right cell on the filter W) is convolved with the input map cell at $(2, 2)$ of $Y$ in layer $l - 1$. Thus, the input $Y$ map cell at $(2, 2)$ influences the output $Z$ map cell at $(0, 0)$ via the weight value at $(2, 2)$ location in the filter weight matrix. This relationship can be captured via the following equation.

$Z(0, 0, k) = Y(2, 2, k_{in})W(2, 2, k_{in}, k)$

Now move the filter by a single stride on the right (top right figure). Observe that now the input $Y$ map value at $(2, 2)$ is convolved with a different weight value, i.e., at $(1, 2)$ location on the weight matrix. It produces an output map value on $Z$ at the location $(1, 0)$, which is captured in the following equation. 


$Z(1, 0, k) = Y(2, 2, k_{in})W(1, 2, k_{in}, k)$

Notice that we keep our focus fixed on the $Y$ input map value at $(2, 2)$. We simply show how a single input value changes multiple locations on the output as we slide the convolutional filter.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_6.png" width=1000, height=600>

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_7.png" width=1000, height=600>




<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_8.png" width=500, height=300>


The last figure above shows 9 locations on the output feature map $Z$ in layer $l$ that are influenced by a single location $(2, 2)$ of the input feature map $Y$ in layer $l - 1$ via the 9 weight values of the layer $l$ filter matrix $W$.

Finally, based on the above illustration, we can derive a general formula to compute how a single input feature map $Y(i, j, k_{in})$ in layer $l−1$ influences a single output map value $Z(i^{'}, j^{'}, k)$ of an arbitrary map in layer $l$.

In the following figure see that the coordinates of $Z$ and $W$ sum to the coordinates of $Y$. Thus, we can write:

- $Z(i^{'}, j^{'}, k) = Y(i, j, k_{in}) W(i - i^{'}, j - j^{'}, k_{in}, k)$
 
 
Based on this equation we can trivially compute how much influence an incremental change in an input feature map $Y(i, j, k_{in})$ in layer $l−1$ will have on a single output map value $Z(i^{'}, j^{'}, k)$ of an arbitrary map in layer $l$.
 
- $\frac{\partial Z(i^{'}, j^{'}, k)}{\partial Y(i, j, k_{in})} = W(i - i^{'}, j - j^{'}, k_{in}, k)$ 

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_9.png" width=600, height=400>





### Step 2: $Y(i, j, k_{in})$ in layer $l−1$ influences all output map values $Z(*, *, k)$ of an arbitrary map in layer $l$


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_2c.png" width=600, height=300>

For step 2, we simply sum over all locations of the output $Z$ of an arbitrary map $k$ in layer $l$.

- $\frac{\partial Z(i^{'}, j^{'}, k)}{\partial Y(i, j, k_{in})} = \sum_{i^{'}}\sum_{j^{'}}W(i - i^{'}, j - j^{'}, k_{in}, k)$ 


### Step 3: $Y(i, j, k_{in})$ in layer $l−1$ influences all output map values $Z(*, *, *)$ of all maps in layer $l$


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_2b.png" width=600, height=300>

For step 3, we sum over all $Z$ maps in layer $l$.

- $\frac{\partial Z(i^{'}, j^{'}, k)}{\partial Y(i, j, k_{in})} = \sum_{k}\sum_{i^{'}}\sum_{j^{'}}W(i - i^{'}, j - j^{'}, k_{in}, k)$ 




### Task 1: Mathematical Formula to Compute $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$ of layer $l-1$

Given $\frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)}$ of layer $l$, now we are ready to compute $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$ of layer $l-1$:

$\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})} = \frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)} \times \frac{Z(i^{'}, j^{'}, k)}{Y(i, j, k_{in})}$

= > $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})} = \sum_{k}\sum_{i^{'}}\sum_{j^{'}}\frac{\partial \mathcal{L}}{\partial Z(i^{'}, j^{'}, k)} \times W(i - i^{'}, j - j^{'}, k_{in}, k)$


### Task 1: Implementation of $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$ at layer $l - 1$

The derivatives for every element of every $Y$ map at layer $l - 1$ can be computed by the direct implementation of the above formula.

We will derive the pseudocode for implementing this formula. We will show that each of the derivative term $Y(i, j, k_{in})$ can be obtained by performing convolution of the elements of all output maps $Z(i^{'}, j^{'}, k)$ and the flipped filter $W$. This can be called as the **backward convolution**.

Let's discuss why do we need to flip the filter to perform convolution for the backward loss computation at $Y$.

### Convolution for Computing $\frac{\partial \mathcal{L}}{\partial Y(i, j, k_{in})}$  by Flipping the Filter 


Consider the following figure. Recall what we did during the forward convolution of the input map $Y$ with the filter $W$. As an example, on the top-left figure, we put the 3 x 3 $W$ filter of layer $l$ on the 5 x 5 input map $Y$ in layer $l - 1$. After convolving it produces a single output on the $0, 0$ location of the map $Z$ in layer $l$. 

During backward propagation, we convolve the matrix of loss derivative w.r.t. the output map $Z(k)$ with the filter $W$. But this time each $Z$ location needs to be convolved with the flipped positions of the filter. 

To understand the backward convolution, let's take a look at the following figure that shows forward convolution of $Y$ with $\pmb{W}$. The input $Y$ map cell at (2, 2) influences the output $Z$ map cell at (0,0) via the weight value at (2, 2) location in the filter weight matrix $\pmb{W}$. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_Task1_Forward.png" width=600, height=400>


The above illustration can be used to illustrate backward backward convolution, as shown below. Assume that now the $Z$ matrix represents the loss derivative w.r.t. $Z$, and the $Y$ matrix represents the loss derivatives w.r.t. $Y$. The $Y$ derivatives are produced by the convolution of the $Z$ derivative matrix with the flipped filter $W$. In the figure below, we convolved the $(0, 0)$ location of the $Z$ derivative matrix not by putting the $(0, 0)$ location of the filter. Instead the $(0, 0)$ position of $Z$ is convolved with the $(2, 2)$ location of the filter. In other words, we needed to flip the filter (first vertically, then horizontally) to perform this convolution. This backward convolution can be captured via the followng equation.

$\frac{\partial \mathcal{L}}{\partial Y(2, 2, k_{in})} = \frac{\partial \mathcal{L}}{\partial Z(0, 0, k)} \times W(2, 2, k_{in}, k)$

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_Task1_Backward.png" width=600, height=400>

Now consider the next cell on the $Z$ derivation matrix on the top-right figure below. The loss derivative w.r.t. the input map $Y(k_{in})$ at $(2, 2)$ location is computed by convolving the $(1, 0)$ location of the $Z(k)$ derivative matrix with the weight value at $(1, 2)$ location. Again, this backward convolution requires to use the flipped weight matrix. 

$\frac{\partial \mathcal{L}}{\partial Y(2, 2, k_{in})} = \frac{\partial \mathcal{L}}{\partial Z(1, 0, k)} \times W(1, 2, k_{in}, k)$

This way we can compute a signle $Y$ derivative value at the location $(2, 2)$ by convolving the $Z$ derivative matrix with the flipped filter. We will use this illustration to define an implementation technique for the backward convolution.



<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_10.png" width=1000, height=600>

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_11.png" width=1000, height=600>

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_12.png" width=500, height=250>


Thus, to compute the loss derivative at $(2, 2)$ location of $Y(k_{in})$ in layer $l - 1$, we need to take the sum of the element-wise product of the elements of various locations on the filter and various locations on the derivative matrix of $Z(k)$.

- How are these two locations related?

Observing the following 9 equations for individual element-wise products, we see that for each $Z$ location, we multiply each $Z$ derivative with the weight values on the flipped locations of the filter.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_14.png" width=400, height=200>


#### How exactly is the filter flipped and the backward convolution performed?

The backward convolution in the above example can be implemented by flipping the filter as shown below. The filter is flipped vertically, followed by  horizontally. 


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_13a.png" width=600, height=300>


Then, we put the flipped filter on the $Z$ derivative map with its bottom right square at the location $(2, 2)$. It indicates that we need to perform convolution on the $Z$ derivative matrix by using the flipped filter.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_13.png" width=600, height=300>



## Backward Convolution using Flipped Filter: Single $Z$ Map

The loss derivative w.r.t. $Y(i, j, k_{in})$ in layer $l - 1$ is the sum of the sums obtained from all output $Z$ maps in layer $l$. 

First, we show this backward convolution using a single output map $Z(k)$. We need to do two things before performing convolution.
- Flip the filter $f_h \times f_w$ at layer $l$ (first vertically, then horizontally)
- Zero-pad the matrix of loss derivative w.r.t $Z(k)$ at layer $l$ by adding $f_h - 1$ rows on top and bottom, and $f_w - 1$ columns on the left and right.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_15.png" width=600, height=300>

Then, perform convolution using the flipped filter on the zero-padded $Z(k)$ derivative matrix. Here we used stride 1. As the following figure illustrates, the $Y(k_{in})$ derivative values are computed via convolution for each location on the input map $Y(k_{in})$ matrix in layer $l - 1$.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_16.png" width=1000, height=600>


### Convolution using Flipped Filter: All $Z$ Maps

Now we will add the contribution from all $Z$ maps in layer $l$ to compute the the loss derivative w.r.t. $Y(i, j, k_{in})$ in layer $l - 1$.

Consider the following figure. The $Y$ derivative for input channel $k_{in}$ at the location $(i, j)$ is computed by the convolution of the $f_k$ filters of layer $l$ with the $f_k$ number of $Z$ derivative maps of layer $l$. During the forward convolution, the $k_{in}$  channel of the input $Y$ always convolves with the $k_{in}$ plane of the filter. Thus, the $k_{in}$  channel of the $Y$ derivative will invoke the $k_{in}$ plane of all filters.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_17.png" width=800, height=600>

In the following figure we show how a single $Y$ derivative value is computed by convolving the $k_{in}$ plane of the all flipped filters over all $Z$ derivative maps.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_18.png" width=800, height=600>

Following figure shows the backward convolution process. Observe that we convolved the zero-padded $\frac{\partial \mathcal{L}}{\partial Z(i, j, k)}$ maps by the flipped filters. We keep on computing elements for the $Y$ derivative map as long as the flipped filter has at least one element in the unpadded derivative $Z$ map.

The size of the $Y$ derivative map will be: $(n_h + f_h - 1) \times (n_w + f_w - 1)$, where $n_h$ and $n_w$ are the height and width of the $Z$ map, respectively. Notice that the size of the $Y$ derivative map recovers the original size of the $Y$ map.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_19.png" width=1000, height=800>

## Backward Convolution: Input Map with Zero-Padding

If the $Y$ map was zero-padded in the forward convolution, the derivative map will be the size of the zero-padded $Y$. However, the zero padding regions need to be removed before further backward computation.

## Backward Convolution: Stride > 1


When the stride is greater than 1, some positions of $Y(k_{in})$ contribute to more locations on the $Z(k)$ maps than others. Thus we must adjust our formula for the loss derivative w.r.t. $Y$ to account for stride > 1.




<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_20.png" width=600, height=400>

When we convolve with stride > 1, the map gets donwnsampled. In other words, every side on the $Z$ map becomes a factor stride smaller. This is because we don't compute all terms that we would do with stride 1. To understand this, consider the following figure.

We see that with stride > 1 convolution is equivalent to performing convolution with stride 1 and then dropping Stride - 1 out of every rows and Stride - 1 of every column. For example, in the following figure the convolved values with stride 2 are equivalent to the convolved values with stride 1 expect we drop every second entry. This insight will be useful to compute the $Z$ derivative map values with stride 2. 


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_21.png" width=600, height=400>

During backpropagation, we get the loss derivatives w.r.t. to the elements of the downsampled (strided > 1) $Z$ map. We create a full-sized $Z$ map (i.e., equivalent to the size when stride 1 is used). Then, place the derivative values into their original locations of the full-sized $Z$ map, as shown below.

Notice that the remaining entries of the full-sized $Z$ map don't affect the loss because they were dropped out during the forward computation. Thus, the loss derivative of these cells w.r.t. to their map values (which are zero) will be zero.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_22.png" width=600, height=400>

Thus, when stride > 1 is used during forward computation, we need to adjust the loss derivative calculation w.r.t. the $Y$ map as follows:
- Upsample the downsampled $Z$ derivatives
- Insert zeros in the empty slots

Then, we get the loss derivatives w.r.t. all the entries of a full-sized $Z$ map. These values backpropagate for computing the loss derivative w.r.t. the $Y$ map in the previous layer.


## Backward Convolution: Task 1 Pseudocode for Computing $\frac{\partial \mathcal{L}}{\partial Y(i^{''}, j^{''}, k_{in})}$


Given $\frac{\partial \mathcal{L}}{\partial Z(i, j, k)}$ of layer $l$ we compute $\frac{\partial \mathcal{L}}{\partial Y(i^{''}, j^{''}, k_{in})}$ of layer $l-1$ using the following pseudocode.

<img src="http://engineering.unl.edu/images/uploads/CNN_Backprop_Conv_dY.png" width=800, height=600>

## Backward Computation for Conv Layer

Let's go back to the backward computation for the Conv layer. We wanted to accomplish two tasks.

Given $\frac{\partial \mathcal{L}}{\partial Z(i, j, k)}$ of layer $l$:

- Task 1: Compute $\frac{\partial \mathcal{L}}{\partial Y(i^{''}, j^{''}, k_{in})}$ of layer $l-1$
- Task 2: Compute $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ of layer $l$

We have completed the task 1. Let's turn to task 2 now.


### Task 2: Compute $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ of layer $l$

To compute the loss derivative for the filters, first, we apply the chain rule of calculus to write the following influence equation.


$\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)} = \frac{\partial \mathcal{L}}{\partial Z(i, j, k)}\times \frac{\partial Z(i, j, k)}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ 


We already have $\frac{\partial \mathcal{L}}{\partial Z(i, j, k)}$ for the Conv layer. Thus, we only need to calculate $\frac{\partial Z(i, j, k)}{\partial W(i^{'}, j^{'}, k_{in}, k)}$. In other words, we need to compute how would changing a filter's weight influences an output map value. 


We will do this computation progressively. We focus on an arbitrary input map $Y(k_{in})$ of layer $l - 1$, which after convolving with the filter weights $W(k_{in}, k)$ of layer $l$ produces the output map $Z(k)$ in layer $l$. We will compute how would a small change in a filter's weight $W(i^{'}, j^{'}, k_{in}, k)$ in layer $l$ influences:
- Step 1: A single output map value $Z(i, j, k)$ of an arbitrary map in layer $l$
- Step 2: Several output map values $Z(k)$ of an arbitrary map in layer $l$



### Step 1: Small change in $W(i^{'}, j^{'}, k_{in}, k)$ influences a single output map value $Z(i, j, k)$ of an arbitrary map in layer $l$


We want to know how would modifying a filter change the output of one specific output map value. Consider the following figure. All weights of the 3 x 3 filter are zero. After convolving the 3 x 3 input region, the filter produces a zero value at the output map. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_23.png" width=600, height=400>

In the top figure below, as we increase the weight at the center of the filter by 1, the $Z$ map value increases by the value of the corresponding input map value. Then, in the bottom figure, we increase the bottom-right weight value by 1, which increments the output map value by the amount of the corresponding input map value.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_24.png" width=600, height=800>



From this illustration, we see that the derivative of a specific output map value $Z(i, j, k)$ w.r.t. a specific filter weight $W(i^{'}, j^{'}, k_{in}, k)$ is just the corresponding input map value.

We corroborate this fact using another observation.


A single weight affects several output map values. This is illustrated below. We convolve the 3 x 3 filter $W$ over a 5 x 5 input map $Y$ using stride 1. We keep our focus fixed on a specific weight value at the center of the filter at coordinate $(1, 1)$. Observe how this single weight value influences the 9 map values at the output $Z$.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_25.png" width=1000, height=600>


The above illustration suggests that the derivative of a specific output map value $Z(i, j, k)$ w.r.t. a specific filter weight $W(i^{'}, j^{'}, k_{in}, k)$ is just the corresponding input map value. Let's write the equation for convolution and see how this works.


$Z(i, j, k) = b_k + \sum_{i^{'}=0}^{f_h-1}\sum_{j^{'}=0}^{f_w-1}\sum_{k_{in}=0}^{n_k}Y(i\times s_h + i^{'},
j\times s_w + j^{'}, k_{in}) * W(i^{'}, j^{'}, k_{in}, k)$

We focus on just one input map value:


$Z(i, j, k) = b_k + Y(i\times s_h + i^{'},j\times s_w + j^{'}, k_{in}) * W(i^{'}, j^{'}, k_{in}, k)$

Then, taking its derivative w.r.t. $W(i^{'}, j^{'}, k_{in}, k)$, we get:

$\frac{\partial Z(i, j, k)}{\partial W(i^{'}, j^{'}, k_{in}, k)} = Y(i\times s_h + i^{'},j\times s_w + j^{'}, k_{in})$




### Step 2: Small change in $W(i^{'}, j^{'}, k_{in}, k)$ influences several output map values $Z(k)$ of an arbitrary map in layer $l$

From the above illustration, we see that a single weight value influences several map values at the output. Also, note that every value of the output map contributes to the loss. To understand this, consider the following figure. The single weight value at the center of the filter influences several $Z$ values, and consequently all $Z$ values influence the loss.


<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_26.png" width=600, height=600>

Thus, to determine the derivative of loss w.r.t. a single weight value $W(i^{'}, j^{'}, k_{in}, k)$, we must sum over all $Z(i, j, k)$ terms it influences.



$\frac{\partial Z(i, j, k)}{\partial W(i^{'}, j^{'}, k_{in}, k)} = \sum_{i=0}^{f_h - 1} \sum_{j=0}^{f_w - 1}Y(i\times s_h + i^{'},j\times s_w + j^{'}, k_{in})$




### Task 2: Mathematical Formula for Computing $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ of layer $l$

Previously we have shown that the loss derivative for the filters is given by:

$\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)} = \frac{\partial \mathcal{L}}{\partial Z(i, j, k)}\times \frac{\partial Z(i, j, k)}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ 


=> $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)} = \sum_{i=0}^{f_h - 1} \sum_{j=0}^{f_w - 1}\frac{\partial \mathcal{L}}{\partial Z(i, j, k)}\times Y(i\times s_h + i^{'},j\times s_w + j^{'}, k_{in})$

Notice that in the above formula we are computing element-wise product between the $Z$ derivative map  and the input $Y$ map values, followed by sum of all products. Therefore, the above formula is nothing but a convolution!



$\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)} = \frac{\partial \mathcal{L}}{\partial Z(k)}* Y(k_{in})$


Below we illustrate this backward convolution.

## Backward Convolution for Computing $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$

Consider the following figure. 

- Forward Propagation: The 3 x 3 filter $W$ convolves with the 5 x 5 input map $Y$ with stride 1 to produce the 3 x 3 output map $Z$. The $Z$ map values influence the loss.

- Backward Propagation: The loss derivatives $\frac{\partial \mathcal{L}}{Z(k)}$ flows backward and at each $Z$ map value location we get their corresponding loss derivative values $\frac{\partial \mathcal{L}}{Z(k)}$. Thus during backward propagation of loss, we get a 3 x 3 $Z$ derivative matrix.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_27.png" width=600, height=600>

For backward convolution to create the 3 x 3 filter derivative matrix $\frac{\partial \mathcal{L}}{W(k)}$, we convolve the 3 x 3 $Z$ derivative matrix at layer $l$ over the 5 x 5 input map $Y$ at layer $l - 1$.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_28.png" width=600, height=600>


Following figure illustrates the backward convolution process for computing $\frac{\partial \mathcal{L}}{W(k)}$. 

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_29.png" width=800, height=800>

Note that during the backward propagation, the $k$-th $Z$ derivative map of layer $l$ convolves with all input map $Y$ of layer $l - 1$, which gives the loss derivative w.r.t. weights $W(k_{in}, k)$ of layer $l$. These weights are associated with the $k$-th filter for the $k_{in}$ input channel.

<img src="http://engineering.unl.edu/images/uploads/CNN_Training_Conv_30.png" width=800, height=800>


### Task 2: Compute $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ with stride > 1

So far, we assumed that the stride is 1. For stride > 1, we need to make the following adjustments.

- $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$ must be upsampled
- If $Y(k_{in})$ was zero-padded during the forward propagation, it must be zero-padded for the backward calculation.


### Backward Convolution: Task 1 Pseudocode for Computing $\frac{\partial \mathcal{L}}{\partial W(i^{'}, j^{'}, k_{in}, k)}$


<img src="http://engineering.unl.edu/images/uploads/CNN_Backprop_Conv_dW.png" width=800, height=600>