[Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks](https://arxiv.org/abs/1312.6082)


---

> Like most neural networks, they contain several filtering layers
with each layer applying an affine transformation to the vector input followed by an elementwise
non-linearity.

an affine map is the composition of two functions: a translation and a linear map
$$
\vec{y} = f(\vec{x}) = A\vec{x} + \vec{b}
$$

$$
\begin{bmatrix}
y \\
1
\end{bmatrix}
=
\begin{bmatrix}
A & b \\
0 & 1
\end{bmatrix}
\begin{bmatrix}
x \\
1
\end{bmatrix}
$$

This happens when the input (i.e., an image patch) is multiplied by weights (filter/kernel) and then a bias is added.
This affine transformation preserves lines and parallelism but not necessarily angles or distances.


After the affine transformation, each resulting value goes through a non-linear function
(like ReLU, sigmoid, or tanh) independently.
"Elementwise" means it's applied to each value separately, not to the whole vector at once.
This output becomes the input for the next layer, and the process repeats.

In [18]:
'''
A convolution is NOT a standard tensor multiplication. Instead, it's a specialized operation where:

1. The kernel slides over the input
2. At each position, we compute element-wise multiplications followed by summation

Kernel Shape Rules!
For an input with shape [batch_size, height, width, channels]:

- The kernel shape is [kernel_height, kernel_width, in_channels, out_channels]
- in_channels MUST match the input's channels dimension
- out_channels is a hyperparameter you choose
'''

import numpy as np

# Example input - a small feature map/activation (e.g., from a previous layer).
# Shape: [batch_size=1, height=3, width=3, channels=2].
input_activation = np.array(
    [
        [
            [[0.2, 0.5], [0.1, 0.3], [0.7, 0.2]],
            [[0.5, 0.8], [0.4, 0.1], [0.3, 0.9]],
            [[0.6, 0.4], [0.8, 0.3], [0.2, 0.5]]],
    ]
)

# 1. AFFINE TRANSFORMATION
# For CNNs, this is typically done with convolution kernels (filters).
# Example kernel with shape [height=2, width=2, in_channels=2, out_channels=1].
kernel = np.array(
    [
        [
            [[0.1], [0.3]],
            [[0.5], [0.7]]
        ],
        [
            [[0.2], [0.4]],
             [[0.6], [0.8]],
        ]
    ]
)

# Bias term for the affine transformation.
bias = 0.1

input_activation.shape, kernel.shape

((1, 3, 3, 2), (2, 2, 2, 1))

In [19]:
# Manually perform convolution (simplified for demonstration).
# In a real implementation, you'd use functions like torch.nn.Conv2d.
# output_height = input_height - kernel_height + 1
# output_width = input_width - kernel_width + 1
output_height = input_activation.shape[1] - kernel.shape[0] + 1
output_width = input_activation.shape[2] - kernel.shape[1] + 1
output = np.zeros((1, output_height, output_width, 1))

# Perform the convolution (the affine transformation).
k_h = kernel.shape[0]
k_w = kernel.shape[1]
print(f'{k_h=}, {k_w=}')
for h in range(output_height):
    for w in range(output_width):
        # Extract the patch from the input.
        patch = input_activation[0, h:h+k_h, w:w+k_w, :]
        patch = input_activation[0, h:h+2, w:w+2, :]
        
        # Element-wise multiplication and sum (dot product).
        # This is the affine part: Wx + b where W is the kernel, x is the input, b is bias.
        output[0, h, w, 0] = np.sum(patch * kernel) + bias

print("After affine transformation (convolution):")
output, output.shape

k_h=2, k_w=2
After affine transformation (convolution):


(array([[[[2.77],
          [3.2 ]],
 
         [[3.59],
          [3.35]]]]),
 (1, 2, 2, 1))

In [24]:
# 2. ELEMENTWISE NON-LINEARITY
# Apply ReLU (Rectified Linear Unit) non-linearity: f(x) = max(0, x)
relu_output = np.maximum(0, output)

print("After elementwise non-linearity (ReLU):")
relu_output, relu_output.shape

After elementwise non-linearity (ReLU):


(array([[[[2.77],
          [3.2 ]],
 
         [[3.59],
          [3.35]]]]),
 (1, 2, 2, 1))

In [25]:
sigmoid_output = 1 / (1 + np.exp(-output))
sigmoid_output, sigmoid_output.shape

(array([[[[0.94103299],
          [0.96083428]],
 
         [[0.97314288],
          [0.96610484]]]]),
 (1, 2, 2, 1))

In [26]:
tanh_output = np.tanh(output)
tanh_output, tanh_output.shape

(array([[[[0.99217766],
          [0.9966824 ]],
 
         [[0.99847782],
          [0.9975412 ]]]]),
 (1, 2, 2, 1))

In [27]:
leaky_relu_output = np.maximum(0.01 * output, output)
leaky_relu_output, leaky_relu_output.shape

(array([[[[2.77],
          [3.2 ]],
 
         [[3.59],
          [3.35]]]]),
 (1, 2, 2, 1))

---

Input image: $f(x, y)$
Convolution filter: $w(x, y)$

$$
\left( f * w \right) \left( x, y \right)
=
\sum_{i, j} f\left( x-i, y-j \right) \cdot w(i, j)
$$

The range of $i$ and $j$ they depend on the filter size.
For a filter of size $(2k+1) \times (2k+1)$, the ranges would typically be:

* $i \in [-k, k]$
* $j \in [-k, k]$

For example, with a 3x3 filter (where $k=1$), the indices would range from -1 to 1.

Translation operator: $T_{\Delta x, \Delta y}f = f(x-\Delta x, y-\Delta y)$

To prove equivariance we need to show that
convolving and then translating gives the same result as translating and then convolving.
Mathematically, 
$T_{\Delta x, \Delta y} \left[ f * w \right] = T_{\Delta x, \Delta y} \left[ f \right] * w$.


$$
\begin{align}
T_{\Delta x, \Delta y} f * w  &=  (f * w)(x-\Delta x, y-\Delta y) \\
&= \sum_{i, j} f\left( \left(x - \Delta x\right) -i, \left(y - \Delta y\right) -j \right) \cdot w(i, j) \\
&= \sum_{i, j} f\left( x - \Delta x -i, y - \Delta y -j \right) \cdot w(i, j)
\end{align}
$$

On the other hand,
$$
\begin{align}
T_{\Delta x, \Delta y} \left[ f \right] * w  &= \sum_{i, j} T_{\Delta x, \Delta y}[f]\left( x - \Delta x -i, y - \Delta y -j \right) \cdot w(i, j) \\
&= \sum_{i, j} f\left( \left(x - i\right) -\Delta x, \left(y - j\right) -\Delta y \right) \cdot w(i, j) \\
&= \sum_{i, j} f\left( x - \Delta x -i, y - \Delta y -j \right) \cdot w(i, j)
\end{align}
$$


So we arrived at exactly the same expression, showing us that 
> \[convolutions\] also builds equivariance to translation into the model (in other words, if the image is shifted by one pixel to the right,
then the output of the convolution is also shifted one pixel to the right; the two representations vary
equally with translation).


Though note that CNNs actually do correlations over windows.
$$
\left( f * w \right) \left( x, y \right)
=
\sum_{i, j} f\left( x+i, y+j \right) \cdot w(i, j)
$$

In [8]:
'''
Input Matrix (f):
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]

Filter (w):
[1, 2]
[3, 4]
'''
# Cross-Correlation (what CNNs actually do).
'''
> Position (0,0): place filter @ this position
[1, 2, 3]    [1, 2]
[4, 5, 6]    [3, 4]
[7, 8, 9]

Calculation: place filter @ this position
1*1 + 2*2 + 4*3 + 5*4 = 1 + 4 + 12 + 20 = 37
'''
# (f * w) = sum_{i, j} f(0+i, 0+j) w(i, j)
# f(0,0)w(0,0) + f(0,1)w(0,1) + f(1,0)w(1,0) + f(1,1)w(1,1)

'''
> Position (0, 1):
[1, 2, 3]    [1, 2]→
[4, 5, 6]    [3, 4]→
[7, 8, 9]

2*1 + 3*2 + 5*3 + 6*4 = 47
'''
# (f * w) = sum_{i, j} f(0+i, 1+j) w(i, j)
# f(0,1)w(0,0) + f(0,2)w(0,1) + f(1,1)w(1,0) + f(1,2)w(1,1)
'''
> Position (1,0):
[1, 2, 3]    
[4, 5, 6]    [1, 2]↓
[7, 8, 9]    [3, 4]↓

4*1 + 5*2 + 7*3 + 8*4 = 67

> Position (1,1):
[1, 2, 3]    
[4, 5, 6]    [1, 2]↘
[7, 8, 9]    [3, 4]↘

5*1 + 6*2 + 8*3 + 9*4 = 77

Result of cross-correlation:
[37, 47]
[67, 77]
'''
print()




In [7]:
'''
Input Matrix (f):
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]

Filter (w):
[1, 2]
[3, 4]
'''
# Convolution (mathematical definition)
# First, flip the filter horizontally and vertically:
'''
Flipped filter:
[4, 3]
[2, 1]
'''

# Now apply correlation with the flipped filter:

'''
> Position (0,0): place flipped filter @ this position
[1, 2, 3]    [4, 3]
[4, 5, 6]    [2, 1]
[7, 8, 9]

Calculation:
1*4 + 2*3 + 4*2 + 5*1 = 4 + 6 + 8 + 5 = 23
'''
# (f * w) = sum_{i, j} f(0-i, 0-j) w(i, j)
# This is equivalent to:
# f(0,0)w(0,0) + f(0,-1)w(0,1) + f(-1,0)w(1,0) + f(-1,-1)w(1,1)
# With flipped filter, this becomes:
# f(0,0)w'(0,0) + f(0,1)w'(0,1) + f(1,0)w'(1,0) + f(1,1)w'(1,1)
# where w' is the flipped filter

'''
> Position (0,1):
[1, 2, 3]    [4, 3]→
[4, 5, 6]    [2, 1]→
[7, 8, 9]

Calculation:
2*4 + 3*3 + 5*2 + 6*1 = 8 + 9 + 10 + 6 = 33
'''
# (f * w) = sum_{i, j} f(0-i, 1-j) w(i, j)
# With flipped filter, equivalent to cross-correlation at this position

'''
> Position (1,0):
[1, 2, 3]    
[4, 5, 6]    [4, 3]↓
[7, 8, 9]    [2, 1]↓

Calculation:
4*4 + 5*3 + 7*2 + 8*1 = 16 + 15 + 14 + 8 = 53

> Position (1,1):
[1, 2, 3]    
[4, 5, 6]    [4, 3]↘
[7, 8, 9]    [2, 1]↘

Calculation:
5*4 + 6*3 + 8*2 + 9*1 = 20 + 18 + 16 + 9 = 63

Result of convolution:
[23, 33]
[53, 63]
'''
print()


