---
title: "Post With Code"
author: "Harlow Malloc"
date: "2025-02-15"
categories: [news, code, analysis]
image: "image.jpg"
---


# How convolution neural networks work and why are they so efficient ?

I was trying to create a MNIST classification model using CNNs and NNs in pytorch and was surprised when I looked at the difference in number of parameters between similar performing CNN and a simple NN. 

This Multi-layerd neural network, had an accuracy of around 97%.


In [None]:
model = nn.Sequential(nn.Linear(784,50), nn.ReLU(), nn.Linear(50,10))

and it used around **39,700** parameters(weights) to do that. 

This CNN also had an accuracy of around 97%. 


In [None]:
def conv(ni, nf, ks=3, stride=2, act=True):
    res = nn.Conv2d(ni, nf, stride=stride, kernel_size=ks, padding=ks//2)
    if act: res = nn.Sequential(res, nn.ReLU())
    return res

simple_cnn = nn.Sequential(
    conv(1 ,4),            #14x14
    conv(4 ,8),            #7x7
    conv(8 ,16),           #4x4
    conv(16,16),           #2x2
    conv(16,10, act=False), #1x1
    nn.Flatten(),
)

and it only used around **5,274** parameters(weights) to do that. 

What are the reasons behind this stark difference in number of parameters this led me to a investigative journey which deepend my understand of how CNN works, but before understanding CNN I am expecting that you understand how Neural networks. 

## How Convolutions work

A Convolution is like a sliding window over the data, it can be any data with grid like structure, It can be a time-series data which is a 1D grid or image data like in our case which can be viewed as a 2D grid. 

<img alt="A 3×3 kernel with 5×5 input, stride-2 convolution, and 1 pixel of padding" width="774" caption="A 3×3 kernel with 5×5 input, stride-2 convolution, and 1 pixel of padding (courtesy of Vincent Dumoulin and Francesco Visin)" id="three_by_five_conv" src="att_00030.png">


Here we have a Kernal of 3x3(black box), which is sliding over an image of size 5x5 with padding 1 and stride of 2(sliding 2 pixels at time) which creates an activation map of 3X3. This is how Convolutions happens in our CNN. 

## How CNNs Work

For a simple neural net we matrix multiply the input with the Parameters(weights). This means that each and every input unit interacts with each and every weight exactly once for calcualting output of a layer, Which makes a Traditional neural net different from a Convolution Neural network.

$$
W =
\begin{bmatrix}
w_1 & w_2 & w_3 \\
w_4 & w_5 & w_6 \\
w_7 & w_8 & w_9
\end{bmatrix}
$$


$$
X =
\begin{bmatrix}
x_1 & x_2 & x_3 \\
x_4 & x_5 & x_6 \\
x_7 & x_8 & x_9
\end{bmatrix}
$$


$$
X@W =
\begin{bmatrix}
(x_1 w_1 + x_2 w_4 + x_3 w_7) & (x_1 w_2 + x_2 w_5 + x_3 w_8) & (x_1 w_3 + x_2 w_6 + x_3 w_9) \\
(x_4 w_1 + x_5 w_4 + x_6 w_7) & (x_4 w_2 + x_5 w_5 + x_6 w_8) & (x_4 w_3 + x_5 w_6 + x_6 w_9) \\
(x_7 w_1 + x_8 w_4 + x_9 w_7) & (x_7 w_2 + x_8 w_5 + x_9 w_8) & (x_7 w_3 + x_8 w_6 + x_9 w_9)
\end{bmatrix}
$$

If we had to repersent a 2x2 kernal sliding over a 3x3 input in a matrix form it would look something like below, and it will create a activation map which is 4x4. If you look and comapre the 2 operations you can see that there are 2 main differences.

1. Weights are Repeating
2. Weight matrix for CNNs is filled with zeros 

$$
W = 
\begin{bmatrix}
k_1 & k_2 & 0 & k_3 & k_4 & 0 & 0 & 0 & 0 \\
0 & k_1 & k_2 & 0 & k_3 & k_4 & 0 & 0 & 0 \\
0 & 0 & 0 & k_1 & k_2 & 0 & k_3 & k_4 & 0 \\
0 & 0 & 0 & 0 & k_1 & k_2 & 0 & k_3 & k_4
\end{bmatrix}
$$
$$
X=
\begin{bmatrix}
x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \\ x_7 \\ x_8 \\ x_9
\end{bmatrix}
$$

$$
X@W =
\begin{bmatrix}
k_1 x_1 + k_2 x_2 + k_3 x_4 + k_4 x_5 \\
k_1 x_2 + k_2 x_3 + k_3 x_5 + k_4 x_6 \\
k_1 x_4 + k_2 x_5 + k_3 x_7 + k_4 x_8 \\
k_1 x_5 + k_2 x_6 + k_3 x_8 + k_4 x_9
\end{bmatrix}
$$

Lets explore these differences further. 

**Repeating Weigths(parameter sharing)-** It is one of the reason behind efficiancy of CNNs, in Dense matrix multiplication input gets multiplied with parameter exactly once to create a output which is not the case in Convolution Neural netwroks the Kernal slide over the input which means that each parameter of the kernal, is used at evey position in input Therefore, rather than learning different parameters for every position we only learn one set of weights. 

**Weight Matrix filled with zeros(sparse repersentation)-** The size of the kernal is smaller than the size of the input therefor, when we repersent the convolutions in a matrix multiplication operation it results in a matrix which is filled with zeros and one might think that because of all these zeros we might be loosing some features of the input especially in strided convolutions which is not optimal but if we look a the diagram below you can see that is not the case. X being our input and h being a shallow layer and g being a deep layer you can see that deeper layer is connected to almost all of the images features. 

<img alt="diagram for sparse connection taken from Deep Learning book by ian goodfellow" width="774" caption="diagram for sparse connection taken from Deep Learning book by ian goodfellow" id="sparceconnection" src="sparceconectivity.png">

## Understanding the Code 

So What does it mean when we are writing this code. 


In [None]:
conv(1 ,4)

This function will recieve a 28X28 image(ignorning batch_size) with one channel(since its black and white) and output of this layer after using relu will be 4 different activation maps created by 4 different kernals and the size of the activation maps will be of size 14x14(because of the stride=2) and the same happens till the last layer which outputs the probability distribution for each digit from 0-9. We are not using relu in last layer as we are using cross_entropy loss fuction which has its own softmax function and expects raw logits. 

This layer has 1 input channel, 4 output channels, and a 3×3 kernel therefore, the total parameters of this layer will be:-


```{math}
 1x4x3x3 = 36 parameters 
```


Confirming our calculations by fetching the parameters of first layer  :-


```{code}
conv1 = simple_cnn[0][0]
conv1.weight
```

```{code}
Parameter containing:
tensor([[[[ 0.2922,  0.3600,  0.2967],
          [-0.0044,  0.0414, -0.0608],
          [ 0.1634, -0.0885,  0.2995]]],


        [[[-0.2102,  0.3089, -0.1890],
          [-0.1660,  0.1155,  0.3302],
          [-0.0576,  0.0286, -0.2662]]],


        [[[-0.3527, -0.0673,  0.2557],
          [-0.1725, -0.3262, -0.3382],
          [-0.1993, -0.3218, -0.5433]]],


        [[[ 0.3354,  0.4143,  0.6307],
          [ 0.8166,  1.2680,  0.7831],
          [ 0.5499,  1.0570,  1.0479]]]], device='cuda:0', requires_grad=True)
```


as you can see the first layer has 36 parameters just as we calculated. Each 3x3 matrix that you can see in the output is a kernal which containing differents parameters which will slide over our input image. 


### Refrences

Fast.ai Course
Howard, J. (n.d.). Lesson 15: Deep learning for coders. Fast.ai. Retrieved March 15, 2025, from https://course.fast.ai/Lessons/lesson15.html

Deep Learning Book
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Convolutional networks. In Deep learning (pp. 326-366). MIT Press. https://www.deeplearningbook.org/contents/convnets.html

Medium Article
Basart, J. (2018, July 9). CNNs from different viewpoints. Medium - Impact AI. https://medium.com/impactai/cnns-from-different-viewpoints-fab7f52d159c
```