# Introduction to Convolutional Neural Networks (CNNs / ConvNets)

![Basic Structure Convolutional Neural Networks](../images/cnn-cover.png)

### Image Classification

* Image as input
* Class or probability  (a cat, dog, etc) as output

###  ConvNet Architecture
* Convolutional Layer
* Pooling Layer
* Fully-Connected Layer

Simple architecture: [INPUT - CONV - RELU - POOL - FC]

## Convolutional Layer

* The CONV layers consist of a set of learnable filters
* Every filter is small spatially (along width and height), but extends through the full depth of the input volume

For example, a typical filter on a first layer of a ConvNet might have size 5x5x3
* 5 pixels width and height
* 3 because images have depth 3 (the color channels)

During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.

Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. 

Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.

## Pooling Layer

The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation.

In addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.

![Pooling layer downsampling](../images/cnn-description-001.jpeg)

![Most common downsampling operation: max pooling](../images/cnn-description-002.jpeg)

## Fully-Connected Layer

This layer basically takes an input volume (whatever the output is of the conv or ReLU or pool layer preceding it) and outputs an N dimensional vector where N is the number of classes that the program has to choose from. For example, if you wanted a digit classification program, N would be 10 since there are 10 digits. Each number in this N dimensional vector represents the probability of a certain class.

For example, if the resulting vector for a digit classification program is [0 .1 .1 .75 0 0 0 0 0 .05], then this represents a 10% probability that the image is a 1, a 10% probability that the image is a 2, a 75% probability that the image is a 3, and a 5% probability that the image is a 9

CIFAR-10 example: [https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html](https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html)

Source: [CS231n: Convolutional Neural Networks for Visual Recognition](http://cs231n.github.io/convolutional-networks/) & [A Beginner's Guide To Understanding Convolutional Neural Networks](https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/)