# Convolutional Neural Network

In this module you become familiar with convolutional neural networks, also known as space invariant artificial neural networks, a type of deep neural networks, frequently used in image AI applications. There are several CNN architectures, you will learn some of the most common ones to add to your toolkit of Deep Learning Techniques.

## Learning Objectives

Explain how a Convolutional Neural Network works

Become familiar with the most common architectures for Convolutional Neural Networks

Gain practice using CNNs for classification and image applications

# Categorical Cross Entropy

Multiclass Classification with Neural Networks

For binary classification problems, we have a final layer
with a single node and a sigmoid activation.

This has many desirable properties:
- Gives an output strictly between 0 and r
- Can be interpreted as a probability
- Derivative is "nice"
- Analogous to logistic regression

Is there a natural extension of this to a multiclass setting?

Reminder: one hot encoding for categories.
1. Take a vector with length equal to the number of categories.
2. Represent each category with one (1) at a particular position and zero (0) everywhere else. For
example, we can represent types of bank account types:

![](./images/21_BankExampleOneHotEncoding.png)


For multiclass classification problems, let the final layer be
a vector with length equal to the number of possible classes.

Extension of sigmoid to multiclass is the **softmax** function.

$$
softmax(z_i)= \dfrac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}
$$

Yields a vector with entries that are between 0 and 1, and sum to 1.


For loss function use "categorical cross entropy".

This is just the log-loss function in disguise:

$$
C.E.E. = - \sum_{i=1}^n y_i \log(\hat{y}_i)
$$

Derivative has a nice property when used with softmax:

$$
\frac{\partial C.E.}{\partial \text{softmax}} . \frac{\partial \text{softmax}}{\partial z_i} = \hat{y}_i - y_i
$$


# Introduction to Convolutional Neural Networks (CNN)

## Motivation - Image Data

So far, the structure of our neural network treats all inputs interchangeably.

No relationships between the individual inputs.

Just an ordered set of variables.

We want to incorporate domain knowledge into the architecture of a Neural Network.

The convolutional networks we discuss here were developed to deal with image data.

Increasingly, these approaches are being applied in more common analytic problems of
regression and classification.

## Motivation

Important structures in image data:
- "Topology" of pixels
- Translation invariance
- Issues of lighting and contrast
- Knowledge of human visual system
- pixels tend to have similar values
- Edges and shapes
- Scale Invariance (a big cat has similar adj with a small cat)



# Images Dataset

Motivation - Image Data

Fully connected image networks would require a vast number
of parameters.

MNIST images are small (28 x 28 pixels), and in grayscale
Color images typically contain:
[(200 X 200) pixels] x [3 color channels (RGB)] =
120,000 values (features).

A single fully connected layer would require:
(200 * 200 x 3)2 = 14,400,000,000 weights!
- Variance (in terms of bias-variance) would be too high.
- So we introduce "bias" by structuring the network to look for certain kinds of patterns.

Features need to be "built up".
- Edges shapes  relations between shapes
- Textures

Example: Cat = [two eyes in certain relation to one another] + [cat fur texture].
- Eyes = dark circle (pupil) inside another circle.
- Circle = particular combination of edge detectors.
- Fur = edges in certain pattern.


# Kernels

A kernel is a grid of weights "overlaid" on image, centered on one pixel.
- Each weight multiplied with pixel underneath it.
- Output over the centered pixel is: $\sum_{p=1}^PW_p . pixel_p$

Used for traditional image processing techniques:
- Blur, Sharpen, Edge detection, Emboss, etc.

![](./images/22_Kernel.png)

![](./images/23_KernelAsFeatureDetectors.png)

### Convolutional Neural Nets

Primary Ideas behind Convolutional Neural Networks:
- Let the Neural Network learn which kernels are most useful.
- Use same set of kernels across entire image (translation invariance).
- Reduces number of parameters and "variance" (from bias-variance point of view).


# Convolution for Color Images

Primary Ideas behind Convolutional Neural Networks:
- Let the Neural Network learn which kernels are most useful.
- Use same set of kernels across entire image (translation invariance).
- Reduces number of parameters and "variance" (from bias-variance point of view).


# Convolutional Settings - Padding and Stride

Convolution Settings - Grid Size

Grid Size (Height and Width):
- The number of pixels a kernel "sees" at once.
- Typically use odd numbers so that there is a "center" pixel.
- Kernel does not need to be square.

![](./images/24_ConvolutionalGridSize.png)

Convolution Settings - Padding

- Using Kernels directly, there will be an "edge effect".
- Pixels near the edge will not be used as "center pixels"
since there are not enough surrounding pixels.
- Padding adds extra pixels around the frame, so pixels from the original image
become center pixels as the kernel moves across the image.
- Added pixels are typically of value zero (zero-padding).

Convolution Settings - Stride
- The "step size" as the kernel moves across the image.
- Can be different for vertical and horizontal steps (but usually is the same value).
- When stride is greater than 1. it scales down the output dimension.

# Convolutional Settings - Depth and Pooling

Convolutional Settings - Depth

In images, we often have multiple numbers associated with each pixel location.
These numbers are referred to as "channels".
- RGB image: 3 channels.
- CMYK: 4 channels.

The number of channels is referred to as the "depth".

So, the kernel itself will have a "depth" the same size as the number of input channels.

Example: a 5 × 5 kernel on an RGB image.
- There will be 5 x 5 x 3 = 75 weights.

The output from the layer will also have a depth.
- The networks typically train many different kernels.
- Each kernel outputs a single number at each pixel location.
- So, if there are 10 kernels in a layer, the output of that layer will have depth = 10.

![](./images/25_FeatureMap.png)

Pooling

Idea: Reduce the image size by mapping a patch of pixels to a single value.
- Shrinks the dimensions of the image.
- Does not have parameters, though there are different types of pooling operations.

![](./images/26_MaxPool.png)

![](./images/27_AveragePool.png)

# Learning Recap

In this section, we discussed:
- Convolutional Neural Networks
- Original motivation (image data)
- Convolution settings: grid size, padding, pooling, depth