# Image Classification

![](assets/cat.png)

![](assets/cat_angles.png)

## Enter Convolutional Neural Networks

- CNN takes just the image's raw pixel data as input and "learns" how to extract these features
- To start, the CNN receives an input feature map: a three-dimensional matrix
    - The size of the first two dimensions corresponds to the length and width of the images in pixels.
    - The size of the third dimension is 3 (corresponding to the 3 channels of a color image: red, green, and blue).


### Convolution

![](assets/convolution_overview.gif)

Convolution layers take several hyperparameters:

  - **Input Size**: How many pixels in height and width is the input image? 
  - **Padding**: How many blank pixels do we want to place around the border of our image?
  - **Kernel Size**: What should the size of our sliding window be?
  - **Stride**: How many pixels do we want to move for each new window?


For each filter-tile pair, the CNN performs element-wise multiplication of the filter matrix and the tile matrix, and then sums all the elements of the resulting matrix to get a single value

<img src="assets/filter_matrix.png" width="60%" />

In [3]:
from IPython.display import IFrame


IFrame(
    'https://developers-dot-devsite-v2-prod.appspot.com/machine-learning/practica/image-classification/conv_widget',
    width=900,
    height=350)

### ReLU

After each convolution layer, we apply a ReLU activation function. Short version here is that in practice, it just empirically works best for this type of task.


### Pooling

Pooling is a way of downsampling our image (or rather, our convolved features) to generally decrease processing time. There are a few different strategies here, but **Max Pooling** is a commong one:

![](assets/maxpool_animation.gif)

As the name suggests, it just takes the maximum value of all elements in each window. Max pooling takes two parameters:

  - **Size**: How many pixels should be sampled in each step?
  - **Stride**: How many pixels should we move in the x- or y- direction with each new step? Note that we can set stride equal to the size in order to have non-overlapping samples

### Flattening

This layer converts a three-dimensional layer in the network into a one-dimensional vector to fit the input of a fully-connected layer for classification. For example, a 5x5x2 tensor would be converted into a vector of size 50.

Typically the final layer in this part of the model uses a **softmax** activation function which outputs a probability value from 0 to 1 for each of the classification labels the model is trying to predict.

![](assets/full_model.png)

### A final note: Data Augmentation

![](assets/data_augmentation.png)