# ConvNets 

!['Convolutions'](p1.png)

!['Convolution operator'](convolution.gif)


### Filters 

- The first step for a CNN is to break up the image into smaller pieces, aka patches. 
- CNN uses filters to split an image into smaller patches. 
- The size of these patches matches the filter size.

Slide filter horizontally or vertically to focus on a different piece of the image.

- The amount by which the filter slides is referred to as the **'stride'**.
  - The stride is a hyperparameter which we can tune. 
- Increasing the stride reduces the size of the model by reducing the number of total patches each layer observes.
  - However, this usually comes with a reduction in accuracy.

- **Important idea:** *Grouping together adjacent pixels* and treating them as a collective.

    - In a non-convolutional neural network, we would have ignored this adjacency. 
    - In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. I.E we would not have **taken advantage of the fact that pixels in an image are close together for a reason and have special meaning**.

    - By taking advantage of local structure, CNN learns to classify local patterns, like shapes and objects, in an image.

##### Filter Depth
- It's common to have more than one filter. 
    - Different filters pick up different qualities of a patch. For example, one filter might look for a particular color, while another might look for a kind of object of a specific shape. 
    - The amount of filters in a convolutional layer is called the **filter depth**.
    
How many neurons does each patch connect to?

- If we have a depth of `k`, we connect each patch of pixels to `k` neurons in the next layer.
    - This gives us the height of **k** in the next layer, as shown below. 
    - In practice, **k** is a hyperparameter we tune, and most CNNs tend to pick the same starting values.

Having multiple neurons for a given patch ensures that the CNN can learn to capture whatever characteristics from given data.
- The CNN isn't "programmed" to look for certain characteristics. 
- Rather, it learns on its own which characteristics to notice.

### Tensorflow Strides, Depth and Padding 

- **SAME Padding**, the output height and width are computed as:
    - $ out\_height =  ceil( \frac{in\_height} {strides[1]} ) $
    - $ out\_width  = ceil( \frac{in\_width} {strides[2]} ) $
    
- **VALID Padding**, No padding. Output height and width are computed as:
    - $ out\_height =  ceil(\frac{in\_height - filter\_height + 1} {strides[1]}) $
    - $ out\_width  =  ceil(\frac{in\_width  - filter\_width  + 1} {strides[2]}) $

- **Non Tensorflow**: $ output\_height = \frac{(n + 2p -f)}{s} +1  $ 
    - n: input height 
    - p: padding   
    - f: filter height
    - s: stride 


**Given**
```python
input = tf.placeholder(tf.float32, (None, 32, 32, 3))

# height, width, input_depth, output_depth = 8, 8, 3, 20
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20))) 
filter_bias = tf.Variable(tf.zeros(20))

# batch, height, width, depth
strides = [1, 2, 2, 1] 
padding = 'SAME'

conv = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias
```

- Output **shape of conv is [1, 16, 16, 20]** - A 4D to account for batch size.
- If we switch padding from `SAME` to `VALID` then the output shape is [1, (32-8+1)/2, 13, 20]


### Number of parameters 

**Given**
- Input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)

**Output Layer**
- $ output\_shape = \frac{(n + 2p -f)}{s} +1  $  = 14x14x20 (HxWxD)

**How many parameters does the convolutional layer have (without parameter sharing)?**

- Without parameter sharing, each neuron in the output layer must connect to each neuron in the filter. 
    - Each neuron in the output layer must also connect to a single bias neuron.
- parameters = (8 * 8 * 3 + 1) * (14 * 14 * 20) = 756,560
    - 8 * 8 * 3 is the number of weights, plus 1 for the bias. 
    - Each weight is assigned to every single part of the output (14 * 14 * 20).
    - why not times with  20?


### Parameter Sharing

The weights, `w`, are shared across patches for a given layer in a CNN to detect the **object or feature** regardless of where in the image the **object** is located.

- This is known as *statistical invariance* or *translation invariance*

The classification of a given patch in an image is determined by the weights and biases corresponding to that patch.
- If we want a **cat** that’s in the top left patch to be classified in the same way as a **cat** in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.
- This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. 
    - Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels.
- There’s an additional benefit to sharing parameters. 
    - If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. 
    - This does not scale well, especially for higher fidelity images. 
    - Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.
    
**Given**
- Input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)

**Output Layer**
- $ output\_shape = \frac{(n + 2p -f)}{s} +1  $  = 14x14x20 (HxWxD)

**How many parameters does the convolutional layer have (with parameter sharing)?**
- This is the number of parameters actually used in a convolution layer **tf.nn.conv2d()**
- With parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel
- So the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer
```python
(8 * 8 * 3 + 1) * 20 = 3840 + 20 = 3860
```


### Layers

Each layer in network detects more and more complex ideas.

#### Layer 1: Picks out very simple shapes and patterns like lines and blobs.

Example patterns that cause activations in the first layer of the network. 
- These range from simple diagonal lines (top left) to green blobs (bottom middle).
!['Example patterns that cause activations in the first layer of the network'](layer1.png)

- Each image in the above grid represents a pattern that causes the neurons in the first layer to activate
    - They are patterns that the first layer recognizes. 
    - The top left image shows a -45 degree line, while the middle top square shows a +45 degree line.

- Let's now see some example images that cause such activations. The below grid of images all activated the -45 degree line. Notice how they are all selected despite the fact that they have different colors, gradients, and patterns.

!['Example patches that activate the -45 degree line detector in the first layer'](layer2.png)

#### Layer 2: Picks up more complex ideas like circles and stripes

Second layer is picking up more complex ideas like circles and stripes. 
- The gray grid on the left represents how this layer of the CNN activates (or "what it sees") based on the corresponding images from the grid on the right.

!['visualization of the second layer in the CNN'](layer_2.png)

- The second layer captures complex ideas.
- Recognizes circles (second row, second column), stripes (first row, second column), and rectangles (bottom right).
- The CNN learns to do this on its own. 
    - There is no special instruction for the CNN to focus on more complex objects in deeper layers.
    - That's just how it normally works out when you feed training data into a CNN.

#### Layer 3: Picks out complex combinations of features from the second layer

!['visualization of the third layer in the CNN'](layer3.png)

#### Layer 5

The last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles.


# TensorFlow Convolution Layer

In [19]:
import tensorflow as tf

K = 64
iwidth = 10
iheight = 10
channels = 3

# Convolution filter
fw = 5
fh = 5

input = tf.placeholder(tf.float32, shape=[None, iheight, iwidth, channels])
weight = tf.Variable(tf.truncated_normal([fh, fw, channels, K]))
bias   = tf.Variable(tf.zeros(K))

# Strides [batch, input_height, input_width, input_channels] = [1, 2, 2, 1] 
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)

### TensorFlow Max Pooling

Examples of how max pooling works. 

!['max pooling with a 2x2 filter'](pool2.png)

!['max pooling with a 2x2 filter and stride of 2'](pool1.png)

- In this case, the max pooling filter has a shape of 2x2
    - Max pooling with a 2x2 filter and stride of 2. 
- The four 2x2 colors represent each time the filter was applied to find the maximum value.
    - As the max pooling filter slides across the input layer, the filter will output the maximum value of the 2x2 square.

Conceptually, the benefit of the max pooling operation is to reduce the size of the input 
- And allow the neural network to focus on only the most important elements.
- Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

In [17]:
# Apply Max Pooling
conv_layer = tf.nn.max_pool(conv_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

The `tf.nn.max_pool()` function performs max pooling with the ksize parameter as the size of the filter and the strides parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

**Pooling decrease the size of the output and prevent overfitting.** Preventing overfitting is a consequence of reducing the output size, which in turn, reduces the number of parameters in future layers.

For a pooling layer the output depth is the same as the input depth. Additionally, the pooling operation is applied individually for each depth slice.

Recently, pooling layers have fallen out of favor. Some reasons are:

- Recent datasets are so big and complex we're more concerned about underfitting.
- Dropout is a much better regularizer.
- Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.

**Example:**
- Given an input of shape 4x4x5 (HxWxD)
- Filter of shape 2x2 (HxW)
- A stride of 2 for both the height and width (S)
```python
new_height = (input_height - filter_height)/S + 1
new_width = (input_width - filter_width)/S + 1
```

**What's the shape of the output?**
- new_height = (4 - 2)/2 + 1 = 2
- new_width = (4 - 2)/2 + 1 = 2


In [24]:
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)

The output shape of pool will be [1, 2, 2, 5], even if padding is changed to 'SAME'.

# Convolutional Network in TensorFlow

- One structure of Convolutional network: A mix of convolutional layers and **max pooling**, followed by fully-connected layers.

### Dataset
- Import MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

In [25]:
import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('.', one_hot=True, reshape=False)

Extracting ./train-images-idx3-ubyte.gz
Extracting ./train-labels-idx1-ubyte.gz
Extracting ./t10k-images-idx3-ubyte.gz
Extracting ./t10k-labels-idx1-ubyte.gz


### Weights and biases 

In [None]:
learning_rate = 0.00001
epochs = 10
batch_size = 128

# Number of samples to calculate validation and accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10  # MNIST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units

weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}

### Convolutions

- In TensorFlow, this is all done using **tf.nn.conv2d()** and **tf.nn.bias_add()**.

In [26]:
def conv2d(x, W, b, strides=1, padding='SAME'):
    y = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding=padding)
    y = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

In TensorFlow, strides is an array of 4 elements: 
- First element in this array indicates the stride for batch and last element indicates stride for features. 
- It's good practice to remove the batches or features you want to skip from the data set rather than use a stride to skip them. 
    - Hence set the first and last element to 1 in strides in order to use all batches and features.

- The middle two elements are the strides for height and width respectively. When someone says they are using a stride of 3, they usually mean `tf.nn.conv2d(x, W, strides=[1, 3, 3, 1])`

- To make life easier, the code is using `tf.nn.bias_add()` to add the bias.
