# Deep Learning with Python (Chollet)
## Chapter 5: Deep learning for computer vision

- Convolutional neural networks (convnets)
- apply convnets to image-classification problems (in particular involving small training datasets. 
- convnets are a building block of LSTM and other advanced model used in time series analysis

### Introduction to convnets

- Practical example before theory

In [1]:
from keras import layers 
from keras import models

model = models.Sequential()
# add layers
model.add(layers.Conv2D(32, (3,3), activation="relu", input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3,3), activation="relu"))

model.summary()

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0
_________________________________________________________________


#### Notes

- Alternating `Conv2D` and `MaxPooling2`
 - [Convolutional layers](https://keras.io/layers/convolutional/): Layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs.
 - [Pooling layers](https://keras.io/layers/pooling/): Max pooling operation for spatial data 
- Inputs are tensors of shape `(image_height, image_width, image_channels)`, here `(28, 28, 1)`
- Feed the last output tensor `(3,3,64)`into a densely connected classifier network after flattening the 3D outputs to 1D.
- Adding a classifier using `Dense`:
 - [Dense layer](https://keras.io/layers/core/): A regular densely-connected NN layer


![convnet](../images/convnet.jpeg)

In [2]:
# adding a classifier
model.add(layers.Flatten())
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dense(10, activation="softmax"))

#### Notes

- 10-way classification and softmax activation
- flattened to 576 $(3*3*64)$ variables.

In [8]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                36928     
__________

#### Notes

- Train this simple convnet on the MNIST digits.
 - `datasets`: [link](https://keras.io/datasets/)
 - MNIST: 60,000 28x28 images of 10 digits, along with test set of 10,000 images.

```python
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
```

 - `to_categorical`([Link](https://keras.io/utils/)) converts a class vector (integers) to binary class matrix. Returns a binary matrix representation of the input. 

In [19]:
# imports
from keras.datasets import mnist
from keras.utils import to_categorical 

# load data
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print("Shape train images:", train_images.shape)
print("Shape test_images:", test_images.shape)

# reshape train and normalize
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255

# reshape test and normalize
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255

print("Shape train images:", train_images.shape)
print("Shape test_images:", test_images.shape)

# one-hot encoding labels
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

print("Shape train labels:", train_labels.shape)
print("Shape test labels:", test_labels.shape)

model.compile(optimizer="rmsprop", 
             loss="categorical_crossentropy", 
             metrics=["accuracy"])

fit1 = model.fit(train_images, train_labels, epochs=5, batch_size=64)

# evaluation
test_loss, test_acc = model.evaluate(test_images, test_labels)

Shape train images: (60000, 28, 28)
Shape test_images: (10000, 28, 28)
Shape train images: (60000, 28, 28, 1)
Shape test_images: (10000, 28, 28, 1)
Shape train labels: (60000, 10)
Shape test labels: (10000, 10)
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Accuracy

In [36]:
print("Test accuracy: "+ str(test_acc*100) + "%")

Test accuracy: 99.17%


#### Notes

- Even this simple convnet achieved accuracy of more than 90%


### The convolution operation

- Main difference to densely connected layer: `Dense` layers learn **global patterns** in their input space, whereas convolution layers learn **local patterns**.

This key characteristic gives two interesting properties:

- **Patterns are translation invariant:** Certain patterns can be found everywhere in the picture (does not depend on whether it was in the lower right-corner or at a different place). Makes convnets very efficient
 - Fewer training samples are needed to learn representations that have generalization power.
 - *Visual world is fundamentally translation invariant*
- **They learn spatial hierarchies of patterns:** First convolition layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layer, and so on. 
 - Allows convnets to efficiently learn increasingly complex and abstract visual concepts
 - *Visual world is fundamentally spatially hierarchical*
 
Convolutions operate over 3D tensors (feature maps):

- 2 spatial axes (*height* and *width*)
- 1 depth axis (also called the *channels* axis)

An RGB image has three color channels and has therefore *depth* of 3, whereas a grayscale image like the mnist images have a *depth* of 1.

Convolution operation extracts patches from the input feature map (receptive field of the input with the same size as the filter) and applies the filter/kernel which yields the output feature map. 

<img src="../images/input_filter.png" width=70%>

<img src="../images/feature_map.png" width=70%>

<img src="../images/conv.gif" width="85%">

Here, first convolution layer takes a feature map of size `(28, 28, 1)` and outputs a feature map of size `(26, 26, 32).`. It computes 32 filters over its input containing a 26x26 grid of values (*response map*) of the filter over the input. 

- Feature map: Every dimension in the depth axis is a feature (filter), and the 2D tensor `output[:, :, n]` is the 2D spatial map of the response of this filter over the input. 

<img src="../images/concept_response_map.png">

Convolutions have two key parameters:

- **Size of the patches extraced from the inputs:** Typically, these are 3x3 or 5x5 (for mnist, we have used 3x3 which can be seen from the summary of the model).
- **Depth of the output feature map:** Number of filters computed by the convolution (32 and 64).

These are the first arguments passed to the `Conv2D` layers: 

```python
Conv2D(output_depth, (window_height, window_width))
```

Convolution steps: 

- Sliding these tiny 3x3 (or 5x5) windows over the 3D input feature map
- stopping at every possible location
- extracting 3D patch of surrounding features with shape `(window_height, window_width, input_depth)` 
 - Each patch is then transformed (via tensor product with same learned weight matrix, called *convolution kernel*) into a 1D vector of shape `(output_depth, )`
- All vectors are then spatially reassembled into a 3D output map of shape `(height, width, output_depth)`. 

<img src="../images/fig_5_4.png" width="60%">

- Output width and heigt may differ from the input width and height due to
 - **Border effects**, which can be countered by **padding** the input feature map
 - The use of **strides**
 
(see next section)

### Border effects and padding

Using a 5x5 feature map and a 3x3 window (filter/kernel?): 

<img src="../images/fig_5_5.png" width="60%">

There are 9 positions around which we can center a 3x3 window, forming a 3x3 grid. This leads to an output feature map of 3x3. The output map shrinks by two tiles alongside each dimension: 

- 28x28 input features become 26x26 after the first convolition layer, ...

Here comes **padding** into the game. Using padding, we can get the same spatial dimensions as the input. 

> Padding adds appropriate number of rows and columnes on each side to so it is possible to fit center convolution windows around every input tile. 

Using a 3x3 window (filter/kernel), we add one column on the right and left as well as one row at the top and the bottom. In Keras' [`Conv2D`](https://keras.io/layers/convolutional/) layers padding is configurable via the `padding` argument, which takes two values:

- `padding="valid"`: No padding (only valid window locations will be used). This is the default.
- `padding="same"`: Pad in a way that output have the same width and height as the input.   


### Convolution strides

So far, assuming center tiles of the convolution window are contiguous. 

> Distance between two successive windows is a parameter of the convolution, called stride

Patches extracted by a 3x3 convolution with stride 2 over a 5x5 input (no padding). 

<img src="../images/fig_5_7.png" width="70%">

- stride equal 2 means width and height of the feature map are downsampled by a factor of 2.

Strided convolutions are rarely used in practice. To downsample feature maps, instead of strides, we use the **max-pooling** operation. 

### The max-pooling operation

From the model summary, we see that the size of the feature maps is halved after every `MaxPooling2D` layer, i.e. form 26x26 to 13x13. This is done by max pooling. 

> Max-pooling aggressively downsamples feature maps

Max-pooling extracts windows from the input feature maps and outputs the max value of each channel. Similar to convolution, except that instead of transforming the local patches via a learned linear transformation (convolution kernel), they're transformed via a hardcoded max tensor operation. 

- Max pooling is usually done with 2x2 windows and stride of 2
 - leads to downsample the feature maps by a factor of 2
- Convolution is typically done with 3x3 windows and no stride (means stride equals one)

> Reason to use downsampling is to reduce the number of feature-map coefficients to process, as well as to induce sspatial-filter hierarchies by making successive convolution layers look at increasingly large windows (in terms of the fraction of the original input they cover).

Basics of convnet so far: 

- Feature maps
- convolution
- max pooling 

and their relation to each other.

---

#### Key terms
- *Convolution:*
- *filter/kernel:*
- *Feature map:* A feature map is a function which maps a data vector to feature space. Intuitively, this is done to present your learning algorithm with data that are better able to classify.  
 - i.e. a function that takes feature vectors in one space and transforms them into feature vectors in another space.
 - Input is mapped to the convolution layer (convolution filer/kernel) to produce a feature map.
- *Response map*:
- *Border effect* and *padding*:
- *Convolution strides*: 
- *Max-pooling*:
 - *downsampling*

## Training a convnet from scratch on a small dataset



#### Key Terms