<a href="https://colab.research.google.com/github/squeeko/DL_TF20_KerasCNNGANSRNNNLP/blob/in_progress/DL_TF2_Ch4_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ConvNets in TensorFlow 2.x

In Tensorflow 2.x if we want to add a convolutional layer with 32 parallel features and a filter size of 3X3, we write:

In [None]:
# import tensorflow as tf
# from tensorflow.keras import datasets, layers, models

In [None]:
# model = models.Sequential()
# model.add(layers.Conv2D(32, (3,3), activation='relu', input_shape = (28, 28, 1)))

This means that we are applying a 3×3 convolution on 28×28 images with one input
channel (or input filters) resulting in 32 output channels (or output filters).

## Pooling layers

Let's suppose that we want to summarize the output of a feature map. Again, we
can use the spatial contiguity of the output produced from a single feature map
and aggregate the values of a sub-matrix into one single output value synthetically
describing the "meaning" associated with that physical region.

## Max pooling
One easy and common choice is the so-called max-pooling operator, which simply
outputs the maximum activation as observed in the region. In Keras, if we want
to define a max pooling layer of size 2×2, we write:
model.add(layers.MaxPooling2D((2, 2)))

## Average pooling
Another choice is average pooling, which simply aggregates a region into the average
values of the activations observed in that region.
Note that Keras implements a large number of pooling layers and a complete list is
available online (https://keras.io/layers/pooling/). In short, all the pooling
operations are nothing more than a summary operation on a given region.

## ConvNets summary
So far, we have described the basic concepts of ConvNets. CNNs apply convolution
and pooling operations in 1 dimension for audio and text data along the time
dimension, in two dimensions for images along the (height × width) dimensions
and in three dimensions for videos along the (height × width × time) dimensions.
For images, sliding the filter over an input volume produces a map that provides
the responses of the filter for each spatial position.

In other words, a CNN has multiple filters stacked together that learn to recognize
specific visual features independently from the location in the image itself. Those
visual features are simple in the initial layers of the network and become more
and more sophisticated deeper in the network. Training of a CNN requires the
identification of the right values for each filter so that an input, when passed through
multiple layers, activates certain neurons of the last layer so that it will predict the
correct values.

## An example of DCNN ‒ LeNet

**"layers.Convolution2D(20, (5, 5), activation='relu', input_shape=input_
shape))"**

Yann LeCun, who very recently won the Turing Award, proposed [1] a family
of convnets named LeNet trained for recognizing MNIST handwritten characters
with robustness to simple geometric transformations and distortion. The core idea
of LeNets is to have lower layers alternating convolution operations with maxpooling
operations. The convolution operations are based on carefully chosen local
receptive fields with shared weights for multiple feature maps. Then, higher levels
are fully connected based on a traditional MLP with hidden layers and softmax as
output layer.

Where the first parameter is the number of output filters in the convolution, and
the next tuple is the extension of each filter. An interesting optional parameter is
padding. There are two options: padding='valid' means that the convolution is
only computed where the input and the filter fully overlap and therefore the output
is smaller than the input, while padding='same' means that we have an output
which is the same size as the input, for which the area around the input is padded
with zeros.

In [1]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, optimizers

# network and training

EPOCHS = 5
BATCH_SIZE = 128
VERBOSE = 1
OPTIMIZER = tf.keras.optimizers.Adam()
VALIDATION_SPLIT = 0.95

IMG_ROWS, IMG_COLS = 28, 28 #  input image dimensions
INPUT_SHAPE = (IMG_ROWS, IMG_COLS, 1)
NB_CLASSES = 10 # number of outputs = number of digits

We have a first convolutional stage with rectified linear unit (ReLU) activations
followed by a max pooling. Our net will learn 20 convolutional filters, each one of
which with a size of 5×5. The output dimension is the same as the input shape, so it
will be 28×28. Note that since Convolution2D is the first stage of our pipeline, we
are also required to define its input_shape. The max pooling operation implements
a sliding window that slides over the layer and takes the maximum of each region
with a step of 2 pixels both vertically and horizontally:

Then there is a second convolutional stage with ReLU activations, followed again
by a max pooling layer. In this case we increase the number of convolutional filters
learned to 50 from the previous 20. Increasing the number of filters in deeper layers
is a common technique used in deep learning:

Then we have a pretty standard flattening and a dense network of 500 neurons,
followed by a softmax classifier with 10 classes:

In [5]:
# define the LeNet ConvNet

def build(input_shape, classes):
  model = models.Sequential()
  
  # CONV => RELU => POOL
  model.add(layers.Convolution2D(20, (5,5), activation='relu', input_shape=INPUT_SHAPE))
  model.add(layers.MaxPool2D(pool_size=(2,2), strides=(2,2)))

  # CONV => RELU => POOL
  model.add(layers.Convolution2D(50, (5,5), activation='relu'))
  model.add(layers.MaxPool2D(pool_size=(2,2), strides=(2,2)))

  # Flatten => RELU layers
  model.add(layers.Flatten())
  model.add(layers.Dense(500, activation='relu'))

  # Add a SoftMax classifier
  model.add(layers.Dense(classes, activation="softmax"))
  return model
