# [Link to my Youtube Video Explaining this whole Notebook](https://youtu.be/cFld9T5xZqo)

[![Imgur](https://imgur.com/4zHY6co.png)](https://youtu.be/cFld9T5xZqo)


#### Some Fundamental concepts before fully understanding each step of the Generator() Function implementing the Reverse Convolution (or `Conv2DTranspose`) for the DCGAN.

## Hyper-parameters: CNN

Here we will speak about the additional parameters present in CNNs, please refer part-I(link at the start) to learn about hyper-parameters in dense layers as they also are part of the CNN architecture.

- **Kernel/Filter Size**: A filter is a matrix of weights with which we convolve on the input. The filter on convolution, provides a measure for how close a patch of input resembles a feature. A feature may be vertical edge or an arch,or any shape. The weights in the filter matrix are derived while training the data. Smaller filters collect as much local information as possible, bigger filters represent more global, high-level and representative information. If you think that a big amount of pixels are necessary for the network to recognize the object you will use large filters (as 11x11 or 9x9). If you think what differentiates objects are some small and local features you should use small filters (3x3 or 5x5). Note in general we use filters with odd sizes.


- **Padding:** Padding is generally used to add columns and rows of zeroes to keep the spatial sizes constant after convolution, doing this might improve performance as it retains the information at the borders. Parameters for the padding function in Keras are Same- output size is the same as input size by padding evenly left and right, but if the amount of columns to be added is odd, it will add the extra column to the right.Valid- Output size shrinks to ceil((n+f-1)/s) where ’n’ is input dimensions ‘f’ is filter size and ‘s’ is stride length. ceil rounds off the decimal to the closet higher integer, No padding occurs.

- **Stride:** It is generally the number of pixels you wish to skip while traversing the input horizontally and vertically during convolution after each element-wise multiplication of the input weights with those in the filter. It is used to decrease the input image size considerably as after the convolution operation the size shrinks to ceil((n+f-1)/s) where ’n’ is input dimensions ‘f’ is filter size and ‘s’ is stride length. ceil rounds off the decimal to the closet higher integer.


- **Number of Channels:** It is the equal to the number of color channels for the input but in later stages is equal to the number of filters we use for the convolution operation. The more the number of channels,more the number of filters used, more are the features learnt, and more is the chances to over-fit and vice-versa.
Pooling-layer Parameters: Pooling layers too have the same parameters as a convolution layer. Max-Pooling is generally used among all the pooling options. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality by keeping the max value(activated features) in the sub-regions binned.


#### Adding Batch Normalization

**Batch Normalization:-** Generally in deep neural network architectures the normalized input after passing through various adjustments in intermediate layers becomes too big or too small while it reaches far away layers which causes a problem of internal co-variate shift which impacts learning to solve this we add a batch normalization layer to standardize (mean centering and variance scaling) the input given to the later layers. This layer must generally be placed in the architecture after passing it through the layer containing activation function and before the Dropout layer(if any) . An exception is for the sigmoid activation function wherein you need to place the batch normalization layer before the activation to ensure that the values lie in linear region of sigmoid before the function is applied.

---

Now, check what I did inside the Generator Function.

#### In my case for the DCGAN on CELEB-A Dataset in this Notebook, I wanted the final output images to be `64*64*3`

#### So I started with 4 and then `Conv2DTranspose` 4 Times i.e 4 and then 2 * 2 * 2 * 2 Giving me 64

Recall again, that regular convolution is typically used to reduce input width and height while increasing its depth. Transposed convolution goes in the reverse direction: it is used to increase the width and height while reducing depth, as you can see in the Generator network diagram in from the Original Paper.

So here I start with a vector of 100 and end up with 64 * 64 * 3 Image Dimension. So I am progressively increasing the image size from [nz,1,1] to [nc,ngf,ngf] i.e. from [100,1,1] to [3,64,64].

The Generator starts with a noise vector z. Using a fully connected layer, we reshape the vector into a three-dimensional hidden layer with a small base (width × height) and large depth (512). Using transposed convolutions, the input is progressively reshaped such that its base grows while its depth decreases until we reach the final layer with the shape of the image we are seeking to synthesize, 64 × 64 × 3. After each transposed convolution layer, we apply batch normalization and the Leaky ReLU activation function. At the final layer, we do not apply batch normalization and, instead of ReLU, we use the Sigmoid activation function.

---


### My Generator Function - For the DCGAN on CELEB-A Dataset in this Notebook, I wanted the final output images to be `64*64*3`

```py

def generator_model():
  model=Sequential()

  # Random noise to 4x4x512 image
  model.add(Dense(4*4*512, input_shape=[noise_shape]))

  #  Next, add a reshape layer to the network to reshape the tensor from the
  # last layer to a tensor of a shape of (4, 4, 512):
  model.add(Reshape([4,4,512]))

  model.add(Conv2DTranspose(256, kernel_size=4, strides=2, padding="same"))
  # BatchNormalization is added to the model after the hidden layer, but before the activation, such as LeakyReLU.
  model.add(BatchNormalization())
  model.add(LeakyReLU(alpha=0.2))

  model.add(Conv2DTranspose(128, kernel_size=4, strides=2, padding="same"))
  model.add(LeakyReLU(alpha=0.2))
  model.add(BatchNormalization())

  model.add(Conv2DTranspose(64, kernel_size=4, strides=2, padding="same"))
  model.add(LeakyReLU(alpha=0.2))
  model.add(BatchNormalization())

  model.add(Conv2DTranspose(3, kernel_size=4, strides=2, padding="same",
                                  activation='sigmoid'))
  return model

model = generator_model()
model.summary()

```


## Explanations of Parameter Calculation to my Input Dense Layer

The meaning of (4*4*512)


I am applying each one of the 512 dense neurons to each of the 4x4 kernel size to the 100 Element Noise Vector, givem me .

=(512*4*4*100) number of parameters.

To the above I also have to add the Bias Terms

which will be - (512*4*4)

Totaling  = 827392 parameters.


This is because of this equation:

input * weights + bias


- input would be the 100 (aka number of params per neuron)

- weights would be the 512 * 4 * 4 (aka number of neurons)

- bias would be the 512 * 4 * 4 (aka one bias per neuron)


**Dense** is pretty much Keras's way to say matrix multiplication.

As a separate example, Whenever we say `Dense(512, activation='relu', input_shape=(32, 32, 3)),` what we are really saying is Perform matrix multiplication to result in an output matrix with a desired last dimension to be 512.

----------------------------------

With a stride of 2. In a normal convolution, this would mean applying the filter only every two steps (skipping one step every time), which would result in an output half the size of the input. However, in a transposed convolution things are essentially reversed, and a stride of 2 gives you double the output size. It does this by basically inserting holes into the input before applying the convolution. So 2×2 stride would upsample the input in Transpose Convolution, so that so that the 2×2 input image is upsampled to 4×4.


Also note the input_shape here is [noise_shape].

In a normal regular convolution model - if you have 30 images of 50x50 pixels in RGB (3 channels), the shape of your input data is (30,50,50,3). Then your input layer tensor, must have this shape.

But here the Generator Model is starting from just a 100 element vector which is the noise_shape

Given the input shape, all other shapes are results of layers calculations.



---


### `tf.keras.layers.Dense` Method Understanding

I have this `tf.keras.layers.Dense` use as below in the Generator() Function.

```py

generator=Sequential()
generator.add(Dense(4*4*512, input_shape=[noise_shape]))

```

The [Official Doc](https://keras.io/api/layers/core_layers/dense/) gives the below


```py

tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

```

The first argument `units` is a Positive integer, which is the dimensionality of the output space. It must be a positive integer since it represents the dimensionality of the output vector.

The "units" of each layer will define the output shape (the shape of the tensor that is produced by the layer and that will be the input of the next layer).

**Each type of layer works in a particular way. Dense layers have output shape based on "units", convolutional layers have output shape based on "filters". But it's always based on some layer property.**

A dense layer has an output shape of (batch_size,units). So, yes, units, the property of the layer, also defines the output shape.

---


### Keras `Conv2DTranspose` class - Understanding Number of Filters - The very first Parameter

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution.

The first required Conv2D parameter is the number of filters that the convolutional layer will learn.


`Conv2DTranspose` class constructor has the following signature:


```py

tf.keras.layers.Conv2DTranspose(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    output_padding=None,
    data_format=None,
    dilation_rate=(1, 1),
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)


```

**And filters is an Integer, which is the dimensionality of the output space (i.e. the number of output filters in the convolution).**


The number of filters is the number of neurons, and each neuron performs a different convolution on the input to the layer (more precisely, the neurons' input weights form convolution kernels).

![Imgur](https://imgur.com/p6RwUDf.png)


So when I am doing something like

`model.add(Conv2DTranspose(32, (3, 3), padding="same", activation="relu"))`

It means the model will learn a total of 32 filters, in this layer.

### Note the signature / the way you have to mention kernel_size in **Conv2DTranspose**

- (3*3) kernel-size is equal to (9,9). It has 81 trainable parameters. 81 for kernel (9*9) and 1 bias.

- (2*2) kernel-size is equal to (4,4). It has 17 trainable parameters. 16 for kernel (4*4) and 1 bias.

So if you have set kernel_size to (3*3) it will output a kernel size (9,9).

So, if you want to define kernel size for example 2 by 2, you should set it like (2,2). Then you will have 5 trainable parameters, which 4 of them are weights and 1 for bias.



### VERY IMPORTANT - What should be number of filters and kernel size in Conv2DTranspose?

[Source](https://stackoverflow.com/questions/66671579/what-should-be-filters-and-kernel-size-in-conv2dtranspose)


Lets say,  I want to generate images of dimension (200, 200).

Than means the final shape of the image generated by the Generator Modle should also need to be 200*200.

And lets say, for the Dense layer I am starting with `(25*25*432)`

So to get to the number 200 => I have to do 25 x 2 x 2 x  2 (i.e. 200)

Which means, I have to implement a network to deconvolve 3 times like below


```py
cnn.add(Dense(25*25*432, input_dim=latent_size, activation='relu'))
cnn.add(Reshape((25, 25, 432)))

# And then deconvolve 3 times to 25x2x2x2 = 200

cnn.add(Conv2DTranspose(192, 2, strides=2, padding='valid',
                    activation='relu',
                    kernel_initializer='glorot_normal'))
cnn.add(BatchNormalization())

cnn.add(Conv2DTranspose(96, 2, strides=2, padding='valid',
                    activation='relu',
                    kernel_initializer='glorot_normal'))
cnn.add(BatchNormalization())

cnn.add(Conv2DTranspose(3, 2, strides=2, padding='valid',
                    activation='relu',
                    kernel_initializer='glorot_normal'))
cnn.add(BatchNormalization())


```

#### In my case for the DCGAN on CELEB-A Dataset in this Notebook, I wanted the final output images to be `64*64*3`

#### And I started with 4 * 4 and then apply `Conv2DTranspose` 4 Times i.e 4 and then 2 * 2 * 2 * 2 Giving me 64


**Also keep in mind that, when using strided transposed convolutions, in order to avoid chequered board artifacts effects caused by unequal coverage of the pixel space, kernel size should be divisible by its number of strides. The main cause of this is uneven overlap at some parts of the image causing artifacts. This can be fixed or reduced by using kernel-size divisible by the stride, for e.g taking a kernel size of 2x2 or 4x4 when having a stride of 2.**

### Super Important - filters vs kernel_size parameter in Keras Conv2D() or Conv2DTranspose()

The `filters` argument sets the `number` of convolutional filters in that layer. These filters are initialized to small, random values, using the method specified by the `kernel_initializer` argument. During network training, the filters are updated in a way that minimizes the loss. So over the course of training, the filters will learn to detect certain features, like edges and textures, and they might become something like the image below (from [here](https://cs231n.github.io/convolutional-networks/#conv)).

[![A set of CNN filters][1]][1]

It is very important to realize that one does not hand-craft filters. These are learned automatically during training -- that's the beauty of deep learning.

I would highly recommend going through some deep learning resources, particularly https://cs231n.github.io/convolutional-networks/


  [1]: https://i.stack.imgur.com/ljMwB.jpg


Common numbers for filters are include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.

### What is filter

In the context of CNN, a filter is a set of learnable weights, represented by a vector of weights, which are learned using the backpropagation algorithm. You can think of each filter as storing a single template/pattern. When you convolve this filter across the corresponding input, you are basically trying to find out the similarity between the stored template and different locations in the input. The filter, similar to a filter encountered in signal processing, provides a measure for how close a patch of input resembles a feature. A feature may be vertical edge or an arch.
The feature that the filter helps identify is not engineered manually but derived from the data through the learning algorithm.

You can consider each filter to be responsible for extracting some type of feature from a raw image. The CNNs try to learn such filters i.e. the filters parametrized in CNNs are learned during training of CNNs. You apply each filter in a Conv2D to each input channel and combine these to get output channels. So, the number of filters and the number of output channels are the same.

#### Now lets talk about the difference between filters vs kernel_size

[Source](https://stackoverflow.com/a/51180353/1902852)

Each convolution layer consists of several convolution channels (aka. depth or filters). In practice, they are a number such as `64, 128, 256, 512` etc. This is equal to number of channels in the output of a convolutional layer. `kernel_size`, on the other hand, is the size of these convolution filters. In practice, they take values such as `3x3` or `1x1` or `5x5`. And in abbreviated for kernel_size is written as `1` or `3` or `5` as they are mostly square in practice.

Lets check an examaple.

In the below image I am presenting a normal convolution, After going through a 5x5x3 kernel, the 12x12x3 image will become a 8x8x1 image. So here the the number of filter is 1 and the filter_size (i.e. kernel_size) is 5 * 5 * 3

![Imgur](https://imgur.com/yc04Rg5.png)

What if we want to increase the number of channels in our output image? What if we want an output of size 8x8x256?
Well, we can create 256 kernels to create 256 8x8x1 images, then stack them up together to create a 8x8x256 image output.

![Imgur](https://imgur.com/WYPO87j.png)

This is how a normal convolution works.


I like to think of it like a function: 12x12x3 — (5x5x3x256) —>  12x12x256

Here the filter size `5x5x3x256` represents the height, width, number of input channels, and number of output channels of the kernel.


Note that this is not matrix multiplication; we’re not multiplying the whole image by the kernel, but moving the kernel through every part of the image and multiplying small parts of it separately.



### IMPORTANT - Now the obvious question you may ask is, if there's any ideal number for the number of filters (i.e. the first parameter of Conv2DTranspose ) to use

Short Ans - There is no direct method to know the number of filters to use for your model. However you can test some values like 16,32,64,128,256...

The number of filters is a hyper-parameter that can be tuned. The number of neurons in a convolutional layer equals to the size of the output of the layer. In the case of images, it's the size of the feature map.

Now for the details, first note that, the first parameter filter is the dimensionality of the output space (i.e. the number of output filters in the convolution).
The number of filters is the number of neurons, since each neuron performs a different convolution on the input to the layer (more precisely, the neurons' input weights form convolution kernels).

In general, people tend to use powers of 2 for different hyperparameters in neural nets. It is not definitively proven to be more effective, but there are schools of thought that point to it being the most effective approach. In terms of the number of filters, filters are meant to detect features. If you add more filters, it should be able to capture more complex features, whether they be visual or physical. The drawback to increasing the number of filters in each layer is the added parameters that are associated with it. This makes your model take up more memory, and it will take longer to train as there are more parameters to update.

In my view, a filter is a single neuron that sweeps through the image, providing different activations for each position. An entire feature map is produced by a single neuron/filter at multiple positions in my view.

 In general, the more features you want to capture (and are potentially available) in an image the higher the number of filters required in a CNN.

So the more the number of filters, the more the network learns (not necessarily good all the time - saturation and convergence matter the most)

 In another word, the number of filters that you set in a layer is to allow ENOUGH containers to network to learn relevant features (or their combinations). What is the sufficient number -> depends upon the dataset. Say, a CNN network at layer X needs at least 24 feature maps to learn important features, so you provide, say, 32, working on idea that you give network a breathing space and let it decide on it's own, maybe some out of 32 are redundant or slightly varied.


 Great explanations about transposed convolutions: https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers

---



#### As an example, what this line `Conv2DTranspose(64, (2, 2), strides=(2, 2))` is doing

What does this layer do exactly?

* First of all the default padding in this case is valid. This means we have no padding.

* The size of the output will be 2 times bigger: if input (m, n), output will be (2m, 2n). Why is that? See the next point.

* Take the first element from the input and multiply by the filter weights with shape (2,2). Put it into the output. Take the next element, multiply and put in the output next to the first result without overlapping. Why is that? We have strides (2, 2).

-------------------------------------------------------------------------------------------

### Formulae of output size for deconvolution

output_size = strides * (input_size-1) + kernel_size - 2*padding

[source](https://stackoverflow.com/a/49347788/1902852)

Just another representation from [SO](https://datascience.stackexchange.com/a/36847/101199)


The correct formula for computing the size of the output with tf.layers.conv2d_transpose():

# Padding==Same:
H = H1 * stride

# Padding==Valid
H = (H1-1) * stride + HF
where, H = output size, H1 = input size, HF = height of filter

e.g., if `H1` = 7, Stride = 3, and Kernel size = 4,

```
With padding=="same", output size = 21,
with padding=="valid", output size = 22

```


-------------------------------------------------------------------------------------------

## SUPER IMPORTANT -  Calculate Output size from filter_size (i.e. kernel_size) and stride when upsampling image using `Conv2DTranspose`

Have a look at the [Keras docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2DTranspose). You can find the formula to calculate the output shape there:



```py
new_rows = ((rows - 1) * strides[0] + kernel_size[0] - 2 * padding[0] + output_padding[0])
new_cols = ((cols - 1) * strides[1] + kernel_size[1] - 2 * padding[1] + output_padding[1])

```

This formula will give the Filter size and stride when upsampling image using `Conv2DTranspose` .

A. Just insert your input image size (e.g. 2, 2 if your input image is of shape 2-by-2 pixel for Height and Width) for rows and cols.

B. Then insert the upscaled size e.g. (18,18) for new_rows and new_cols.

C. To upscale your image by factor x you normally use a stride of x. Rearranging the formula gives you the required kernel size and paddings.


 For example, With a stride of 2. In a normal convolution, this would mean applying the filter only every two steps (skipping one step every time), which would result in an output half the size of the input. However, in a transposed convolution things are essentially reversed, and a stride of 2 gives you double the output size. It does this by basically inserting holes into the input before applying the convolution.

And also You should use the minimum possible padding and a kernel size dividable by the stride.


-------------------------------------------------------------------------------------------

## VERY IMPORTANT - understanding Transposed-Convolutions in General

[Source](https://towardsdatascience.com/understand-transposed-convolutions-and-build-your-own-transposed-convolution-layer-from-scratch-4f5d97b2967)



### Kernel Size (or Filter Size) in Transposed-Convolutions - For my Notebook in here, I am using 4 * 4 Kernel Size

In transposed convolutions, when the kernel size gets larger, we “disperse” (i.e. spread over a large area) every single number from the input layer to a broader area. Therefore, the larger the kernel size, the larger the output matrix (if no padding is added):

![Imgur](https://imgur.com/VRaCTnH.png)


#### So how should you choose your kernel_size ( i.e. filter_size) ?

This question arises in one’s mind that whether there is a specific way to choose such dimensions or sizes. So, the answer is no. In the current Deep Learning world, we are using the most popular choice that is used by every Deep Learning practitioner out there, and that is 3x3 kernel size.

filter size as one of the hyper-parameters to choose from

If your input image is larger than 128×128 - Consider using a 5×5 or 7×7 kernel to learn larger feature.

If your images are smaller than 128×128 you may want to consider starting with strictly 3×3 or 4×4 filters

#### But still, why not 1x1, 2x2 or 4x4 as smaller sized kernel?

- **1x1 kernel size** is only used for dimensionality reduction that aims to reduce the number of channels. It captures the interaction of input channels in just one pixel of feature map. Therefore, 1x1 was eliminated as the features extracted will be finely grained and local that too with no information from the neighboring pixels.

- **2x2 and 4x4 are generally NOT preferred** because odd-sized filters symmetrically divide the previous layer pixels around the output pixel. So, For an odd-sized filter, all the previous layer pixels would be symmetrically around the output pixel. And if this symmetry is not present, there will be distortions across the layers which happens when using an even sized kernels, that is, 2x2 and 4x4. So, this is why we don’t use 2x2 and 4x4 kernel sizes.

![Imgur](https://imgur.com/WBcciW8.png)


### Strides in Transposed-Convolutions - For my Notebook in here, I am specifying strides=2


In transposed convolutions, the strides parameter indicates how fast the kernel moves on the output layer, as explained by the picture below. Notice that the kernel always move only one number at a time on the input layer. Thus, the larger the strides, the larger the output matrix (if no padding).

![Imgur](https://imgur.com/QLA2Yy0.png)


`Conv2DTranspose` with a stride of 2 doubles the first two dimensions, so an input of (None, 32, 2, 32) is will produce an output of shape (None, 64, 4, 32)


With a stride of 2. In a normal convolution, this would mean applying the filter only every two steps (skipping one step every time), which would result in an output half the size of the input. However, in a transposed convolution things are essentially reversed, and a stride of 2 gives you double the output size. It does this by basically inserting holes into the input before applying the convolution.

-------------------------------------------------------------------------------------------

### VERY IMPORTANT - How to decide how many convolutions or deconvolutions apply to a GAN?

[Source](https://stackoverflow.com/questions/56392367/how-to-decide-how-many-convolutions-e-deconvolutions-apply-to-a-gan)


Lets take this simple example of a plain-vanilla GAN's Generator() function - Here, the training is happening on CIFAR10 image dataset, a dataset of 50,000 32x32 RGB images belong to 10 classes (5,000 images per class).

```py

import keras
from keras import layers
import numpy as np

latent_dim = 32
height = 32
width = 32
channels = 3

generator_input = keras.Input(shape=(latent_dim,))

# First, transform the input into a 16x16 128-channels feature map
x = layers.Dense(128 * 16 * 16)(generator_input)
x = layers.LeakyReLU()(x)
x = layers.Reshape((16, 16, 128))(x)

# Then, add a convolution layer
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)

# Upsample to 32x32
x = layers.Conv2DTranspose(256, 4, strides=2, padding='same')(x)
x = layers.LeakyReLU()(x)

# Few more conv layers
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(256, 5, padding='same')(x)
x = layers.LeakyReLU()(x)

# Produce a 32x32 1-channel feature map
x = layers.Conv2D(channels, 7, activation='tanh', padding='same')(x)
generator = keras.models.Model(generator_input, x)
generator.summary()

```

Now lets look at following 2 Questions

### 1. Why the input is transformed into 16 × 16 128-channel

It's an arbitrary choice, you could have chosen any number of channels for the Dense layer. The 16x16 is set because you only have 1 layer to up sample 16x16 to 32x32.

And also, that upsample to 32*32 is done with the stride of 2.

**Strides** are used to influence output size of convolution layers. In normal convolutions, outputs are downsampled by the same factor as strides, whereas in transposed convolutions they are upsampled by the same factor as strides.

For instance, you could change your first layer output to `8x8x128` and then use a stride of 4 in your `Conv2DTranspose`, this way you would get the same result in terms of dimensionality.

**Also keep in mind that, when using strided transposed convolutions, in order to avoid chequered board artifacts effects caused by unequal coverage of the pixel space, kernel size should be divisible by its number of strides. The main cause of this is uneven overlap at some parts of the image causing artifacts. This can be fixed or reduced by using kernel-size divisible by the stride, for e.g taking a kernel size of 2x2 or 4x4 when having a stride of 2.**


#### 2. When a convolution is performed, with which filter ?

The first argument you set in `Conv2D` or `Conv2DTranspose` is the number of filters generated by a convolution layer.

As said before, the strided `Conv2DTranspose` is used exactly to upsample width and height by a factor equal to the number of strides.

The other 3 `Conv2D` are also arbitrary, you should determine them by experimentation and fine tuning your model.


-------------------------------------------------


#### What is checkerboard pattern of artifacts created in GAN Training if Kernel size is not divisible by strides

[source](https://distill.pub/2016/deconv-checkerboard/)


When we look very closely at images generated by neural networks, we often see a strange checkerboard pattern of artifacts. It’s more obvious in some cases than others, but a large fraction of recent models exhibit this behavior.

A. One approach is to make sure you use a kernel size that is divided by your stride, avoiding the overlap issue.  e.g taking a kernel size of 2x2 or 4x4 when having a stride of 2.

This is equivalent to “sub-pixel convolution,” a technique which has recently had success in image super-resolution. However, while this approach helps, it is still easy for deconvolution to fall into creating artifacts.

B. Another approach is to separate out upsampling to a higher resolution from convolution to compute features. For example, you might resize the image (using nearest-neighbor interpolation or bilinear interpolation) and then do a convolutional layer. This seems like a natural approach, and roughly similar methods have worked well in image super-resolution

---


------------------------------------------------------------------------------------------------------------

# Reason for using 4 * 4 * 515 Shape for the Input Dense Layer in the Generator Function.

The generator is a fully-convolutional network that inputs a noise vector (latent_dim) to output an image of 3 x 64 x 64.

#### What I am doing here is that - I start with 512 output channels, and divide the output channels by a factor of 2 up until the 3rd Block (3rd Conv2D Layer). Then in the final block, the output channels equal to 3 (RGB image).

To understand the above in detail, I have to explain the following 4 Concepts from first principle.

### 1. What exactly is this first parameter to the Conv2DTranspose() function called 'filter' (sometimes called a Windows, or kernels)

The convolutional layer computes the convolutional operation of the input images using filters to extract features and scans the entire image looking through this filter. The filter is slid across the width and height of the input and the dot products between the input and filter are computed at every position. The output of a convolution is referred to as a feature map.

Each filter is convolved with the inputs to compute an activation map. The output volume of the convolutional layer is obtained by stacking the activation maps of all filters along the depth dimension.

We can scan the image using multiple filters to generate multiple feature mappings of the image. Each feature mapping will reveal the parts of the image which express the given feature defined by the parameters of our filter.

Each convolution layer consists of several filters. In practice, they are a number such as 32,64, 128, 256, 512, etc. This is equal to the number of channels in the output of a convolutional layer.

**Filter Size (Kernal Size)**

Each filter will have a defined width and height, but the height and width of the filters(kernel) are smaller than the input volume.

The filters have the same dimension but with smaller constant parameters as compared to the input images. As an example, for computing a [32,32, 3], 3D image, the acceptable filter size is f × f × 3, where f = 3, 5, 7, and so on.

kernel_size: is the size of these convolution filters. In practice, they take values such as 1×1, 3×3, or 5×5. To abbreviate, they can be written as 1 or 3 or 5 as they are mostly square in practice.


**Input Shape** - The input shape is the only one you must define because your model cannot know it. It is based on your training data. All the other shapes are calculated automatically based on the units and particularities of each layer.

For example, if you have 100 images of 32x32x3 pixels, the shape of your input data is (100,32,32,3). Then your input layer tensor must have this shape.

#### You always have to give a 4D array as input to the CNN. So input data has a shape of (batch_size, height, width, channels), where the first dimension represents the batch size of the image and the other three dimensions represent dimensions of the image which are height, width, and channels.


### 2. Understanding the official Keras Implementation of Conv2DTranspose() filter parameter - for DCGAN with MNIST Dataset

The TensorFlow DCGAN tutorial code for the [generator](https://www.tensorflow.org/tutorials/generative/dcgan#the_generator) and discriminator models is intended for 28x28 pixel black-and-white images (MNIST dataset).


```py

def make_generator_model():
    model = tf.keras.Sequential()
    model.add(layers.Dense(7*7*256, use_bias=False, input_shape=(100,)))
    model.add(layers.BatchNormalization())
    model.add(layers.LeakyReLU())

    model.add(layers.Reshape((7, 7, 256)))
    assert model.output_shape == (None, 7, 7, 256)  # Note: None is the batch size

    model.add(layers.Conv2DTranspose(128, (5, 5), strides=(1, 1), padding='same', use_bias=False))
    assert model.output_shape == (None, 7, 7, 128)
    model.add(layers.BatchNormalization())
    model.add(layers.LeakyReLU())

    model.add(layers.Conv2DTranspose(64, (5, 5), strides=(2, 2), padding='same', use_bias=False))
    assert model.output_shape == (None, 14, 14, 64)
    model.add(layers.BatchNormalization())
    model.add(layers.LeakyReLU())

    model.add(layers.Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False, activation='tanh'))
    assert model.output_shape == (None, 28, 28, 1)

    return model

```

#### Explanation of the above code

Recall that regular convolution is typically used to reduce input width and height while increasing its depth. Transposed convolution goes in the reverse direction: it is used to increase the width and height while reducing depth, as you can see in the Generator network diagram in from the Original Paper

* The generator takes 100 samples from the prior distribution (noise) as you can see from the input_shape.

* Then, it projects the data into a bigger dimension of 7 * 7 * 256 and reshape it to have feature maps of shape (7, 7, 256).

- **Now the idea is by the end of the model, we want to decrease channels to 1 and increase the width and height to reach the original image size.**

* The channels are controlled by the number of filters and that is why it is decreasing each consecutive Conv2DTranspose layer. It goes from 256 to 128, 64 and 1.

* For the width and height, they are controlled by strides parameter. With that, the first Conv2DTranspose doesn't change the width and height because it has strides of 1, however the second will multiple by 2, which yields to (14, 14) and again with the last Conv2DTranspose, which yields to (28, 28).


### 3. The way the Original Paper Suggested

First note that the original Paper has the Input Dense layer starts with ( 4 * 4 * 1024 ) and the last layer becomes ( 64 * 64 * 3 )

Essentially, this network takes in a 100x1 noise vector, and maps it into the G(Z) output which is 64x64x3.

### 100x1 → 1024x4x4 → 512x8x8 → 256x16x16 → 128x32x32 → 64x64x3

The first layer expands the random noise by projecting and reshaping at each step. So the generator's job is to take this random vector and generate 3x64x64 image that is indistinguishable from real images. Input is a random 100 dimensional vector sampled from standard normal distribution.

![Imgur](https://imgur.com/Xjbq4fH.png)


### 4. So finally in my Implementation of DCGAN on CELEB-A Dataset - Why I am using 4*4*512 for the first Dense Layer

Noting again, that regular convolution is typically used to reduce input width and height while increasing its depth. Transposed convolution goes in the reverse direction: it is used to increase the width and height while reducing depth, as you can see in the Generator network diagram in from the Original Paper.

So here I start with a vector of 100 and end up with 64 * 64 * 3 Image Dimension.

The Generator starts with a noise vector z. Using a fully connected layer, we reshape the vector into a three-dimensional hidden layer with a small base (width × height) and large depth (512). Using transposed convolutions, the input is progressively reshaped such that its base grows while its depth decreases until we reach the final layer with the shape of the image we are seeking to synthesize, 64 × 64 × 3. After each transposed convolution layer, we apply batch normalization and the Leaky ReLU activation function. At the final layer, we do not apply batch normalization and, instead of ReLU, we use the Sigmoid activation function.

The idea of the Generator of DCGAN for CELEB-A Dataset is that - by the end of the model, I want to decrease channels to 3 and increase the width and height to reach the original image size (64 * 64 ).

Hence, I am starting with a Dense layer that takes this seed as input, then upsample several times until I reach the desired image size of **64x64x3**.

So what I am doing here is increasing the first hidden layer (Dense Layer) to project the data to (4 * 4 * 512) >> And then keep adjusting the Filter for the subsequent Layers >> and in the final Conv2DTranspose, change the filters to 3. In this way the final output would be (64 * 64 * 3).

That is, in other words, the generator, G, needs to be designed to map the latent space vector (z of shape 100) to the required data-space (64x64x3 in this case).

Since our data are images, converting z to data-space means ultimately creating a RGB image with the same size as the training images (i.e. 3x64x64). And, this is accomplished through a series of strided two dimensional convolutional transpose layers, each paired with a 2d batch norm layer and a relu activation.

If I were to do this project of applying DCGAN on MNIST Dataset (which is shaped as 28 * 28 * 1) - Then I would choose a Dense Layer such that it takes the seed as input, then upsample several times until I reach the desired image size of 28x28x1.

---

### Bonus Sections

### How Filter Values are learned during backpropagation

First note whats the result of a CNN Training. Typically, CNNs are trained in a supervised setting. That is, the input is an image, and the output is a label for that image. For instance, in the MNIST handwritten digit recognition dataset, you feed an image and the output is one of the ten digits. In ImageNet dataset, you feed an image and the output is one of the 1000 objects.


The filter or kernel values which are the conv layer weights, are learn through the traditional backpropagation process. The loss calculated in the end, is taken back up to the network. In each layer, the gradients are calculated/determined and are adjusted by the learning rate defined.

The main purpose of Deep convolution neural network is to learn the values the filter will take. The so called kernel values in each filter, such that it extracts the right information from the image.

At the outset, to explain in simple terms, lets say, we have 3*3 filter, it initially assigns all the 9 values to 0.5 . It runs over the image and produces the next level of feature map by doing the math.

It does the final layer of activation and final prediction for the probability for each of the classes. If the model doesn't predict well for that batch/sample, it propagates, the loss backwards and changes the values of these kernel (9 values).

Actually now, its not just these 9 values, but this process is carried out for all the number of filters of convolution you have at each level.

Now to efficiently make the training process, you can initialize these values through various different methods.

* Uniformly assign all as same values

* Assign random values with constant mean and standard deviation

* Assign random values within a range and many more.

---

## What is a 'neuron' in the context of Convolutional Neural Network?

First note that, densely connected neural networks are poor at working with images, due to the huge number of parameters that would need to be learned. And thats where CNN comes, and note from the image below how CNN allows us to optimize the calculations.

On the Figure below, its a 2D convolution - neurons marked with numbers 1–9 form the input layer, while units A-D denotes calculated feature map elements. And I-IV are values of the Filter or Kernel.

![Imgur](https://imgur.com/d6c8IkY.gif)

First of all, you can see that not all neurons in the two consecutive layers are connected to each other. For example, unit 1 only affects the value of A. Secondly, we see that some neurons share the same weights. Both of these properties mean that we have much less parameters to learn. By the way, it is worth noting that a single value from the filter affects every element of the feature map — it will be crucial in the context of backpropagation.

[Source](https://cs231n.github.io/convolutional-networks/)

The width of one layer, say the first MLP layer has a width of 512, meaning it has 512 neurons all of which are connected to each input value.

One can think of a neuron in a convolutional layer as follows (according to a CS231n article[1]):

Every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input (i.e. its receptive field) and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter).

Here, “an output of a neuron that looks at a small region in the input” refers to the dot product of a filter with a region in the input that produces the response of that filter at that spatial location, i.e. it results in one entry of the 3D output volume. Now that entry corresponds to the activation of that neuron with respect to that particular spatial location in the input. The neuron here represents the dot product of that filter with the input region (omitting bias and activation function here for simplicity).

Further, “and shares parameters with all neurons to the left and right spatially” refers to the fact that the filter slides over the width and height of the input volume so when the filter is slid one step (cf. stride) to the right and the dot product is taken again to produce the next entry in the 3D output volume, this then represents the “neuron to the right” that uses the same filter (i.e. parameters) as the neuron on its left.

Behind each entry (1x1x1) in the 3D output volume there is a neuron whose weights (i.e. the filter) were multiplied with a part of the input defined by the receptive field of that neuron to produce that entry.


