# Industry Demo: Using CNNs with Flower Images

## Objective 

Here is a quick run-through of the broad learning objectives of this session:

1. You will get a hands-on experience in building an end-to-end pipeline for training CNNs. This is almost exactly how you would do this in a production environment. You are provided code throughout this session. We urge you to try experimenting with this code - it's the best way to make yourselves proficient in image processing techniques.

2. We will spend a good amount of time on data preprocessing techniques commonly used with image processing. This is because preprocessing takes about 50-80% of your time in most deep learning projects, and knowing some useful tricks will help you a lot in your projects. We will first use the flowers dataset from Kaggle to demonstrate the key concepts. Later, we will apply the same techniques on Chest X-ray images. The purpose of starting with the flowers dataset is to understand the process using images that you understand before getting into medical images.

3. Eventually, we will build a classifier - for both the Flowers and X-ray datasets. We'll take you through the important steps and hyperparameters involved in this process.

## Neural networks in industry applications
Neural Networks have changed the face of image processing in the industry. Through this demonstration, we'll see how they are used in the medical imaging industry.

Some of the notable types of medical images are:

1. X-rays
2. CT Scans
3. MRI images

Today, jobs related to image processing are specialised enough that deep learning experts need to also understand the domain where machine learning is being applied. Rohit has worked extensively with medical images and will demonstrate how that knowledge can be applied to deep learning tasks (using chest X-Rays as examples).

## Structure of this session
This session is divided into two parts:
<ul><li dir="ltr"><p dir="ltr"><strong>Flower classification</strong>: First, we will see how to classify flowers them into “roses” and “daisies”. This is a toy dataset and its purpose&nbsp;is to introduce you to the key concepts and methodologies. In this session, you will learn:</p><ul><li><p dir="ltr">How to set-up an end-to-end pipeline for training deep learning models</p></li><li><p dir="ltr">Preprocessing techniques: Morphological transformations etc.</p></li><li><p dir="ltr">Data augmentation using data generators</p></li><li><p dir="ltr">Building a network: Ablation experiments, hyperparameter tuning, storing the best model&nbsp;in&nbsp;disk etc.</p></li></ul></li><li dir="ltr"><p dir="ltr"><strong>X-ray classification</strong>: We will&nbsp;apply the concepts learnt in the first half to Chest X-ray images. Here, you will learn how to <strong>identify and debug problems</strong>&nbsp;often encountered during training.</p></li></ul>

## Datasets

In this session, we will use the <a target="_blank" href="https://www.kaggle.com/alxmamaev/flowers-recognition">Kaggle flowers dataset</a> - this is the same one you had used in the session on transfer learning (and hence you can use the same notebook on Google Colab in this session). In case you haven't uploaded the flowers dataset on Google Colab, you can download the following instructions.

You can download the notebook used in this session <a target="_blank" href="https://github.com/ContentUpgrad/Convolutional-Neural-Network-Industry-Applications/tree/main/Using-CNNs-with-Flowers-Images">here</a>. You can download the dataset from this <a target="_blank" href="https://woolfaws-prod.s3.ap-south-1.amazonaws.com/flowers.rar">link</a>
        
<strong>Important note:</strong> For most of the code in the notebook, you can <strong>use a CPU</strong> on Google Colab and switch to the GPU later while training the final model (towards the end of the notebook). You will go through the script 'resnet.py' used in the architecture in the next few segments.

Let's get started. First, we'll look at some of the tools in our kit - Python libraries


<div class="MuiBox-root css-1bi8ut6"><div class="text_component" data-testid="text-component"><p>By now, you should have the following ready:</p><ol><li>You have your notebook accessible on Google Colab.</li><li>You also have your flowers data accessible on the Google Colab environment divided into 'daisy' and 'rose'.</li></ol><p>&nbsp;</p><p>Please <strong>use a&nbsp;CPU</strong> (not a GPU) for running the initial parts of the code (everything before training). Let's move on to examining the shape and size of the images.</p><p>&nbsp;</p><h2>Images - Channels and sizes</h2><p>Images come in&nbsp;different shapes and sizes<strong>.</strong> They also&nbsp;come through <strong>different sources</strong>. For example, some images are what we call “natural images”, which means they are taken in <strong>colour</strong>, in the<strong> real world</strong>. For example:</p><ul><li>A picture of a flower is a natural image.</li><li>An X-ray image is <em>not</em> a natural image.&nbsp;</li></ul><p>Natural images also have a specific statistical meaning&nbsp;- if you're interested, you can read more about it at the end of this page.</p><p><br>Taking all these variations into consideration, we need to perform some pre-processing on any image data. Let’s watch Rohit introduce us to the first few steps in pre-processing.</p></div></div>

<div class="MuiBox-root css-1bi8ut6"><div class="text_component" data-testid="text-component"><p>To recap:</p><ul><li>RGB is the most popular encoding format, and most "natural images" we encounter are in&nbsp;RGB.</li><li>Also, among the first step of data pre-processing is <b>to make the images of the same size.</b>&nbsp;</li></ul><p>&nbsp;</p><p>Let's move on to how we can&nbsp;<strong>change the shape and form of images.</strong></p><p>&nbsp;</p><h2>Morphological transformations</h2><p>&nbsp;The term <em>morphological transformation</em>&nbsp;refers&nbsp;to any modification involving the&nbsp;<strong>shape and form</strong> of the images. These are very often&nbsp;used in image analysis tasks. Although they are used with all types of images, they are especially powerful for images that are not natural&nbsp;(come from a source other than a picture of the real world).</p></div></div>

### Thresholding

One of the simpler operations where we take all the pixels whose intensities are above a certain threshold, and convert them to ones; the pixels having value less than the threshold are converted to zero. This results in a *binary image*.
### Erosion, Dilation, Opening & Closing

**Erosion** shrinks bright regions and enlarges dark regions. **Dilation** on the other hand is exact opposite side - it shrinks dark regions and enlarges the bright regions.

**Opening** is erosion followed by dilation. Opening can remove small bright spots (i.e. “salt”) and connect small dark cracks. This tends to “open” up (dark) gaps between (bright) features.

**Closing** is dilation followed by erosion. Closing can remove small dark spots (i.e. “pepper”) and connect small bright cracks. This tends to “close” up (dark) gaps between (bright) features.

All these can be done using the `skimage.morphology` module. The basic idea is to have a **circular disk** of a certain size (3 below) move around the image and apply these transformations using it.


## Normalisation
Normalisation is the most crucial step in the pre-processing part. There are multiple ways to normalise images which we will be talking about.

<div class="MuiBox-root css-1bi8ut6"><div class="text_component" data-testid="text-component"><p>&nbsp;Note: Formula for normalisations are:&nbsp;<img alt="Equation" data-latex="(image-np.min(image))/(np.max(image)-np.min(image))" src="https://latex.upgrad.com/render?formula=%28image-np.min%28image%29%29/%28np.max%28image%29-np.min%28image%29%29" style="vertical-align: middle;display: inline;">&nbsp; &nbsp; and&nbsp;<img alt="Equation" data-latex="(image-np.percentile(image,5))/(np.percentile(image,95)-np.percentile(image,5))" src="https://latex.upgrad.com/render?formula=%28image-np.percentile%28image%2C5%29%29/%28np.percentile%28image%2C95%29-np.percentile%28image%2C5%29%29" style="vertical-align: middle;display: inline;">&nbsp;Some of the brackets are missing in&nbsp;the video.&nbsp;</p><h2>Why do we normalise?</h2><p>Normalisation makes the training process much <strong>smoother</strong>. This is an important preprocessing step, so let's discuss it briefly.</p><p>For example, let's say you have some data points <img alt="Equation" data-latex="x_1, x_2, x_3, ..., x_n" src="https://latex.upgrad.com/render?formula=x_1%2C%20x_2%2C%20x_3%2C%20...%2C%20x_n" style="vertical-align: middle;display: inline;">. The&nbsp;range of values of most data points is between (say) <strong>-10 to 10</strong>, but a few data points (say <img alt="Equation" data-latex="x_{11}" src="https://latex.upgrad.com/render?formula=x_%7B11%7D" style="vertical-align: middle;display: inline;"> and <img alt="Equation" data-latex="x_{18}" src="https://latex.upgrad.com/render?formula=x_%7B18%7D" style="vertical-align: middle;display: inline;">) have values ranging from <strong>-900 to 1000</strong>.</p><p>Now, in backpropagation, the gradients are (directly or indirectly) related to the derivatives <img alt="Equation" data-latex="f’(x)" src="https://latex.upgrad.com/render?formula=f%E2%80%99%28x%29" style="vertical-align: middle;display: inline;">&nbsp;where <img alt="Equation" data-latex="f" src="https://latex.upgrad.com/render?formula=f" style="vertical-align: middle;display: inline;">&nbsp;is the activation function. Say you are using a <strong>sigmoid activation</strong>. In sigmoid, the value of <img alt="Equation" data-latex="f’(x)" src="https://latex.upgrad.com/render?formula=f%E2%80%99%28x%29" style="vertical-align: middle;display: inline;"> at x=-800 and x=900 is almost zero, but it is a small positive number between x=-1 and +1.&nbsp;</p><p><img alt="Shape of Sigmoid Function" src="https://cdn.upgrad.com/UpGrad/temp/e6551df0-32d4-4cbe-855b-50cf49169072/sigmoid.png"></p><p>This makes the gradient with respect to &nbsp;<img alt="Equation" data-latex="x_{11}" src="https://latex.upgrad.com/render?formula=x_%7B11%7D" style="vertical-align: middle;display: inline;">and <img alt="Equation" data-latex="x_{18}" src="https://latex.upgrad.com/render?formula=x_%7B18%7D" style="vertical-align: middle;display: inline;"> drop to almost zero, and so the weight updates cannot happen in the right direction. Although sigmoid is rarely used in modern deep learning architectures, this problem arises in other activation functions as well and can be reduced using normalisation.</p><h2>Outliers&nbsp;</h2><p>In the case, you have outliers in data and if you normalise by the equation&nbsp; &nbsp;<img alt="Equation" data-latex="(x-x_{min})/(x_{max}-x_{min})" src="https://latex.upgrad.com/render?formula=%28x-x_%7Bmin%7D%29/%28x_%7Bmax%7D-x_%7Bmin%7D%29" style="vertical-align: middle;display: inline;">, then your "normal" data will scale to a very small range, something like 0 to 0.1, which you do not want. The data should be distributed between 0 to 1. Therefore, it is suitable to normalise using the percentile as explained by Rohit.&nbsp;</p><p>Try answering these questions about normalisation</p><p>We have now transformed the original data to make it more suitable as input for training later.&nbsp;</p><h2>Coming up</h2><p>In the upcoming page, we will explore another important function of data pre-processing - creating extra data out of existing data. This falls under the umbrella term of <strong>data&nbsp;augmentation.</strong></p><h2>Additional reading</h2><ol><li>Working with Neural Networks always involves using tricks to speed up computation. <a href="https://arxiv.org/abs/1705.01809" target="_blank">This excellent paper uses a technique called 'pixel normalisation' to modify text data into an image form,</a> in order to enable fast processing.</li></ol></div></div>


## Augmentation

There are multiple types of augmentations possible. The basic ones transform the original image using one of the following types of transformations:

1. Linear transformations - flip , rotate etc 
2. Affine transformations

Reason: 
1. Insufficient / Small Training data
This brings us to the next aspect of data pre-processing - data augmentation. Many times, the quantity of data that we have is not sufficient to perform the task of classification well enough. In such cases, we perform data augmentation.

2. Tackling Overfitting 

As an example, if we are working with a dataset of classifying gemstones into their different types, we may not have enough number of images (since high-quality images are difficult to obtain). In this case, we can perform augmentation to increase the size of your dataset.

<div class="MuiBox-root css-1bi8ut6"><div class="text_component" data-testid="text-component"><p>In this page, you learnt about augmentations. This should be used whenever our training data is small and we need to give the classifier more instances as training examples. There are mainly two types of augmentations:</p><ol><li>Linear Transformations</li><li>Affine Transformations</li></ol><p>&nbsp;</p><h2>Practice questions</h2><p>Attempt the following questions. The solutions to these are available on the next page. We strongly urge you to try out these problems before viewing the solutions:</p><ol><li>Write code to perform a left-right flip, followed by an up-down&nbsp;flip to the same image.</li><li>Normalise the image using 25th and 75th percentiles.</li><li>Perform a 90-degree rotation, and follow it up with a 4x zoom-in.</li></ol><p>&nbsp;</p><h2>Coming up</h2><p>We are finished with pre-processing techniques. On the next page, we've given you the solutions to the&nbsp;practice questions. After this, we'll move to network building, starting from an understanding of ResNet.</p></div></div>

<div class="text_component" data-testid="online-editor-content"><p>Following are the solutions to the practice problems mentioned in the last segment:</p><p>&nbsp;</p><p>1. Write code to perform a left-right flip, followed by an up-down&nbsp;flip to the same image.</p><pre>image_fliplr = np.fliplr(image)
image_final = np.flipud(image_fliplr)

plot_image([image_final])
</pre><p></p><p>2. Normalise the image between the 25th and 75th percentiles.</p><pre>norm3_image = (image - np.percentile(image,25))/ (np.percentile(image,75) - np.percentile(image,25))
</pre><p></p><p>3. Perform a 90-degree rotation, and follow it up with a 4x zoom-in.</p><div id="code-snippet-cke_52910" data-lang="python" class="code-snippet-container"><div contenteditable="false" class="code"><pre style="margin: 0; line-height: 125%;"><span></span><span style="color: #888888"># First, define the shifting transformations</span>
shift_x, shift_y <span style="color: #333333">=</span> image<span style="color: #333333">.</span>shape[<span style="color: #0000DD; font-weight: bold">0</span>]<span style="color: #333333">/</span><span style="color: #0000DD; font-weight: bold">2</span>, image<span style="color: #333333">.</span>shape[<span style="color: #0000DD; font-weight: bold">1</span>]<span style="color: #333333">/</span><span style="color: #0000DD; font-weight: bold">2</span>
matrix_to_topleft <span style="color: #333333">=</span> tf<span style="color: #333333">.</span>SimilarityTransform(translation<span style="color: #333333">=</span>[<span style="color: #333333">-</span>shift_x, <span style="color: #333333">-</span>shift_y])
matrix_to_center <span style="color: #333333">=</span> tf<span style="color: #333333">.</span>SimilarityTransform(translation<span style="color: #333333">=</span>[shift_x, shift_y])

<span style="color: #888888"># Then, perform rotation transform</span>
rot_transforms <span style="color: #333333">=</span>  tf<span style="color: #333333">.</span>AffineTransform(rotation<span style="color: #333333">=</span>np<span style="color: #333333">.</span>deg2rad(<span style="color: #0000DD; font-weight: bold">90</span>))

<span style="color: #888888"># Then, perform the scaling transform with 4X zoom-in</span>
scale_transforms <span style="color: #333333">=</span> tf<span style="color: #333333">.</span>AffineTransform(scale<span style="color: #333333">=</span>(<span style="color: #6600EE; font-weight: bold">0.25</span>, <span style="color: #6600EE; font-weight: bold">0.25</span>))

<span style="color: #888888"># Add up the transforms</span>
rot_plus_scale_matrix <span style="color: #333333">=</span> matrix_to_topleft <span style="color: #333333">+</span> rot_transforms <span style="color: #333333">+</span> scale_transforms <span style="color: #333333">+</span> matrix_to_center

<span style="color: #888888"># Finally, apply the added-up transformation </span>
final_image <span style="color: #333333">=</span> tf<span style="color: #333333">.</span>warp(image, rot_plus_scale_matrix)

<span style="color: #888888"># Plot the image</span>
plot_image([final_image])
</pre></div></div><p>&nbsp;</p></div>


# ResNet architecture

<div class="text_component" data-testid="online-editor-content"><p>In the next few segments, you will&nbsp;build the network using the&nbsp;<strong>ResNet architecture. </strong>On this page, we will recap the architecture of ResNets and discuss some improvements proposed in it later. This is a <strong>text-only, optional page</strong>&nbsp;intended to give you a high-level overview of the architecture - you can skip this if you want to learn the Python implementation&nbsp;directly.</p><h2><strong>ResNets - Original Architecture and Proposed Improvements</strong></h2><p>Since ResNets have become quite&nbsp;prevalent in the industry, it is worth spending some&nbsp;time to understand&nbsp;the important elements of their architecture. You may quickly revisit&nbsp;the <a href="https://learn.upgrad.com/course/1610/segment/18644/115005/349108/1816905" target="_blank">ResNet segment here</a>, though the broad ideas are discussed below again.</p><p>Let's start with the original architecture <a href="https://arxiv.org/pdf/1512.03385.pdf" target="_blank">proposed here</a>. The basic problem ResNet had solved was that training&nbsp;<em>very deep networks</em>&nbsp;was <strong>computationally hard</strong> - e.g. a 56-layer&nbsp;net had a&nbsp;<em>lower training accuracy</em> than a 20-layer&nbsp;net. By the way, before ResNets&nbsp;anything having more than 20 layers was called<em>&nbsp;</em><em>very</em><em> deep.</em></p><p style="text-align: center;"><img alt="ResNet: Experiments with Network Depth" src="https://cdn.upgrad.com/UpGrad/temp/05914e52-21f9-4013-84c3-089d7db8f090/Screen+Shot+2018-09-23+at+9.14.42+AM.png"></p><p>The ResNet team argued that a net with&nbsp;<img alt="Equation" data-latex="n+1" src="https://latex.upgrad.com/render?formula=n%2B1" style="vertical-align: middle;display: inline;">&nbsp; layers should perform&nbsp;<em>at least as good as</em>&nbsp;the one with&nbsp;<img alt="Equation" data-latex="n" src="https://latex.upgrad.com/render?formula=n" style="vertical-align: middle;display: inline;">&nbsp;layers. This is because even if the additional layer simply lets the input pass through it (i.e. acts as an<strong> identity function&nbsp;<strong><img alt="Equation" data-latex="f(x)=x" src="https://latex.upgrad.com/render?formula=f%28x%29%3Dx" style="vertical-align: middle;display: inline;"></strong></strong>), it will perform identically to&nbsp;the <img alt="Equation" data-latex="n" src="https://latex.upgrad.com/render?formula=n" style="vertical-align: middle;display: inline;">-layered network.&nbsp;</p><p>Now let's see how ResNets had solved this problem.&nbsp;Consider&nbsp;the figure 2, which is a ResNet(from the paper). Let's say&nbsp;the input to some 'unit' of a network&nbsp;is&nbsp;<img alt="Equation" data-latex="x" src="https://latex.upgrad.com/render?formula=x" style="vertical-align: middle;display: inline;">(the unit has two weight&nbsp;layers). Let's say that, ideally, this unit should have learnt some function&nbsp;<img alt="Equation" data-latex="H(x)" src="https://latex.upgrad.com/render?formula=H%28x%29" style="vertical-align: middle;display: inline;">, i.e. given the input&nbsp;<img alt="Equation" data-latex="x" src="https://latex.upgrad.com/render?formula=x" style="vertical-align: middle;display: inline;">&nbsp;this unit should have learnt to produce the<strong> desired output&nbsp;<strong><img alt="Equation" data-latex="H(x)" src="https://latex.upgrad.com/render?formula=H%28x%29" style="vertical-align: middle;display: inline;"></strong></strong>.&nbsp;</p><p>In a normal neural net, these two&nbsp;layers (i.e.&nbsp;this unit) would try to learn the function&nbsp;<img alt="Equation" data-latex="H(x)" src="https://latex.upgrad.com/render?formula=H%28x%29" style="vertical-align: middle;display: inline;">. But&nbsp;ResNets tried a different trick. They argued:&nbsp;let&nbsp;<img alt="Equation" data-latex="F(x)" src="https://latex.upgrad.com/render?formula=F%28x%29" style="vertical-align: middle;display: inline;">&nbsp;denote the&nbsp;<strong>residual</strong> between&nbsp;<img alt="Equation" data-latex="H(x)" src="https://latex.upgrad.com/render?formula=H%28x%29" style="vertical-align: middle;display: inline;">&nbsp;and&nbsp;<img alt="Equation" data-latex="x" src="https://latex.upgrad.com/render?formula=x" style="vertical-align: middle;display: inline;">, i.e.&nbsp;<img alt="Equation" data-latex="F(x) =H(x)-x" src="https://latex.upgrad.com/render?formula=F%28x%29%20%3DH%28x%29-x" style="vertical-align: middle;display: inline;">. They hypothesised that it will be <strong>easier&nbsp;to learn the residual function</strong>&nbsp;<img alt="Equation" data-latex="F(x)" src="https://latex.upgrad.com/render?formula=F%28x%29" style="vertical-align: middle;display: inline;">&nbsp;than to learn&nbsp;<img alt="Equation" data-latex="H(x)" src="https://latex.upgrad.com/render?formula=H%28x%29" style="vertical-align: middle;display: inline;">. In the extreme case that the unit should simply let the signal&nbsp;pass-through it (i.e.&nbsp;<img alt="Equation" data-latex="H(x)=x" src="https://latex.upgrad.com/render?formula=H%28x%29%3Dx" style="vertical-align: middle;display: inline;">&nbsp;is the optimal thing to learn), it would be easier&nbsp;to push&nbsp;the residual&nbsp;<img alt="Equation" data-latex="F(x)" src="https://latex.upgrad.com/render?formula=F%28x%29" style="vertical-align: middle;display: inline;">&nbsp;to zero than to learn <img alt="Equation" data-latex="H(x)" src="https://latex.upgrad.com/render?formula=H%28x%29" style="vertical-align: middle;display: inline;">.</p><p>Experiments on<em>&nbsp;deep</em> nets&nbsp;proved that&nbsp;this hypothesis was indeed true - if&nbsp;<em>learning to let the signal pass-through&nbsp;</em>was the optimal thing to do (i.e. reduced the loss), the units learnt <img alt="Equation" data-latex="F(x) =0" src="https://latex.upgrad.com/render?formula=F%28x%29%20%3D0" style="vertical-align: middle;display: inline;">; but if something useful was to be learnt, the units learnt that. These units are called <strong>residual units</strong>.</p><p style="text-align: center;"><img alt="ResNet" src="https://cdn.upgrad.com/UpGrad/temp/41942b84-46c1-4aab-8502-c4b94679889b/Screen+Shot+2018-09-22+at+11.36.25+PM.png"></p><p>&nbsp;After the network has learnt the residual&nbsp;<img alt="Equation" data-latex="F(x)" src="https://latex.upgrad.com/render?formula=F%28x%29" style="vertical-align: middle;display: inline;">, feedforward&nbsp;goes on as usual with the output&nbsp;<img alt="Equation" data-latex="H(x)=F(x)+x" src="https://latex.upgrad.com/render?formula=H%28x%29%3DF%28x%29%2Bx" style="vertical-align: middle;display: inline;">&nbsp;(since <img alt="Equation" data-latex="F(x) =H(x)-x" src="https://latex.upgrad.com/render?formula=F%28x%29%20%3DH%28x%29-x" style="vertical-align: middle;display: inline;">). This addition is facilitated by the <strong>shortcut (or skip) connections</strong>&nbsp;shown in the figure. The connections <strong>do not add any extra parameter</strong>&nbsp;(and thus complexity) to the network - they simply add the input to the residual.</p><p><strong>Bottleneck Residual Blocks&nbsp;</strong></p><p>The figure above shows a <strong>residual block </strong>(or unit)&nbsp;of two layers.&nbsp;The ResNet team had experimented with other&nbsp;types of blocks as well. One particularly successful one was the <strong>bottleneck architecture</strong>&nbsp;designed especially for deeper nets&nbsp;(we'll be using this in the upcoming sections). The&nbsp;bottleneck block&nbsp;has three layers&nbsp;in this sequence: (1, 1), (3, 3) and (1, 1)&nbsp;filters (right side in the figure below).</p><p style="text-align: center;"><img alt="Bottleneck Block" src="https://cdn.upgrad.com/UpGrad/temp/1769c680-b8f3-4da3-97de-b87b2e350594/Screen+Shot+2018-09-23+at+12.18.41+AM.png"></p><p>The reason why the bottleneck architecture works better than the vanilla one is beyond the scope of this discussion -&nbsp;you can <a href="https://stats.stackexchange.com/questions/347280/regarding-the-understanding-of-bottleneck-unit-of-resnet" target="_blank">read the intuition here</a> and the details in the original paper.&nbsp;For practical purposes, it will suffice to remember that they <strong>facilitate&nbsp;training </strong>of deeper nets.</p><p><strong>Improved&nbsp;ResNet Architecture</strong></p><p>In 2016, the ResNet team had proposed some improvements in the original architecture <a href="https://arxiv.org/pdf/1603.05027.pdf" target="_blank">here</a>.&nbsp;Using these modifications, they had trained nets of <strong>more than&nbsp;1000 layers </strong>(e.g.&nbsp;<a href="https://github.com/KaimingHe/resnet-1k-layers" target="_blank">ResNet-1001</a>)&nbsp;which had shown improved performance on the CIFAR-10 and 100 datasets. The basic ideas of the proposed design&nbsp;are explained below.</p><p>You know that&nbsp;skip connections act as a 'direct path' of information propagation within a residual block. The new architecture basically stretched the idea of skip connections from residual blocks&nbsp;<em>to the entire network</em>, i.e. if <em>multiple units</em> should ideally learn identity functions, the signal could be <em>directly propagated across units</em>.</p><p style="text-align: center;"><img alt="Proposed Residual Architecture" src="https://cdn.upgrad.com/UpGrad/temp/1bc0196d-28a9-4469-a3c9-3a310a9446d3/Screen+Shot+2018-09-23+at+12.36.46+AM.png"></p><p>Another unconventional change proposed here was to use the activation function (ReLU) <em>both&nbsp;before and after&nbsp;</em>the weight layer (called <em>pre and post-activation</em>). On the right side of the figure above, the grey arrows show&nbsp;the 'direct path' whereas the other layers (BN, ReLU etc.) are on the usual path. This modification had boosted training efficiency (i.e. gradient propagation) and was thus used to train nets deeper than 1000 layers.</p><p>You can read more about the original and the proposed architectures in the papers (provided below). In the next few segments, you will use some variants of ResNet (ResNet-18,&nbsp;ResNet-34 etc.).</p><h2>Additional reading</h2><ol><li><a href="https://arxiv.org/abs/1512.03385" target="_blank">The ResNet paper, He et al.</a></li><li><a href="https://arxiv.org/pdf/1603.05027.pdf" target="_blank">Proposed ResNet architecture, He et al.</a></li></ol></div>
