# Problem Statement
The task is to return similar images to the given query image.

# What Is Image Similarity?
- Image similarity is the measure of how similar 2 images are.
- It quantifies the degree of similarity between intensity patterns in 2 images.
- Finding image similarity also called reverse image search.

# Applications Of Image Similarity (What Can Be Done Using Image Similarity?)
- Image retrieval: Image similarity is the backbone of image search engines like Google Images and Pinterest. By understanding the visual content of a query image, these platforms efficiently retrieve similar images from their vast databases.
- E-commerce: Product recommendations play a crucial role in online shopping experiences. Image similarity empowers e-commerce platforms to suggest visually similar products to customers, enhancing their browsing and purchase decisions.
- Duplicate detection: Maintaining clean and organized datasets is essential for various tasks. Image similarity algorithms can effeciently identify duplicate or near-identical images within large datasets, ensuring data integrity and reducing storage requirements.
- Security and surveillance: Facial recognition systems leverage image similarity to match faces against databases of known individuals. This technology is used in security applications for identification, surveillance and access control.
- Brand protection: Image similarity safeguards brands by detecting unauthorized use of logos, trademarks or copyrighted images online. By comparing new content with a database of brand assets, businesses can proactively identify and address potential infringements.
- Content moderation: Image similarity can be used to streamline content moderation on social media platforms and online communities. By detecting visually similar content that violates community guidelines, such as violent or offensive imagery, image similarity algorithms can aid in maintaining a safe and appropriate online environment.

# How To Compute The Similarity Of 2 Images?
### Challenges of direct pixel comparison
Directly comparing the pixel values of 2 images often fails to capture meaningful similarities. This is because,
- Lighting variations: Images of the same object can drastically different pixel values due to varying lighting conditions.
- Perspective shifts: Changes in the viewpoint can significantly alter pixel values, even if the object remains the same.
- OcculusionsL Partial occulusions can drastically change pixel values, making direct comparison unreliable.

To overcome the above challenges, a representation that captures the essesnce of an image that goes beyond raw pixel values is needed. This is where Embeddings come into play.

### What are Embeddings?
- Numeric representations: Embeddings are essentially dense vactors (lists of numbers) that encode the semantic meaning of an image.
- Learned representations: These vectors are not manually engineered but rather learned by powerful Machine Learning models, typically deep Neural Networks.

### How to obtain image Embeddings?
1. Traditional methods (Hand-crafted features):
    - SIFT (Scale-Invariant Feature Transform): Detects key points and their local orientations, invariant to scale and rotation.
    - HOG (Histogram of Oriented Gradients): Represents the distribution of edge orientations within local cells.
    - LBP (Local Binary Patterns): Captures local texture information by comparing pixel intensities to their neighbors.
    - Gabor filters: Extract features at different frequencies and orientations.
    - The limitations of the above are,
        - Hand-engineered: These methods require careful design and may not capture complex relationships in images.
        - Performance: Often less accurate than Deep Learning based methods, especially for complex tasks.
2. Deep Learning methods (Learned features):
    - CNNs: CNNs are the most popular approach. They learn hierarchical representations of images, capturing increasingly complex features as the network deepens.
    - Popular architectures:
        - AlexNet, VGG, ResNet, Inception: These pre-trained models on massive datasets (like ImageNet) have proven highly effective for image feature extraction.

### Computing similarity with Embeddings
Once the image embeddings have been obtained, similarity can be computed using various distance metrics,
- Euclidean Distance: Measures the straight-line distance between 2 points in the embedding space.
- Cosine Similarity: Measures the cosine of the angle between 2 vectors, capturing the direction rather than the magnitude of the vectors.

### Example: Finding similar images
1. Extract the Embeddings: Use a pre-trained CNN to extract embeddings for the query image and all the images in the database.
2. Compute similarities: Calculate the Cosine similarity (or another suitable metric) between the query image embedding and all other image embeddings.
3. Rank images: Sort the images based on their similarity scores to the query image.
4. Retrieve similar images: Present the top-ranked images as the most similar results.

### Key advantages of Deep Learning Embedding:
- Data-driven: Learn directly from the data, capturing complex patterns and relationships.
- High Accuracy: Embeddings often outperform hand-crafted features, especially for challenging tasks.
- Versatility: Can be used for various tasks beyond image similarity, such as image classification, object detection and image generation.

![computer_vision_69.png](attachment:computer_vision_69.png)

# How Is Image Similarity Different From Image Classification In CNN?
While both image similarity and image classification leverage CNNs, they have distinct goals and approaches,

### Image classification:
- Goal: Assigns an image to one or more pre-defined categories (e.g., "cat", "dog", "car").
- Focus: Primarily concerned with the final classification layer, which predicts the probability of image belonging to each class.
- Output: A discrete lable or a set of probabilities for each class.
- Discards information: The final classification layer often discards subtle visual nuances that might be important for similarity comparisons.

### Image similarity
- Goal" Determines how similar 2 images are based on their visual content.
- Focus: Utilizes the learned feature representations (embeddings) extracted from the intermediate layers of CNN.
- Output: A continuous similarity score (e.g., Euclidean distance, Cosine similarity) that quantifies the degree of visual resemblance.
- Preserves richer information: By using intermediate layer Embeddings, image similarity captures more detailed visual information than just final classification.

### In essence
- Image classification focuses on assigning a discrete label to an image.
- Image similarity focuses on finding continuous relationships between images based on their visual content.

### Example
Imagine that a CNN has been trained to classify images into "cat", "dog" and "bird",
- Classification: Given an image, the model would predict "cat" with a high probability.
- Similarity: The model could be used to find other images that are visually similar to the input image, regardless of their class labels. This might include images of different cat breeds or even images of other animals with similar textures or patterns.

### Key points
- Image similarity leverages the power of CNNs to learn rich visual representations.
- By focusing on intermediate layer Embeddings, image similarity captures more nuanced visual information than traditional classification.
- This enables applications like image search, recommendation systems and duplicate detection.

# Feature Extraction Using Pre-Trained CNN Models
Leveraging pre-trained CNN models for feature extraction is a powerful technique in Deep Learning. The following is a breakdoen of the key aspects,

### Why use pre-trained models?
- Rich feature representations: Pretrained models, like ResNet-50, InceptionV3 and others have been trained on massive datasets (e.g., ImageNet) with millions of images. This extensive training allows them to learn intricate and hierarchical feature representations that capture essential visual information like edges, textures, shapes and objects.
- Transfer Learning: By utilizing these pre-trained models, the knowledge gained from training on a large dataset can transferred to the specific task at hand. This significantly reduces the need for large amounts of labeled data for the target problem.
- Efficiency: Training a deep CNN from scratch can be computationally expensive and time-consuming. Using pre-trained models allows to leverage existing knowledge and accelarate the training process.

### How it works?
1. Load the pre-trained model: Load the chosen pre-trained model.
2. Extract features:
    - Remove the final classification layer: Typically, the last layer of a pre-trained model is a Classifier (e.g., Softmax). Remove this layer.
    - Feed the images: Pass the images through the remaining layers of the pretrained model.
    - Extract feature vectors: Obtain the output of the penultimate layer (or another suitable intermediate layer). This output will be high-dimensional vector representing the extracted features of the image.
3. Train a Classifier:
    - Train a new classifier (e.g., SVM, Logistic Regression or another small Neural Network) on top of the extracted features. This Classifier will learn to map the extracted features to the correct class labels for the specified dataset.

### Benefits
- Improved performance: Often leads to significantly better performance compared to training a model from scratch, especially when dealing with limited labeled data.
- Faster training: Training a classifier on top of extracted features is much faster than training a deep CNN from scratch.
- Reduced overfitting: Pre-trained models have already learned generalizable features, which can help to reduce overfitting on smaller datasets.

### Caltech101 dataset
Caltech101 dataset is a common benchmark for image classification. By extracting features using pre-trained models like ResNet-50 and InceptionV3 and then training Classifiers on these features, a high Accuracy can be achieved on the Caltech101 dataset with relatively little training data.

### Key considerations:
- Choosing the right pre-trained model: Select a model that aligns with the complexity and nature of the images.
- Fine-tuning (optional): For further improvement, the later layers of the pre-trained model can be fine-tuned on the specified dataset. This allows the model to adapt slightly to the nuances of the data.
- Data augmentation: Augmenting the training data (e.g, rotations, flips, crops) can further improve generalization and robustness.

![computer_vision_70.png](attachment:computer_vision_70.png)

# Caltech101 Dataset
The Caltech101 dataset is a collection of images used for object recognition and classification tasks in Computer Vision.

### Key characteristics:
- 101 classes: It contains images of objects belonging to 101 different categories, plus a background clutter class.
- Image count: Each class has roughly 40 to 800 images, resulting in a total of arount of 9000 images.
- Image size: Images typically have edge lengths of 200 to 300 pixels.
- Image-level labels: Each label is associated with a single object label.
- Purpose: The dataset is widely used to train and evaluate object recognition algorithms.

### Significance
The Caltech101 dataset has played a crucial role in advancing Computer Vision research. It provides a benchmark for evaluating the performance of object recognition algorithms and has been used in numerous research papers and publications.

### Applications
- Object recognition: Training and evaluating models for identifying objects in images.
- Image classification: Classifying images into their respective categories.
- Fine-grained image analysis: Distinguishing between subtle variations within object categories.
- One-shot learning: Training models to recognize objects with limited training data.

In [1]:
# importing dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import tensorflow as tf
import os
import random
import math
import time
import warnings

# Which Pre-Trained CNN Architecture To Use?
AlexNet and VGGNet, despite their goroundbreaking contributions, have limitations,
- Computational cost: They are computationally expensive to train due to their depth and large number of parameters, particularly in the Fully Connected layers.
- Vanishing or exploding gradient: As the network deepens, training becomes challenging due to the vanishing or exploding gradient problem, hindering the flow of information during backpropagation.

### Enter ResNet and Inception
These architectures emerged as significant advancements, addressing the limitations of their predecessors.

1. ResNet (Residual Networks):
    - Key idea: ResNets introduce "skip connections" or "residual connections" that allow information to bypass layers directly. This helps overcome the vanishing or exploding gradient problem and enables training of extremely deep networks.
    - Benefits:
        - Improved training stability: Skip connections facilitate the flow of gradients, enabling deeper networks to be trained effectively.
        - Enhanced performance: ResNets have achieved state-of-the-art results on various image recognition benchmarks, including ImageNets.
2. Inception Networks (GoogLeNet):
    - Key idea: Inception modules efficiently extract features at ultiple scales simultaneously using a combination of convolutional filters with different sizes (`1x1`, `3x3`, `5x5`) and pooling operations.
    - Benefits:
        - Efficient feature extraction: Inception modules capture information at different scales and resolutions, improving the model's ability to recognize objects of varying sizes and shapes.
        - Reduced computational cost: The use of `1x1` `Convolutional` filters helps to reduce the number of parameters and computational costs.

### Summary
ResNet and Inception architectures represent significant advancements in CNN design. They address the limitations of earlier models like AlexNet and VGG by,
- Improving training stability: ResNets mitigate the vanishing or exploding gradient problem.
- Increasing efficiency: Inception modules extract features efficiently and reduce computational cost.
- Achieving state-of-the-art performance: Both architectures have demonstrated exceptional performance on various image recognition tasks.

These advancements have paved the way for even more sophisticated SNN architectures and continue to drive progress in the field of Computer Vision.

# Inception Networks (GoogLeNet)

### Origins
The name "Inception" is inspired by the movie of the same name which was directed by Christopher Nolan, a dialogue from the movie "We need to go deeper" was the motivation to build deeper Neural Networks.

![computer_vision_71.png](attachment:computer_vision_71.png)

It was introduced in 2014 by Christian Szegedy in their paper "Going Deeper with Convolutions". Inception (GoogLeNet) won the ImageNet competition in 2014.

The original paper link: https://arxiv.org/pdf/1409.4842

### Importance of GoogLeNet
- Reduced network size: Compared to AlexNet and VGGNet, Inception achieves a dramatic reduction in network size. This is primarily due to,
    - Removal of Fully Connected layers: Inception replaces Fully Connected layers with Global Average Pooling (GAP), significantly reducing the number of parameters. Most parameters reside in Fully Connected layers, so this change leads to substantial memory savings.
    - Efficient feature extraction: The architecture utilizes inception modules (explained below) to extract features efficiently.

### Comparison of network sizes and accuracies

| model_name | number_of_parameters | top_1_acc | top_5_acc |
| :-: | :-: | :-: | :-: |
| AlexNet | 60M | 63.3% | 84.6% |
| VGG-16 | 138M | 74.4% | 91.9% |
| VGG-19 | 155M | 74.5% | 92.0% |
| Inception | 11.2M | 74.8% | 92.2% |

Observe that, Inception achieves comparable or better accuracy than AlexNet and VGGNet while using significantly fewer parameters.

### Inception Modules: Network in Network
- Concept: Inception introduces a novel concept called a "network in network" or "micro-architecture". This refers to small building blocks used within the larger network.
- Functionality: Unlike traditional sequential CNNs, Inception allows the output from a layer to split into multiple paths with different operations (e.g., various filter sizes) and then rejoin later.
- Inception module as multi-level feature extractor: This approach enables the network to learn `Convolutional` layers with multiple filter sizes within the inception module, effectively extracting features at different scales and resolutions.

In essence, Inception represents a major advancement in CNN design by achieving high accuracy with a smaller network size and introducing the concept of inception modules for efficient multi-level feature extraction.

# Why Is There A Need For Multi-Level Feature Extractor?
The Inception module's brilliance lies in its ability to simultaneously extract features at multiple scales, addressing the critical question of "which filter size is the best?".

### The dilemma of filter size
- Trade-off: Often in traditional CNNs, there is dilemma when choosing filter sizes,
    - Large filters (e.g., `5x5`): Captures broader spatial information, but are computationally expensive.
    - Small filters (e.g., `3x3`): More efficient, but might miss finer details.
    - `1x1` filters: Extract local features, but can also be used for dimensionality reduction.

### The Inception solution
The Inception module elegantly overcomes this dilemma by, 
1. Parallel execution: Instead of choosing a single filter size, it employs parallel `Convolutional` paths with different filter sizes (`1x1`, `3x3`, `5x5`) on the same input.
2. Concatenation: The outputs from these parallel `Convolutional` layers are then concatenated along the channel dimension. This creates a richer feature map that incorporates information extracted at various scales.
3. `1x1` convolutions as dimensionality reduction: Before applying larger filters (`3x3` and `5x5`), `1x1` `Convolutional` layers are used to reduce the dimensionality of the input. This significantly reduces the number of parameters and computational cost, making the process more efficient.

### Why multi-scale feature extractions is crucial:
- Capturing diverse information: Images contain features at various scales. Small filters capture fine details, while larger filters capture broader spatial patterns. By extracting features at multiple scales, the Inception module ensures that the network can effectively learn and represent a wide range of visual information.
- Improve representation: The combined output of the parallel paths creates a more comprehensive and robust feature representation compared to using a single filter size.
- Adaptability: The network can learn the optimal combination of features at different scales, adapting to the specific characteristics of the input data.

### Example
In the images below,

![computer_vision_72.png](attachment:computer_vision_72.png)

There are 3 main entities,
- Wheat.
- Humans.
- Sky.

For effective image similarity assessment, a Neural Network  benefits from the ability to utilize Kernels of varying dimensions.
- Recognizing large-scale entities: Larger Kernels are generally more suitable for capturing features of objects that occupy a significant portion of the image.
- Detecting fine-grained details: Conversely, smaller Kernels are more effective at identifying fine-grained details and local patterns within the image.

The Inception architecture, with its parallel paths employing Kernels of different sizes (`1x1`, `3x3`, `5x5`) provides the network with the flexibility to adapt to these varying spatial scales. This allows the model to efficiently extract and compare features at multiple levels of detail, leading to more accurate and robust image similarity assessment.

# Architecture
Each Inception module consists of 4 operations in parallel,
1. `1x1` `Convolutional` layer.
2. `3x3` `Convolutional` layer.
3. `5x5` `Convolutional` layer.
4. Max Pooling.

![computer_vision_73.png](attachment:computer_vision_73.png)

An activation function (ReLU) is implicitly applied after every `Convolutional` layer. To save space, this activation function is not included in the above network diagram.

Here,
- `5x5` `Convolutional` layer is used to capture global features.
- `3x3` `Convolutional` layer is used to capture small and distributed features.
- `1x1` `Convolutional` layer is used for depth reduction.
- Max Pooling layer is mainly used to find the spatial invariance and also to capture very small entities.

In [2]:
def inception_module(layer_in, f1, f2_in, f2_out, f3_in, f3_out, f4_out):
    
    # 1x1 Convolutional layer
    conv_1 = tf.keras.layers.Conv2D(f1, (1, 1), padding = "same", activation = "relu")(layer_in)
    
    # 3x3 Convolutional layer
    conv_3 = tf.keras.layers.Conv2D(f2_in, (1, 1), padding = "same", activation = "relu")(layer_in)
    conv_3 = tf.keras.layers.Conv2D(f2_out, (3, 3), padding = "same", activation = "relu")(conv_3)
    
    # 5x5 Convolutional layer
    conv_5 = tf.keras.layers.Conv2D(f3_in, (1, 1), padding = "same", activation = "relu")(layer_in)
    conv_5 = tf.keras.layers.Conv2D(f3_out, (3, 3), padding = "same", activation = "relu")(conv_5)
    
    # 3x3 Max Pooling
    pool = tf.keras.layers.MaxPooling2D((3, 3), strides = (1, 1), padding = "same")(layer_in)
    pool = tf.keras.layers.Conv2D(f4_out, (1, 1), padding = "same", activation = "relu")(pool)
    
    # concatenate filters
    layer_out = tf.keras.layers.concatenate([conv_1, conv_3, conv_5, pool], axis = -1)

    return layer_out

The network is 22 layers deep when counting only layers with parameters (or 27 layers if pooling is counted as well).

To see the code of the full model architecture, refer this link: https://www.analyticsvidhya.com/blog/2018/10/understanding-inception-network-from-scratch/

# Description Of Architecture
The core idea behind the Inception module is to create a networl within the network (a "micro-architecture") that explores different feature extraction paths in parallel. This allows the model to learn a diverse set of features at different scales and with varying levels of complexity.

### Distinct paths
1. `1x1` `Convolutional` path:
    - This path simply applies a `1x1` `Convolutional` layer to the input.
    - While seemingly trivial, `1x1` `Convolutional` layers can act as,
        - Dimensionality reduction: Reduce the number of input channels, which is crucial for computational efficiency, especially before applying more expensive operations (like `3x3`, `5x5` convolutional layers).
        - Feature transformation: Introduce non-linearity and learn complex combinations of the input features.
2. `3x3` `Convolutional` path:
    - This path first applies `1x1` `Convolutional` layer to reduce the dimensionality of the input.
    - This dimensionality reduction step is crucial for computational efficiency.
    - Subsequently, a `3x3` `Convolutional` layer is applied to extract features at a moderate spatial scale.
3. `5x5` `Convolutional` path:
    - Similar to the `3x3` path, this path begins with a `1x1` `Convolutional` layer for dimensionality reduction.
    - Then, a `5x5` `Convolutional` layer is applied to capture broader spatial information.
4. Pooling path:
    - This path involves,
        - Max Pooling: Performs max pooling with a stride of `1x1` (efffectivwly downsampling without changing the spatial dimensions).
        - `1x1` `Convolutional` layer: A `1x1` `Convolutional` layer is applied to the output of the pooling operation, allowing for further feature extraction and dimensionality reduction.

# Concatenation and output
- Convergence: The outputs from all 4 paths are concatenated along the channel dimension. This creates a richer feature map that incorporates information extracted at different scales and with varying levels of complexity.
- Zero padding: To ensure that all 4 paths have the same spatial dimensions, zero padding is applied to the outputs of the convolutional and pooling paths. This allows for seamless concatenation.
- Output: The resulting concatenated feature map is then fed as input to the next layer in the network, which can be another Inception module or a Fully Connected layer.

### Stacking the Inception modules
In practice, multiple Inception modules are often stacked sequentially within the network. This allows the network to learn increasingly complex and abstract representations of the input image.

### Key advantages
- Multi-scale feature extraction: The Inception module effectively captures features at different scales, enhancing the network's ability to recognize objects of varying sizes and shapes.
- Computational efficiency: The use of `1x1` `Convolutional` layer for dimensionality reduction significantly improves computational efficiency, especially for larger filter sizes.
- Flexibility: The Inception module provides a flexible framework for exploring different architectural variations and optimizing network performance.

# What Are `1x1` Convolutions And How Are They Used To Reduce Dimensions?

### What are `1x1` convolutions?
- Definition:
    - A `1x1` convolution is a special case of `Convolutional` layer in a Neural Network where the filter size is `1x1`.
    - Essentially, it is a single-pixel filter that slides across the input feature maps.
- "Silly" sliding;
    - At first glance, it might seem pointless to slide a `1x1` filter. It is true that in a single-channel input, a `1x1` convolution would simply multiply each pixel by a constant.

### Multi-channel inputs and dimension reduction
- The power of multi-channel inputs:
    - The real power of `1x1` convolutions lies in their application to multi-channel inputs, like the RGB channels in an image.
    - In this scenario, the `1x1` filter has a depth equal to the number of input channels.
- Channel-wise linear combination:
    - Each `1x1` filter element acts as a weight for a specific input channel.
    - The convolution operation effectively performs a linear combination of the values from all input channels at each pixel location.
- Example - RGB image:
    - For an RGB image, a `1x1` convolution with 3 filter elements would,
        1. Multiply the Red channel by the first filter element.
        2. Multiply the Green channel by the second filter element.
        3. Multiply the Blue channel by the third filter element.
        4. Sum the result of these multiplications.
    - This produces a single output value at each pixel location, which is a linear combination of the original RGB values.

### Why use `1x1` convolutions?
- Dimensionality reduction:
    - The most common use of `1x1` convolutions is to reduce the number of feature maps (channels).
    - By using fewer `1x1` filters than the number of input channels, the feature information is effectively compressed.
- Computational efficiency:
    - `1x1` convolutions are computationally inexpensive compared to larger filters.
    - Since they operate on a single pixel at a time, they involve fewer multiplications and additions.
- Bottleneck layers:
    - In some architectures, `1x1` convolutions are used to create bottleneck layers.
    - These layers have significantly smaller number of channels, which reduces the number of parameters and computational cost while still allowing the network to capture essential information.

![computer_vision_75.png](attachment:computer_vision_75.png)

# Why Is There A Need To Increase The Depth Of The Network?
The following is a breakdown of why increasing the depth of the neural network is crucial,

### Representing complex functions
- Increased expressiveness: Deeper networks have a greater capacity to represent intricate and non-linear functions. Each layer adds another level of transformation allowing the network to capture increasingly sophisticated patterns in the data.
- Beyond linearity: Shallow networks are limited in their ability to model complex relationships that go beyond simple linear combinations. Deeper architectures can effectively capture highly non-linear dependencies.

### Hierarchical feature learning
- Feature extraction:
    - Lower layers: Focus on basic features like edges, colors and textures.
    - Intermediate layers: Combine these low-level features to form more complex shapes and objects.
    - Higher layers: Detect abstract concepts like faces, objects and scenes. This hierarchical representation allows the network to gradually build an understanding of the data.

### Improved performance
- State-of-the-art results: Deeper networks have consistently demonstrated superior performance across various tasks, including image classification, object detection and natural language processing.
- Pushing boundaries: The pursuit of deeper architectures has driven significant advancements in the field of Deep Learning, leading to breakthroughs in areas like Computer Vision and AI.

# Can Better Performance Be Obtained Just By Adding More Layers?
Simply adding more layers to a Neural Network does not guarantee better performance. In fact, it can lead to significant training challenges.

### Vanishing and exploding gradients
- The core issue:
    - During back propagation, gradients (the derivatives of the loss function with respect to the weights) are calculated and propagated backwards through the networks's layers.
    - In deep networks, these gradients are computed through a chain of matrix multiplications, one for each layer.
- Vanishing gradients:
    - If the values in these multiplications are consistently less than 1, the gradients can become extremenly small as they propagate back through many layers.
    - This is like repeateadly multiplying a number by a fraction - it quickly approaches zero.
    - As a result, the weights of the earlier layers receive very small updates, hindering the learning process.
- Exploding gradients:
    - Conversely, if the values in the multiplications are consistently greater than 1, the gradients can grow exponentially large.
    - This leads to unstable training, where the weights can fluctuate wildly and the network fails to converge.

### Why these problems occur?
- Activation functions: Some activation functions, like Sigmoid and TanH, have gradients that saturate near 0 or 1. This can contribute to vanishing gradients.
- Weight initialization: Poorly initialized weights can also exacerbate these probelms. If weights are too small, gradients may vanish and if they are too large, gradients may explode.

### Mitigating the problems
- Weights initialization strategies: Techniques like Xavier or Glorot intiazation and He initialization aim to keep the variance of activations and gradients stable across layers.
- Activation functions: Using activation functions like ReLU, which have non-zero gradients for positive inputs, can help alleviate vanishing gradients.
- Gradient clipping: This technique limits the magnitude of gradients during back propagation, preventing them from exploding.
- Batch normalization: This technique helps stabilize the learning process by normalizing the activations of each layer.
- Residual connections (skip connections): These connections allow gradients to flow more directly through the network, bypassing some layers and mitigating vanishing gradients.

# Problem Of Very Deep Neural Networks
The problem of vanishing gradient is a core hurdle in Deep Learning

### The essence
- In deep networks, the gradient signal, which guides the weight updates during training, can diminish exponentially as it backpropagates from the output layer to the initial layers.
- This "vanishing" effect makes it incredibly difficult for the network to learn meaningful representations in the earlier layers.

### Why it happens?
- Back propagation and chain rule: Back propagation relies on the chain rule to compute gradients. In deep networks, this involves a long chain of multiplications, each involving the weight matrix of a layer and the derivative of the activation function.
- Small gradients: If the magnitudes of these derivatives or weights are consistently less than 1, the product across many layers can become vanishingly small.
- Activation function: Sigmoid anf TanH functions, with their saturating derivatives are particularly prone to this issue.

### Consequences
- Slow learning: Early layers receive minimal updates, leading to extremely slow training or even preventing the network from learning effectively.
- Difficulty in capturing complex features: Early layers are crucial for extracting basic features. If they cannot learn properly, the entire networks's ability to represent complex patterns is severely hindered.

![computer_vision_76.png](attachment:computer_vision_76.png)

### How Will This Issue Be Solved?
ResNet (Residual Network).

# ResNet
ResNet is a groundbreaking deep learning architecture that revolutionized the field of computer vision. It was introduced in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft Research in their seminal paper "Deep Residual Learning for Image Recognition". It won the 2015 ImageNet competition (ILSVRC).

The key innovation introduced in ResNet is "Skip Connections", which allow the network to learn residual functions.

The imapct of ResNet is that, it revilutionized Deep Learning in Computer Vision, enabling the creation of extremely deep networks (over 150 layers) without the vanishing or exploding gradient problem.

| model_name | number_of_parameters | top_1_acc | top_5_acc |
| :-: | :-: | :-: | :-: |
| AlexNet | 60M | 63.3% | 84.6% |
| VGG-16 | 138M | 74.4% | 91.9% |
| VGG-19 | 155M | 74.5% | 92.0% |
| Inception | 11.2M | 74.8% | 92.2% |
| ResNet | 25.6M | 76.13 | approx. 93% |

![computer_vision_77.png](attachment:computer_vision_77.png)

# What Are skip connections?

![computer_vision_78.png](attachment:computer_vision_78.png)

### The essence
- Core idea: In a ResNet block, the output of the block, $f(x)$, is calculated as,
    - $f(x) = C(x) + x$.
    - Where,
        - $C(x)$ = Output of the convolutional path within the block.
        - $x$ = Input to the block (the skip connection).
- Learning residuals: This formulation means the convolutional path, $C(x)$, is effectively trained to learn the residual - the difference between the desired output and the input. In other words,
    - $C(x) = f(x) - x$.

### Why learning residuals is easier?
- Identify mapping: The authors of ResNet argue that in many cases, the optimal function for a layer in a very deep network might simply be the identity function (i.e., output = input).
- Learning difficulty: Directly learning the identity function is very deep networks can be challenging due to vanishing or exploding gradient problem.
- Residual learning advantage: By learning the residual, the network can easily learn the identity function by simply setting the weights of the convolutional path to zero. This makes it easier for the network to learn and optimize.

### How do skip connections solve the vanishing and exploding gradients problem?
- Direct gradient flow:
    - Skip connections provide a direct path for gradients to flow from later layers back to earlier layers.
    - Instead of solely relying on the chain of multiplications through the `Convolutional` layers, gradients can now flow directly through the skip connection.
- Mitigating vanishing gradients:
    - If the gradients become very small as they propagate through the `Convolutional` layers, the skip connection ensures that a portion of the gradeint signal still reaches earlier layers. This helps prevent the gradient from vanishing completely.
- Improved stability:
    - By allowing gradients to flow more directly, skip connections stabilize the training process. This makes it possible to train significantly deeper networks without encountering the vanishing or exploding gradient problem.

The following images shows the topological landscape of the loss function with and without skip connection,

![computer_vision_79.png](attachment:computer_vision_79.png)

### Different versions of ResNet
ResNets are composed of numerous residual blocks. Each block typically contains a few `Convolutional` layers and crucial skip connection. The primary way ResNet models are differentiated is by the number of layers. Common versions include,
- ResNet-18: A relatively smaller model with 18 layers.
- ResNet-34: Slightly deeper than ResNet-18 with 34 layers.
- ResNet-50: A significant jump in depth with 50 layers.
- ResNet-101: A very deep model with 101 layers.
- ResNet-152: An extremely deep model with 152 layers.

![computer_vision_80.png](attachment:computer_vision_80.png)

### Types of residual blocks
- Identity block.
- Convolution block.

### Identity block

![computer_vision_78.png](attachment:computer_vision_78.png)

### Code for identity block

In [3]:
def identity_block(x, filter):
    
    # copy tensor to a variable named x_skip
    f1, f2, f3 = filter
    x_skip = x

    # layer 1
    x = tf.keras.layers.Conv2D(f1, (1, 1), padding = "valid")(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)
    x = tf.keras.layers.Activation("relu")(x)

    # layer 2
    x = tf.keras.layers.Conv2D(f2, (3, 3), padding = "same")(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)

    # layer 3
    x = tf.keras.layers.Conv2D(f3, (1, 1), padding = "valid")(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)

    # add residue
    x = tf.keras.layers.Add()([x, x_skip])
    x = tf.keras.layers.Activation("relu")(x)

    return x

### Convolution block

![computer_vision_81.png](attachment:computer_vision_81.png)

### Code for convolution block

In [4]:
def convolution_block(x, s, filter):

    # copy tensor to a variable named x_sjip
    f1, f2, f3 = filter
    x_skip = x

    # layer 1
    x = tf.keras.layers.Conv2D(f1, (1, 1), padding = "valid", strides = (s, s))(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)
    x = tf.keras.layers.Activation("relu")(x)

    # layer 2
    x = tf.keras.layers.Conv2D(f2, (3, 3), padding = "same")(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)
    x = tf.keras.layers.Activation("relu")()

    # layer 3
    x = tf.keras.layers.Conv2D(f3, (1, 1), padding = "valid")(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)

    # processing residue with conv(1, 1)
    x_skip = tf.keras.layers.Conv2D(f3, (1, 1), padding = "valid", strides = (s, s))(x_skip)
    x_skip = tf.keras.layers.BatchNormalization(axis = 3)(x_skip)

    # add residue
    x = tf.keras.layers.Add()([x, x_skip])
    x = tf.keras.layers.Activation("relu")(x)

    return x

# ResNet-50 Architecture

![computer_vision_82.png](attachment:computer_vision_82.png)

In [5]:
def resnet50(input_shape = (224, 224, 3)):

    x_input = tf.keras.Input(input_shape)

    x = tf.keras.layers.ZeroPadding2D((3, 3))(x_input)

    # stage 1
    x = tf.keras.layers.Conv2D(64, (7, 7), strides = (2, 2), kernel_initializer = glorot_uniform(seed = 0))(x)
    x = tf.keras.layers.BatchNormalization(axis = 3)(x)
    x = tf.keras.layers.Activation("relu")(x)
    x = tf.keras.layers.MaxPooling2D((3, 3), strides = (2, 2))(x)

    # stage 2
    x = convolution_block(x, s = 1, filters = [64, 64, 256])
    x = identity_block(x, [64, 64, 256])
    x = identity_block(x, [64, 64, 256])

    # stage 3
    x = convolution_block(x, s = 2, filters = [128, 128, 512])
    x = identity_block(x, [128, 128, 256])
    x = identity_block(x, [64, 64, 256])

    # stage 3
    x = convolution_block(x, s = 2, filters = [128, 128, 512])
    x = identity_block(x, [128, 128, 512])
    x = identity_block(x, [128, 128, 512])
    x = identity_block(x, [128, 128, 512])

    # stage 4
    x = convolution_block(x, s = 2, filters = [256, 256, 1024])
    x = identity_block(x, [256, 256, 1024])
    x = identity_block(x, [256, 256, 1024])
    x = identity_block(x, [256, 256, 1024])
    x = identity_block(x, [256, 256, 1024])
    x = identity_block(x, [256, 256, 1024])

    # stage 5
    x = convolution_block(x, s = 2, filters = [512, 512, 2048])
    x = identity_block(x, [512, 512, 2048])
    x = identity_block(x, [512, 512, 2048])

    x = tf.keras.layers.AveragePooling2D(pool_size = (2, 2), padding = "same")(x)

    model = tf.keras.Model(inputs = x_input, outputs = x, name = "ResNet50")

    return

### Why ResNet-50?
- Accuracy: ResNet-50 generally exhibits higher accuracy compared to GoogLeNet (Inception), especially for deeper architectures. This improved accuracy translates to more robust and informative image representations.
- Parameter efficiency: While ResNet-50 has more parameters than GoogLeNet, it often achieves better performance with a more resonable number of parameters.
- Widely used and well-researched: ResNet-50 is widely adopted architecture with extensive research and readily available implementations. This provides a strong foundation and a wealth of resources for a project.

### Recap of Transfer Learning
- The challenge: Training a deep CNN from scratch is a computationally expensive and time-consuming. It requires vast amounts of data and significant processing power.
- Transfer Learning solution:
    - Leverage pre-trained models: Utilize a pre-trained model (like ResNet-50) that has been trained on a massive dataset like ImageNet. These models have learned general visual features (edges, textures, shapes) that are applicable to a wide range of image-related tasks.
    - Fine-tuning:
        1. Feature extraction: Use a pre-trained model to extract high-level feature representations from the input images.
        2. Freeze initial layers: Freeze the weights of the initial layers of the pre-trained model, as these layers have learned general visual features.
        3. Train new layers: Add a new classifier (e.g., a Fully Connected layer) on top of the pre-trained model. Train only this new classifier on the specific image similarity dataset.
        4. Fine-tune deeper layers (optional): For further improvements, some of the deeper layers of the pre-trained model can be gradually un-frozen and they can be fine-tuned along with new classifier.
- Key benefits of Transfer Learning
    - Faster training: Significantly reduces training time compared to training a model from scratch.
    - Improved performance: Often leads to better performance, especially when the amount of training data is limited.
    - Reduced data requirements: Requires less training data compared to training from scratch.
- Next steps:
    1. Implement ResNet-50: Load the pre-trained ResNet-50 model in Keras.
    2. Feature Extraction: Extract feature vectors from the images using the pre-trained model.
    3. Train a similarity model: Train a separate model (e.g., a simple feedforward network or a distance-based metric) on the extracted features to learn the similarity between image pairs.

# Loading The ResNet-50 Model

In [7]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

model = tf.keras.applications.resnet50.ResNet50(weights = "imagenet", include_top = False, input_shape = (224, 224, 3), pooling = "avg")

model.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m94765736/94765736[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 0us/step


# Feature Extraction