# Problem Statement
The task is to return similar images to the given query image.

# What Is Image Similarity?
- Image similarity is the measure of how similar 2 images are.
- It quantifies the degree of similarity between intensity patterns in 2 images.
- Finding image similarity also called reverse image search.

# Applications Of Image Similarity (What Can Be Done Using Image Similarity?)
- Image retrieval: Image similarity is the backbone of image search engines like Google Images and Pinterest. By understanding the visual content of a query image, these platforms efficiently retrieve similar images from their vast databases.
- E-commerce: Product recommendations play a crucial role in online shopping experiences. Image similarity empowers e-commerce platforms to suggest visually similar products to customers, enhancing their browsing and purchase decisions.
- Duplicate detection: Maintaining clean and organized datasets is essential for various tasks. Image similarity algorithms can effeciently identify duplicate or near-identical images within large datasets, ensuring data integrity and reducing storage requirements.
- Security and surveillance: Facial recognition systems leverage image similarity to match faces against databases of known individuals. This technology is used in security applications for identification, surveillance and access control.
- Brand protection: Image similarity safeguards brands by detecting unauthorized use of logos, trademarks or copyrighted images online. By comparing new content with a database of brand assets, businesses can proactively identify and address potential infringements.
- Content moderation: Image similarity can be used to streamline content moderation on social media platforms and online communities. By detecting visually similar content that violates community guidelines, such as violent or offensive imagery, image similarity algorithms can aid in maintaining a safe and appropriate online environment.

# How To Compute The Similarity Of 2 Images?
### Challenges of direct pixel comparison
Directly comparing the pixel values of 2 images often fails to capture meaningful similarities. This is because,
- Lighting variations: Images of the same object can drastically different pixel values due to varying lighting conditions.
- Perspective shifts: Changes in the viewpoint can significantly alter pixel values, even if the object remains the same.
- OcculusionsL Partial occulusions can drastically change pixel values, making direct comparison unreliable.

To overcome the above challenges, a representation that captures the essesnce of an image that goes beyond raw pixel values is needed. This is where Embeddings come into play.

### What are Embeddings?
- Numeric representations: Embeddings are essentially dense vactors (lists of numbers) that encode the semantic meaning of an image.
- Learned representations: These vectors are not manually engineered but rather learned by powerful Machine Learning models, typically deep Neural Networks.

### How to obtain image Embeddings?
1. Traditional methods (Hand-crafted features):
    - SIFT (Scale-Invariant Feature Transform): Detects key points and their local orientations, invariant to scale and rotation.
    - HOG (Histogram of Oriented Gradients): Represents the distribution of edge orientations within local cells.
    - LBP (Local Binary Patterns): Captures local texture information by comparing pixel intensities to their neighbors.
    - Gabor filters: Extract features at different frequencies and orientations.
    - The limitations of the above are,
        - Hand-engineered: These methods require careful design and may not capture complex relationships in images.
        - Performance: Often less accurate than Deep Learning based methods, especially for complex tasks.
2. Deep Learning methods (Learned features):
    - CNNs: CNNs are the most popular approach. They learn hierarchical representations of images, capturing increasingly complex features as the network deepens.
    - Popular architectures:
        - AlexNet, VGG, ResNet, Inception: These pre-trained models on massive datasets (like ImageNet) have proven highly effective for image feature extraction.

### Computing similarity with Embeddings
Once the image embeddings have been obtained, similarity can be computed using various distance metrics,
- Euclidean Distance: Measures the straight-line distance between 2 points in the embedding space.
- Cosine Similarity: Measures the cosine of the angle between 2 vectors, capturing the direction rather than the magnitude of the vectors.

### Example: Finding similar images
1. Extract the Embeddings: Use a pre-trained CNN to extract embeddings for the query image and all the images in the database.
2. Compute similarities: Calculate the Cosine similarity (or another suitable metric) between the query image embedding and all other image embeddings.
3. Rank images: Sort the images based on their similarity scores to the query image.
4. Retrieve similar images: Present the top-ranked images as the most similar results.

### Key advantages of Deep Learning Embedding:
- Data-driven: Learn directly from the data, capturing complex patterns and relationships.
- High Accuracy: Embeddings often outperform hand-crafted features, especially for challenging tasks.
- Versatility: Can be used for various tasks beyond image similarity, such as image classification, object detection and image generation.

![computer_vision_69.png](attachment:computer_vision_69.png)

# How Is Image Similarity Different From Image Classification In CNN?
While both image similarity and image classification leverage CNNs, they have distinct goals and approaches,

### Image classification:
- Goal: Assigns an image to one or more pre-defined categories (e.g., "cat", "dog", "car").
- Focus: Primarily concerned with the final classification layer, which predicts the probability of image belonging to each class.
- Output: A discrete lable or a set of probabilities for each class.
- Discards information: The final classification layer often discards subtle visual nuances that might be important for similarity comparisons.

### Image similarity
- Goal" Determines how similar 2 images are based on their visual content.
- Focus: Utilizes the learned feature representations (embeddings) extracted from the intermediate layers of CNN.
- Output: A continuous similarity score (e.g., Euclidean distance, Cosine similarity) that quantifies the degree of visual resemblance.
- Preserves richer information: By using intermediate layer Embeddings, image similarity captures more detailed visual information than just final classification.

### In essence
- Image classification focuses on assigning a discrete label to an image.
- Image similarity focuses on finding continuous relationships between images based on their visual content.

### Example
Imagine that a CNN has been trained to classify images into "cat", "dog" and "bird",
- Classification: Given an image, the model would predict "cat" with a high probability.
- Similarity: The model could be used to find other images that are visually similar to the input image, regardless of their class labels. This might include images of different cat breeds or even images of other animals with similar textures or patterns.

### Key points
- Image similarity leverages the power of CNNs to learn rich visual representations.
- By focusing on intermediate layer Embeddings, image similarity captures more nuanced visual information than traditional classification.
- This enables applications like image search, recommendation systems and duplicate detection.

# Feature Extraction Using Pre-Trained CNN Models
Leveraging pre-trained CNN models for feature extraction is a powerful technique in Deep Learning. The following is a breakdoen of the key aspects,

### Why use pre-trained models?
- Rich feature representations: Pretrained models, like ResNet-50, InceptionV3 and others have been trained on massive datasets (e.g., ImageNet) with millions of images. This extensive training allows them to learn intricate and hierarchical feature representations that capture essential visual information like edges, textures, shapes and objects.
- Transfer Learning: By utilizing these pre-trained models, the knowledge gained from training on a large dataset can transferred to the specific task at hand. This significantly reduces the need for large amounts of labeled data for the target problem.
- Efficiency: Training a deep CNN from scratch can be computationally expensive and time-consuming. Using pre-trained models allows to leverage existing knowledge and accelarate the training process.

### How it works?
1. Load the pre-trained model: Load the chosen pre-trained model.
2. Extract features:
    - Remove the final classification layer: Typically, the last layer of a pre-trained model is a Classifier (e.g., Softmax). Remove this layer.
    - Feed the images: Pass the images through the remaining layers of the pretrained model.
    - Extract feature vectors: Obtain the output of the penultimate layer (or another suitable intermediate layer). This output will be high-dimensional vector representing the extracted features of the image.
3. Train a Classifier:
    - Train a new classifier (e.g., SVM, Logistic Regression or another small Neural Network) on top of the extracted features. This Classifier will learn to map the extracted features to the correct class labels for the specified dataset.

### Benefits
- Improved performance: Often leads to significantly better performance compared to training a model from scratch, especially when dealing with limited labeled data.
- Faster training: Training a classifier on top of extracted features is much faster than training a deep CNN from scratch.
- Reduced overfitting: Pre-trained models have already learned generalizable features, which can help to reduce overfitting on smaller datasets.

### Caltech101 dataset
Caltech101 dataset is a common benchmark for image classification. By extracting features using pre-trained models like ResNet-50 and InceptionV3 and then training Classifiers on these features, a high Accuracy can be achieved on the Caltech101 dataset with relatively little training data.

### Key considerations:
- Choosing the right pre-trained model: Select a model that aligns with the complexity and nature of the images.
- Fine-tuning (optional): For further improvement, the later layers of the pre-trained model can be fine-tuned on the specified dataset. This allows the model to adapt slightly to the nuances of the data.
- Data augmentation: Augmenting the training data (e.g, rotations, flips, crops) can further improve generalization and robustness.

![computer_vision_70.png](attachment:computer_vision_70.png)

# Caltech101 Dataset
The Caltech101 dataset is a collection of images used for object recognition and classification tasks in Computer Vision.

### Key characteristics:
- 101 classes: It contains images of objects belonging to 101 different categories, plus a background clutter class.
- Image count: Each class has roughly 40 to 800 images, resulting in a total of arount of 9000 images.
- Image size: Images typically have edge lengths of 200 to 300 pixels.
- Image-level labels: Each label is associated with a single object label.
- Purpose: The dataset is widely used to train and evaluate object recognition algorithms.

### Significance
The Caltech101 dataset has played a crucial role in advancing Computer Vision research. It provides a benchmark for evaluating the performance of object recognition algorithms and has been used in numerous research papers and publications.

### Applications
- Object recognition: Training and evaluating models for identifying objects in images.
- Image classification: Classifying images into their respective categories.
- Fine-grained image analysis: Distinguishing between subtle variations within object categories.
- One-shot learning: Training models to recognize objects with limited training data.

In [1]:
# importing dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import tensorflow as tf
import os
import random
import math
import time
import warnings

# Which Pre-Trained CNN Architecture To Use?
AlexNet and VGGNet, despite their goroundbreaking contributions, have limitations,
- Computational cost: They are computationally expensive to train due to their depth and large number of parameters, particularly in the Fully Connected layers.
- Vanishing or exploding gradient: As the network deepens, training becomes challenging due to the vanishing or exploding gradient problem, hindering the flow of information during backpropagation.

### Enter ResNet and Inception
These architectures emerged as significant advancements, addressing the limitations of their predecessors.

1. ResNet (Residual Networks):
    - Key idea: ResNets introduce "skip connections" or "residual connections" that allow information to bypass layers directly. This helps overcome the vanishing or exploding gradient problem and enables training of extremely deep networks.
    - Benefits:
        - Improved training stability: Skip connections facilitate the flow of gradients, enabling deeper networks to be trained effectively.
        - Enhanced performance: ResNets have achieved state-of-the-art results on various image recognition benchmarks, including ImageNets.
2. Inception Networks (GoogLeNet):
    - Key idea: Inception modules efficiently extract features at ultiple scales simultaneously using a combination of convolutional filters with different sizes (`1x1`, `3x3`, `5x5`) and pooling operations.
    - Benefits:
        - Efficient feature extraction: Inception modules capture information at different scales and resolutions, improving the model's ability to recognize objects of varying sizes and shapes.
        - Reduced computational cost: The use of `1x1` `Convolutional` filters helps to reduce the number of parameters and computational costs.

### Summary
ResNet and Inception architectures represent significant advancements in CNN design. They address the limitations of earlier models like AlexNet and VGG by,
- Improving training stability: ResNets mitigate the vanishing or exploding gradient problem.
- Increasing efficiency: Inception modules extract features efficiently and reduce computational cost.
- Achieving state-of-the-art performance: Both architectures have demonstrated exceptional performance on various image recognition tasks.

These advancements have paved the way for even more sophisticated SNN architectures and continue to drive progress in the field of Computer Vision.

# Inception Networks (GoogLeNet)

### Origins
The name "Inception" is inspired by the movie of the same name which was directed by Christopher Nolan, a dialogue from the movie "We need to go deeper" was the motivation to build deeper Neural Networks.

![computer_vision_71.png](attachment:computer_vision_71.png)

It was introduced in 2014 by Christian Szegedy in their paper "Going Deeper with Convolutions". Inception (GoogLeNet) won the ImageNet competition in 2014.

The original paper link: https://arxiv.org/pdf/1409.4842

### Importance of GoogLeNet
- Reduced network size: Compared to AlexNet and VGGNet, Inception achieves a dramatic reduction in network size. This is primarily due to,
    - Removal of Fully Connected layers: Inception replaces Fully Connected layers with Global Average Pooling (GAP), significantly reducing the number of parameters. Most parameters reside in Fully Connected layers, so this change leads to substantial memory savings.
    - Efficient feature extraction: The architecture utilizes inception modules (explained below) to extract features efficiently.

### Comparison of network sizes and accuracies

| model_name | number_of_parameters | top_1_acc | top_5_acc |
| :-: | :-: | :-: | :-: |
| AlexNet | 60M | 63.3% | 84.6% |
| VGG-16 | 138M | 74.4% | 91.9% |
| VGG-19 | 155M | 74.5% | 92.0% |
| Inception | 11.2M | 74.8% | 92.2% |

Observe that, Inception achieves comparable or better accuracy than AlexNet and VGGNet while using significantly fewer parameters.

### Inception Modules: Network in Network
- Concept: Inception introduces a novel concept called a "network in network" or "micro-architecture". This refers to small building blocks used within the larger network.
- Functionality: Unlike traditional sequential CNNs, Inception allows the output from a layer to split into multiple paths with different operations (e.g., various filter sizes) and then rejoin later.
- Inception module as multi-level feature extractor: This approach enables the network to learn `Convolutional` layers with multiple filter sizes within the inception module, effectively extracting features at different scales and resolutions.

In essence, Inception represents a major advancement in CNN design by achieving high accuracy with a smaller network size and introducing the concept of inception modules for efficient multi-level feature extraction.

# Why Is There A Need For Multi-Level Feature Extractor?
The Inception module's brilliance lies in its ability to simultaneously extract features at multiple scales, addressing the critical question of "which filter size is the best?".

### The dilemma of filter size
- Trade-off: Often in traditional CNNs, there is dilemma when choosing filter sizes,
    - Large filters (e.g., `5x5`): Captures broader spatial information, but are computationally expensive.
    - Small filters (e.g., `3x3`): More efficient, but might miss finer details.
    - `1x1` filters: Extract local features, but can also be used for dimensionality reduction.

### The Inception solution
The Inception module elegantly overcomes this dilemma by, 
1. Parallel execution: Instead of choosing a single filter size, it employs parallel `Convolutional` paths with different filter sizes (`1x1`, `3x3`, `5x5`) on the same input.
2. Concatenation: The outputs from these parallel `Convolutional` layers are then concatenated along the channel dimension. This creates a richer feature map that incorporates information extracted at various scales.
3. `1x1` convolutions as dimensionality reduction: Before applying larger filters (`3x3` and `5x5`), `1x1` `Convolutional` layers are used to reduce the dimensionality of the input. This significantly reduces the number of parameters and computational cost, making the process more efficient.

### Why multi-scale feature extractions is crucial:
- Capturing diverse information: Images contain features at various scales. Small filters capture fine details, while larger filters capture broader spatial patterns. By extracting features at multiple scales, the Inception module ensures that the network can effectively learn and represent a wide range of visual information.
- Improve representation: The combined output of the parallel paths creates a more comprehensive and robust feature representation compared to using a single filter size.
- Adaptability: The network can learn the optimal combination of features at different scales, adapting to the specific characteristics of the input data.

### Example
In the images below,

![computer_vision_72.png](attachment:computer_vision_72.png)

There are 3 main entities,
- Wheat.
- Humans.
- Sky.

For effective image similarity assessment, a Neural Network  benefits from the ability to utilize Kernels of varying dimensions.
- Recognizing large-scale entities: Larger Kernels are generally more suitable for capturing features of objects that occupy a significant portion of the image.
- Detecting fine-grained details: Conversely, smaller Kernels are more effective at identifying fine-grained details and local patterns within the image.

The Inception architecture, with its parallel paths employing Kernels of different sizes (`1x1`, `3x3`, `5x5`) provides the network with the flexibility to adapt to these varying spatial scales. This allows the model to efficiently extract and compare features at multiple levels of detail, leading to more accurate and robust image similarity assessment.

# Architecture
