## TOC:
* [Setup](#first-bullet)
* [Image Retrieval using Multimodal Models](#second-bullet)
* [Multimodal Visual QA](#third-bullet)
* [Datasets and Metrics](#eigth-bullet)
* [TLDR](#last-bullet)

## Setup <a class="anchor" id="first-bullet"></a>

In [1]:
from transformers import ViTForImageClassification, ViTImageProcessor
from PIL import Image
import requests
import matplotlib.pyplot as plt
import numpy as np
from sentence_transformers import SentenceTransformer, util


  from .autonotebook import tqdm as notebook_tqdm


## Image Retrieval using Multimodal Models <a class="anchor" id="second-bullet"></a>

Multimodal models leverage multiple data modalities (e.g., images and text) to achieve advanced tasks like image retrieval and question answering.

### Why Multimodal
- Combines strengths of different data types.
- Enables applications like:
  - Image search using text.
  - Answering questions about visual content.

### Overview
Image retrieval involves finding the most relevant image(s) from a dataset based on a query. Multimodal retrieval uses text or other images as queries.

___

### Key Idea
Learn a shared embedding space for text and images where semantically related inputs are close.

___

#### Key Concepts
- **Feature Extraction:** Represent images and text in a numerical form.
- **Similarity Matching:** Use a distance metric (e.g., cosine similarity) to measure relevance.

___

### **Key Steps in Image Retrieval**

#### **1. Feature Extraction**
- Extract numerical representations (embeddings) for both the query and the database images.
- Common techniques:
  - **For images:** Use Convolutional Neural Networks (CNNs) or Vision Transformers (ViT).
  - **For text queries:** Use transformers such as BERT or GPT.

#### **2. Embedding Space**
- Represent both images and text queries in a **shared embedding space**.
- Similar images or matching image-text pairs should be close together in this space.
- Distance metrics such as **cosine similarity** or **Euclidean distance** are used to compute similarity.

#### **3. Similarity Computation**
- Compare the query embedding with embeddings from the database using a similarity metric.
- Rank the database images based on their similarity to the query.

___

### Mathematical Formulation
Given:
- $ I $: Image embeddings.
- $ T $: Text embeddings.

Objective: Minimize the loss function $ \mathcal{L} $:
$$
\mathcal{L} = - \frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{sim}(I_i, T_i))}{\sum_{j=1}^{N} \exp(\text{sim}(I_i, T_j))}
$$
where $ \text{sim}(I, T) $ is a similarity measure.

### History of Image Retrieval
- 2001: Early models used SIFT and traditional machine learning.
- 2014: CNNs revolutionized image feature extraction.
- 2021: OpenAI's **CLIP** introduced a robust multimodal framework using contrastive learning.

### CLIP Architecture
- **Image Encoder:** ResNet or Vision Transformer (ViT).
- **Text Encoder:** Transformer-based architecture (e.g., GPT-2).
- **Shared Space:** Project image and text embeddings into a common latent space.

# Contrastive Learning: Concepts and Mathematical Foundation

## Overview

**Contrastive Learning** is a representation learning technique that aims to map data points into a feature space where similar samples are close to each other and dissimilar samples are far apart. It is widely used in self-supervised learning and other tasks where labeled data is limited.

---

## Key Concepts

1. **Representation Learning**:
   - Learn a function $f_\theta$ (parameterized by $\theta$) that maps input data $x$ into a feature embedding $z = f_\theta(x)$.
   - The goal is to ensure embeddings preserve meaningful semantic relationships.

2. **Positive and Negative Pairs**:
   - **Positive pairs**: Represent semantically similar examples (e.g., two augmented views of the same image).
   - **Negative pairs**: Represent semantically dissimilar examples (e.g., different images or text).

3. **Contrastive Loss**:
   - A loss function designed to minimize the distance between positive pairs while maximizing the distance between negative pairs.

---

## Mathematical Foundation

### Feature Space

Let:
- $x_i$ and $x_j$: Two data points (e.g., augmented versions of an image).
- $z_i = f_\theta(x_i)$ and $z_j = f_\theta(x_j)$: Their embeddings in the feature space.

The embeddings are normalized to lie on a unit hypersphere, i.e., $\|z_i\| = 1$.

### Similarity Metric

The similarity between two embeddings is measured using the **cosine similarity**:
$$
\text{sim}(z_i, z_j) = \frac{z_i^\top z_j}{\|z_i\| \|z_j\|}.
$$
Since embeddings are normalized, this simplifies to:
$$
\text{sim}(z_i, z_j) = z_i^\top z_j.
$$

### Contrastive Loss

#### 1. **Triplet Loss**:
Given an anchor $z_a$, a positive sample $z_p$, and a negative sample $z_n$, the triplet loss is:
$$
\mathcal{L}_{\text{triplet}} = \max(0, \|z_a - z_p\|^2 - \|z_a - z_n\|^2 + m),
$$
where $m > 0$ is a margin that separates positive and negative pairs.

#### 2. **NT-Xent Loss** (Normalized Temperature-Scaled Cross-Entropy Loss):
The NT-Xent loss, used in SimCLR, is defined as:
$$
\mathcal{L}_{\text{NT-Xent}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)},
$$
where:
- $\tau > 0$ is a temperature scaling parameter.
- $N$ is the batch size.
- The numerator represents the similarity between positive pairs $(z_i, z_j)$.
- The denominator sums over all pairs, including both positive and negative pairs.

#### 3. **Contrastive Loss** (not to be confused with the triplet, although there are similarities)
$$
\mathcal{L}_{\text{contrastive}} = \frac{1}{N} \sum_{i=1}^N \bigg( y_{i} \cdot \|z_i - z_j\|^2 + (1 - y_{i}) \cdot \max(0, m - \|z_i - z_j\|)^2 \bigg),
$$

where:
- $N$: Number of pairs.
- $y_i \in \{0, 1\}$: Binary label indicating whether the pair is positive ($y_i = 1$) or negative ($y_i = 0$).
- $\|z_i - z_j\|$: The distance (e.g., Euclidean) between the embeddings of the two samples.
- $m$: Margin that defines the minimum allowable distance between negative pairs.
- The first term ($y_i \cdot \|z_i - z_j\|^2$) minimizes the distance for positive pairs.
- The second term ($(1 - y_i) \cdot \max(0, m - \|z_i - z_j\|)^2$) ensures negative pairs are separated by at least the margin $m$.

##### Intuition
- For **positive pairs** ($y_i = 1$): Minimize the squared distance $\|z_i - z_j\|^2$ so that similar samples are closer in the embedding space.
- For **negative pairs** ($y_i = 0$): Ensure that their distance is greater than the margin $m$. If the distance is already greater than $m$, the loss contributes $0$ for that pair.

### Optimization Goal

The goal of contrastive learning is to minimize the loss function so that positive pairs are closer together and negative pairs are farther apart.

---

## Applications

1. **Self-Supervised Learning**:
   - SimCLR and MoCo use contrastive loss to pretrain models without labeled data.
2. **Image Retrieval**:
   - Representations learned can be used to retrieve semantically similar images.
3. **Multimodal Learning**:
   - CLIP aligns text and image embeddings using contrastive learning.

---

## Advantages

- Reduces reliance on labeled data.
- Produces robust and generalized embeddings.
- Can be applied across domains (e.g., images, text, multimodal tasks).

---

## Challenges

- **Negative Sampling**: Effective negative sampling is computationally intensive.
- **Batch Size**: Large batch sizes are often necessary to include diverse negative samples.
- **Mode Collapse**: In some setups, embeddings may collapse, reducing their utility.

https://arxiv.org/pdf/2103.00020v1

<!-- ![image info](.\CLIP_arhitecture.png "Title"){width=10 height=10} -->
<img src="CLIP_arhitecture.png" alt="CLIP Arhitecture" class="hover-effect" width=900>

Source: <a href="https://arxiv.org/abs/2103.00020v1" > Learning Transferable Visual Models From Natural Language Supervision </a>

In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

texts = ["A photo of a cat", "A photo of a dog"]
image = Image.open("cat.jpg")

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # Image-to-text similarity scores
print(logits_per_image.softmax(dim=1))  # Probabilities

tensor([[0.9921, 0.0079]], grad_fn=<SoftmaxBackward0>)


In [None]:
texts2 = ["A photo of a cat", "A photo of a dog"]
image2 = Image.open("cat_dog.jpg")

inputs = processor(text=texts2, images=image2, return_tensors="pt", padding=True)

outputs2 = model(**inputs)
logits_per_image2 = outputs2.logits_per_image  # Image-to-text similarity scores
print(logits_per_image2.softmax(dim=1))  # Probabilities

tensor([[0.5380, 0.4620]], grad_fn=<SoftmaxBackward0>)


# **Augmenting RAG with Image Retrieval**

Incorporating **image retrieval** into a Retrieval-Augmented Generation (RAG) system allows handling:
- **Image-based queries.**
- **Text-based queries.**
- **Multimodal queries** (text + image).

This enhancement enables RAG systems to handle richer queries and provide more relevant, context-aware responses.

---

## **Expanded RAG Architecture with Image Retrieval**

### **1. Embedding Creation**
- **Text Queries:** Use a **text encoder** (e.g., BERT, GPT) to generate embeddings for user queries.
- **Image Queries:** Use an **image encoder** (e.g., CLIP, CNNs, Vision Transformers) to generate embeddings for image queries.
- **Shared Embedding Space:** For multimodal systems, encode text and images into a shared embedding space.

### **2. Vector Indexing**
- Store embeddings for images and associated metadata in a **vector search engine** such as FAISS, Pinecone, or Google Vertex AI Matching Engine.

### **3. Query Execution**
- Convert the query (text or image) into embeddings and retrieve the top-k most relevant results.
- For multimodal queries, combine the embeddings of both text and image inputs.

### **4. Contextual Augmentation**
- Retrieve relevant image captions, metadata, or associated textual information.
- Inject the retrieved context into the input for the generative model.

### **5. Answer Generation**
- Use a generative model (e.g., GPT, T5) to process the query and retrieved context, producing the final response.

---

## **How Image Retrieval Enhances RAG**

1. **Rich Context:** Retrieved images provide additional context to text-based queries.
2. **Multimodal Queries:** Users can submit queries like "Find similar products to this shoe."
3. **Cross-Modal Retrieval:** Answer questions such as "Show me something similar to this painting but in blue."

---

## **Technologies for Image Retrieval in RAG**

### **1. Image Encoders**
- **CLIP:** Maps images and text into a shared embedding space.
- **ResNet:** CNN-based architecture for image feature extraction.
- **ViT (Vision Transformers):** Processes images in patches, offering high accuracy for large datasets.

### **2. Vector Search Engines**
- **FAISS:** Open-source library for nearest neighbor search.
- **Pinecone:** Managed vector search service with real-time updates and metadata filtering.
- **Google Vertex AI Matching Engine:** A cloud-based vector search system.
- **Weaviate:** Open-source vector search platform with native multimodal capabilities.
- **Autonomous Database:** Oracle cloud variant

### **3. Multimodal Generative Models**
- **BLIP:** Combines image understanding with text generation for image-based queries.
- **OpenAI GPT Models:** Extended to process image captions or descriptions.
- **LLaMA or T5:** Fine-tuned for text generation in multimodal RAG systems.

---

## Multimodal Visual Question Answering (VQA) <a class="anchor" id="second-bullet"></a>

### Overview
VQA models answer natural language questions about images by reasoning over both modalities.

#### Key Concepts
- **Image Understanding:** Extract visual features using CNNs or ViTs.
- **Text Understanding:** Encode the question using transformers.
- **Fusion Mechanism:** Combine image and text embeddings to predict the answer.

### Mathematical Formulation
Let:
- $ Q $: Textual representation of the question.
- $ V $: Visual representation of the image.

The output is the answer $ A $:
$$
A = \text{softmax}(W \cdot \text{concat}(f_Q(Q), f_V(V)))
$$
where $ f_Q $ and $ f_V $ are embedding functions for text and images.

### History of VQA
- 2015: Early VQA models used LSTMs for questions and CNNs for images.
- 2020: Models like **LXMERT** and **ViLT** introduced attention mechanisms for better fusion.

### LXMERT Architecture
- **Image Encoder:** Bottom-up attention mechanism for extracting region-based features.
- **Text Encoder:** Transformer architecture for processing questions.
- **Cross-Modal Encoder:** Layers of attention to combine image and text representations.

- LXMERT Demo Setup and Example: https://github.com/huggingface/transformers/blob/main/examples/research_projects/lxmert/demo.ipynb

### Steps for VQA
1. Extract visual features from the image.
2. Encode the question into a textual embedding.
3. Combine embeddings using attention mechanisms.
4. Classify into possible answers.

https://github.com/huggingface/transformers/blob/main/examples/research_projects/lxmert/demo.ipynb

In [None]:
import torch
from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image

def perform_blip_qa(image_path, question):
    processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
    model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")

    inputs = processor(image, question, return_tensors="pt")
    
    # Perform inference using generate()
    with torch.no_grad():
        generated_ids = model.generate(**inputs)
    
    # Decode the generated answer
    answer = processor.decode(generated_ids[0], skip_special_tokens=True)
    return answer

# Test the BLIP QA pipeline
if __name__ == "__main__":
    image_path = "cat.jpg"
    question = "What color is the cat?"

    # Perform BLIP QA
    answer = perform_blip_qa(image_path, question)
    print(f"Question: {question}")
    print(f"Answer: {answer}")


Question: What color is the cat?
Answer: orange and white


## Applications of Multimodal Models
1. **E-commerce:** Search for products using text queries.
2. **Education:** Interactive learning tools for children.
3. **Healthcare:** Analyze and describe medical images.

## TL;DR <a class="anchor" id="last-bullet"></a>

- CLIP: https://arxiv.org/pdf/2103.00020v1
- OmniVision multimodal for edge devies: https://levelup.gitconnected.com/omnivision-968m-the-worlds-most-compact-and-smallest-multimodal-vision-language-model-for-edge-ai-4ccd66082bfb
- SimCLR : https://arxiv.org/abs/2002.05709
- MoCo: https://arxiv.org/abs/1911.05722
- LXMERT QA Demo Setup and Example: https://github.com/huggingface/transformers/blob/main/examples/research_projects/lxmert/demo.ipynb