<h1><strong>Image-based Joint-Embedding Predictive Architecture</strong></h1>

<h3><strong>Objectives</strong></h3>
<ol start="1">
  <li>To understand Self Supervised Learning </li>
  <li>To understand common approaches to Self Supervised Learning </li>
  <li>To understand I-JEPA architecture and its non generative approach to Self Supervised Learning </li>
  <li> To learn how context and target blocks are used to predict representations in images.</li>
</ol>

<h3><strong>Introduction to Self Supervised Learning (SSL)</strong></h3>

<p style="line-height:1.5;" ><strong><a href = https://arxiv.org/html/2301.05712v4#S2>Self-supervised learning (SSL)</a></strong> is a subset of unsupervised learning that enables machine learning models to learn meaningful representations from unlabeled data by generating supervisory signals, or pseudo-labels, directly from the data itself. </p>

<p>The primary goal of SSL is to learn <strong><a href="https://arxiv.org/pdf/1206.5538#:~:text=This%20paper%20is%20about%20representation,objectives%20for%20learning%20good%20representations?" target="_blank">generalizable representations</a></strong> that capture essential patterns in the data, which can then be transferred to downstream tasks such as classification, object detection, or segmentation with minimal labeled data.</p>


<p style="line-height:1.5;" >Unlike supervised learning, which relies on human-annotated labels, SSL designs <strong><a href="https://arxiv.org/html/2301.05712v4#S2.SS2.1.1" target="_blank">pretext tasks</a></strong>  that exploit the inherent structure, relationships, or attributes of the data to create these pseudo-labels.</p>
<p>For example, in the rotation prediction task, an image is rotated by a known angle (e.g., 0°, 90°, 180°, or 270°), and the model is trained to predict the angle, with the angle serving as the pseudo-label . Other pretext tasks include predicting the correct order of shuffled image patches (jigsaw puzzle task) or reconstructing missing parts of an image (colorization task).</p>



<h3>Most used architectures of SSL</h3>

<strong>Invariance-based/ Joint embedding Architecture :</strong>
In this approach during training we present the model with similar images(like the rotated cat in the figure) and we optimize the encoder so that it will yeild similar embeddings to both images because they have similar semantic meaning. Here, embeddings of not compatible images are trained to be dissimilar. The different views of the image are usually created using hand-crafted data augmentation techniques like geometric transformations, coloring and more.

Adavantage:
This approach proved to reach high semantic levels.</br>
Disadvatage:
It has biases specific to images meaning it struggles to generalise while training in other data
Heavily relies on prior knowledge.

<div style="display: flex; gap: 10px;">
  <img src="https://i.postimg.cc/L8qpZRZQ/invariance1.png" alt="invariance architecture example" width="400">
  <img src="https://i.postimg.cc/bvkf3TgQ/Invariance.png" alt="invariance architecture example" width="400">
</div>



<strong>Generative Architecture:</strong></br>
In this approach during training we mask or corrupt random parts of the input image and then use other model to get the embeddings the encoder returns and reconstruct the image.

Advantage</br>
<ul>
  <li>This approach can generalize to the other types of data. LLMs are trained to predict the mask words or next words</li>
  <li>Less prior knowledge is needed since there is no requirement to provide advance images that are similar to each other</li>
</ul>

<strong>Disadvantage.</strong>
<ul>
<li>Reaches low semantic level</li>
<li>Poor performance.</li>
</ul>
<div>
<img src ="https://i.postimg.cc/Pf4hgSjy/generative.png" alt ="generative architecture example" width="400">
<img src ="https://i.postimg.cc/5N21b89V/generative12.png" alt ="generative architecture example" width="400">
</div>

## Representation Collapse Problem in SSL

### What is Representation Collapse?

In self-supervised learning, particularly in joint-embedding architectures like contrastive methods (e.g., [ **SimCLR**](https://sh-tsang.medium.com/review-simclr-a-simple-framework-for-contrastive-learning-of-visual-representations-5de42ba0bc66)
, **[MoCo](https://medium.com/@nour_badr/easily-explained-momentum-contrast-for-unsupervised-visual-representation-learning-moco-c6f00a95c4b2)**
), the **representation collapse** problem occurs when the model learns trivial or degenerate representations. This means that the encoder maps all input data — regardless of their semantic differences — to the same or very similar points in the embedding space. As a result, the learned representations fail to capture meaningful distinctions between different inputs, rendering them useless for downstream tasks like classification or object detection.

Representation collapse typically manifests in two forms:

- **Complete Collapse:** All inputs are mapped to a single point or a very small region in the embedding space, effectively losing all discriminatory power.
- **Dimensional Collapse:** The embeddings occupy only a low-dimensional subspace of the full embedding space, reducing the richness and expressiveness of the representations.

This problem is particularly prevalent in contrastive learning methods, where the model optimizes a loss function (e.g., **InfoNCE**) to pull representations of positive pairs (augmented views of the same image) closer together while pushing negative pairs (different images) apart. Without proper regularization or design, the model may find a trivial solution where all representations are identical or nearly identical, satisfying the loss function but failing to learn useful features.

---

### Causes of Representation Collapse

Several factors contribute to representation collapse in SSL:

- **Lack of Sufficient Negative Samples:** Contrastive methods rely heavily on negative samples to prevent the model from mapping all inputs to the same point. If the number of negative samples is insufficient (e.g., due to small batch sizes) or if negative samples are not diverse enough, the model may collapse representations to minimize the loss.

- **Improper Loss Function Design:** Loss functions like **InfoNCE** require careful tuning of hyperparameters (e.g., temperature) to balance the attraction of positive pairs and repulsion of negative pairs. Poorly tuned parameters can lead to trivial solutions.

- **Overly Strong Augmentations:** Heavy data augmentations can make positive pairs too dissimilar, causing the model to struggle to align them, potentially leading to collapse.

- **Lack of Regularization:** Without explicit mechanisms to encourage diversity in the embedding space (e.g., normalization, projection heads, or stop-gradient techniques), the model may converge to a degenerate solution.

---

### Impact of Representation Collapse

When representation collapse occurs, the model’s embeddings lose their ability to distinguish between different objects, scenes, or classes in the data. This severely degrades performance on downstream tasks, as the representations lack the semantic richness needed for tasks like image classification, object detection, or semantic segmentation. For example, if all images of cats and dogs map to the same embedding, the model cannot differentiate between these classes in a downstream classifier.


<h3>What is a Joint-Embedding Predictive Architecture (JEPA)?</h3>
<p style="line-height:1.5;" >JEPA is a type of self-supervised learning architecture that learns meaningful representations by predicting the embeddings (representations) of one part of the input data from another part, using a predictor network. Unlike traditional generative methods that reconstruct raw input data (e.g., pixels) or invariance-based methods that enforce similarity between augmented views, JEPA focuses on predicting abstract representations in an embedding space. This makes it both efficient and capable of learning high-level, semantic features.</p>

# Why I-JEPA

## 1. Improving Semantic Representation
Semantic representations are critical for downstream tasks like image classification, object detection, and semantic segmentation, as they enable models to generalize to new tasks with minimal fine-tuning. I-JEPA achieves this through its Joint-Embedding Predictive Architecture (JEPA) framework, specifically tailored for images.
How I-JEPA Improves Semantic Representation

**i. Prediction in Representation Space:**

I-JEPA predicts the representations of target blocks $   s_y(i) = \{s_{y_j}\}_{j \in B_i}   $ from a context block’s representation $   s_x = \{s_{x_j}\}_{j \in B_x}   $, using a predictor $   g_\phi   $.
The loss function minimizes the $   L_2   $ distance in the embedding space:
$$\mathcal{L} = \frac{1}{M} \sum_{i=1}^M \sum_{j \in B_i} \|\hat{s}_{y_j} - s_{y_j}\|_2^2$$
where $   \hat{s}_y(i)   $ is the predicted representation, and $   M   $ is the number of target blocks (typically 4).
Unlike pixel-space reconstruction (e.g., Masked Autoencoders, MAE), predicting in representation space discards low-level details, focusing on high-level features like object parts or structures (Table 7: Top-1 accuracy 66.9% in representation space vs. 40.7% in pixel space).


**ii. Multi-Block Masking Strategy:**

The context block (scale 0.85–1.0) provides a large, informative view of the image, capturing global context.
The target blocks (scale 0.15–0.2, $   M=4   $) are large enough to contain semantic content (e.g., a dog’s head, a car’s wheel), encouraging the model to learn meaningful features (Table 8: Top-1 54.2% for scale [0.15, 0.2] vs. 19.2% for smaller blocks).
The non-overlap constraint ($   B_x \cap B_i = \emptyset   $) ensures the model infers missing regions, promoting semantic understanding (Table 6: multi-block masking outperforms random masking’s 17.6% Top-1).


**iii. Vision Transformers (ViTs):**

I-JEPA uses ViTs for the context and target encoders, leveraging multi-head self-attention to capture global and local relationships:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
This allows the model to focus on relevant patches, enhancing semantic feature extraction.

#2. Learning Without Prior Knowledge

**I-JEPA** overcomes a common limitation of traditional self-supervised learning (SSL) methods by learning effectively **without relying on prior knowledge** such as manual augmentations or labeled data.

---

### **How I-JEPA Avoids Prior Knowledge**

#### i. No Hand-Crafted Augmentations  
Many SSL methods, like SimCLR, DINO, and iBOT, depend heavily on image augmentations — such as cropping, flipping, and color jittering — to generate multiple views of the same image. These augmentations enforce invariance to certain transformations, but they can also introduce biases.  
For instance, color jittering may negatively affect tasks like depth prediction where color information is irrelevant.  

In contrast, I-JEPA uses a **single image view** combined with its **multi-block masking strategy**. It predicts representations of masked target blocks directly from the visible context block without any augmentations (see Section 1).  
This design reduces inductive biases and improves generalization, making I-JEPA more versatile across diverse tasks (refer to Table 4, where I-JEPA outperforms DINO and iBOT on low-level vision tasks).

#### ii. No Labeled Data Needed  
As a self-supervised learning method, I-JEPA leverages the intrinsic structure within images to generate supervisory signals. Specifically, it predicts the representations of masked regions based on visible context, requiring **no labeled data**.  


## 3. Avoids Representation Collapse

I-JEPA’s design mitigates the **representation collapse** problem — a common issue in contrastive self-supervised learning methods where the model maps all inputs to similar or identical embeddings, losing discriminatory power. Unlike contrastive methods, I-JEPA employs a **non-contrastive, predictive approach** that inherently promotes diverse and meaningful representations.

---

###Non-Contrastive Objective

Contrastive methods like **SimCLR** and **MoCo** rely on negative samples to prevent collapse, which can be computationally expensive and sensitive to batch size or hyperparameter tuning. I-JEPA avoids this by using a **predictive objective**, minimizing the **L2 distance** between predicted and target representations of masked regions. This eliminates the need for negative samples, reducing the risk of collapse due to insufficient or poorly chosen negatives. The model is incentivized to learn distinct representations to accurately predict embeddings for different masked regions.

---

###Momentum-Updated Target Encoder

I-JEPA uses a **target encoder** updated via an **Exponential Moving Average (EMA)** of the context encoder’s parameters:

\[
\theta_{\text{target}} \leftarrow m \cdot \theta_{\text{target}} + (1 - m) \cdot \theta_{\text{context}}
\]

This **EMA-based update** ensures stable and consistent target representations, preventing the model from converging to trivial solutions where the context and target encoders produce identical outputs. The slow-moving target encoder acts as a **regularizing mechanism**, encouraging the predictor to learn meaningful mappings without collapsing.

---

By leveraging a **non-contrastive objective** and a **momentum-updated target encoder**, **I-JEPA** ensures robust, diverse, and semantically rich embeddings, avoiding the pitfalls of representation collapse seen in contrastive methods.





<h3>I-JEPA Architecture</h3>
<p>The I-JEPA architecture consists of three main components: the context block, the target block, and the predictor. These components work together to learn image representations by predicting the representations of masked image regions from visible regions.</p>
<ol>
  <li><strong>Context Block</strong></li>
  <p>The context block is typically a Vision Transformer (ViT) or a similar backbone network. Its purpose is to process the visible parts of the input image. It takes the unmasked image patches as input and encodes them into a set of context representations. This block learns to extract meaningful features and understand the content within the available, unmasked regions of the image.</p>
  <li><strong>Target Block</strong></li>
  <p>The target block is another encoder, often a momentum-updated version of the context block. It processes the masked parts of the image. Unlike the context block which sees the visible patches, the target block operates on the patches that were originally masked out. Its role is to generate target representations for these masked regions. Using a momentum-updated encoder helps provide stable and consistent target signals during training.</p>
  <li><strong>Predict Task</strong></li>
  <p>The predictor is a separate network, usually a shallow Transformer. It takes the context representations generated by the context block as input. Its task is to predict the target representations of the masked regions that were generated by the target block. The predictor is trained to bridge the gap between the representations of the visible and masked parts of the image, effectively learning to infer the content of the missing regions based on the available context.</p>
<div style="display: flex; gap: 10px;">
  <img src="https://i.postimg.cc/6pDNrSrW/jepa.png" style="width:30%;">
  <img src="https://i.postimg.cc/Y97KVWDT/ijepa.png" style="width:30%;">
</div>




---



I-JEPA (**Image-based Joint Embedding Predictive Architecture**) is a **non-contrastive self-supervised learning** method.
It learns by **predicting hidden parts of an image** (masked regions) using **visible context**, *without* reconstructing raw pixels.
Instead, it predicts **representations** (embeddings) of the masked patches — which forces the network to learn **semantic structure**.

---

##  **Dataflow in Architecture**

### **Step A: Split the image**

Given an image $x$, you divide it into overlapping or non-overlapping patches:

$$
x = \{ p_1, p_2, \ldots, p_N \}
$$

---

### **Step B: Masking**

Randomly select a subset of patches to **mask**.
Define:

* V = visible patch indices.
* M = masked patch indices.

So:

$$
V \cup M = \{1, \ldots, N\}, \quad V \cap M = \emptyset
$$

---

### **Step C: Context encoder**

**Input:** Only visible patches $\{ p_i \}_{i \in V}$

**Operation:**
The context encoder $f_{\text{context}}$ maps the visible patches to a latent representation:

$$
z_{\text{context}} = f_{\text{context}}(\{ p_i \}_{i \in V})
$$

In practice, this is often a **Vision Transformer** (ViT) that:

1. Linearly embeds each patch.
2. Adds positional encodings.
3. Runs them through transformer blocks.
4. Outputs embeddings for each visible patch plus a global token.

---

### **Step D: Target encoder**

**Input:** The *full image* (both visible + masked patches).
The target encoder $f_{\text{target}}$ is the same architecture as the context encoder, but its parameters are updated by **EMA**:

$$
\theta_{\text{target}} \leftarrow m \cdot \theta_{\text{target}} + (1 - m) \cdot \theta_{\text{context}}
$$

The target encoder computes **target representations** for the *masked* patches:

$$
z_{\text{target}} = f_{\text{target}}(\{ p_i \}_{i \in M})
$$

---

### **Step E: Predictor**

The predictor $g$ takes the context embedding $z_{\text{context}}$ and tries to **predict** the target representations $z_{\text{target}}$:

$$
\hat{z}_{\text{target}} = g(z_{\text{context}})
$$

---

##**The Training Objective**

I-JEPA’s goal is to **minimize the L2 distance** between the predicted and target representations for the masked regions:

$$
L = \sum_{i \in M} \| \hat{z}_i - z_i \|_2^2
$$

where:

* $z_i = f_{\text{target}}(p_i)$
* $\hat{z}_i = g(f_{\text{context}}(\{ p_j \}_{j \in \mathcal{V}}))$

---

## **Why does this work?**

* The **context encoder** *must* learn to encode useful information about visible regions.
* The **predictor** learns to infer missing parts’ semantic representations — not pixels.
* The **target encoder**, updated by EMA, provides stable targets that are not easily copied.
* There’s **no pixel-level reconstruction** → the model focuses on **meaningful abstract features**.
* Because it’s **non-contrastive**, there’s no need for negative pairs → no risk of representation collapse due to trivial contrastive shortcuts.

---

## ** Visual summary**

| Component           | Input             | Output                                  | Role                                     |
| ------------------- | ----------------- | --------------------------------------- | ---------------------------------------- |
| **Context Encoder** | Visible patches   | Context representation                  | Encodes what is seen                     |
| **Target Encoder**  | Full image        | True representations for masked patches | Provides ground truth in embedding space |
| **Predictor**       | Context embedding | Predicted masked representations        | Bridges gap from visible to missing      |

---

---

**So:**

$$
\hat{z}_{\text{target}} = g(f_{\text{context}}(\{ p_i \}_{i \in \mathcal{V}})), \quad z_{\text{target}} = f_{\text{target}}(\{ p_i \}_{i \in \mathcal{M}})
$$

and

$$
\min_{\theta} \sum_{i \in \mathcal{M}} \| \hat{z}_i - z_i \|^2
$$

---

<img src ="https://i.postimg.cc/x8knKbs7/overall1.png" alt ="generative architecture example">

## Masking Strategy in I-JEPA

### Multi-Block Masking Strategy and Semantic Segmentation

**I-JEPA** employs a **multi-block masking strategy**.  
Instead of masking random individual patches scattered across the entire image, it masks out **multiple large blocks** of the image.  
These blocks can vary in **size** and **aspect ratio** and are strategically chosen to cover significant portions of the image.

---

This multi-block approach is crucial because it forces the model to predict the content of substantial, contiguous regions based on the surrounding visible context.  
In contrast to masking small, scattered patches (which might cause the model to over-rely on local pixel correlations), masking larger regions encourages the model to understand **higher-level structures** within the image.

---

While I-JEPA does not explicitly perform traditional **semantic segmentation** as a separate task, the choice of masking large, coherent blocks inherently relates to **capturing semantic information**.  
By masking out large regions, the model must predict the representations of potentially complex objects or parts of objects within those blocks.  
This forces the model to learn relationships between different parts of an image and to infer the **semantic content** of the masked areas using the visible context.

---

The design of this masking strategy — especially the use of large, potentially object-sized blocks — encourages the model to go **beyond low-level pixel prediction**.  
Instead, it must learn to predict the **semantic representations** of the masked regions, pushing it to develop a deeper understanding of the image’s **content and structure**.

This focus on high-level, meaningful features that are **invariant to small pixel variations** aligns with I-JEPA’s non-generative objective:  
learning robust, **semantic representations** instead of reconstructing raw pixel data.

---


## Training Objective

The primary training objective in **I-JEPA** is to minimize the distance between the **predicted representations** of the masked blocks (output of the predictor) and the **target representations** of those same masked blocks (output of the target encoder).

I-JEPA specifically uses the **L2 distance** (Euclidean distance) — or equivalently, **Mean Squared Error (MSE)** — as the metric for this minimization. The loss function quantifies how different the predicted vector is from the target vector for each masked patch.

Minimizing this L2 distance encourages the predictor to generate representations that are as close as possible to the true representations provided by the target encoder for the masked regions.

---

## **Mathematical Definition**

The L2 distance measures the straight-line distance between two vectors. Mathematically, for vectors $ y $ (true value) and $ \hat{y} $ (predicted value):


$$
L_2(y, \hat{y}) = \sqrt{ \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }
$$

---

Minimizing the L2 distance in the representation space encourages the predictor to learn meaningful, semantically rich features for the masked regions based on the context from the visible parts.

Unlike pixel-wise reconstruction (which can be computationally expensive and prone to local artifacts), representation-space prediction focuses on capturing the **higher-level meaning** of the masked content.

---

## Target Encoder Update: Exponential Moving Average (EMA)

In I-JEPA, the **target encoder** is *not* updated directly via gradients. Instead, it is updated using an **Exponential Moving Average (EMA)** of the context encoder’s parameters.

At each training step, the update rule is:

$$
\theta_{\text{target}} \leftarrow m \cdot \theta_{\text{target}} + (1 - m) \cdot \theta_{\text{context}}
$$

**Where:**

- $   \theta_{\text{target}}   $ are the parameters of the target encoder.
- $ \theta_{\text{context}} $ are the parameters of the context encoder.
- \( m \) is the **momentum coefficient** (e.g., 0.99 or 0.996).

This makes the target encoder a **slow-moving average** of the context encoder, stabilizing the prediction targets and improving training dynamics.

---

## How This Differs From Other Methods

- **Generative methods:** These aim to reconstruct the actual pixel values of the masked regions (e.g., autoencoders, Masked Autoencoders/MAE). They require learning fine low-level details and can be computationally heavy.

- **Contrastive methods:** These (e.g., SimCLR, MoCo) aim to maximize agreement between different augmented views of the same image instance while minimizing agreement with negatives. They rely on constructing **positive** and **negative** pairs.

In contrast, I-JEPA learns to directly predict **semantic representations**, not pixels, and avoids the need for negative samples.

---


<h3>Comparison with generative and contrastive SSL Methods</h3>

## Comparison with Other Self-Supervised Learning Methods

**I-JEPA** distinguishes itself from other prominent self-supervised learning paradigms — namely **generative** and **contrastive** methods — through its **non-generative, non-contrastive prediction objective**.

---

### I-JEPA vs. Generative Methods

Generative self-supervised learning methods, such as **autoencoders** or **Masked Autoencoders (MAE)**, aim to reconstruct the original input data, often at the **pixel level**.  
Their primary goal is to predict or generate the missing or corrupted parts of the input based on the visible parts.  
This typically involves minimizing a reconstruction loss (e.g., MSE or cross-entropy) on the raw pixel values.

**Key points:**

- **Objective:**  
  Predict and reconstruct raw input data (e.g., pixels).

- **Computational Implications:**  
  Reconstructing high-dimensional data like images pixel-by-pixel is computationally intensive, especially during decoding.  
  This can cause models to over-focus on low-level details and lead to possible reconstruction artifacts.

- **I-JEPA’s Difference:**  
  Instead of reconstructing pixels, I-JEPA predicts **representations** of masked regions in a learned feature space.  
  This is computationally lighter (no pixel decoder) and shifts the learning focus toward higher-level, semantic features.

---

### I-JEPA vs. Contrastive Methods

Contrastive self-supervised learning methods — such as **SimCLR** or **MoCo** — learn representations by contrasting different views of the data.  
The core idea is to **pull together** representations of different augmented views of the same image (**positive pairs**) and **push apart** representations of different images (**negative pairs**).

**Key points:**

- **Objective:**  
  Learn representations by maximizing agreement between positive pairs and minimizing agreement between negative pairs.

- **Computational Implications:**  
  Contrastive methods often require many negative samples in each batch, which can be computationally demanding and require large memory or special techniques (like momentum encoders) to maintain a negative queue.  
  The reliance on negatives also makes these methods sensitive to hyperparameters and batch size.

- **I-JEPA’s Difference:**  
  I-JEPA is **non-contrastive** — it does **not rely on negative pairs**.  
  Its learning signal comes directly from minimizing the distance between the **predicted** and **target representations** of masked regions, simplifying training and removing the need for explicit negatives.

---

By avoiding pixel-wise generation and explicit contrasting of negatives, **I-JEPA** focuses on **predictive representation learning** — encouraging the model to infer meaningful semantic information purely through representation-space prediction.

---


## Summary



- **I-JEPA** is an image-based self-supervised learning method that uses a **non-generative approach**, focusing on learning **abstract representations** rather than reconstructing raw pixels.

- Its architecture includes:
  - A **context block** that encodes visible parts of the image.
  - A **target block** — a momentum-updated encoder for masked parts.
  - A **predictor** that infers target representations from the context representations.

- I-JEPA employs a **multi-block masking strategy**, masking large, contiguous regions.  
  This encourages the model to learn **semantic relationships** and capture **higher-level features** beyond local pixel information.

- The training objective is to **minimize the L2 distance** (or MSE) between the predicted representations of the masked blocks and their corresponding target representations, operating entirely in a **learned feature space**.

- I-JEPA fundamentally differs from:
  - **Generative methods**, which reconstruct raw pixels.
  - **Contrastive methods**, which rely on negative pairs to learn representations.

---

This makes I-JEPA a unique approach for learning robust, semantically meaningful visual representations without pixel-level generation or contrastive negatives.


## Suggestion Points

* explain breifly about represenation collapse problem.
* There are two sections on What is IJEPA? (Remove one)
* Explain how image patches are processed in target and context network and connect with predictor network.