# **BYOL: Bootstrap Your Own Latent**

### Pre-requisites

* Familiarity with contrastive learning frameworks (SimCLR, MoCo)
* Understanding of representation learning and CNNs
* Knowledge of momentum encoders and projection heads

Most contrastive learning methods (like SimCLR and MoCo) rely on **negative samples** to avoid collapsing representations. However, **BYOL (Bootstrap Your Own Latent)** proposes a novel approach: **learning good representations without negative samples at all**.

> **Analogy**: Imagine two friends learning dance by copying each other’s movements. One leads (online network), and the other follows with stable moves (target network). Over time, the leading friend learns to mimic better, even without a critic.

This analogy is crucial because it highlights the core difference: BYOL manages to learn representations by self-supervision, where one part of the network coaches another, eliminating the need for explicitly comparing "positive" and "negative" examples.


# **The Collapsing Problem**
In self-supervised learning if a model learns to output a constant representation for all inputs, the loss function could easily be minimized (e.g., by making all embeddings zero) without learning anything useful. This is known as **representation collapse.**

If the model always outputs the same feature vector regardless of the input image, it's not actually learning to differentiate anything, rendering the learned representations useless for downstream tasks. This is the central challenge BYOL aims to solve without negative samples.

To avoid collapse, most contrastive methods rely on:

**Negative samples**:

* Explicitly pushing apart representations of different instances.

* This is like saying "If these two images are different, their representations in the latent space must be far apart."

**Large batch sizes:**
* To get enough diverse negative samples in a mini-batch.

* The larger the batch, the more distinct negative examples there are for comparison, which helps prevent collapse but demands significant computational resources.



## **Core Idea**

BYOL trains a network to **predict the representation of one augmented view** of an image from **another augmented view**, using two networks:

* An **online network** that learns
* A **target network** that provides stable targets (not directly trained)

> No need for negative pairs. Instead, BYOL prevents representation collapse through a **momentum-updated target network** and a **prediction head**.


## **Architecture Overview**

<center>
  <img src="https://www.researchgate.net/publication/355737346/figure/fig1/AS:1084211217870854@1635507506342/BYOLs-architecture-BYOL-minimizes-a-similarity-loss-between-qthzth-and-sgzx-where-th-and.jpg" width=70%>
</center>

### **1. Data Augmentation**

Given an image, generate two random augmented views:

* $v$: first augmented view
* $v'$: second augmented view

Typical augmentations include:

* Random cropping and resizing
* Color jittering
* Horizontal flipping
* Gaussian blur

These transformations help learn invariance by forcing the model to represent the same content under varying conditions.

### **2. Online and Target Networks**

Both views are passed through two structurally identical but functionally different networks:

#### **2.1 Online Network** $(f_\theta, g_\theta, q_\theta)$

* **Encoder** $f_{\theta}$: Backbone network like ResNet.
* **Projection Head** $g_{\theta}$: MLP that maps encoder output to latent representation space.
* **Predictor** $q_{\theta}$: MLP that transforms the projection into a prediction vector.

#### **2.2 Target Network** $(f_\xi, g_\xi)$

* Has the same architecture as the online network (encoder + projection), **but no predictor**.
* Not updated via backpropagation.
* Updated using **momentum** from the online network:

$$
\xi \leftarrow m \cdot \xi + (1 - m) \cdot \theta
$$

Where $m$ is a momentum coefficient (e.g., 0.996). This slow update ensures smoother and more stable targets across training iterations.


### **3. Forward Pass Flow**

1. **Online branch** processes view $v$:

$$
y = q_{\theta}(g_{\theta}(f_{\theta}(v)))
$$

2. **Target branch** processes view $v'$:

$$
y' = g_{\xi}(f_{\xi}(v'))
$$

> Both $y$ and $y'$ are **L2-normalized** before loss computation.


### **4. Loss Function**

BYOL uses **Mean Squared Error (MSE)** between the normalized outputs of the online predictor and target encoder:

$$
\mathcal{L} = \left\| \text{Normalize}(y) - \text{Normalize}(y') \right\|_2^2
$$

The total loss is computed symmetrically by also predicting $v$ from $v'$ using the opposite roles.

This helps the online network gradually align its predictions to the stable representations of the target network.





## **Why Doesn’t BYOL Collapse?**

Despite lacking negative samples, BYOL avoids collapse due to:

* The **momentum-updated target network** ensuring temporal consistency
* The **predictor network** introducing asymmetry, encouraging non-trivial solutions

This combination implicitly regularizes the training process, steering it away from trivial constant outputs.

> Empirical results show BYOL performs competitively or better than SimCLR and MoCo, even without using contrastive negatives.




## **Downstream Applications**

After training, only the **encoder from the online network** is retained. It can be fine-tuned or used as a feature extractor for:

* **Image classification** (linear probe or fine-tuning)
* **Object detection**
* **Semantic segmentation**
* **Medical imaging** or domain-specific transfer tasks



## **Advantages of BYOL**

| Feature                | BYOL                                 |
| ---------------------- | ------------------------------------ |
| Negative Samples       | Not required                         |
| Network Design         | Online + Momentum Target             |
| Risk of Collapse       | Mitigated via momentum and predictor |
| Representation Quality | High (often rivals SimCLR/MoCo)      |



## **Challenges and Considerations**

* **Momentum tuning** is critical for stable learning
* **Asymmetry via predictor** is essential — removing it leads to collapse
* **Training time** may be longer without negatives, but requires smaller batch sizes



## **Conclusion**

**BYOL** challenges the assumption that negative samples are essential for self-supervised learning. With its **momentum-updated target encoder** and **predictor MLP**, BYOL learns rich and transferable representations **purely from positive pairs**.

Its architecture has inspired many follow-up models, including:

* **SimSiam**: Removes momentum encoder but relies on stop-gradient
* **DINO**: Adapts the idea to vision transformers
* **VICReg**: Combines variance-invariance-covariance regularization

BYOL remains a cornerstone in the evolution of contrastive and non-contrastive self-supervised learning.



## **References**

* Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020).
  [Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning](https://arxiv.org/abs/2006.07733). *NeurIPS 2020*.

* Sik-Ho Tsang. (Feb 13, 2022).
  [Review — BYOL: Bootstrap Your Own Latent A New Approach to Self-Supervised Learning](https://sh-tsang.medium.com/review-byol-bootstrap-your-own-latent-a-new-approach-to-self-supervised-learning-6f770a624441).

## **Code Tutorial Reference**

* **Odom, F.** (2020, November 5). [Easy Self-Supervised Learning with BYOL](https://medium.com/the-dl/easy-self-supervised-learning-with-byol-53b8ad8185d). *The DL – Medium*.

