# SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

### Pre-requistes

- Understanding of representational learning
- Familiarity with Computer Vision Basics
- Knowledge of ResNets or CNN architectures

In supervised learning, labeled datasets guide the model to learn meaningful representations, but labeling is often costly and time consuming. **SimCLR (A Simple Framework for Contrastive Learning of Visual Representations)** aims to bypass this by learning powerful visual representations **without labels**, purely from **data augmentations**. The key idea is that two augmented views of the same image should have similar representations, while views from different images should differ.

> **Analogy**: Imagine you're identifying your friend in a crowd. Whether your friend wears a hat or glasses, or even if they change clothes, you still recognize them. These variations are like data augmentations. SimCLR learns a representation that is invariant to such changes.


## **Core Idea**

<center>

<img src="https://i.postimg.cc/3NPT6pPS/SimCLR.gif" width=50%>
</center>

SimCLR learns by **maximizing the similarity between augmented views of the same image** (positive pairs) while **minimizing similarity between different images** (negative pairs). This process creates a feature space where similar images cluster together, and dissimilar ones are far apart.

### **Process Overview:**

* For each image in the batch, generate **two different augmented views**. (example one cropped and one color-distorted version).
* Pass all augmented views through a **shared encoder network** (example ResNet) to obtain their representations.
* Apply a **contrastive loss (NT-Xent)** to pull positive pairs (augmented views of the same image) closer together and **pushes negative pairs** (views of different images) apart.



## **Major Components of SimCLR**

<center>

<img src="https://i.postimg.cc/4xwSb8D2/simclr-general-architecture.png" width=75%>
</center>

### **1. Data Augmentation**

The data augmentation module generates two different augmented views of the same image. These views are meant to be semantically similar, so the model learns to map them close together in the representation space.

For a given image, two stochastic augmentations are applied to produce a pair of correlated views:

* \$\hat{x}\_i\$: first augmented view
* \$\hat{x}\_j\$: second augmented view

The following augmentations are applied sequentially:

* **Random cropping and resizing** to the original size, simulating different perspectives.
* **Random color distortion** (adjusting brightness, contrast, saturation, and hue) to mimic lighting changes.
* **Random Gaussian blur** to introduce slight blurriness, encouraging robustness.

> The combination of random cropping and color distortion has been shown to be particularly effective for learning useful representations.

### **2. Base Encoder**

A deep neural network encoder \$f(\cdot)\$ is used to extract high-level feature representations from the augmented images.

A commonly used encoder is **ResNet**, applied as follows:

$$
\mathbf{h}_i = f(\hat{x}_i) = \text{ResNet}(\hat{x}_i)
$$

Here:

* \$\hat{x}\_i\$ is the augmented input image.
* \$\mathbf{h}\_i\$ is the resulting **\$d\$-dimensional feature vector**, typically taken from the output of the average pooling layer of ResNet.
* \$\mathbf{h}\_i\$ and \$\mathbf{h}\_j\$ (from \$\hat{x}\_j\$) are expected to be close in feature space for positive pairs.

### **3. Projection Head**

The projection head \$g(\cdot)\$ is a small neural network that maps the encoder’s output \$\mathbf{h}\_i\$ to a new space where the contrastive loss is applied. This helps improve the quality of learned representations.

It is typically a **Multi-Layer Perceptron (MLP)** with one hidden layer:

$$
\mathbf{z}_i = g(\mathbf{h}_i) = W^{(2)} \sigma(W^{(1)} \mathbf{h}_i)
$$

Where:

* \$\sigma\$ is a ReLU activation function.
* \$\mathbf{z}\_i\$ is the final vector used in the contrastive loss computation.
* This step improves learning but \$\mathbf{h}\_i\$ (not \$\mathbf{z}\_i\$) is used as the final representation for downstream tasks.

### **4. Contrastive Loss**

SimCLR uses a **contrastive loss function** to learn meaningful representations by bringing embeddings of similar (positive) pairs closer together, while pushing embeddings of dissimilar (negative) pairs farther apart in the projected space.

> #### **Setup**
>
> * Let the mini-batch contain **\$N\$ original images**.
> * After augmentation (two views per image), the batch size becomes **\$2N\$ samples**.
> * For each positive pair \$(i, j)\$ (two augmented views of the same image), the remaining \$2(N-1)\$ augmented examples serve as **negative samples**.


#### **4.1 Cosine Similarity**

For two embeddings \$\mathbf{z}\_i\$ and \$\mathbf{z}\_j\$, the **cosine similarity** is calculated as:

$$
\text{sim}(\mathbf{z}_i, \mathbf{z}_j) = \frac{\mathbf{z}_i \cdot \mathbf{z}_j}{\|\mathbf{z}_i\| \|\mathbf{z}_j\|}
$$

* Here, \$\cdot\$ denotes the dot product.
* \$|\mathbf{z}\_i|\$ and \$|\mathbf{z}\_j|\$ are the vector norms (magnitudes).
* The similarity ranges from -1 (opposite) to 1 (identical direction).

A **temperature parameter** \$\tau\$ is used to scale these similarities before applying the loss, controlling how "soft" or "hard" the penalty for mismatches is.


#### **4.2 NT-Xent Loss**

The **Normalized Temperature-Scaled Cross-Entropy (NT-Xent) loss** for a positive pair \$(i, j)\$ is defined as:

$$
\ell_{i,j} = -\log \frac{
\exp\left(\text{sim}(\mathbf{z}_i, \mathbf{z}_j)/\tau\right)
}{
\sum_{k=1}^{2N} \mathbb{I}_{[k \neq i]} \exp\left(\text{sim}(\mathbf{z}_i, \mathbf{z}_k)/\tau\right)
}
$$

Where:

* The numerator is the scaled similarity between the positive pair \$(i, j)\$.
* The denominator sums over all other embeddings in the batch **except** the anchor \$\mathbf{z}\_i\$, treating them as negative pairs.
* \$\mathbb{I}\_{\[k \neq i]}\$ is an indicator function that excludes the anchor itself from the denominator.

#### **4.3 Symmetric Loss**

We calculate this loss twice for each positive pair:

* Once treating \$\mathbf{z}\_i\$ as the anchor and \$\mathbf{z}\_j\$ as the positive.
* Once with roles reversed (anchor \$\mathbf{z}\_j\$ and positive \$\mathbf{z}\_i\$).


#### **4.4 Final Loss**

The total loss is the average over **all positive pairs** in the batch and both directions.

> This loss encourages the model to:
>
> * **Attract** embeddings of augmented views of the *same* image (positive pairs).
> * **Repel** embeddings of views from *different* images (negative pairs).

Thus, the model learns a representation space where similar images are close and different images are far apart.



### **Key Insights:**

* **Large batch size** helps the model see more negative samples, which improves performance.
* **Longer training** also leads to better representations.


## **Downstream Tasks**

After training, the SimCLR model can be used for **transfer learning** on downstream tasks like classification, object detection, or segmentation:

* The **projection head** is discarded.
* The **representations from the base encoder** (i.e., the output before the projection head) are used as learned features.
* These features serve as input to a simple classifier or other task-specific models, often resulting in improved performance thanks to the rich representations learned during contrastive training.

Example: A SimCLR model trained on ImageNet can be fine tuned for a medical imaging task, using the learned features to classify X-ray images with minimal labeled data.

## **Evaluating SimCLR Models**

To assess the quality of SimCLR’s representations, common evaluation protocols include:

*   **Linear Evaluation**: Train a linear classifier on the frozen encoder outputs ($\mathbf{h}_i$) and evaluate on a labeled dataset (example ImageNet). High accuracy indicates strong representations.

*  **K-Nearest Neighbors (k-NN)**: Use the feature space to find the k-nearest neighbors for a test image and predict its class based on majority voting.

*   **Downstream Task Performance**: Fine-tune the model on tasks like object detection or semantic segmentation and compare performance to supervised baselines.

## **Challenges and Limitations**
SimCLR is powerful but has challenges:



*   **Large Batch Sizes**: SimCLR relies on large batches (4096 images) for sufficient negative samples, requiring significant computational resources.
*  **Augmentation Sensitivity:** The choice and strength of augmentations heavily impact performance. Poor augmentations may lead to trivial representations.
*   **False Negatives**: Random negative sampling may include images that are semantically similar to the anchor, confusing the model.
*   **Computational Cost**: Training with large batches and deep encoders like ResNet is resource-intensive.

**Solutions:** Methods like **MoCo** use a memory bank to reduce batch size dependence, while **BYOL** avoids negative pairs entirely.


## **References**

* Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, February). [*A simple framework for contrastive learning of visual representations*. arXiv](https://arxiv.org/abs/2002.05709)

* Chaudhary, A. (2020, March 4). [*The illustrated SimCLR framework*. Amitness.](https://amitness.com/posts/simclr)

* Tsang, S.-H. (2020, August 20). [*Review — SimCLR: A simple framework for contrastive learning of visual representations*. Medium.](https://sh-tsang.medium.com/review-simclr-a-simple-framework-for-contrastive-learning-of-visual-representations-5de42ba0bc66)

## **Code Tutorial Reference**

* Lippe, P. (2025, May 1). [*Tutorial 13: Self-Supervised Contrastive Learning with SimCLR*. PyTorch Lightning (Lightning AI)](https://lightning.ai/docs/pytorch/stable/notebooks/course_UvA-DL/13-contrastive-learning.html) *(CC BY-SA License)*