<div style="display: flex; align-items: center; margin-bottom: 20px;">
  <img src="https://tse4.mm.bing.net/th?id=OIP.EhV6r5gkOCL2LvxAkFAnigAAAA&rs=1&pid=ImgDetMain" alt="SimCLR Logo" style="height: 60px; margin-right: 15px;">
  <div>
    <h2 style="margin: 0;">Look Twice, Learn Better : How SimCLR transformed Computer Vision</h2>
    <p style="margin: 0; color: #0066cc; font-size: 22px;"><em>Rishita Agarwal</em></p>
  </div>
</div>


<iframe width="560" height="315" src="https://youtu.be/muJMTto75qE " title="SimCLR Explanation" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>



## Motivation behind this project topic?

The growth of digital images has created a havoc in machine learning, we have a lot of visual data, yet supervised learning approaches act as a bottleneck due to the lack of high quality labeled data. This is one of the major challenges in computer vision today.

One of the solution which can be proposed for this can be manual annotation of the data, but manual labelling is expensive, time consuming and many times impractical at large scale. Some times it may also require expert knowledge in a particular domain.

**Unsupervised learning** offers a more promising path by leveraging the large amount of unlabeled data available. Though, it is not that easy to produce results as good as supervised learning. Some of the key requirements are -
1. The learned representation should be **generalizable** to diverse downstream tasks as well.
2. The approach should **scale computationally** with larger datasets and model sizes.
3. Features should capture **meaningful semantic** information.

<div style="display: flex; justify-content: center;">
    <img src="https://www.edushots.com/upload/articles-images/b35a6ab4259fcd2fa572cc62333ac5ec15371617.jpg" alt="SimCLR Framework" width="45%" style="margin-right: 10px;"/>
    <img src="https://media.geeksforgeeks.org/wp-content/uploads/20231213175718/Self-660.png" alt="SimCLR Results" width="45%"/>
</div>
<p style="text-align: center;"><em>Figure: Unsupervised Learning (left) and Self Supervised Learning (right)</em></p>

SimCLR addressed these requirements in a simplistic manner, still managing to achieve state-of-the-art results, grabing my attention to this topic. The idea was simple yet innovative, which actually was derived from the essential components of existing methods.


## History and current works?

The initial methods relating to **self supervised visual representations** led to this exploration. Let me talk more about these methods.

It all started with **handcrafted pretext** tasks like predicting image rotation or colorizing grayscale images. These methods were a good start to the finding a good representation, but it was noticed that the learnt representations were more specific to the pretext task rather than being general purpose.

Then, approaches like **InstDisc and CPC** introduced the concept of contrastive objectives (Fig. given below) but still relied on complex architectures or memory banks to store representations.

<div style="display: flex; justify-content: center;">
    <img src="https://insights.willogy.io/assets/static/contrastive_learning_intuition.42db587.208b1cf168018c6226966d0407c62134.jpg" alt="Contrastive Learning Concept" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: Visualization of contrastive learning approach</em></p>

Finally **SimCLR** simplied the approach of contrastive learning dramatically, showing that with right data augmentation, loss function and projection head, superior results with a straightforward framework could be achieved.
Even after SimCLR more methods based on its insights were built, like, MoCo v2, SimSiam etc. to further improve the self supervision, trying to reduce the need for negative examples. CLIP and CLAP are 2 multimodal approaches to Contrastive learning, applying CL to image-text and audio-text pairs. I will give a basic introduction to these methods towards the end of this blog.

## Diving deep into Contrastive Learning and SimCLR's approach

**Contrastive learning** is an approach that learns representations by comparing similar and dissimilar samples. The fundamental idea behind this concept is: Similar items ("Positive") should be closer together and dissimilar items ("Negative") should be further apart.

It is a self supervised technique which learns meaningful representation without any explicitly labeled data.

**SimCLR (Simple Framework for Contrastive Learning of Visual Representation) Framework** 

This framework consists of 4 major components - 
1. **Data Augmentation Module** - In this module, each input image goes through stochastic augmentation twice, to generate 2 different views.

``` 
X -> [augmentation] -> x̃ᵢ
X -> [augmentation] -> x̃ⱼ
```

The augmentation sequence is as follows :
- **Random Cropping** - After cropping the image randomly, it should again be resized to original size.
- **Random color distortion** - Include color dropping, brightness, contrast etc.
- **Random Gaussian blur** - Gaussian function to smooth an image. The effect is similar to viewing the image through a translucent screen, creating a hazy appearance by reducing image noise and detail.

The most important combination of transformation is that of cropping (spatial transformation) and color distortion (appearance transformation). A good reason for this is, that without color distortion the networks can exploit the shortcut of matching color histograms, rather than actually learning semantic features.

<div style="display: flex; justify-content: center;">
    <img src="https://tse1.mm.bing.net/th?id=OIP.l9m-_lWHc2iopae_sHtdUwHaDM&rs=1&pid=ImgDetMain" alt="Data Augmentation" width="45%" style="margin-right: 10px;"/>
    <img src="https://img-blog.csdnimg.cn/20201013134203801.png?,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3h6MTMwODU3OTM0MA==,size_16,color_FFFFFF,t_70#pic_center" alt="Data Augmentation" width="45%" style="margin-right: 10px;"/>
</div>
<p style="text-align: center;"><em>Figure: Data Augmentation Example (left) and Composition of Augmentation Techniques (right)</em></p>


2. **Base Encoder Network**

SimCLR employs a standard CNN (ResNet architecture) as its base encoder. Given an augmented image x̃, the encoder generates a representation vector -

``` h = f(x̃) = ResNet(x̃) ```

where h ∈ ℝᵈ is the output after the average pooling layer.

3. **Projection Head**
This head maps the representation h to a space z, where contrastive loss is applied -

``` z = g(h) = W₂σ(W₁h) ```

Where σ is a ReLU nonlinearity, and W₁ and W₂ are learnable weights. 
In this research paper, they have made a relatively simple MLP with one hidden layer as the projection head. 

## Contrastive Loss and Where should it be applied?

It was noticed that applying the contrastive loss directly on the representations h hurts their quality for downstream tasks. The projection head allows the model to discard information that may be useful for the downstream task but not for the contrastive task.

Crucially, after training is complete, the projection head is discarded, and the **representation h is used for downstream tasks.**

#### Contrastive Loss function 
A special form of contrastive loss called NT-Xent (Normalized Temperature scaled Cross Entropy Loss) is used. For a positive pair (i, j) :

``` ℓᵢ,ⱼ = -log(exp(sim(zᵢ,zⱼ)/τ) / Σₖ₌₁²ᴺ 1[k≠i]exp(sim(zᵢ,zₖ)/τ)) ```

Where:

- sim(u,v) = u·v/‖u‖‖v‖ is the cosine similarity
- τ is a temperature parameter that controls the concentration level of the distribution.
- The sum is over all 2N examples in the batch (including the other augmented views)

This loss effectively treats each augmented image in the batch as a single positive example (the other augmented view of the same image) and 2N-2 negative examples (all other augmented images)

#### *What's the temperature parameter?* 
Hmmm... Good question. Since we have already L2 normalized our embeddings z1 and z2, why do we need this temperature scaling parameter? The temperature parameter (τ) essentially controls how "peaky" or "smooth" the distribution becomes.

1. **Lower temperature** - Amplifies the differences between the similarity scores and makes the model more certain about its choices. Eg. τ = 0.1, would tend to give sharper distinctions between positive and negative pairs and model would focus more on hardest negative pairs.
2. **Higher temperature** - Smooths out the differences between similarity scores and reates a more uniform distribution. It allows the model to consider a broader range of negative examples

So, normalization is needed to ensure all embeddings lie on the unit hypersphere (-1 to 1 for cosine similarity) and temperatures takes care of the concentration of attention on the positive and negative pairs.


<div style="display: flex; justify-content: center;">
    <img src="https://sthalles.github.io/assets/contrastive-self-supervised/cover.png" alt="SimCLR Framework" width="70%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Framework</em></p>

 

## Key Insights

The major points which can be reflected from this paper are described in the section.
Based on the extensive ablation studies which were conducted -
1. **Data augmentation composition is critical** - It was observed that following a specific sequence and combination of augmentations enhanced the results. Random cropping + color distortion provides significant benefit.

2. **Normalized embeddings and temperature scaling matters** - Temperature effectively controls the importance of difficult negative examples. 
3. **Contrastive Learning benefits from larger batch sizes** - Larger batches (up to 8192) continue to improve performance as they provide more negative examples.
4. **Longer training improves results** - SimCLR benefits more from training for longer periods than supervised counterparts.
5. **Non linear projection head is crucial** - A linear projection is better than no projection, but non linear projection works even better, improving accuracy 3-4%.
6. **Unsupervised learning benefits more from model scaling** - As the model size increases, the gap between supervised and self supervised models shrinks.


## Algorithm in a nutshell

1. Sample a minibatch of N images
2. For each image x, generate two augmented versions x̃ᵢ and x̃ⱼ
3. Compute representations ```h = f(x̃)``` and projections ```z = g(h)``` for all 2N augmented images
4. For each positive pair, compute the contrastive loss using all other 2N-2 augmented examples as negatives
5. Update the networks f and g to minimize the loss
6. After training, discard the projection head g and use the encoder f and representation h for downstream tasks

<div style="display: flex; justify-content: center;">
    <img src="./blog_images/SimCLR.png" alt="SimCLR Framework" width="30%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Simple Example</em></p>

 

## Results

The simplicity and effectiveness of SimCLR made it a pivotal development in self-supervised learning. Some notable achievements:

- Linear classifiers trained on SimCLR representations achieved **76.5% top-1** accuracy on ImageNet, a *7% relative improvement* over previous state-of-the-art methods
- With just 1% of labeled ImageNet data, fine-tuned SimCLR achieved **85.8% top-5** accuracy
- SimCLR outperformed supervised pre-training on multiple transfer learning tasks

Perhaps most importantly, SimCLR showed that with the right components, a simple approach to contrastive learning could outperform more complex methods, setting a new direction for self-supervised learning research.

<div style="display: flex; justify-content: center;">
    <img src="./blog_images/results.png" alt="SimCLR Framework" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR results on transfer learning tasks</em></p>


## What surprised me!?

After going through the results and concept in depth, I found several aspects which surprised me -
1. **The power of simple data augmentation:** A carefully composed and sequential data augmentations provided such effective contrastive tasks without any complex architecture. Random cropping + color distortion in this case. 

A question one might have is - **Why Random Cropping works??**
<div style="display: flex; justify-content: center;">
    <img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial17/crop_views.svg?raw=1" alt="SimCLR Framework" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Simple Example</em></p>

When performing randomly cropping and resizing, we can distinguish between two situations: (a) cropped image A provides a local view of cropped image B, or (b) cropped images C and D show neighboring views of the same image. While situation (a) requires the model to learn some sort of scale invariance to make crops A and B similar in latent space, situation (b) is more challenging since the model needs to recognize an object beyond its limited view. 



2. **Projection head paradox:** It was observed through experiments that the projection head works to remove information that may be useful for downstream tasks, which worked in a fascinating way as then the result of projection head could be used for contrastive loss.

3. **Transfer learning performance:** This aspect highlights the major aim of the representations produced by SimCLR. It was observed that the SimCLR representations performed better than supervised ImageNet pretraining was completely unexpected and important aspect. This removed the issue of task specific representation of self supervision.


<div style="display: flex; justify-content: center;">
    <img src="https://th.bing.com/th/id/OIP.tnTNLLEZNfrvlr_kkoBezwHaFO?rs=1&pid=ImgDetMain" alt="Results" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Results</em></p>
 



## Is there a scope for improvement??

Even after all these fascinating aspects of this architecture, there still remains several areas for improvement: 

1. **Computational efficiency:** SimCLR requires large batch sizes and long training times for better performance, which makes it computationally expensive. Although, further works like *MoCo* address this issue using momentum encoders and memory banks.

2. **Negative sample dependency:** The need for many negative examples require large batch sizes. Recent approaches like *BYOL and SimSiam* show that its possible to learn without explicit negatives.

3. **Multimodality:** While SimCLR focusses on visual representations, *CLIP and CLAP* architectures extend it to multiple modalities, which yielded even richer representations.

4. **Exploration:** Exploration of more achitectures other than ResNet and actual reasoning of why certain data augmentation compositions work better than other, would help develop self supervised learning more.

## Want to see some results? 
#### *Lets take a look at a Pytorch implementation which I did for SimCLR*

I implemented the SimCLR framework on a sample of *CIFAR10* (10000 images) for training the model, with *ResNet18* as the base encoder and a simple *MLP layer* for projection head.

Each of the image is 32x32 color image and there are 10 different classes of images. The classes are -
[Airplane, Automobile, Bird, Car, Deer, Dog, Frog, Horse, Ship]

The code below applies data augmentation (Random Crop + Color distortion + Gaussian Blur) twice to the image -

```python
# Define data augmentation pipeline
class TwoCropTransform:
    """Create two crops of the same image"""
    def __init__(self, transform):
        self.transform = transform

    def __call__(self, x):
        return [self.transform(x), self.transform(x)]

# Define data augmentations as per SimCLR paper
def get_simclr_pipeline_transform(size):
    """Return a set of data augmentation transformations as described in the SimCLR paper."""
    color_jitter = transforms.ColorJitter(0.8, 0.8, 0.8, 0.2)
    
    data_transforms = transforms.Compose([
        transforms.RandomResizedCrop(size=size),
        transforms.RandomHorizontalFlip(),
        transforms.RandomApply([color_jitter], p=0.8),
        transforms.RandomGrayscale(p=0.2),
        transforms.GaussianBlur(kernel_size=int(0.1 * size)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    return data_transforms
```

On applying the above augmentation, I got 2N images for N images in my dataset. **Positive pairs** are those pair of images which we got from the same original image.

The following are the various data augmentations on an image - 

<div style="display: flex; justify-content: center;">
    <img src="./blog_images/imp1.png" alt="SimCLR Framework" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: CIFAR10 image with augmentations</em></p>

Then I defined the NT-Xent Loss (aiming to bring the positive pairs closer and the negative pairs further apart) -

```python
def forward(self, z_i, z_j):
        # Get actual batch size (might be smaller than self.batch_size for last batch)
        current_batch_size = z_i.size(0)
        
        # Create mask dynamically based on actual batch size
        mask = self._get_mask(current_batch_size)
        
        # Concatenate the representations from the two augmentations
        representations = torch.cat([z_i, z_j], dim=0)
        
        # Compute similarity matrix - more memory efficient approach
        similarity_matrix = torch.mm(representations, representations.t())
        
        # Normalize the similarity matrix
        sim_i_j = torch.diag(similarity_matrix, current_batch_size)
        sim_j_i = torch.diag(similarity_matrix, -current_batch_size)
        
        positives = torch.cat([sim_i_j, sim_j_i], dim=0)
        
        # Remove diagonal (self-similarity)
        mask_samples_from_same_repr = ~torch.eye(2 * current_batch_size, dtype=torch.bool, device=device)
        negatives = similarity_matrix[mask_samples_from_same_repr].view(2 * current_batch_size, -1)
        
        # Scale by temperature
        positives = positives / self.temperature
        negatives = negatives / self.temperature
        
        # Create logits and compute loss
        logits = torch.cat([positives.view(-1, 1), negatives], dim=1)
        labels = torch.zeros(2 * current_batch_size, dtype=torch.long, device=device)
        
        loss = self.criterion(logits, labels)
        loss = loss / (2 * current_batch_size)
        
        return loss

```

The similarity between the pairs of vectors is in the loss function. Thus, making use of the negative samples.

The code below is the final code for training SimCLR -

```python

def train_simclr(model, train_loader, optimizer, criterion, epochs):
    model.train()
    losses = []
    
    # Add debugging info
    print(f"Starting training with {len(train_loader)} batches per epoch")
    print(f"Actual batch size: {next(iter(train_loader))[0][0].shape[0]}")
    
    for epoch in range(epochs):
        running_loss = 0.0
        batch_count = 0
        
        for i, images in enumerate(train_loader):
            try:
                # Get the two augmented views of the same batch
                (x_i, x_j) = images[0]
                
                # Add progress info
                if i == 0 or (i+1) % 5 == 0:
                    print(f"Processing batch {i+1}/{len(train_loader)}")
                
                # Debug info for first batch
                if epoch == 0 and i == 0:
                    print(f"Input shape: {x_i.shape}")
                
                x_i, x_j = x_i.to(device), x_j.to(device)

                # Zero the gradients
                optimizer.zero_grad()
                
                # Forward pass for both augmentations with added memory efficiency
                with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
                    z_i = model(x_i)
                    z_j = model(x_j)
                    
                    # Compute loss
                    loss = criterion(z_i, z_j)
                
                # Backward pass
                loss.backward()
                optimizer.step()
                
                running_loss += loss.item()
                batch_count += 1
                
                # Free up memory
                torch.cuda.empty_cache() if torch.cuda.is_available() else None
                
            except Exception as e:
                print(f"Error in batch {i}: {e}")
                continue
        
        # Record epoch loss
        epoch_loss = running_loss / batch_count if batch_count > 0 else float('inf')
        losses.append(epoch_loss)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss:.4f}, Processed {batch_count} batches")
        
        # Save model checkpoint once at the end
        if epoch == epochs - 1:
            torch.save(model.state_dict(), f"simclr_final.pt")
    
    # Plot loss curve
    try:
        plt.figure(figsize=(10, 5))
        plt.plot(losses)
        plt.title('SimCLR Training Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.savefig('simclr_loss.png')
        plt.show()
    except Exception as e:
        print(f"Could not plot loss curve: {e}")
    
    return model, losses

```
Due to restricted time and computational resources, I took a subset of the data and also ran for a limited batch_size = 256 and 100 epochs. I took a temperature of 0.5 for a balanced check for positive and negative samples.

Finally, training loss was a decreasing curve over the epochs: 

<div style="display: flex; justify-content: center;">
    <img src="./final_results/simclr_loss.png" alt="SimCLR Framework" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: Training Loss of SimCLR over the 100 epochs</em></p>

### Results on downstream task 

#### 1. Linear Evaluation

In this downstream task, I performed the linear evaluation task. Lets take a look at the steps involved in this: 
- Load your pre-trained SimCLR model
- Freeze the encoder (so the weights don't change)
- Extract features from all your training and test images
- Train a simple linear classifier on top of these features
- Evaluate performance on the test set

The implementation extracts features from both training and test datasets in a single pass through the frozen encoder, significantly accelerating the evaluation process compared to end-to-end fine-tuning. The linear classifier itself is intentionally simple—just a single fully-connected layer mapping the 512-dimensional feature vectors to the 10 CIFAR-10 classes—to ensure that classification performance genuinely reflects the quality of the learned representations rather than the power of the classifier.
This is the most common way to evaluate self-supervised representations because it tests if your features are linearly separable.

<div style="display: flex; justify-content: center;">
    <img src="./final_results/training_curves.png" alt="Linear Evaluation" width="45%" style="margin-right: 10px;"/>
    <img src="./final_results/confusion_matrix.png" alt="Confusion Matrix" width="45%" style="margin-right: 10px;"/>
</div>
<p style="text-align: center;"><em>Figure: Linear Evaluation Training Curves (left) and Confusion Matrix (right)</em></p>

The above plots show that there is confusion between classes 0 and 8 (plane and ship) and classes 1 and 9 (car and truck). These can be as these are visually similar classes (similar shapes and features).

The confusion patterns suggest that SimCLR has learned representations that capture some semantic similarities but still struggles with fine-grained distinctions between visually similar categories. This might be due to lack of training data as unsupervised learning needs more data in comparison to a supervised learning approach. Training on more epochs or increasing the batch size might also help improve the performance as then more negative samples can be compared.

#### 2. Label Efficiency evaluation 

One of the big advantages of self-supervised learning is that it can work well with less labeled data. 
I implemented a ```evaluate_with_less_data()``` function that tests how well your model performs when trained with only a small percentage of labeled data (1%, 10%, 25%, etc.)
The following is a plot showing how accuracy changes as you add more labeled data-
<div style="display: flex; justify-content: center;">
    <img src="./final_results/label_efficiency.png" alt="Label Efficiency" width="55%" style="margin-right: 10px;"/>
</div>
<p style="text-align: center;"><em>Figure: Test accuracy on label efficiency task</em></p>

The above plot show that there is a steep performance jump from **1% to 10% labeled data**. This accuraccy forms a plateau after 50% labelled data.

The accuracy is not more than 50% right now, due to many factors including less training data used (only 10000 images) and low batch size and epochs for training. 

Still, the nature of the curve depicts that for this 50% accuracy only 10% of the lavelled data was enough. 
This confirms one of the core value propositions of contrastive learning methods like SimCLR - *they can learn useful representations that transfer well even with limited labeled data.*

The relatively high performance **(43%)** even with just **1% labeled data** demonstrates the effectiveness of the knowledge transfer from self-supervised pre-training to downstream tasks.





## Visualizing the representations

The visualization shows how, after training, semantically similar images cluster together in the representation space, even though the model never saw class labels. This organization emerges purely from the contrastive learning objective.

Given below is the t-SNE visualization of the embeddings -
<div style="display: flex; justify-content: center;">
    <img src="./final_results/tsne_embeddings.png" alt="TSNE Visualization" width="55%" style="margin-right: 10px;"/>
</div>
<p style="text-align: center;"><em>t-SNE Visualization of the embeddings</em></p>

The entire implementation is available at https://github.com/rishita3003/MMDP-Research-Paper/blob/main/training_evaluation.ipynb

## Tools

The following frameworks are used in the implementation done by me -
1. PyTorch
2. Torchvision
3. NumPy
4. Matplotlib
5. Seaborn
6. Scikit-learn
7. Sklearn.manifold



## References

### Papers

1. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML).

2. He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.


### Videos and Talks

1. Ting Chen: "A Simple Framework for Contrastive Learning of Visual Representations" - ICML 2020 Talk
2. Yann LeCun: "Self-Supervised Learning: The Dark Matter of Intelligence" - Facebook AI Blog

### LLM Tools

1. Claude for managing the images html code in the blog.