<div style="display: flex; align-items: center; margin-bottom: 20px;">
  <img src="https://tse4.mm.bing.net/th?id=OIP.EhV6r5gkOCL2LvxAkFAnigAAAA&rs=1&pid=ImgDetMain" alt="SimCLR Logo" style="height: 60px; margin-right: 15px;">
  <div>
    <h2 style="margin: 0;">Look Twice, Learn Better : How SimCLR transformed Computer Vision</h2>
    <p style="margin: 0; color: #0066cc; font-size: 22px;"><em>Rishita Agarwal</em></p>
  </div>
</div>


## Motivation behind this project topic?

The growth of digital images has created a havoc in machine learning, we have a lot of visual data, yet supervised learning approaches act as a bottleneck due to the lack of high quality labeled data. This is one of the major challenges in computer vision today.

One of the solution which can be proposed for this can be manual annotation of the data, but manual labelling is expensive, time consuming and many times impractical at large scale. Some times it may also require expert knowledge in a particular domain.

**Unsupervised learning** offers a more promising path by leveraging the large amount of unlabeled data available. Though, it is not that easy to produce results as good as supervised learning. Some of the key requirements are -
1. The learned representation should be **generalizable** to diverse downstream tasks as well.
2. The approach should **scale computationally** with larger datasets and model sizes.
3. Features should capture **meaningful semantic** information.

<div style="display: flex; justify-content: center;">
    <img src="https://www.edushots.com/upload/articles-images/b35a6ab4259fcd2fa572cc62333ac5ec15371617.jpg" alt="SimCLR Framework" width="45%" style="margin-right: 10px;"/>
    <img src="https://media.geeksforgeeks.org/wp-content/uploads/20231213175718/Self-660.png" alt="SimCLR Results" width="45%"/>
</div>
<p style="text-align: center;"><em>Figure: Unsupervised Learning (left) and Self Supervised Learning (right)</em></p>

SimCLR addressed these requirements in a simplistic manner, still managing to achieve state-of-the-art results, grabing my attention to this topic. The idea was simple yet innovative, which actually was derived from the essential components of existing methods.


## History and current works?

The initial methods relating to **self supervised visual representations** led to this exploration. Let me talk more about these methods.

It all started with **handcrafted pretext** tasks like predicting image rotation or colorizing grayscale images. These methods were a good start to the finding a good representation, but it was noticed that the learnt representations were more specific to the pretext task rather than being general purpose.

Then, approaches like **InstDisc and CPC** introduced the concept of contrastive objectives (Fig. given below) but still relied on complex architectures or memory banks to store representations.

<div style="display: flex; justify-content: center;">
    <img src="https://insights.willogy.io/assets/static/contrastive_learning_intuition.42db587.208b1cf168018c6226966d0407c62134.jpg" alt="Contrastive Learning Concept" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: Visualization of contrastive learning approach</em></p>

Finally **SimCLR** simplied the approach of contrastive learning dramatically, showing that with right data augmentation, loss function and projection head, superior results with a straightforward framework could be achieved.
Even after SimCLR more methods based on its insights were built, like, MoCo v2, SimSiam etc. to further improve the self supervision, trying to reduce the need for negative examples. CLIP and CLAP are 2 multimodal approaches to Contrastive learning, applying CL to image-text and audio-text pairs. I will give a basic introduction to these methods towards the end of this blog.

## Diving deep into Contrastive Learning and SimCLR's approach

**Contrastive learning** is an approach that learns representations by comparing similar and dissimilar samples. The fundamental idea behind this concept is: Similar items ("Positive") should be closer together and dissimilar items ("Negative") should be further apart.

It is a self supervised technique which learns meaningful representation without any explicitly labeled data.

**SimCLR (Simple Framework for Contrastive Learning of Visual Representation) Framework** 

This framework consists of 4 major components - 
1. Data Augmentation Module - In this module, each input image goes through stochastic augmentation twice, to generate 2 different views.

``` 
X -> [augmentation] -> x̃ᵢ
X -> [augmentation] -> x̃ⱼ
```

The augmentation sequence is as follows :
- **Random Cropping** - After cropping the image randomly, it should again be resized to original size.
- **Random color distortion** - Include color dropping, brightness, contrast etc.
- **Random Gaussian blur** - Gaussian function to smooth an image. The effect is similar to viewing the image through a translucent screen, creating a hazy appearance by reducing image noise and detail.

The most important combination of transformation is that of cropping (spatial transformation) and color distortion (appearance transformation). A good reason for this is, that without color distortion the networks can exploit the shortcut of matching color histograms, rather than actually learning semantic features.

<div style="display: flex; justify-content: center;">
    <img src="https://tse1.mm.bing.net/th?id=OIP.l9m-_lWHc2iopae_sHtdUwHaDM&rs=1&pid=ImgDetMain" alt="Data Augmentation" width="45%" style="margin-right: 10px;"/>
    <img src="https://img-blog.csdnimg.cn/20201013134203801.png?,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3h6MTMwODU3OTM0MA==,size_16,color_FFFFFF,t_70#pic_center" alt="Data Augmentation" width="45%" style="margin-right: 10px;"/>
</div>
<p style="text-align: center;"><em>Figure: Data Augmentation Example (left) and Composition of Augmentation Techniques (right)</em></p>


2. Base Encoder Network

SimCLR employs a standard CNN (ResNet architecture) as its base encoder. Given an augmented image x̃, the encoder generates a representation vector -

``` h = f(x̃) = ResNet(x̃) ```

where h ∈ ℝᵈ is the output after the average pooling layer.

3. Projection Head
This head maps the representation h to a space z, where contrastive loss is applied -

``` z = g(h) = W₂σ(W₁h) ```

Where σ is a ReLU nonlinearity, and W₁ and W₂ are learnable weights. 
In this research paper, they have made a relatively simple MLP with one hidden layer as the projection head. 

## Contrastive Loss and Where should it be applied?

It was noticed that applying the contrastive loss directly on the representations h hurts their quality for downstream tasks. The projection head allows the model to discard information that may be useful for the downstream task but not for the contrastive task.

Crucially, after training is complete, the projection head is discarded, and the **representation h is used for downstream tasks.**

#### Contrastive Loss function 
A special form of contrastive loss called NT-Xent (Normalized Temperature scaled Cross Entropy Loss) is used. For a positive pair (i, j) :

``` ℓᵢ,ⱼ = -log(exp(sim(zᵢ,zⱼ)/τ) / Σₖ₌₁²ᴺ 1[k≠i]exp(sim(zᵢ,zₖ)/τ)) ```

Where:

- sim(u,v) = u·v/‖u‖‖v‖ is the cosine similarity
- τ is a temperature parameter that controls the concentration level of the distribution
- The sum is over all 2N examples in the batch (including the other augmented views)

This loss effectively treats each augmented image in the batch as a single positive example (the other augmented view of the same image) and 2N-2 negative examples (all other augmented images)

<div style="display: flex; justify-content: center;">
    <img src="https://sthalles.github.io/assets/contrastive-self-supervised/cover.png" alt="SimCLR Framework" width="70%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Framework</em></p>

 

## Key Insights

The major points which can be reflected from this paper are described in the section.
Based on the extensive ablation studies which were conducted -
1. **Data augmentation composition is critical** - It was observed that following a specific sequence and combination of augmentations enhanced the results. Random cropping + color distortion provides significant benefit.

2. **Normalized embeddings and temperature scaling matters** - Temperature effectively controls the importance of difficult negative examples. 
3. **Contrastive Learning benefits from larger batch sizes** - Larger batches (up to 8192) continue to improve performance as they provide more negative examples.
4. **Longer training improves results** - SimCLR benefits more from training for longer periods than supervised counterparts.
5. **Non linear projection head is crucial** - A linear projection is better than no projection, but non linear projection works even better, improving accuracy 3-4%.
6. **Unsupervised learning benefits more from model scaling** - As the model size increases, the gap between supervised and self supervised models shrinks.


## Algorithm in a nutshell

1. Sample a minibatch of N images
2. For each image x, generate two augmented versions x̃ᵢ and x̃ⱼ
3. Compute representations ```h = f(x̃)``` and projections ```z = g(h)``` for all 2N augmented images
4. For each positive pair, compute the contrastive loss using all other 2N-2 augmented examples as negatives
5. Update the networks f and g to minimize the loss
6. After training, discard the projection head g and use the encoder f and representation h for downstream tasks

<div style="display: flex; justify-content: center;">
    <img src="SimCLR.png" alt="SimCLR Framework" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Simple Example</em></p>

 

## What surprised me!?

After going through the results and concept in depth, I found several aspects which surprised me -
1. **The power of simple data augmentation:** A carefully composed and sequential data augmentations provided such effective contrastive tasks without any complex architecture. Random cropping + color distortion in this case.

2. **Projection head paradox:** It was observed through experiments that the projection head works to remove information that may be useful for downstream tasks, which worked in a fascinating way as then the result of projection head could be used for contrastive loss.

3. **Transfer learning performance:** This aspect highlights the major aim of the representations produced by SimCLR. It was observed that the SimCLR representations performed better than supervised ImageNet pretraining was completely unexpected and important aspect. This removed the issue of task specific representation of self supervision.


<div style="display: flex; justify-content: center;">
    <img src="https://th.bing.com/th/id/OIP.tnTNLLEZNfrvlr_kkoBezwHaFO?rs=1&pid=ImgDetMain" alt="Results" width="70%"/>
</div>
<p style="text-align: center;"><em>Figure: SimCLR Results</em></p>
 



## Is there a scope for improvement??

Even after all these fascinating aspects of this architecture, there still remains several areas for improvement: 

1. **Computational efficiency:** SimCLR requires large batch sizes and long training times for better performance, which makes it computationally expensive. Although, further works like *MoCo* address this issue using momentum encoders and memory banks.

2. **Negative sample dependency:** The need for many negative examples require large batch sizes. Recent approaches like *BYOL and SimSiam* show that its possible to learn without explicit negatives.

3. **Multimodality:** While SimCLR focusses on visual representations, *CLIP and CLAP* architectures extend it to multiple modalities, which yielded even richer representations.

4. **Exploration:** Exploration of more achitectures other than ResNet and actual reasoning of why certain data augmentation compositions work better than other, would help develop self supervised learning more.

## References

### Papers

1. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML).

2. He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.


### Videos and Talks

1. Ting Chen: "A Simple Framework for Contrastive Learning of Visual Representations" - ICML 2020 Talk
2. Yann LeCun: "Self-Supervised Learning: The Dark Matter of Intelligence" - Facebook AI Blog

## Tools

## Visualizing the representations

## Implemenatation in pytorch