## Look Twice, Learn Better : How SimCLR transformed Computer Vision

``` Rishita Agarwal ```

## Motivation behind this project topic?

The growth of digital images has created a havoc in machine learning, we have a lot of visual data, yet supervised learning approaches act as a bottleneck due to the lack of high quality labeled data. This is one of the major challenges in computer vision today.

One of the solution which can be proposed for this can be manual annotation of the data, but manual labelling is expensive, time consuming and many times impractical at large scale. Some times it may also require expert knowledge in a particular domain.

**Unsupervised learning** offers a more promising path by leveraging the large amount of unlabeled data available. Though, it is not that easy to produce results as good as supervised learning. Some of the key requirements are -
1. The learned representation should be **generalizable** to diverse downstream tasks as well.
2. The approach should **scale computationally** with larger datasets and model sizes.
3. Features should capture **meaningful semantic** information.

<div style="display: flex; justify-content: center;">
    <img src="https://www.edushots.com/upload/articles-images/b35a6ab4259fcd2fa572cc62333ac5ec15371617.jpg" alt="SimCLR Framework" width="45%" style="margin-right: 10px;"/>
    <img src="https://media.geeksforgeeks.org/wp-content/uploads/20231213175718/Self-660.png" alt="SimCLR Results" width="45%"/>
</div>
<p style="text-align: center;"><em>Figure: Unsupervised Learning (left) and Self Supervised Learning (right)</em></p>

SimCLR addressed these requirements in a simplistic manner, still managing to achieve state-of-the-art results, grabing my attention to this topic. The idea was simple yet innovative, which actually was derived from the essential components of existing methods.


## History and current works?

The initial methods relating to **self supervised visual representations** led to this exploration. Let me talk more about these methods.

It all started with **handcrafted pretext** tasks like predicting image rotation or colorizing grayscale images. These methods were a good start to the finding a good representation, but it was noticed that the learnt representations were more specific to the pretext task rather than being general purpose.

Then, approaches like **InstDisc and CPC** introduced the concept of contrastive objectives (Fig. given below) but still relied on complex architectures or memory banks to store representations.

<div style="display: flex; justify-content: center;">
    <img src="https://insights.willogy.io/assets/static/contrastive_learning_intuition.42db587.208b1cf168018c6226966d0407c62134.jpg" alt="Contrastive Learning Concept" width="50%"/>
</div>
<p style="text-align: center;"><em>Figure: Visualization of contrastive learning approach</em></p>

Finally **SimCLR** simplied the approach of contrastive learning dramatically, showing that with right data augmentation, loss function and projection head, superior results with a straightforward framework could be achieved.
Even after SimCLR more methods based on its insights were built, like, MoCo v2, SimSiam etc. to further improve the self supervision, trying to reduce the need for negative examples. CLIP and CLAP are 2 multimodal approaches to Contrastive learning, applying CL to image-text and audio-text pairs. I will give a basic introduction to these methods towards the end of this blog.

## Diving deep into Contrastive Learning and SimCLR's approach

## CLIP and CLAP