# <img src="https://img.icons8.com/bubbles/50/000000/mind-map.png" style="height:50px;display:inline"> EE 046746 - Technion - Computer Vision
---

#### <a href="https://taldatech.github.io">Tal Daniel</a> (adapted from Tal's Tutorials in ECE046211 - Deep Learning)

## Tutorial 12 - Self-Supervised Learning
---

### <img src="https://img.icons8.com/bubbles/50/000000/checklist.png" style="height:50px;display:inline"> Agenda
---
* [Representation and Self-Supervised Learning](#-Representation-and-Self-Supervised-Learning)
* [Autoencoders](#-Deep-Unsupervised-Learning---Deep-Autoencoders)
* [Self-Supervised Learning](#-Self-Supervised-Learning)
  * [Corrupted Version Reconstruction & Visual Common Sense Tasks](#-Corrupted-Version-Reconstruction-&-Visual-Common-Sense-Tasks)
  * [Contrastive Methods](#-Contrastive-Learning)
    * [Simple Framework for Contrastive Learning of Visual Representations (SimCLR)](#-Simple-Framework-for-Contrastive-Learning-of-Visual-Representations-(SimCLR))
    * [Using the Learned Representation for Downstream Tasks](#-Using-the-Learned-Representation-for-Downstream-Tasks)
    * [Momentum Contrast (MoCo)](#-Momentum-Contrast-(MoCo))
    * [Contrastive Predictive Coding (CPC)](#-Contrastive-Predictive-Coding-(CPC))
    * [Performance Comparison](#-Performance-Comparison)
* [Recommended Videos](#-Recommended-Videos)
* [Credits](#-Credits)

## <img src="https://img.icons8.com/color/96/000000/self-esteem.png" style="height:50px;display:inline"> Representation and Self-Supervised Learning
---
* Data is usually abundant and cheap, it's the labels that are expensive.
  * Can we use the unlabeled data to gain knowledge that is usable for downstream supervised tasks?


* Maybe we can learn rich and useful features from raw unlabeled data
  * We can learn a representation of the data!
<img src="./assets/rep_data.png" style="height:250px">
* **The way we represent the data has a great impact on the performance and compelxity**.

* What are the various general tasks that can be used to learn representations from unlabelled data?
  * **Deep Unsupervised Learning** - learn representations without lables, subset of deep learning, which is a subset of representation learning, which is a subset of machine learning.
  * **Self-supervised Learning** - often used interchangeably with unsupervised learning. Self-supervised: **create your own supervision through pretext tasks**.

### <img src="https://img.icons8.com/color/96/000000/code.png" style="height:50px;display:inline"> Deep Unsupervised Learning - Deep Autoencoders
---
* **Motivation:** Most of the natural data is high-dimensional, such as images. Consider the MNIST (hand-written digits) dataset, where each image has $28x28=784$ pixels, which means it can be represented by a vector of length 784. 
    * But do we really need 784 values to represent a digit? The answer is probably no. We believe that the data lies on a low-dimensional manifold which is enough to describe the observations. In the case of MNIST, we known that there are 10 digits - so we can represent the digits as one-hot vectors, which means we only need 10 dimensions.
    * So we can **encode** high-dimensional observations in a low-dimensional space.
    * But how can we learn meaningful low-dimensional representations? 
    * The general idea is to reconstruct or, **decode** the low-dimensional representation to the high-dimensional representation, and use the reconstruction error to learn the best representations. This is the core idea behind **autoencoders**.



<img src="./assets/MnistExamples.png" style="height:250px">

* Image from <a href="https://en.wikipedia.org/wiki/MNIST_database">Wikipedia</a>

* **Autoencoders** - models which take data as input and discover some latent state representation of that data. The input data is converted into an encoding vector where each dimension represents some learned attribute about the data. The most important detail to grasp here is that our encoder network is outputting a single value for each encoding dimension. The decoder network then subsequently takes these values and attempts to recreate the original input. Autoencoders have **three parts**: an encoder, a decoder, and a 'loss' function that maps one to the other. For the simplest autoencoders - the sort that compress and then reconstruct the original inputs from the compressed representation - we can think of the 'loss' as describing the amount of information lost in the process of reconstruction.
        
<img src="./assets/autoencoder_1.jpeg" style="height:250px">
    

### <img src="https://img.icons8.com/cute-clipart/64/000000/task.png" style="height:50px;display:inline"> Self-Supervised Learning
---
* A version of unsupervised learning where **data provides the supervision**.
* **Idea**: withhold some part of the data and then task a neural network to predict it from the remaining parts.
* Details decide what proxy loss or pretext task the network tries to solve, and depending on the quality of the task, good semantic features can be obtained without actual labels.
* Advantages over supervised learning:
    * Large cost of producing a new dataset for each task (prepare labeling manuals, categories, hiring humans, creating GUIs, storage pipelines, etc).
    * Good supervision may not be cheap (e.g., medicine, legal).
    * Take advantage of vast amount of unlabeled data on the Internet (images, videos, language).

### <img src="https://img.icons8.com/cute-clipart/64/000000/console.png" style="height:50px;display:inline"> Self-Supervised Learning Methods
---
* *Reconstruct from a corrupted (or partial) version*
    * Denoising Autoencoders - "Withholded" data: Clean image
    * In-painting -
    * Colorization, Split-Brain Autoencoder
* *Visual common sense tasks*
    * Relative patch prediction
    * Jigsaw puzzles
    * Rotation prediction
* **Contrastive Learning** (our focus)
    * word2vec
    * Contrastive Predictive Coding (CPC)
    * Instance Discrimination
    * Simple Framework for Contrastive Learning of Visual Representations (SimCLR), Momentum Contrast (MoCo), Bootstrap Your Own Latent (BYOL)
    

### <img src="https://img.icons8.com/clouds/64/000000/white-noise.png" style="height:50px;display:inline"> Corrupted Version Reconstruction & Visual Common Sense Tasks
---
**Code Demos** - <a href="https://colab.research.google.com/github/rll/deepul/blob/master/demos/lecture7_selfsupervised_demos.ipynb">Self-Supervised Learning Demos</a>

* **Context Encoder** - Try to predict a hidden mask in the image
  * <img src="./assets/context_encoder.PNG" style="height:250px">
  

* The reconstruction is mediocre

<img src="./assets/context_encoder_res.png" style="height:250px">

* We care about the learned representation!
  * Specifically - is it useful for a downstream task?


| Top 1 Accuracy on CIFAR-10| Top 5 Accuracy CIFAR-10|
|--|--|
|45.77|90.29|

* Did it learn anything?
  * We can look the the nearest neighbors in the latent space to see if that's the case.

<table><tr>
<td> <img src="./assets/context_encoder_NN1.png" style="height:250px"> </td>
<td> <img src="./assets/context_encoder_NN2.png" style="height:250px"> </td>
</tr></table>


* **Rotation Prediction** - Try to predict the rotation "class" of a given image.
<img src="./assets/rotation_prediction.png" style="height:400px">

* What about the learned representations?

| | Top 1 Accuracy on CIFAR-10| Top 5 Accuracy on CIFAR-10|
|--|--|--|
| Context Encoder |45.77|90.29|
| **Rotation Prediction** |79.91|99.12|


<table><tr>
<td> <img src="./assets/rotation_prediction_NN1.png" style="height:250px"> </td>
<td> <img src="./assets/rotation_prediction_NN2.png" style="height:250px"> </td>
</tr></table>

### <img src="https://img.icons8.com/plasticine/100/000000/protect-from-magnetic-field.png" style="height:50px;display:inline"> Contrastive Learning
---
* Contrastive learning is an approach to formulate the task of **finding similar and dissimilar things for a ML model (basically what classification does when given labels)**. 
* Contrastive methods, as the name implies, learn representations by contrasting **positive and negative** examples. 
* Using this approach, one can train a machine learning model to classify between similar and dissimilar images.
<img src="./assets/contrastive_1.png" style="height:150px">
<img src="./assets/contrastive_puzzle.gif" style="height:200px">

* <a href="https://analyticsindiamag.com/contrastive-learning-self-supervised-ml">Image Source</a>



* More formally, for any data point $x$, contrastive methods aim to learn an encoder $f$ such that: 
    * $x^+$ is a data point similar to $x$, referred to as a *positive* sample.
    * $x^−$ is a data point dissimilar to $x$, referred to as a *negative* sample.
    * The **score function** is a metric that measures the similarity between two features: $$score(f(x), f(x^+))  >>  score(f(x), f(x^-))$$
    
    
    
* How can we sample "similar" or "different" images?

* The most common loss function to implement the score paradigm is **InfoNCE** loss, which looks similar to softmax.

<img src="./assets/infonce_loss.png" style="height:100px">

* The denominator terms consist of one positive sample, and N−1 negative samples. 

* Compare this to softmax: $$Softmax(y_i)=\frac{e^{y_i}}{\sum_{j=1}^{M}e^{y_j}},\quad i\in[1,...,M],y\in\mathbb{R}^M$$
  * Where $y_i$ is the angle between the normalized representation of the images $u^{T}w=||u||||w||\cos{\angle(u,w)}$

### <img src="https://img.icons8.com/nolan/64/collapse-arrow.png" style="height:50px;display:inline"> Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
---
* <a href="https://arxiv.org/abs/2002.05709">**Simple Framework for Contrastive Learning of Visual Representations (SimCLR)**</a> is a framework for contrastive learning of *visual* representations. 
* It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.

<img src="./assets/simclr.png" style="height:300px">

* A **stochastic data augmentation module** that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\tilde{x}_i$ and $\tilde{x}_j$, which is considered a **positive pair**.
* SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur. The authors find **random crop and color distortion** is crucial to achieve good performance.
* A neural network base encoder $f(\cdot)$ that extracts **representation vectors** from augmented data examples. The framework allows various choices of the network architecture without any constraints. 
    * For simplicity ResNet is used to obtain $h_i = f(\tilde{x}_i)\in \mathcal{R}^d$ where $h_i$ is the output after the average pooling layer.
* A small neural network projection head $g(\cdot)$ that maps representations to the space where contrastive loss is applied. 
* MLP with one hidden layer is used to obtain $z_i=g(h_i)$.
* **The authors find it beneficial to define the contrastive loss on $z_i$’s rather than $h_i$’s**.

* A minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. 
* Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N-1)$ augmented examples within a minibatch are treated as negative examples. 
* A NT-Xent (the normalized temperature-scaled cross entropy loss) loss function is used: $$\ell_{i,j} = -\log{\frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{[k\neq i]}\exp(\text{sim}(z_i, z_k)/\tau)}}$$
where $\text{sim}(z_i, z_j) = \frac{z_i^Tz_j}{\left\Vert z_i \right\Vert \left\Vert z_j \right\Vert}$

* <a href="https://github.com/sthalles/SimCLR">PyTorch Code</a>

<img src="./assets/simclr_anim.gif" style="height:350px">

* <a href="https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html">Image Source</a>

* Why can't we just use the positive samples?




* How likely is our negative sample to actually be negative?

<img src="./assets/SSCL_probelm.png" style="height:350px"> 


* <a href="https://www.v7labs.com/blog/contrastive-learning-guide">Image Source</a>


* What about the representations?

| | Top 1 Accuracy on CIFAR-10| Top 5 Accuracy on CIFAR-10| Top 1 Accuracy on Imagenet| Top 5 Accuracy on Imagenet|
|--|--|--|--|--|
| Context Encoder |45.77|90.29|||
| Rotation Prediction |79.91|99.12|||
| **SimCLR** |92.84|99.86|69.3|89.0|





<table><tr>
<td> <img src="./assets/simclr_NN1.png" style="height:250px"> </td>
<td> <img src="./assets/simclr_NN2.png" style="height:250px"> </td>
</tr></table>


### <img src="https://img.icons8.com/bubbles/64/000000/knowledge-transfer.png" style="height:50px;display:inline"> Using the Learned Representation for Downstream Tasks
---

* Is classification the only downstream task our learned representation can help?

* Example: **Segmentation** on PASCAL VOC2012 

<img src="./assets/self_sup_seg.png" style="height:400px">


* Images from <a href="https://colab.research.google.com/github/rll/deepul/blob/master/demos/lecture7_selfsupervised_demos.ipynb">Berkeley's Deep Unsupervised Learning Course</a>


### <img src="https://img.icons8.com/officel/80/000000/gyroscope.png" style="height:50px;display:inline"> Momentum Contrast (MoCo)
---
* <a href="https://arxiv.org/abs/1911.05722">**Momentum Contrast (MoCo)**</a> is a self-supervised learning algorithm with a contrastive loss.
* Contrastive loss methods can be thought of as **building dynamic dictionaries**. 
* The **"keys" (tokens)** in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. 
* Unsupervised learning trains encoders (by minimizing a contrastive loss) to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others.
* In MoCo, we maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued.
*  The queue decouples the dictionary size from the mini-batch size, allowing it to be large.
* Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.
* <a href="https://github.com/facebookresearch/moco">PyTorch Code</a>
    * <a href="https://colab.research.google.com/github/facebookresearch/moco/blob/colab-notebook/colab/moco_cifar10_demo.ipynb">Colab Demo</a>

* The positive samples part remains the same as in SimCLR.
* For the negative samples - We build a **momentum encoder**
  * This encoder's architecture is the same as the normal encoder, but its weights are an **exponential moving average**: 
  $$W_{\text{momentum}\newline\text{encoder}}^k =\beta \cdot W_{\text{momentum}\newline\text{encoder}}^{k-1} + (1-\beta)\cdot W_{\text{encoder}}$$
  * The momentum encoder doesn't have backpropegation -  so we can save memory and apply it on a large amount of negative samples

<img src="./assets/moco.png" style="height:350px">

| Short training| Top 1 Accuracy on Imagenet| Batch size|
|--|--|--|
| SimCLR |61.9| 256|
| SimCLR |66.6| 8192|
| **MoCo** |**60.6**|**256**|
| MoCo v2|67.5|256|

| Long training| Top 1 Accuracy on Imagenet| Batch size|
|--|--|--|
| SimCLR |69.3| 4096|
| MoCo v2 |71.1|256|

* Both MoCo and SimCLR can be categorized (along with other newer methods such as SwAV, BYOL, etc.) as Instance Based (Discrimination) Contrastive Learning.

### <img src="https://img.icons8.com/pastel-glyph/64/000000/qr-code--v2.png" style="height:50px;display:inline"> Contrastive Predictive Coding (CPC)
---
* <a href="https://arxiv.org/abs/1807.03748">**Contrastive Predictive Coding (CPC)**</a> learns self-supervised **representations** (Coding) by **predicting the future** (Predictive) in a learned *latent space* by using powerful autoregressive models that **Contrast** (Contrastive) "right" and "wrong" sequences. 
* The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.



<img src="./assets/cpc.png"  style="height:400px">

* <a href="https://github.com/davidtellez/contrastive-predictive-coding">Image Source</a>

<img src="./assets/cpc2.png"  style="height:400px">

1. A non-linear encoder $g_{enc}$ maps the input sequence of observations $x_t$ to a sequence of latent representations $z_t = g_{enc}(x_t)$, potentially with a lower resolution.
2. An autoregressive model $g_{ar}$ summarizes all $z \leq t$ in the latent space and produces a context latent representation $c_t=g_{ar}(z \leq t)$.
  * In the original paper they used a single sample window $k=1$ (Predict the future based on the last sample)
3. Compute the InfoNCE loss between the future code $z_{t+k}$ and the predicted future code $\hat{z}_{t+k}$ based on the context $c_t$
  * Remember in InfoNCE we have $f(x)^{T}f(x_j)$, so here $f_k(x_{t+k}, c_t) = \exp(z^T_{t+k}W_{k}c_{t})$
    * $f$ is modeled to preserves the mutual information between $x_{t+k}$ and $c_t$ ($f_k(x_{t+k}, c_t) \propto \frac{\mathbb{P}(x_{t+k}|c_t)}{\mathbb{P}(x_{t+k}})$)
    * $W_k$ are learned weights, $f$ can be unnormalized (does not have to integrate to 1)
  * Negative samples are an incorrect "future prediction"
* Any type of encoder and autoregressive can be used. 
    * For example: strided convolutional layers with RNN and GRUs.
    
* <a href="https://github.com/jefflai108/Contrastive-Predictive-Coding-PyTorch">PyTorch Code</a>

* How can we "predict the future" in images?

<img src="./assets/cpc_future.gif"  style="height:300px">

* <a href="https://towardsdatascience.com/a-framework-for-contrastive-self-supervised-learning-and-designing-a-new-approach-3caab5d29619">Image Source</a>

1. Create a sequence from the image: Split the image into overlapping patches, and model rows of patches from top to bottom as a sequence.
2. Use an encoder $g_{enc}$ that's fit for images (e.g. Resnet-50 architecture)
3. Use an autoregressive model fit for images (e.g. PixelCNN) to create a context vector $c_t$ from the first $k$ rows.
4. Compute the InfoNCE loss between the context $c_t$ and the predicted rows $z_{t+k}$
  * Negative samples will be incorrect rows

<img src="./assets/cpc_images.png"  style="height:350px">

### <img src="https://img.icons8.com/clouds/80/000000/performance-2.png" style="height:50px;display:inline"> Performance Comparison
---
Performance on ImageNet (linear evaluation) using ResNet-50 and ResNet200 (2×), compared to other unsupervised and supervised (Sup.) baselines:


<table><tr>
<td> <img src="./assets/self_supervised_perf1.png" style="height:350px"> </td>
<td> <img src="./assets/self_supervised_perf2.png" style="height:350px"> </td>
</tr></table>

### <img src="https://img.icons8.com/bubbles/50/000000/video-playlist.png" style="height:50px;display:inline"> Recommended Videos
---
#### <img src="https://img.icons8.com/cute-clipart/64/000000/warning-shield.png" style="height:30px;display:inline"> Warning!
* These videos do not replace the lectures and tutorials.
* Please use these to get a better understanding of the material, and not as an alternative to the written material.

#### Video By Subject

* General Self-Supervised Learning - <a href="https://www.youtube.com/watch?v=dMUes74-nYY">Lecture 7 Self-Supervised Learning - UC Berkeley Spring 2020 - CS294-158 Deep Unsupervised Learning</a>
* SimCLR - <a href="https://www.youtube.com/watch?v=APki8LmdJwY">SimCLR Explained!</a>
* MoCo - <a href="https://www.youtube.com/watch?v=LvHwBQF14zs">Momentum Contrastive Learning</a>


## <img src="https://img.icons8.com/dusk/64/000000/prize.png" style="height:50px;display:inline"> Credits
---
* EE 046211 Winter 22 - Original Tutorial - <a href="https://taldatech.github.io/">Tal Daniel</a> 
* EE 046746 Spring 22 - <a href="https://github.com/HilaManor">Hila Manor</a>
* Icons made by <a href="https://www.flaticon.com/authors/becris" title="Becris">Becris</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a>
* Icons from <a href="https://icons8.com/">Icons8.com</a> - https://icons8.com
* <a href="https://sites.google.com/view/berkeley-cs294-158-sp20/home">Berkeley's CS294-158-SP20-Deep Unsupervised Learning</a>
* <a href="http://cs231n.stanford.edu/2021/">Stanford's CS2331n (Spring 2021) -Convolutional Neural Networks for Visual Recognition</a>
* <a href="https://paperswithcode.com/method/contrastive-predictive-coding">Contrastive Predictive Coding</a>
* <a href="https://paperswithcode.com/method/simclr"> Simple Framework for Contrastive Learning of Visual Representations (SimCLR)</a>
* <a href="https://paperswithcode.com/method/moco">Momentum Contrast</a>
* <a href="https://arxiv.org/pdf/2009.00104.pdf">A Framework For Contrastive Self-Supervised Learning And Designing A New Approach</a>
