<a href="https://colab.research.google.com/github/spaceml-org/Curator-Unlabeled-Image-Search-Guide/blob/main/notebooks/SSL%2BImage_Similarity_Search%2BActive_Labeler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this demo, we'll demonstrate how to 
1. train a model with Self-Supervised Learner (SSL)
2. find similar images with Image Similarity Search
3. improve our model with Active Labeler

[Notice]
*   Image Similarity Search and Swipe Labeler operate on your local computer, not in the Colab notebook.
*   We used [UC Merced Land Use dataset](https://weegee.vision.ucmerced.edu/datasets/landuse.html) in this demo. Although UC Merced dataset has labels, we set up the dataset as if it is unlabeled dataset to demonstrate how to use unlabeled dataset in this pipeline.


# 1. Self-Supervised Learner

## 1-1. Install packages & SSL

In [1]:
#installs
!pip install -q split-folders
!pip install -q torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 torchtext==0.6.0
!pip install -q pytorch-lightning==1.1.8
!pip install -q pytorch-lightning-bolts
!pip install -q --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100
!pip install -q wandb
!pip install -q annoy

[K     |████████████████████████████████| 776.8 MB 18 kB/s 
[K     |████████████████████████████████| 12.8 MB 23 kB/s 
[K     |████████████████████████████████| 7.6 MB 29.9 MB/s 
[K     |████████████████████████████████| 64 kB 2.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 35.2 MB/s 
[K     |████████████████████████████████| 696 kB 5.4 MB/s 
[K     |████████████████████████████████| 123 kB 47.4 MB/s 
[K     |████████████████████████████████| 269 kB 46.3 MB/s 
[K     |████████████████████████████████| 829 kB 37.0 MB/s 
[K     |████████████████████████████████| 1.3 MB 36.3 MB/s 
[K     |████████████████████████████████| 142 kB 46.1 MB/s 
[K     |████████████████████████████████| 294 kB 45.5 MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone
  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 253 kB 5.2 MB/s 
[K     |████████████████████████████████| 282 kB 38.7 MB/s 
[K     |████████████████

In [2]:
import os
import torch

#not logging on wandb in this demo
os.environ['WANDB_MODE']='disabled'

In [3]:
#additional imports for Active Labeler
import pathlib
from imutils import paths
import shutil

In [4]:
!rm -rf SSL
!git clone --branch simsiam https://github.com/spaceml-org/Self-Supervised-Learner.git
!mv Self-Supervised-Learner SSL

Cloning into 'Self-Supervised-Learner'...
remote: Enumerating objects: 2817, done.[K
remote: Counting objects: 100% (290/290), done.[K
remote: Compressing objects: 100% (269/269), done.[K
remote: Total 2817 (delta 175), reused 38 (delta 21), pack-reused 2527[K
Receiving objects: 100% (2817/2817), 11.95 MiB | 27.08 MiB/s, done.
Resolving deltas: 100% (1765/1765), done.


## 1-2. Preparing dataset

Before starting training on Self-Supervised Learner, we should make sure if the data is in below folder structure:
```
/Dataset
    /Class 1
        Image1.png
        Image2.png
    /Class 2
        Image3.png
        Image4.png
```
In case there is no label, organize directories like this:
```
/Dataset
    /Unlabelled
        Image1.png
        Image2.png
        Image3.png
        Image4.png
```

UC Merced Land Use dataset is organized as the former; however, in this demo, we'll change the folder structure into the latter to treat the dataset as an unlabeled dataset.

In [5]:
#download UC Merced Land Use dataset
!gdown http://weegee.vision.ucmerced.edu/datasets/UCMerced_LandUse.zip
!unzip -qq UCMerced_LandUse.zip

Downloading...
From: http://weegee.vision.ucmerced.edu/datasets/UCMerced_LandUse.zip
To: /content/UCMerced_LandUse.zip
100% 332M/332M [00:09<00:00, 34.5MB/s]


In [6]:
#convert from tif to jpg (.tif file is not available in Swipe Labeler and Active Labeler)
import PIL.Image as Image
for img in list(paths.list_images('/content/UCMerced_LandUse/Images')):
  im = Image.open(img).convert('RGB').save(img.split('.')[0] + '.jpg', "JPEG", quality = 100)
  os.remove(img)   

In [7]:
#create an unlabeled image folder and copy all UC Merced dataset images into that folder
folder= '/content/Dataset/Unlabeled'
if os.path.exists(folder):
    shutil.rmtree(folder)
pathlib.Path(folder).mkdir(parents=True, exist_ok=True)
for i in paths.list_images('/content/UCMerced_LandUse/Images'):
  shutil.copy(i,os.path.join(folder,i.split('/')[-1]))

## 1-3. Training self-supervised learning model



In [8]:
#run this cell to check information regarding arguments
!python /content/SSL/train.py --help

usage: train.py [-h] [--DATA_PATH DATA_PATH] [--VAL_PATH VAL_PATH]
                [--model MODEL] [--batch_size BATCH_SIZE] [--cpus CPUS]
                [--hidden_dim HIDDEN_DIM] [--epochs EPOCHS]
                [--learning_rate LEARNING_RATE] [--patience PATIENCE]
                [--val_split VAL_SPLIT] [--withhold_split WITHHOLD_SPLIT]
                [--gpus GPUS] [--log_name LOG_NAME] [--image_size IMAGE_SIZE]
                [--resize RESIZE] [--technique TECHNIQUE] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  --DATA_PATH DATA_PATH
                        path to folders with images to train on.
  --VAL_PATH VAL_PATH   path to validation folders with images
  --model MODEL         model to initialize. Can accept model checkpoint or
                        just encoder name from models.py
  --batch_size BATCH_SIZE
                        batch size for SSL
  --cpus CPUS           number of cpus to use to fetch data
  --hidden_dim H

In [9]:
#train an encoder
!python /content/SSL/train.py --technique SIMCLR --DATA_PATH /content/Dataset --model minicnn32 --batch_size 32 --learning_rate 1e-3 --log_name ssl --image_size 256 --epochs 50

[34mAutomatically splitting data into train and validation data...[0m
Copying files: 2100 files [00:00, 3300.15 files/s]
warmup
[34mModel architecture successfully loaded[0m
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.readers`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
In DALI 1.0 all decoders were moved into a dedicated :mod:`~nvidia.dali.fn.decoders`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
read 1680 files from 1 directories
In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.r

In [10]:
# Load model
%cd /content/SSL
from models import SIMCLR, SIMSIAM, CLASSIFIER
%cd /content/

model = SIMCLR.SIMCLR.load_from_checkpoint('/content/models/SIMCLR_ssl.ckpt')

/content/SSL
/content
warmup


# 2. Image Similarity Search (prep)

## 2-1. Download dataset

Downloading multiple files or folders from Colab notebook to your computer can take a long time. We recommend you download 'UCMerced_LandUse.zip' to your computer and unzip it.

You could download 'UCMerced_LandUse.zip' file if you only want to use Image Similarity Search. But if you also want to use Swipe Labeler or Active Labeler, run the code cell below and download 'UCMerced_LandUse_jpg_ver.zip' file because the Labelers don't accept .tif format image files.

In [11]:
!zip -r UCMerced_LandUse_jpg_ver.zip /content/Dataset/Unlabeled

  adding: content/Dataset/Unlabeled/ (stored 0%)
  adding: content/Dataset/Unlabeled/chaparral28.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/freeway06.jpg (deflated 0%)
  adding: content/Dataset/Unlabeled/beach43.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/tenniscourt57.jpg (deflated 0%)
  adding: content/Dataset/Unlabeled/forest68.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/chaparral72.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/denseresidential89.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/overpass63.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/forest33.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/sparseresidential89.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/runway67.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/chaparral84.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/intersection24.jpg (deflated 1%)
  adding: content/Dataset/Unlabeled/baseballdiamond21.jpg (deflated 1%)
  adding: content

## 2-2. Download model

To use Image Similarity Search app, we need a model file in either .pt or .pth format. Because SSL model is .ckpt format in default, we'll change the model into .pt format file.

In [12]:
# check torch size
model.local_rank = 0
model.setup(stage = 'inference') #we set up inference with this call to instantiate the DALI data pipeline
model.eval()
model.cuda()

for batch in model.inference_dataloader:
    print(len(batch))
    print(batch[0].shape)
    break

In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.readers`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
In DALI 1.0 all decoders were moved into a dedicated :mod:`~nvidia.dali.fn.decoders`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))


1
torch.Size([32, 3, 256, 256])


In [13]:
# type the torch size you checked above into the torch.ones parenthesis
# to use this file in Image Similarity Search, you should have a gpu in your computer
# if you don't have a gpu, run the next cell to get a cpu version .pt file

with torch.no_grad():
    x = torch.ones((32, 3, 256, 256)).cuda()  #typical looking datapoint = (1, 3, 256, 256))
    traced_cell = torch.jit.trace(model, (x))
torch.jit.save(traced_cell, "UCMerced_simclr_minicnn32_50epochs.pt") #change the file name as you want

In [14]:
# generate cpu version .pt file
with torch.no_grad():
    x = torch.ones((32, 3, 256, 256)).cpu()
    traced_cell = torch.jit.trace(model.cpu(), (x))
torch.jit.save(traced_cell, "UCMerced_simclr_minicnn32_50epochs_cpu.pt")

Now download the .pt file from the Colab notebook file directory to your computer.

## 2-3. Check output embedding size of the model

Embedding size is a required input in Image Similarity Search app so we should check the output embedding size of our SSL model.

In [15]:
# check layers
model = SIMCLR.SIMCLR.load_from_checkpoint('/content/models/SIMCLR_ssl.ckpt')
model.eval()
model.cuda()

warmup


SIMCLR(
  (projection): Projection(
    (model): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Linear(in_features=128, out_features=128, bias=False)
    )
  )
  (encoder): miniCNN(
    (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (conv3): Conv2d(32, 48, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (adaptive_pool): AdaptiveAvgPool2d(output_size=(16, 16))
    (conv4): Conv2d(48, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
    (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (fc1): Linear(in_features=1024, out_features=32, bias=True)
  )
)

In this demo, the output embedding is 32.

## 2-4. Run Image Similarity Search app

Now follow [this guide](https://github.com/spaceml-org/Curator-Unlabeled-Image-Search-Guide/blob/main/Single_Usage_Guide/Image_Similarity_Search.md) to set up and run the Image Similarity Search app on your computer. 

<img width="854" alt="ISS_screenshot" src="https://user-images.githubusercontent.com/66165810/134059552-f64b23da-ecfe-40f7-aff5-5730dc9f2a78.PNG">

#3. Active Learner

## 3-1. Code setup 

In [16]:
%cd "/content"
import os
import shutil
if os.path.exists('/content/Active-Labeler'):
  shutil.rmtree('/content/Active-Labeler')

!git clone https://github.com/spaceml-org/Active-Labeler.git

/content
Cloning into 'Active-Labeler'...
remote: Enumerating objects: 2139, done.[K
remote: Counting objects: 100% (2139/2139), done.[K
remote: Compressing objects: 100% (1550/1550), done.[K
remote: Total 2139 (delta 679), reused 1955 (delta 565), pack-reused 0[K
Receiving objects: 100% (2139/2139), 24.19 MiB | 28.31 MiB/s, done.
Resolving deltas: 100% (679/679), done.


## 3-2. Run Active Labeler

Access Swipe Labeler through the link generated by the following code cell to label the most uncertain images for the model. 


**[Note]**
This link only works on colab. If you are running the Active Labeler tool on your local device, you will get another link (http://0.0.0.0:5000/). Use that for your local device.



In [17]:
#generate Swipe Labeler link
from google.colab.output import eval_js
print(eval_js("google.colab.kernel.proxyPort(5000)"))

https://18k5nkmv9vej-496ff2e9c6d22116-5000-colab.googleusercontent.com/


In [18]:
config_path = "/content/Active-Labeler/pipeline_config.yaml"
import sys
sys.path.insert(0, "/content/Active-Labeler")
from pipeline import Pipeline
pipeline = Pipeline(config_path)

Initialization
Load Config


In [22]:
pipeline.main()

warmup


  0%|          | 0/2100 [00:00<?, ?it/s]


Got embeddings. Embedding Shape: torch.Size([2100, 32])
Annoy file stored at  /content/runtime/annoy_file.ann

----- iteration: 1
Enter n closest, 0 to stop
4

 3 images to label.
 3 labeled: 2 Pos 1 Neg

----- iteration: 2
Enter n closest, 0 to stop
4

 0 images to label.
 0 labeled: 0 Pos 0 Neg

----- iteration: 3
Enter n closest, 0 to stop
0
warmup
iteration 1
Enter l for Linear, f for finetuning and q to quit
l
Epoch 0/49
----------
train Loss: 25.3412 Acc: 0.0000
val Loss: 100.0000 Acc: 0.0000

Epoch 1/49
----------
train Loss: 0.0000 Acc: 4.0000
val Loss: 100.0000 Acc: 0.0000

Epoch 2/49
----------
train Loss: 0.0000 Acc: 4.0000
val Loss: 100.0000 Acc: 0.0000

Epoch 3/49
----------
train Loss: 0.0000 Acc: 4.0000
val Loss: 100.0000 Acc: 0.0000

Epoch 4/49
----------
train Loss: 0.0000 Acc: 4.0000
val Loss: 100.0000 Acc: 0.0000

Epoch 5/49
----------
train Loss: 0.0000 Acc: 4.0000
val Loss: 100.0000 Acc: 0.0000

Epoch 6/49
----------
train Loss: 0.0000 Acc: 4.0000
val Loss: 100.00

100%|██████████| 17/17 [00:00<00:00, 659.40it/s]


 20 images to label.





 20 labeled: 0 Pos 20 Neg
Total Images: 2 + 0 = 2 positive || 1 + 20 = 21 negative
iteration 2
Enter l for Linear, f for finetuning and q to quit
f
Epoch 0/49
----------
train Loss: 8.9105 Acc: 0.0000
val Loss: 0.5222 Acc: 0.0000

Epoch 1/49
----------
train Loss: 13.2051 Acc: 0.0000
val Loss: 0.0000 Acc: 0.8000

Epoch 2/49
----------
train Loss: 200.0000 Acc: 13.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 3/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 4/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 5/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 6/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 7/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 8/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.0000

Epoch 9/49
----------
train Loss: 200.0000 Acc: 16.0000
val Loss: 0.0000 Acc: 1.000

  0%|          | 0/2100 [00:00<?, ?it/s]


Got embeddings. Embedding Shape: torch.Size([2100, 32])
Annoy file stored at  /content/runtime/annoy_file.ann


100%|██████████| 17/17 [00:00<00:00, 488.26it/s]


 20 images to label.





 0 labeled: 0 Pos 0 Neg
Total Images: 2 + 0 = 2 positive || 21 + 0 = 21 negative
iteration 3
Enter l for Linear, f for finetuning and q to quit
q
