<a href="https://colab.research.google.com/github/spaceml-org/Curator-Unlabeled-Image-Search-Guide/blob/main/notebooks/SSL%2BImage_Similarity_Search%2BActive_Labeler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this demo, we'll demonstrate how to 
1. train a model with Self-Supervised Learner (SSL)
2. find similar images with Image Similarity Search
3. improve our model with Active Labeler

[Notice]
*   Image Similarity Search and Swipe Labeler operate on your local computer, not in Colab notebook.
*   We used [UC Merced Land Use dataset](https://weegee.vision.ucmerced.edu/datasets/landuse.html) in this demo. Although UC Merced dataset has labels, we set up the dataset as if it is unlabeled dataset to demonstrate how to use unlabeled dataset in this pipeline.


# 1. Self-Supervised Learner

## 1-1. Install packages & SSL

In [None]:
#installs
!pip install -q split-folders
!pip install -q torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 torchtext==0.6.0
!pip install -q pytorch-lightning==1.1.8
!pip install -q pytorch-lightning-bolts
!pip install -q --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100
!pip install -q wandb
!pip install -q annoy

[K     |████████████████████████████████| 776.8 MB 18 kB/s 
[K     |████████████████████████████████| 12.8 MB 24 kB/s 
[K     |████████████████████████████████| 7.6 MB 36.3 MB/s 
[K     |████████████████████████████████| 64 kB 2.8 MB/s 
[K     |████████████████████████████████| 1.2 MB 36.9 MB/s 
[K     |████████████████████████████████| 696 kB 5.2 MB/s 
[K     |████████████████████████████████| 829 kB 33.5 MB/s 
[K     |████████████████████████████████| 269 kB 45.1 MB/s 
[K     |████████████████████████████████| 119 kB 48.7 MB/s 
[K     |████████████████████████████████| 1.3 MB 39.9 MB/s 
[K     |████████████████████████████████| 294 kB 39.2 MB/s 
[K     |████████████████████████████████| 142 kB 46.7 MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone
  Building wheel for PyYAML (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 253 kB 5.3 MB/s 
[K     |████████████████████████████████| 282 kB 37.7 MB/s 
[K     |████████████████

In [None]:
import os
import itertools
import shutil
import PIL
import matplotlib.pyplot as plt
import torch
from torch import nn
from tqdm.notebook import tqdm
from torchvision import transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
import numpy as np
import PIL.Image as Image
import time

#not logging on wandb in this demo
os.environ['WANDB_MODE']='disabled'

In [None]:
#additional imports for Active Labeler

import pathlib
from pathlib import Path
from imutils import paths
import shutil
from torchvision.datasets import ImageFolder
from shutil import copyfile
import random 


class_chosen = "island"
seed = 100

random.seed(seed)
import pandas as pd
import sys
import glob

In [None]:
!rm -rf SSL
!git clone --branch simsiam https://github.com/spaceml-org/Self-Supervised-Learner.git
!mv Self-Supervised-Learner SSL

Cloning into 'Self-Supervised-Learner'...
remote: Enumerating objects: 2817, done.[K
remote: Counting objects: 100% (290/290), done.[K
remote: Compressing objects: 100% (269/269), done.[K
remote: Total 2817 (delta 175), reused 38 (delta 21), pack-reused 2527[K
Receiving objects: 100% (2817/2817), 11.95 MiB | 26.04 MiB/s, done.
Resolving deltas: 100% (1765/1765), done.


## 1-2. Preparing dataset

Before starting training on Self-Supervised Learner, we should make sure if the data is in below folder structure:
```
/Dataset
    /Class 1
        Image1.png
        Image2.png
    /Class 2
        Image3.png
        Image4.png
```
In case there is no label, organize directories like this:
```
/Dataset
    /Unlabelled
        Image1.png
        Image2.png
        Image3.png
        Image4.png
```

UC Merced Land Use dataset is organized as the former. However, in this demo, we'll change the folder structure into the latter to treat the dataset as an unlabeled dataset.

In [None]:
#download UC Merced Land Use dataset
!gdown http://weegee.vision.ucmerced.edu/datasets/UCMerced_LandUse.zip
!unzip -qq UCMerced_LandUse.zip

Downloading...
From: http://weegee.vision.ucmerced.edu/datasets/UCMerced_LandUse.zip
To: /content/UCMerced_LandUse.zip
100% 332M/332M [00:07<00:00, 45.6MB/s]


In [None]:
#convert from tif to jpg (.tif file is not available in Swipe Labeler and Active Labeler)
for img in list(paths.list_images('/content/UCMerced_LandUse/Images')):
  im = Image.open(img).convert('RGB').save(img.split('.')[0] + '.jpg', "JPEG", quality = 100)
  os.remove(img)   

In [None]:
#create an unlabeled image folder and copy all UC Merced dataset images into that folder
folder= '/content/Dataset/Unlabeled'
if os.path.exists(folder):
    shutil.rmtree(folder)
pathlib.Path(folder).mkdir(parents=True, exist_ok=True)
for i in paths.list_images('/content/UCMerced_LandUse/Images'):
  shutil.copy(i,os.path.join(folder,i.split('/')[-1]))

## 1-3. Training self-supervised learning model



In [None]:
#run this cell to check information regarding arguments
!python /content/SSL/train.py --help

usage: train.py [-h] [--DATA_PATH DATA_PATH] [--VAL_PATH VAL_PATH]
                [--model MODEL] [--batch_size BATCH_SIZE] [--cpus CPUS]
                [--hidden_dim HIDDEN_DIM] [--epochs EPOCHS]
                [--learning_rate LEARNING_RATE] [--patience PATIENCE]
                [--val_split VAL_SPLIT] [--withhold_split WITHHOLD_SPLIT]
                [--gpus GPUS] [--log_name LOG_NAME] [--image_size IMAGE_SIZE]
                [--resize RESIZE] [--technique TECHNIQUE] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  --DATA_PATH DATA_PATH
                        path to folders with images to train on.
  --VAL_PATH VAL_PATH   path to validation folders with images
  --model MODEL         model to initialize. Can accept model checkpoint or
                        just encoder name from models.py
  --batch_size BATCH_SIZE
                        batch size for SSL
  --cpus CPUS           number of cpus to use to fetch data
  --hidden_dim H

In [None]:
#train an encoder
!python /content/SSL/train.py --technique SIMCLR --DATA_PATH /content/Dataset --model minicnn32 --batch_size 32 --learning_rate 1e-3 --log_name ssl --image_size 256 --epochs 50

[34mAutomatically splitting data into train and validation data...[0m
Copying files: 2100 files [00:00, 3737.06 files/s]
warmup
[34mModel architecture successfully loaded[0m
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.readers`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
In DALI 1.0 all decoders were moved into a dedicated :mod:`~nvidia.dali.fn.decoders`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
read 1680 files from 1 directories
In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.r

In [None]:
# Load model
%cd /content/SSL
from models import SIMCLR, SIMSIAM, CLASSIFIER
%cd /content/

model = SIMCLR.SIMCLR.load_from_checkpoint('/content/models/SIMCLR_ssl.ckpt')

/content/SSL
/content
warmup


# 2. Image Similarity Search (prep)

## 2-1. Download dataset

Downloading multiple files or folders from Colab notebook to your computer can take a long time. We recommend you download 'UCMerced_LandUse.zip' to your computer and unzip it.

You could download 'UCMerced_LandUse.zip' file if you only want to use Image Similarity Search. But if you also want to use Swipe Labeler or Active Labeler, run below code cell and download 'UCMerced_LandUse_jpg_ver.zip' file because they don't accept .tif format image files.

In [None]:
!zip -r UCMerced_LandUse_jpg_ver.zip /content/Dataset/Unlabeled

  adding: content/Dataset/Unlabeled/ (stored 0%)


## 2-2. Download model

To use Image Similarity Search app, we need a model file in either .pt or .pth format. Because SSL model is .ckpt format in default, we'll change the model into .pt format file.

In [None]:
# check torch size
model.local_rank = 0
model.setup(stage = 'inference') #we set up inference with this call to instantiate the DALI data pipeline
model.eval()
model.cuda()

for batch in model.inference_dataloader:
    print(len(batch))
    print(batch[0].shape)
    break

In DALI 1.0 all readers were moved into a dedicated :mod:`~nvidia.dali.fn.readers`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))
In DALI 1.0 all decoders were moved into a dedicated :mod:`~nvidia.dali.fn.decoders`
submodule and renamed to follow a common pattern. This is a placeholder operator with identical
functionality to allow for backward compatibility.
  op_instances.append(_OperatorInstance(input_set, self, **kwargs))


1
torch.Size([32, 3, 256, 256])


In [None]:
# type the torch size you checked above into the torch.ones parenthesis
# to use this file in Image Similarity Search, you should have a gpu in your computer
# if you don't have a gpu, run the next cell to get a cpu version .pt file

with torch.no_grad():
    x = torch.ones((32, 3, 256, 256)).cuda()  #typical looking datapoint = (1, 3, 256, 256))
    traced_cell = torch.jit.trace(model, (x))
torch.jit.save(traced_cell, "UCMerced_simclr_minicnn32_50epochs.pt") #change the file name as you want

In [None]:
# generate cpu version .pt file
with torch.no_grad():
    x = torch.ones((32, 3, 256, 256)).cpu()
    traced_cell = torch.jit.trace(model.cpu(), (x))
torch.jit.save(traced_cell, "UCMerced_simclr_minicnn32_50epochs_cpu.pt")

Now download the .pt file from the Colab notebook file directory to your computer.

## 2-3. Check output embedding size of the model

Embedding size is a required input in Image Similarity Search app so we should check the output embedding size of our SSL model.

In [None]:
# check layers
model = SIMCLR.SIMCLR.load_from_checkpoint('/content/models/SIMCLR_ssl.ckpt')
model.eval()
model.cuda()

warmup


SIMCLR(
  (projection): Projection(
    (model): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Linear(in_features=128, out_features=128, bias=False)
    )
  )
  (encoder): miniCNN(
    (conv1): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (conv3): Conv2d(32, 48, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (adaptive_pool): AdaptiveAvgPool2d(output_size=(16, 16))
    (conv4): Conv2d(48, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
    (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (fc1): Linear(in_features=1024, out_features=32, bias=True)
  )
)

In this demo, the output embedding is 32.

## 2-4. Run Image Similarity Search app

Now follow [this guide](https://github.com/spaceml-org/Curator-Unlabeled-Image-Search-Guide/blob/main/Single_Usage_Guide/Image_Similarity_Search.md) to set up and run the Image Similarity Search app on your computer. 

<img width="854" alt="ISS_screenshot" src="https://user-images.githubusercontent.com/66165810/134059552-f64b23da-ecfe-40f7-aff5-5730dc9f2a78.PNG">

#3. Active Learner

## 3-1. Code setup

In [None]:
%cd "/content"
import os
import shutil
if os.path.exists('/content/Active-Labeller'):
  shutil.rmtree('/content/Active-Labeller')

!git clone https://github.com/spaceml-org/Active-Labeller.git

/content
Cloning into 'Active-Labeller'...
remote: Enumerating objects: 2031, done.[K
remote: Counting objects: 100% (2031/2031), done.[K
remote: Compressing objects: 100% (1474/1474), done.[K
remote: Total 2031 (delta 615), reused 1889 (delta 534), pack-reused 0[K
Receiving objects: 100% (2031/2031), 24.02 MiB | 26.37 MiB/s, done.
Resolving deltas: 100% (615/615), done.


In [None]:
import logging
logging.basicConfig(level=logging.DEBUG,filename='/content/app.log', filemode='a', format='%(asctime)s - %(levelname)-8s - %(funcName)-15s - %(message)s', datefmt='%d-%b-%y %H:%M:%S')

In [None]:
import sys
sys.path.insert(0, "/content/Active-Labeller")
sys.path.insert(0, "/content/Active-Labeller/ActiveLabeler-main")
sys.path.insert(0, "/content/Active-Labeller/ActiveLabeler-main/Self-Supervised-Learner")
sys.path.insert(0, "/content/Active-Labeller/ActiveLabeler-main/ActiveLabelerModels")

In [None]:
import random 
random.seed(seed)
import numpy as np 
np.random.seed(seed) #TODO random in code uses seed in each line ? 

config_path = "/content/Active-Labeller/pipeline_config.yml"
from pipeline import Pipeline
pipeline = Pipeline(config_path,"airplane")

Access Swipe labeler at the following link:

In [None]:
from google.colab.output import eval_js
print(eval_js("google.colab.kernel.proxyPort(5000)"))

https://wgooh68b6oj-496ff2e9c6d22116-5000-colab.googleusercontent.com/


## 3-2. Run Active Labeler

Run the cell below and open the above link to label the images


In [None]:
import random 
random.seed(100)
import numpy as np 
np.random.seed(100)

pipeline.main()

  0%|          | 0/2100 [00:00<?, ?it/s]

  im = torch.Tensor(im).unsqueeze(0).cuda()



Got embeddings. Embedding Shape: torch.Size([2100, 32])
Annoy file stored at  /content/runtime/NN_local/annoy_file.ann

----- iteration: 1
Enter n closest

 10 images to label.


In [None]:
model = torch.load("/content/final_model.ckpt")

def to_tensor(pil):
            return torch.tensor(np.array(pil)).permute(2, 0, 1).float()

t = transforms.Compose([
            transforms.Resize((256, 256)),
            transforms.Lambda(to_tensor)
        ])

dataset = ImageFolder("/content/Dataset", transform=t)
img_paths = [i[0] for i in dataset.imgs]

unlabeled_predictions = []

with torch.no_grad():
    bs = 128
    if len(dataset) < bs:
        bs = 1
    loader = DataLoader(dataset, batch_size=bs, shuffle=False)
    for batch in tqdm(loader):
        x = batch[0].cuda()
        feats = model.encoder(x)[-1]
        feats = feats.view(feats.size(0), -1)
        predictions = model.linear_model(feats)
        unlabeled_predictions.extend(predictions.detach().cpu().numpy())
unlabeled_predictions = [1 if x > 0.5 else 0 for x in unlabeled_predictions]
print(unlabeled_predictions)

In [None]:
os.mkdir("/content/AL_Dataset")
os.mkdir("/content/AL_Dataset/Positive")
os.mkdir("/content/AL_Dataset/Negative")


for i in range(len(img_paths)):
    if unlabeled_predictions[i] == 0:
        target = os.path.join("/content/AL_Dataset/Negative", img_paths[i].split("/")[-1])
        shutil.move(img_paths[i], target)
    else:
        target = os.path.join("/content/AL_Dataset/Positive", img_paths[i].split("/")[-1])
        shutil.move(img_paths[i], target)

In [None]:
print(len(list(paths.list_images('/content/AL_Dataset/Positive'))))
print(len(list(paths.list_images('/content/AL_Dataset/Negative'))))