### OBJECTIVE
As PoC (Proof of Concept), show that active learning can help improve the efficacy of crowd source data annotations.

### POC TODO
- understand and play around the codebase (`AI-MDN` repo)
    - set up dataloader
    - run `train` (included testing evaluation metrics)
    - run `active_learning_cycle`
    - (set up VOC 2007 + 2012 datasets)
    - (try train it end to end on VOC for reproducing results)
- build pipeline customised to the competition
    - determine number of classes
    - set up dataloader
    - visualise sample annotations
- determine the best learning rate and hyperparam, batch size per cycle
- validate the effectiveness with the annotated data we have (WITH v.s. WITHOUT active learning)

### References
1. [**Active Learning for Deep Object Detection via Probabilistic Modeling](https://github.com/NVlabs/AL-MDN)
2. [Active Learning algorithms for classification, object detection, human pose estimation and semantic segmentation](https://github.com/superannotateai/active_learning)

#### Notes on `train_ssd_gmm_active_learning pipeline.main`
1. load GMM model
2. load VOC test dataset
3. evaluate unbiased performance on test set
4. load VOC unlabelled dataset 
5. apply active learning to select top samples from (4)
6. load VOC labelled dataset
7. train on existing dataset plus the addition

In [None]:
%%capture
!pip install pycocotools==2.0.4
!git clone https://github.com/riven314/AL-MDN.git
%cd AL-MDN

In [None]:
import os
import random

import torch
import torch.utils.data as data
from torch.utils.data.sampler import SubsetRandomSampler, SequentialSampler
from PIL import Image

# util pulled from my repo
from data import MEANS, detection_collate, BaseTransform
from data.voc0712 import VOCDetection, VOCAnnotationTransform
from utils.augmentations import SSDAugmentation
from subset_sequential_sampler import SubsetSequentialSampler
from train_ssd_gmm_active_learning import *

%load_ext autoreload
%autoreload 2

### 0. Config

In [None]:
MIN_DIM = 300
CLASS_N = 21
NUM_TOTAL_IMAGES=100
NUM_INITIAL_LABELED_SET=50
BATCH_SIZE = 8
NUM_WORKERS = 2
VGG_BACKBONE_PATH = '/kaggle/input/vgg16-reduced-fc-backbone/vgg16_reducedfc.pth'
IS_CUDA = torch.cuda.is_available()
VOC_ROOT = '/kaggle/input/pascal-voc-2007/VOCtrainval_06-Nov-2007/VOCdevkit'


class args:
    resume = False
    save_folder = os.path.dirname(VGG_BACKBONE_PATH)
    basenet = os.path.basename(VGG_BACKBONE_PATH)
    lr = 0.01
    momentum = 0.9
    weight_decay = 5e-4
    gamma = 0.1
    id = 1
    cuda = IS_CUDA
    use_cuda = IS_CUDA
    start_iter = 0
    
cfg = {
    'min_dim': MIN_DIM,
    'num_classes': CLASS_N,
    'max_iter': 10,
    'lr_steps': (5, 8),
    'name': 'testing',
    "num_total_images": NUM_TOTAL_IMAGES,
    "acquisition_budget": 1000
}

### 1. Model

In [None]:
net, optimiser = load_net_optimizer_multi(cfg, args)

### 2. Set up Dataloader (VOC 2007 Dataset)

In [None]:
# TODO: customise this to Happy Whale competition
def create_loaders(
    voc_root, batch_size, num_workers, min_dim,
    num_total_images, num_initial_labeled_set
):
    num_train_images = num_total_images
    indices = list(range(num_train_images))
    random.shuffle(indices)
    labeled_set = indices[:num_initial_labeled_set]
    unlabeled_set = indices[num_initial_labeled_set:]

    supervised_dataset = VOCDetection(root=voc_root, transform=SSDAugmentation(min_dim, MEANS))
    unsupervised_dataset = VOCDetection(voc_root, [('2007', 'trainval')],
                                        BaseTransform(300, MEANS),
                                        VOCAnnotationTransform())

    supervised_data_loader = data.DataLoader(supervised_dataset, batch_size=batch_size,
                                             num_workers=num_workers,
                                             sampler=SubsetRandomSampler(labeled_set),
                                             collate_fn=detection_collate,
                                             pin_memory=True)
    unsupervised_data_loader = data.DataLoader(unsupervised_dataset, batch_size=1,
                                               num_workers=num_workers,
                                               sampler=SubsetSequentialSampler(unlabeled_set),
                                               collate_fn=detection_collate,
                                               pin_memory=True)
    return supervised_dataset, supervised_data_loader, unsupervised_dataset, unsupervised_data_loader, indices, labeled_set, unlabeled_set

In [None]:
# d = VOCDetection(root=VOC_ROOT,
#                  image_sets=[('2007', 'trainval')],
#                  dataset_name='VOC2007')
# supervised_dataset = VOCDetection(root=VOC_ROOT, transform=SSDAugmentation(MIN_DIM, MEANS))

sup_ds, sup_dl, unsup_ds, unsup_dl, idxs, label_set, unlabel_set = create_loaders(
    voc_root=VOC_ROOT, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, min_dim=MIN_DIM,
    num_total_images=NUM_TOTAL_IMAGES, num_initial_labeled_set=NUM_INITIAL_LABELED_SET
)

### 3. Train

In [None]:
criterion = MultiBoxLoss_GMM(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5, False, args.cuda)
criterion = criterion.cuda()
net = train(label_set, sup_dl, idxs, cfg, args, criterion)

### 4. Apply Active Learning to Rank Unseen Labels

In [None]:
try:
    net.eval()
    new_labeled_set, new_unlabeled_set = active_learning_cycle(
        iter(unsup_dl),
        label_set,
        unlabel_set,
        net,
        cfg["num_classes"],
        acquisition_budget=cfg['acquisition_budget'],
        num_total_images=cfg['num_total_images'],
    )
except Exception as e:
    print("Failed probably because none of the prediction surpass a threshold. Need to train longer to resolve this!")