## Pool-based Active Learning - Evaluation Study

The main purpose of this tutorial is to show how a realistic comparision study can be realized using 'scikit-activeml'. In this tutorial, we use a self-supervised learning model DINOv2 from [1] to creat a (to be continund)

In [2]:
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, KFold

from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, RandomSampling, DiscriminativeAL, CoreSet, TypiClust, Badge
from skactiveml.utils import call_func, MISSING_LABEL

import warnings
mlp.rcParams["figure.facecolor"] = "white"
warnings.filterwarnings("ignore")

## Data Set Generation

Introduction about DINOv2 to get embedding dataset. (To be continuend)

In [None]:
#!pip3 install torch torchvision torchaudio
#!pip install tqdm

In [7]:
import torch
import torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from tqdm import tqdm

In [6]:
transforms = transforms.Compose(
        [transforms.Resize(256),
         transforms.CenterCrop(224),
         transforms.ToTensor(),
         transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))]
    )

batch_size = 4

install the corresponding data_set (CIFAR10)

In [10]:
cifar10_trainset = datasets.CIFAR10(root="./data", train=True, download=True,transform=transforms)
cifar10_trainloader = torch.utils.data.DataLoader(cifar10_trainset, batch_size=batch_size, shuffle=True, num_workers=2)

cifar10_testset = datasets.CIFAR10(root="./data", train=False, download=True, transform=transforms)
cifar10_testloader = torch.utils.data.DataLoader(cifar10_testset, batch_size=batch_size, shuffle=False, num_workers=2)

cifar10_classes = ('plane', 'car', 'bird', 'cat',
               'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


 12%|█▏        | 19660800/170498071 [00:41<05:14, 479003.14it/s]


KeyboardInterrupt: 

Compute the Embedding for Images with DINOv2

In [9]:
dinov2_vits14 = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")

cifar10_train_embedding_list = []
cifar10_train_label_list = []
cifar10_test_embedding_list = []
cifar10_test_label_list = []

Using cache found in /Users/chengjiaying/.cache/torch/hub/facebookresearch_dinov2_main


In [None]:
with torch.no_grad():
    for i, data in tqdm(enumerate(cifar10_trainloader), total=len(cifar10_trainloader)):
        image, label = data

        embeddings = dinov2_vits14(image)
        cifar10_train_embedding_list.append(embeddings)
        cifar10_train_label_list.append(label)
    
    for i, data in tqdm(enumerate(cifar10_testloader), total=len(cifar10_testloader)):
        image, label = data

        embeddings = dinov2_vits14(image)
        cifar10_test_embedding_list.append(embeddings)
        cifar10_test_label_list.append(label)
    
    cifar10_X_Train = torch.cat(cifar10_train_embedding_list, dim=0).numpy()
    cifar10_y_Train = torch.cat(cifar10_train_label_list, dim=0).numpy()
    cifar10_X_Test = torch.cat(cifar10_test_embedding_list, dim=0).numpy()
    cifar10_y_Test = torch.cat(cifar10_test_label_list, dim=0).numpy()
    

Save the embedding feature in seperat file

In [None]:
np.save('./embedding_data/cifar10_dinov2_X_train.npy', cifar10_X_Train)
np.save('./embedding_data/cifar10_dinov2_y_train.npy', cifar10_y_Train)
np.save('./embedding_data/cifar10_dinov2_X_test.npy', cifar10_X_Test)
np.save('./embedding_data/cifar10_dinov2_y_test.npy', cifar10_y_Test)