## **Tutorial Overview: Comparing Active Learning Strategies with scikit-activeml**

This tutorial aims to demonstrate a practical comparison study using the 'scikit-activeml' library. The workflow involves utilizing a self-supervised learning model, specifically DINOv2 from [1], to generate embeddings for the CIFAR-10, CIFAR-100, and Flowers-102 datasets. Subsequently, various active learning strategies will be employed to intelligently select samples for labeling.

**Key Steps:**
1. **Self-Supervised Learning Model:** Utilize the DINOv2 model to create embedding datasets for CIFAR-10, CIFAR-100, and Flowers-102 datasets.

2. **Active Learning Strategies:** Employ different active learning strategies provided by the scikit-activeml library, including:
    - Random Sampling
    - Uncertainty Sampling
    - Discriminative Active Learning (DiscriminativeAL)
    - CoreSet
    - TypiClust
    - Badge

3. **Labeling Selection:** Use each active learning strategy to select specific samples for labeling, exploring diverse approaches to guide the learning process.

4. **Logging Results with mlflow:** Record and track the results obtained from each active learning strategy using mlflow [2], a platform for managing the complete machine learning lifecycle.

**References:**

[1] M. Oquab et al., ‘DINOv2: Learning Robust Visual Features without Supervision’. arXiv, Apr. 14, 2023. Accessed: Jan. 13, 2024. [Online]. Available: http://arxiv.org/abs/2304.07193

[2] ‘MLflow - A platform for the machine learning lifecycle’, MLflow. Accessed: Jan. 18, 2024. [Online]. Available: https://mlflow.org/



**Step 1: Prepair your Data with DINOv2**

In [1]:
#!pip install -U matplotlib
#!pip install -U scikit-learn
#!pip install iteration_utilities

In [2]:
import sys
sys.path.append("/mnt/stud/home/jcheng/scikit-activeml/")

In [3]:
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from skactiveml.classifier import SklearnClassifier
from skactiveml.pool import UncertaintySampling, RandomSampling, DiscriminativeAL, CoreSet, TypiClust, Badge
from skactiveml.utils import call_func, MISSING_LABEL

import warnings
mlp.rcParams["figure.facecolor"] = "white"
warnings.filterwarnings("ignore")

## Data Set Generation

Introduction about DINOv2 to get embedding dataset. (To be continuend)

In [4]:
#!pip3 install torch torchvision torchaudio
#!pip install tqdm

In [5]:
import torch
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from tqdm import tqdm

In [6]:
transforms = transforms.Compose(
        [transforms.Resize(256),
         transforms.CenterCrop(224),
         transforms.ToTensor(),
         transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))]
    )

batch_size = 4

load the pretrained model

In [13]:
dinov2_vits14 = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")

Using cache found in /mnt/stud/home/jcheng/.cache/torch/hub/facebookresearch_dinov2_main


install the corresponding data_set (CIFAR10)

In [8]:
dataset_classes = {
    "CIFAR10": 10,
    "CIFAR100": 100,
    "Flowers102": 102
}

In [15]:
def load_and_process_dataset(dataset_name, root_dir, num_classes, is_train):
    # Load the dataset
    dataset = datasets.__dict__[dataset_name](root=root_dir, train=is_train, download=True, transform=transforms)

    # Create a DataLoader
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=is_train, num_workers=2)

    embedding_list = []
    label_list = []

    with torch.no_grad():
        for i, data in tqdm(enumerate(dataloader), total=len(dataloader), desc=f"{dataset_name.capitalize()}"):
            image, label = data
            embeddings = dinov2_vits14(image)
            embedding_list.append(embeddings)
            label_list.append(label)

        # Concatenate embeddings and labels
        X = torch.cat(embedding_list, dim=0).numpy()
        y_true = torch.cat(label_list, dim=0).numpy()

    return X, y_true

In [16]:
# CIFAR-10
cifar10_X_train, cifar10_y_train_true = load_and_process_dataset("CIFAR10", "./data", 10, True)
cifar10_X_test, cifar10_y_test_true = load_and_process_dataset("CIFAR10", "./data", 10, False)

Files already downloaded and verified


Cifar10: 100%|██████████| 12500/12500 [53:09<00:00,  3.92it/s]


Files already downloaded and verified


Cifar10: 100%|██████████| 2500/2500 [11:34<00:00,  3.60it/s]


In [19]:
np.save('./embedding_data/cifar10_dinov2_X_train.npy', cifar10_X_train)
np.save('./embedding_data/cifar10_dinov2_y_train.npy', cifar10_y_train_true)
np.save('./embedding_data/cifar10_dinov2_X_test.npy', cifar10_X_test)
np.save('./embedding_data/cifar10_dinov2_y_test.npy', cifar10_y_test_true)

In [None]:
# CIFAR-100
cifar100_X_train, cifar100_y_train_true = load_and_process_dataset("CIFAR100", "./data", 100, True)
cifar100_X_test, cifar100_y_test_true = load_and_process_dataset("CIFAR100", "./data", 100, False)

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ./data/cifar-100-python.tar.gz


100%|██████████| 169001437/169001437 [06:28<00:00, 435350.83it/s]


Extracting ./data/cifar-100-python.tar.gz to ./data


Cifar100:  15%|█▍        | 1841/12500 [08:14<46:01,  3.86it/s] 

In [None]:
np.save('./embedding_data/cifar100_dinov2_X_train.npy', cifar100_X_train)
np.save('./embedding_data/cifar100_dinov2_y_train.npy', cifar100_y_train_true)
np.save('./embedding_data/cifar100_dinov2_X_test.npy', cifar100_X_test)
np.save('./embedding_data/cifar100_dinov2_y_test.npy', cifar100_y_test)

In [None]:
# Flowers-102
flowers102_X_train, flowers102_y_train_true = load_and_process_dataset("Flowers102", "./data/flowers102", 102, True)
flowers102_X_test, flowers102_y_test_true = load_and_process_dataset("Flowers102", "./data/flowers102", 102, False)

In [None]:
np.save('./embedding_data/flowers102_dinov2_X_train.npy', flowers102_X_train)
np.save('./embedding_data/flowers102_dinov2_y_train.npy', flowers102_y_train_true)
np.save('./embedding_data/flowers102_dinov2_X_test.npy', flowers102_X_test)
np.save('./embedding_data/flowers102_dinov2_y_test.npy', flowers102_y_test)

In [None]:
## Load your preprocessing Data

If you already complete these step before, please load your data here

In [11]:
cifar10_X_train = np.load('./embedding_data/cifar10_dinov2_X_train.npy')
cifar10_y_train_true = np.load('./embedding_data/cifar10_dinov2_y_train.npy')
cifar10_X_test = np.load('./embedding_data/cifar10_dinov2_X_test.npy')
cifar10_y_test = np.load('./embedding_data/cifar10_dinov2_y_test.npy')

In [None]:
cifar100_X_train = np.load('./embedding_data/cifar100_dinov2_X_train.npy')
cifar100_y_train_true = np.load('./embedding_data/cifar100_dinov2_y_train.npy')
cifar100_X_test = np.load('./embedding_data/cifar100_dinov2_X_test.npy')
cifar100_y_test = np.load('./embedding_data/cifar100_dinov2_y_test.npy')

In [None]:
flowers102_X_train = np.load('./embedding_data/flowers102_dinov2_X_train.npy')
flowers102_y_train_true = np.load('./embedding_data/flowers102_dinov2_y_train.npy')
flowers102_X_test = np.load('./embedding_data/flowers102_dinov2_X_test.npy')
flowers102_y_test = np.load('./embedding_data/flowers102_dinov2_y_test.npy')

## Random Seed Management

In [12]:
master_random_state = np.random.RandomState(0)

def gen_seed(random_state:np.random.RandomState):
    return random_state.randint(0, 2**31)

def gen_random_state(random_state:np.random.RandomState):
    return np.random.RandomState(gen_seed(random_state))

## Classification Models and Query Strategies

In [13]:
classifier_factory_functions = {
    'LogisticRegression': lambda classes, random_state: SklearnClassifier(
        LogisticRegression(),
        classes=classes,
        random_state=gen_seed(random_state)
    )
}

In [14]:
query_strategy_factory_functions = {
    'RandomSampling': lambda random_state: RandomSampling(random_state=gen_seed(random_state)),
    'UncertaintySampling': lambda random_state: UncertaintySampling(random_state=gen_seed(random_state)),
    'DiscriminativeAL': lambda random_state: DiscriminativeAL(random_state=gen_seed(random_state)),
    'CoreSet': lambda random_state: CoreSet(random_state=gen_seed(random_state)),
    'TypiClust': lambda random_state: TypiClust(random_state=gen_seed(random_state)),
    'Badge': lambda random_state: Badge(random_state=gen_seed(random_state))
}

In [15]:
def create_classifier(name, classes, random_state):
    return classifier_factory_functions[name](classes, random_state)

def create_query_strategy(name, random_state):
    return query_strategy_factory_functions[name](random_state)

## Experiment Parameters

In [16]:
n_reps = 1
n_training_dataset = len(cifar10_X_Train)
#n_cycles = int(0.5 * n_training_dataset)
n_cycles = 500
classifier_names = classifier_factory_functions.keys()
query_strategy_names = query_strategy_factory_functions.keys()

## Experiment Loop

In [None]:
results = {}

for clf_name in classifier_names:
    for qs_name in query_strategy_names:
        accuracies = np.full((n_reps, n_cycles), np.nan)
        for i_rep in range(n_reps):
            cifar10_y_Train = np.full(shape=cifar10_y_Train_true.shape, fill_value=MISSING_LABEL)
            
            clf = create_classifier(clf_name, classes=np.arange(len(cifar10_classes)), random_state=gen_random_state(master_random_state))
            qs = create_query_strategy(qs_name, random_state=gen_random_state(master_random_state))
            clf.fit(cifar10_X_Train, cifar10_y_Train)
            
            for c in tqdm(range(n_cycles), desc=f'Repeat {i_rep + 1} in {clf_name} with {qs_name}'):
                query_idx = call_func(qs.query, X=cifar10_X_Train, y=cifar10_y_Train, batch_size=1, clf=clf, discriminator=clf)
                cifar10_y_train[query_idx] = cifar10_y_train_true[query_idx]
                clf.fit(cifar10_X_train, cifar10_y_train)
                score = clf.score(cifar10_X_test, cifar10_y_test)
                accuracies[i_rep, c] = score
        
        results[(clf_name, qs_name)] = accuracies

Repeat 1 in LogisticRegression with RandomSampling: 100%|██████████| 500/500 [01:46<00:00,  4.71it/s]
Repeat 1 in LogisticRegression with UncertaintySampling:  48%|████▊     | 239/500 [01:46<02:12,  1.98it/s]

## Resulting Plotting

In [17]:
#!pip install mlflow

In [18]:
import mlflow

mlflow.set_tracking_uri(uri="file:///mnt/stud/home/jcheng/scikit-activeml/tutorials/tracking")
mlflow.set_experiment("Pool Evaluation with DINOv2")

In [None]:
with mlflow.start_run():
    for clf_name in classifier_names:
        for qs_name in query_strategy_names:
            key = (clf_name, qs_name)
            result = results[key]
            reshaped_result = result.reshape((-1, n_cycles))
            errorbar_mean = np.mean(reshaped_result, axis=0)
            mlflow.log_metric = (f'errorbar_mean for {qs_name} with {clf_name}', errorbar_mean)
            errorbar_std = np.std(reshaped_result, axis=0)
            mlflow.log_metric = (f'errorbar_std for {qs_name} with {clf_name}', errorbar_std)
            plt.errorbar(np.arange(n_cycles), errorbar_mean, errorbar_std, label=f"({np.mean(errorbar_mean):.4f}) {qs_name}", alpha=0.5)
        plt.title(clf_name)
        plt.legend(loc='lower right')
        plt.xlabel('cycle')
        plt.ylabel('accuracy')
        plt.show()