# Pool-based Active Learning - Simple Comparison Experiment

The main purpose of this tutorial is to ease the implementation of our library `scikit-activeml` to new users. `scikit-activeml` is a library that executes the most important query strategies. It is built upon the well-known machine learning frame-work `scikit-learn`, which makes it user-friendly. For better understanding, we show an exemplary active learning cycle here. Let's start by importing the relevant packages from both `scikit-learn` and `scikit-activeml`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, KFold

from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL
from skactiveml.visualization import plot_decision_boundary, plot_utilities

import warnings
warnings.filterwarnings("ignore")

# Random Seed Management
To guarantee that the experiment is reproducible, we have to set the random states for all components that might use one. To simplify this, we make all random seeds dependent of a single fixed random state and use helper functions to generate new seeds and random states. Keep in mind that the master_random_state should only be used to create new random states or random seeds.

In [None]:
master_random_state = np.random.RandomState(0)

def gen_seed(random_state:np.random.RandomState):
    return random_state.randint(0, 2**31)

def gen_random_state(random_state:np.random.RandomState):
    return np.random.RandomState(gen_seed(random_state))

## Data Set Generation
We generate a data set of 100 data points with two clusters from the `make_classification` method of `scikit-learn`. This method also returns the true labels of each data point. In practice, however, we do not know these labels unless we ask an oracle. The labels are stored in `y_true`, which acts as an oracle.

In [None]:
n_features = 2
n_classes = 2
classes = np.arange(n_classes)
X, y_true = make_classification(
    n_features=n_features, n_redundant=0, n_samples=400,
    n_classes=n_classes, random_state=gen_seed(master_random_state))
bound = [[min(X[:, 0]), min(X[:, 1])], [max(X[:, 0]), max(X[:, 1])]]
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='jet')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data set');

# Classification Models and Query Strategies
We handle the creation of classifiers and query strategies using a factory functions to simplify the separation of classifiers and query strategies across repetitions and folds.

In [None]:
def create_classifier(name, classes, random_state):
    classifier_factory_functions = {
        'PWC': lambda: ParzenWindowClassifier(
            classes=classes,
            random_state=gen_seed(random_state)
        ),
        'LR': lambda: SklearnClassifier(
            LogisticRegression(random_state=gen_seed(random_state)),
            classes=classes,
            random_state=gen_seed(random_state)
        )
    }
    return classifier_factory_functions[name]()

def create_query_strategy(name, random_state):
    query_strategy_factory_functions = {
        'US': lambda: UncertaintySampling(random_state=gen_seed(random_state)),
        'PAL': lambda: ProbabilisticAL(random_state=gen_seed(random_state))
    }
    return query_strategy_factory_functions[name]()

## Experiment Parameters

In [None]:
n_reps = 10
n_folds = 5
n_cycles = 50
use_stratified = True
classifier_names = ['PWC', 'LR']
query_strategy_names = ['US', 'PAL']

In [None]:
kfold_class = StratifiedKFold if use_stratified else KFold

for clf_name in classifier_names:
    print(clf_name)
    for qs_name in query_strategy_names:
        print(qs_name)
        accuracies = np.full((n_reps, n_folds, n_cycles), np.nan)
        for i_rep in range(n_reps):
            print(i_rep)
            kf = kfold_class(n_splits=n_folds, shuffle=True, random_state=gen_seed(master_random_state))
            for i_fold, (train_idx, test_idx) in enumerate(kf.split(X, y_true)):
                X_train = X[train_idx]
                y_test = y_true[test_idx]
                X_test = X[test_idx]
                y_train = y_true[train_idx]

                clf = create_classifier(clf_name, classes, gen_random_state(master_random_state))
                qs = create_query_strategy(qs_name, gen_random_state(master_random_state))

                y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)

                clf.fit(X, y)
                for c in range(n_cycles):
                    query_idx = qs.query(X=X, y=y, clf=clf, batch_size=1)
                    y[query_idx] = y_true[query_idx]
                    clf.fit(X, y)

                    # plotting
                    unlbld_idx = unlabeled_indices(y)
                    lbld_idx = labeled_indices(y)