This software package provides a toolbox for testing pool-based active-learning algorithms in MATLAB.
Specifically, we consider the following scenario. There is a pool of datapoints . We may successively select a set of points to observe. Each observation reveals a discrete, integer-valued label for . This labeling process might be nondeterministic; we might choose the same point twice and observe different labels each time. In active learning, we typically assume we have a budget that limits the number of points we may observe.
Our goal is to iteratively build a set of observations
that achieves some goal in an efficient manner. One typical goal is that this training set allows us to accurately predict the labels on the unobserved points. Assume we have a probabilistic model
and let represent the set of unobserved points. We might with to minimize either the 0/1 loss on the unlabeled points
We could sample a random set of points, but by careful consideration of our observation locations, we hope we can do significantly better than this. One common active learning strategy, known as uncertainty sampling, iteratively chooses to make an observation at the point with the largest marginal entropy given the current data:
with the hope that these queries can better map out the boundaries between classes.
Of course, there are countless goals besides minimizing generalization error and numerous other strategies besides the highly myopic uncertainty sampling. Indeed, many active learning scenerios might not involve probability models at all. Providing a highly adaptable and extensible toolbox for conducting arbitrary pool-based active learning experiments is the goal of this project.
The most-important function is active_learning
, which simulates an
active learning experiment using the following procedure:
Given: initially labeled points X,
corresponding labels Y,
budget B
for i = 1:B
% find points available for labeling
eligible_points = selector(x, y)
% decide on point(s) to observe
x_star = query_strategy(x, y, eligible_points)
% observe point(s)
y_star = label_oracle(x_star)
% add observation(s) to training set
X = [X, x_star]
Y = [Y, y_star]
end
The implementation supports user-specified:
-
Selectors, which given the current training set, return a set of points currently eligible for labeling. See
selectors.m
for usage and available implementations. -
Query strategies, which given a training set and the selected eligible points, decides which point(s) to observe next. Note that a query strategy can return multiple points, allowing for batch observations. See
query_strategies.m
for usage and available implementations. -
Label oracles, which given a set of points, return a set of corresponding labels. Label oracles may optionally be nondeterministic (see, for example,
bernoulli_oracle
). Seelabel_oracles.m
for usage and available implementations.
Each of these are provided as function handles satisfying a desired API, described below.
This function also supports arbitrary user-specified callbacks called after each round of the experiment. This can be useful, for example, for plotting the progress of the algorithm and/or printing statistics such as test error online.
A selector considers the current labeled dataset and indicates which of the unlabeled points should be considered for observation at this time.
Selectors must satisfy the following interface:
test_ind = selector(problem, train_ind, observed_labels)
-
problem
: a struct describing the problem, containing fields:-
`points`: an ![(n x d)][13] data matrix for the available points
num_classes
: the number of classesnum_queries
: the number of queries to make
-
-
train_ind
: a list of indices intoproblem.points
indicating the thus-far observed points -
observed_labels
: a list of labels corresponding to the observations intrain_ind
test_ind
: a list of indices intoproblem.points
indicating the points to consider for labeling
The following general-purpose selectors are provided in this toolbox:
fixed_test_set_selector
: selects all points besides a given test setgraph_walk_selector
: confines an experiment to follow a path on a graphidentity_selector
: selects all pointsrandom_selector
: selects a random subset of pointsunlabeled_selector
: selects points not yet observed
In addition, the following "meta" selectors are provided, which combine or modify the outputs of other selectors:
complement_selector
: takes the complement of a selector's outputintersection_selector
: takes the intersection of the outputs of selectors-
`union_selector`: takes the union of the outputs of selectors
Query strategies select which of the points currently eligible for labeling (returned by a selector) should be observed next.
Query strategies must satisfy the following interface:
query_ind = query_strategy(problem, train_ind, observed_labels, test_ind)
-
problem
: a struct describing the problem, containing fields:-
`points`: an ![(n x d)][13] data matrix for the available points
num_classes
: the number of classesnum_queries
: the number of queries to make
-
-
train_ind
: a list of indices intoproblem.points
indicating the thus-far observed points -
observed_labels
: a list of labels corresponding to the observations intrain_ind
-
test_ind
: a list of indices intoproblem.points
indicating the points eligible for observation
query_ind
: an index intoproblem.points
indicating the point(s) to query next (every entry inquery_ind
will always be a member of the set of points intest_ind
)
The following query strategies are provided in this toolbox:
argmax
: samples the point(s) maximizing a given score functionargmin
: samples the point(s) minimizing a given score functionexpected_error_reduction
: samples the point giving lowest expected loss on unlabeled pointsmargin_sampling
: samples the point with the smallest marginquery_by_committee
: samples the point with the highest disagreement between modelsuncertainty_sampling
: samples the most uncertain point
Label oracles are functions that, given a set of points chosen to be queried, returns a set of corresponding labels. In general, they need not be deterministic, which is especially interesting when points can be queried multiple times.
Label oracles must satisfy the following interface:
label = label_oracle(problem, query_ind)
-
problem
: a struct describing the problem, containing fields:-
`points`: an ![(n x d)][13] data matrix for the available points
num_classes
: the number of classes
-
-
query_ind
: an index intoproblem.points
specifying the point(s) to be queried
label
: a list of integers between 1 andproblem.num_classes
indicating the observed label(s)
The following general-purpose label oracles are provided in this toolbox:
lookup_oracle
: a trivial lookup-table label oracle given a fixed list of ground-truth labelsbernoulli_oracle
: a label oracle that, conditioned on the queried point(s), samples labels independently from a Bernoulli distribution with given success probabilitymultinomial_oracle
: a label oracle that, conditioned on the queried point(s), samples labels independently from a multinomial distribution with given success probabilities