Let's say that we've gotten a dataset that we'd like to benchmark for active learning. 

In [1]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100_000, n_classes=4, n_informative=10)

We will use modAL to handle some of the heavy liften. That means that we also need to write a compatible sampler if we want to have the random benchmark around.

In [2]:
import numpy as np
from modAL.uncertainty import entropy_sampling
from sklearn.linear_model import LogisticRegression

def randomly(classifier, X, n_instances=1):
    idx =  np.random.randint(0, X.shape[0] - 1, n_instances)
    return idx, X

We can now use the `run_experiment` from the `skteach.py` script to automate the data collection.

In [3]:
from skteach import run_experiment

In [4]:
?run_experiment

[0;31mSignature:[0m
[0mrun_experiment[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mestimator[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrategy[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbatch_size[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_batch[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstart_size[0m[0;34m=[0m[0;36m50[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0midx_start[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Runs an active learning experiment and returns a list of results for a single run. 

Arguments:
    - name: the name for the experiment
    - data: the (X, y) data pair
    - estimator: the ML model to use
    - strategy: a modAL compatible selection function
    - batch_size: size of each labelling batch
    - n_batch

In [5]:
data = []
for i in range(5): 
    data += run_experiment("entropy", data=(X, y), estimator=LogisticRegression(), strategy=entropy_sampling, n_batch=50)
for i in range(5):
    data += run_experiment("randomly", data=(X, y), estimator=LogisticRegression(), strategy=randomly, n_batch=50)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
100%|███████████████████████████████████| 50/50 [00:03<00:00, 15.33it/s]
100%|███████████████████████████████████| 50/50 [00:03<00:00, 13.43it/s]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the 

We can compare the scores with an altair chart now.

In [6]:
import pandas as pd
import altair as alt

pltr = pd.DataFrame(data)

(alt.Chart(pltr)
  .mark_line()
  .encode(x='batch', y='score', color='name', detail='run_id')
  .properties(width=600, height=250)
  .interactive())

Lesson one in active learning: random sampling is a suprisingly strong benchmark

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [8]:
data = []
for i in range(5): 
    data += run_experiment("entropy", data=(X, y), estimator=RandomForestClassifier(), strategy=entropy_sampling, n_batch=50)
for i in range(5):
    data += run_experiment("randomly", data=(X, y), estimator=RandomForestClassifier(), strategy=randomly, n_batch=50)

100%|███████████████████████████████████| 50/50 [01:18<00:00,  1.57s/it]
100%|███████████████████████████████████| 50/50 [01:19<00:00,  1.59s/it]
100%|███████████████████████████████████| 50/50 [01:19<00:00,  1.60s/it]
100%|███████████████████████████████████| 50/50 [01:19<00:00,  1.59s/it]
100%|███████████████████████████████████| 50/50 [01:19<00:00,  1.59s/it]
100%|███████████████████████████████████| 50/50 [00:41<00:00,  1.22it/s]
100%|███████████████████████████████████| 50/50 [01:04<00:00,  1.29s/it]
100%|███████████████████████████████████| 50/50 [00:41<00:00,  1.20it/s]
100%|███████████████████████████████████| 50/50 [00:41<00:00,  1.21it/s]
100%|███████████████████████████████████| 50/50 [00:41<00:00,  1.20it/s]


In [9]:
pltr = pd.DataFrame(data)

(alt.Chart(pltr)
  .mark_line()
  .encode(x='batch', y='score', color='name', detail='run_id')
  .properties(width=600, height=250)
  .interactive())

In [None]:


data = []
for i in range(5): 
    data += run_experiment("entropy", data=(X, y), estimator=GaussianProcessClassifier(), strategy=entropy_sampling, n_batch=50)
for i in range(5):
    data += run_experiment("randomly", data=(X, y), estimator=GaussianProcessClassifier(), strategy=randomly, n_batch=50)