# Getting Started

The main purpose of this tutorial is to ease the implementation of our library `scikit-activeml` to new users. `scikit-activeml` is a library that executes the most important query strategies. It is built upon the well-known machine learning frame-work `scikit-learn`, which makes it user-friendly. For better understanding, we show an exemplary active learning cycle here. Let's start by importing the relevant packages from both `scikit-learn` and `scikit-activeml`. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from skactiveml.pool import UncertaintySampling
from skactiveml.utils import is_unlabeled, MISSING_LABEL, plot_2d_dataset
from skactiveml.classifier import SklearnClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

## Data set Generation
We generate a data set of 100 data points with two clusters from the `make_classification` method of `scikit-learn`. This method also returns the true labels of each data point. In practice, however, we do not know these labels unless we ask an oracle. The labels are stored in `y_true`, which acts as an oracle.

In [None]:
X, y_true = make_classification(n_features=2, n_redundant=0, random_state=0)
plt.scatter(X[:, 0], X[:, 1], c=y_true)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data set');

## Classification
Our goal is to classify the data points into two classes. To do so, we introduce a vector `y` to store the labels that we acquire from the oracle (`y_true`). As shown below, the vector `y` is unlabeled at the beginning.

In [None]:
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)
print(y)

There are many easy-to-use classification algorithms in `scikit-learn`. In this example, we use the logistic regression classifier. Details of other classifiers can be accessed from here: https://scikit-activeml.readthedocs.io/en/latest/api/classifier.html. As `scikit-learn` classifiers cannot cope with missing labels, we need to wrap these with the `SklearnClassifier`.

In [None]:
clf = SklearnClassifier(LogisticRegression(),  classes=np.unique(y_true))

## Query Strategy
The query strategies are the central part of our library. In this example, we use uncertainty sampling with entropy to determine the most uncertain data points. All implemented strategies can be accessed from here: https://scikit-activeml.readthedocs.io/en/latest/api/pool.html.

In [None]:
qs = UncertaintySampling(clf, method='entropy', random_state=42)

## Active Learning Cycle
As an example, we choose to loop around the active learning cycle *20* times (`n_cycles`), and the number of labels to be acquired is the multiplication of `batch_size` and `n_cycles`. 
Inside the loop, we first get the unlabeled indices of vector `y`. All the indices for *100* data points are unlabeled, because we are at the first iteration. Then, we use those unlabeled indices to create a list of labeling set, which contain unlabeled data set, and ask the query strategy to give us the indices of the most informative data points in `X_cand`. It will return indices that are equal to the number of`batch_size`, and data point corresponds to these indices are the most informative ones to be labeled. In our case, the output is just a single index, because the `batch_size` is *1*.
Finally, we ask the oracle for the true label of the selected data point and store it in vector `y` to train our classifier. We continue until we reach the *20* labeled data points. 
Below, we see the implementation of an active learning cycle. The first figure shows the decision boundary after acquiring the label of two data points. The second figure is after having *10* labeled data points from the oracle, which shows significant improvement compare to the first figure. The last figure shows the decision boundary after acquiring labels for *20* data points. Finally, we use the accurcy score as a performance measure, which shows the accuracy of our classifer at specified iterations. 

## Active Learning Cycle
In this example, we perform 20 iterations of the active learning cycle (`n_cycles=20`). In each iteration, we acquire one label (`batch_size=1`). The total number of labels to be acquired is `batch_size * n_cycles`. 
Inside the loop, we first get the indices of the unlabeled instances. In the first iteration, `unlbld_idx` contains all indices because all data points are unlabeled. Next, we use the unlabeled indices to create the list `X_cand` of candidate instances that can still be queried and ask the query strategy to give us the indices of the most informative data points in `X_cand`. The `query` method returns the indices of the best `batch_size` instances together with the corresponding instances. In our case, the output is just a single index, as `batch_size=1`.

Finally, we ask the oracle for the true label of the selected data point and store it in `y` to train our classifier. We continue until we reach the 20 labeled data points. 
Below, we see the implementation of an active learning cycle. The first figure shows the decision boundary after acquiring the label of two data points. The second figure shows the decision boundary with 10 acquired labels, which shows significant improvement compared to the first figure. The last figure shows the decision boundary after acquiring labels for 20 data points. Finally, we use the accuracy score as a performance measure, which shows the accuracy of our classifier in the specified iterations.

In [None]:
n_cycles = 20
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)
for c in range(n_cycles):
    unlbld_idx = np.where(is_unlabeled(y))[0]
    X_cand = X[unlbld_idx]
    query_idx = unlbld_idx[qs.query(X_cand=X_cand, X=X, y=y, batch_size=1)]
    y[query_idx] = y_true[query_idx]
    clf.fit(X, y)
    if c in [1, 9, 19]:
        y_pred = clf.predict(X)
        plot_2d_dataset(X, y, y_true, clf, qs)
        print('The accuracy score is {} for {} iterations.'.format(accuracy_score(y_true, y_pred), np.sum(~np.isnan(y))))