In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Active Learning: Background

In many machine learning problems, the training data are treated as a fixed and given part of the problem definition. In practice, however, the training data are often not fixed beforehand. Rather, as an active participant in the training process, the learner has an opportunity to play a role in deciding what data will be acquired for training. 

In **active learning**, the learner must take actions to gain information and must decide what actions will give him/her the information that will best minimize future loss. In many applications, unlabeled data are usually abundant, but manual labeling is costly. If more sophisticated supervised (and semi-supervised) learning algorithms are used in these applications, labeled instances are very difficult, time-consuming, or costly to obtain. 

A few examples are given below:

- **Information Extraction:** Good information extraction systems must be trained by using labeled documents with detailed annotations. Locating entities and relations can take half an hour for even simple newswire stories. Annotations for other knowledge domains may require additional expertise, e.g., annotating gene and disease mentions for biomedical information extraction often requires Ph.D-level biologists.

- **Classification and Filtering:** Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) requires users to label each document or media file with particular labels, like “relevant” or “not relevant” to the query. It can be tedious and even redundant for one to have to annotate thousands of these instances.

- **Clinical Named Entity Recognition:** Identification of clinical concepts or clinical named entity recognition (NER) is an important task for building clinical natural language processing (NLP) systems. Chen et al. reported that for the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences, it is very difficult to simulate experiments using an existing clinical NER corpus with annotated medical problems, treatments, and lab tests in clinical notes.

Effective management of digital video resources for human action retrieval is a difficult task, as action retrieval is more challenging than action recognition. **Relevance feedback** is one common technique used for mitigating the effect of the lack of training in human action retrieval systems. In order to improve the accuracy of a retrieval system, by relevance feedback, the user can feedback each result returned to the retrieval system for marking them either as relevant or irrelevant to their query. 

Then, the retrieval system will perform an initial query and return the top $ T $ most relevant results to the user for labeling as many “relevant” or as few “irrelevant” as desired. Next, the user runs the query again to generate improved results until intra-class variability is better represented in the query. Even with relevance feedback, however, retrieval results can still be poor: the amount of feedback is often very small, as users have limited patience to provide feedback to improve the query; with only few training samples, the retrieval results will often still be unstable and unreliable. 

**Active learning** is a better choice than relevance feedback for retrieval systems. Similar to relevance feedback, “Active learning” (also called query learning) also requires the user to label several database items according to their relevance to the query. These labels are then used to update the query and improve the retrieval results. The key difference between relevance feedback and active learning is which database items the user provides labels for: in relevance feedback, the user himself/herself labels the top \( T \) most relevant results returned by the retrieval system. In active learning, the user can (actively) query the teacher/expert for labels in order to achieve higher accuracy. 

With more informative feedback, active learning has a faster learning rate and typically performs better than relevance feedback. Active learning is a subfield of machine learning and, more generally, artificial intelligence. The key idea behind active learning is that if a machine learning algorithm is allowed to choose the data from which it learns, it will perform better with less training.

## Statistical Active Learning

When active learning is used for classification or regression, there are three most important paradigms:

1. **Membership Query Learning**: It is also called constructive active learning. In membership query learning, the learner is allowed to create or select unlabeled instances for the human expert to label.

2. **Pool-based Active Learning**: It is popular for many applications such as text classification and speech recognition where unlabeled data are plentiful and cheap, but labels are expensive and slow to acquire. In pool-based active learning, the learner may not propose arbitrary points to label, but instead has access to a set of unlabeled examples, and is allowed to select which of them to request labels for.

3. **Stream-based Active Learning**: It is also called sequential active learning or selective sampling. Stream-based active learning resembles pool-based learning in many ways, except that the learner must decide whether to query or discard the unlabeled instances as a stream. The choice of examples to label can be seen as a dilemma between exploration and exploitation over the data space representation.

Active learning aims at involving the best method to choose the data points for $ T_{C,i} $ (a subset chosen to be labeled) via some query strategy. Query strategies for determining which data points should be labeled can be organized into a number of different categories:

- **Uncertainty Sampling** is one of the most popular query criteria and can be used to select the most uncertain instances of the unlabeled data.

- **Query by Committee (QBC)** method is an effective active learning method. Given a committee $ C = \{ \theta^{(1)}, \ldots, \theta^{(C)} \} $ of models, these models are all trained on the current labeled set $ L $, and are query candidates. The QBC method directly measures the voting differences of committee members and selects samples with inconsistent voting by class labels as training data, while the most informative query is considered to be the instance about which they most disagree.

- **Expected Model Change** method queries the instance that would impart the greatest change to the current model if we knew its label.

- **Variance Reduction** method labels those points that would minimize output variance, which is one of the components of error.

- **Balance Exploration and Exploitation**: In this setting, the choice of examples to label is seen as a dilemma between the exploration and exploitation over the data space representation. This strategy manages this compromise by modeling the active learning problem as a contextual conflict problem. The commonly used algorithms have the greedy algorithm, the Softmax algorithm, Bayes Bandit algorithm, etc.

- **Least Certainty** selects subset $ N_a $ whose elements are in the bottom of the rank queue $ Q $.

- **Middle Certainty** selects subset $ N_a $ whose elements are in the middle of the rank queue $ Q $.

- **Exponentiated Gradient Exploration** can improve any active learning algorithm by an optimal random exploration.

Relying on the learner’s ability to statistically model its own uncertainty, the objective of statistical active learning is usually to find model parameters that minimize some form of expected loss. 

The process of statistical active learning is usually as follows:

1. Begin by requesting labels for a small random sub-sample of the examples and fit a statistical model to the labeled data.
2. For any $ x $, a statistical model is used to estimate both the conditional expectation $ \hat{y} = y(x) $ and the variance of that expectation, $ \sigma_{\hat{y}}(x) $.
3. Consider a candidate point $ \tilde{x} $, and ask what reduction in loss we would obtain if we had labeled it $ \tilde{y} $.
4. Given the ability to estimate the expected effect of obtaining label $ \tilde{y} $ for candidate point $ \tilde{x} $, repeat this computation for a sample of $ M $ candidates, and then request a label for the candidate with the largest expected decrease in loss. Add the newly labeled example to the training set, retrain, and begin looking at candidate points to add on the next iteration.


# Active Learning

## 6.17.1 Active Learning: Background

In many machine learning problems, the training data are treated as a fixed part of the problem definition. However, in practice, the training data are often not fixed beforehand. Rather, as an active participant in the training process, the learner has the opportunity to play a role in deciding what data will be acquired for training. In active learning, the learner must take actions to gain information and decide what actions will give the information that will best minimize future loss.

In many applications, unlabeled data are usually abundant, but manual labeling is costly. If more sophisticated supervised (and semi-supervised) learning algorithms are used in these applications, labeled instances are often very difficult, time-consuming, or costly to obtain. A few examples are given below:

- **Information extraction**: Good information extraction systems must be trained using labeled documents with detailed annotations. Locating entities and relations can take half an hour for even simple newswire stories. Annotations for other knowledge domains may require additional expertise; e.g., annotating gene and disease mentions for biomedical information extraction often requires Ph.D.-level biologists.

- **Classification and filtering**: Learning to classify documents (e.g., articles or web pages) or any other kind of media (e.g., image, audio, and video files) requires users to label each document or media file with particular labels, such as "relevant" or "not relevant" to the query. It can be tedious and even redundant to annotate thousands of these instances.

- **Clinical named entity recognition**: Identification of clinical concepts or clinical named entity recognition (NER) is an important task for building clinical natural language processing (NLP) systems. Chen et al. reported that for the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences, it is very difficult to simulate experiments using an existing clinical NER corpus with annotated medical problems, treatments, and lab tests in clinical notes.

Effective management of digital video resources for human action retrieval is a difficult task, as action retrieval is more challenging than action recognition. Relevance feedback is one common technique used for mitigating the effect of the lack of training in human action retrieval systems. In order to improve the accuracy of a retrieval system, relevance feedback allows the user to mark returned results as either relevant or irrelevant to their query.

Active learning is a better choice than relevance feedback for retrieval systems. The key difference is that in relevance feedback, the user labels the top results returned by the system, whereas, in active learning, the user can query the teacher/expert for labels to achieve higher accuracy.

Active learning is a subfield of machine learning and, more generally, artificial intelligence. The key idea behind active learning is that if a machine learning algorithm is allowed to choose the data from which it learns, it will perform better with less training.

## 6.17.2 Statistical Active Learning

When active learning is used for classification or regression, there are three most important paradigms:

1. **Membership query learning**: Also called constructive active learning, where the learner is allowed to create or select unlabeled instances for the human expert to label.

2. **Pool-based active learning**: Popular for applications such as text classification and speech recognition, where unlabeled data are plentiful and cheap, but labels are expensive and slow to acquire. The learner has access to a set of unlabeled examples and can select which to request labels for.

3. **Stream-based active learning**: Also called sequential active learning or selective sampling, where the learner must decide whether to query or discard the unlabeled instances as a stream.

Active learning aims to choose the data points for $ T_{C,i} $ (a subset chosen to be labeled) via some query strategy. Query strategies for determining which data points should be labeled can be organized into several categories:

- **Uncertainty sampling**: Selects the most uncertain instances of the unlabeled data.
- **Query by committee (QBC)**: Measures voting differences of committee members and selects samples with inconsistent voting by class labels.
- **Expected model change**: Queries the instance that would impart the greatest change to the current model if its label were known.
- **Variance reduction**: Labels those points that would minimize output variance.
- **Balance exploration and exploitation**: Manages the choice of examples to label as a compromise between exploration and exploitation.
  
The process of statistical active learning usually follows these steps:

1. Begin by requesting labels for a small random sub-sample of examples and fit a statistical model to the labeled data.
2. Estimate both the conditional expectation $ \hat{y}(x) $ of its label $ y $ and the variance of that expectation $ \sigma_{\hat{y}}(x) $.
3. For a candidate point $ \tilde{x} $, compute the reduction in loss if it were labeled $ \tilde{y} $.
4. Repeat this for a sample of $ M $ candidates and request a label for the candidate with the largest expected decrease in loss.

## Active Learning Algorithms

Given a large pool $ U $ of unlabeled samples and a small set of labeled samples $ L $, where $ |L| < |U| $ and $ X = L \cup U $, the pool-based active learning (PAL) framework can be summarized as follows:

1. **Initial model generation**: A small number of samples are queried for annotation to build the initial model, using either random sampling or application-oriented sampling.
2. **Querying**: Unannotated samples are ranked based on the querying algorithm.
3. **Training**: The selected unlabeled subset is retrained on the updated annotated set.
4. **Iteration**: Steps 2 and 3 are repeated until the stop criterion is met.

**Algorithm 1: Active Learning**

1. Input: labeled dataset $ L = \{x_i, y_i\}_{i=1}^{l} $ and unlabeled dataset $ U = \{x_i\}_{i=l+1}^{l+u} $.
2. Repeat:
   - Upsample $ L $ to obtain an even class distribution $ L_D $.
   - Use $ L/L_D $ to train a classifier $ h $, then classify $ U $.
   - Rank the data based on the prediction confidence value $ C $ and store them in queue $ Q $.
   - Select subset $ N_a $ whose elements are in the bottom of the rank queue $ Q $ (least certainty) or select $ N_a $ whose elements are in the middle of the rank queue $ Q $ (middle certainty).
   - Submit $ N_a $ to human annotation.
   - Remove $ N_a $ from $ U $: $ U = U \setminus N_a $.
   - Add $ N_a $ to $ L $: $ L = L \cup N_a $.
3. Until $ U $ is exhausted.
4. Output: $ \{x_i, y_i\}_{i=1}^{l+u} $.

**Algorithm 6.35: Co-active Learning**

1. Input: a learning domain with features $ V = \{x_i, y_i\}_{i=1}^{l} \cup \{x_i\}_{i=l+1}^{l+u} $.
2. Repeat:
   - Split $ V $ into two “views”: $ V_1 = \{x_i\}_{i=1}^{l} $ and $ V_2 = \{x_i\}_{i=1}^{l} $ where $ V_1 \cap V_2 = \emptyset $.
   - Upsample views $ V_1 $ and $ V_2 $ to obtain even class distributions $ V_{D1} $ and $ V_{D2} $.
   - Use $ V/V_{D1} $ and $ V/V_{D2} $ to train classifiers $ h_1 $ and $ h_2 $.
   - Rank the data based on the prediction confidence value $ C $ and store them in queue $ Q $.
   - Select subset $ N_a $ whose elements are in the middle of the rank queue $ Q $ (middle certainty).
   - Submit the selected subset $ N_{ca} = U_{a1} \cup U_{a2} $ to human annotation.
   - Remove $ N_{ca} $ from $ U $: $ U = U \setminus N_{ca} $.
   - Add $ N_{ca} $ to $ L $: 4 L = L \cup N_{ca} $.
3. Until $ U $ is exhausted.
4. Output: $ \{x_i, y_i\}_{i=1}^{l+u} $.

## Active Learning Based Binary Linear Classifiers

Consider binary classification with two classes “+” and “−.” We are given a small labeled instance set $ L = \{x_i, y_i\}_{i=1}^{l} $ and a large unlabeled instance set $ U = \{x_i\}_{i=l+1}^{l+u} $, where $ x_i \in \mathbb{R} $ and $ y_i \in \{+1, -1\} $.



In the above algorithm, $ d(a, b) $ is a distance function available for practical application. Once the enhanced labeled data set $ L $ and the remaining unlabeled data set $ U $ are found by Algorithm 1, we can apply any semi-supervised binary classification or regression method to find the binary classifier or regressor $ h $. Then, for any testing instance $ x $, the classification decision is given by:

$$ \text{sign}(h^T x) $$

or for regression:

$$ \hat{y} = h^T x $$

where:

- $ h^T x $ is the linear combination of the model parameters $ h $ and the feature vector $ x $.
- $ \text{sign}(h^T x) $ returns $ +1 $ or $ -1 $ for binary classification, depending on whether $ h^T x $ is positive or negative.
- $ \hat{y} $ is the predicted continuous value in the case of regression.

### Steps Overview:

1. Use the distance function $ d(a, b) $ to find the nearest neighbor $ x_q $ in the unlabeled set $ U $ for querying.
2. Add the queried instance $ x_q $ to the labeled dataset $ L $, and remove it from $ U $.
3. After the labeled dataset $ L $ is enhanced, apply a semi-supervised binary classification or regression method to build a model $ h $.
4. For any test instance $ x $, the binary classification decision is made by $ \text{sign}(h^T x) $, and the regression prediction is made by $ \hat{y} = h^T x $.

This approach allows for an active learning-based refinement of the model with minimal labeling efforts.


In [1]:
import random
import math

# Define a simple distance function (Euclidean distance)
def euclidean_distance(x1, x2):
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(x1, x2)))

# Function to find the nearest neighbor from the unlabeled set U
def find_nearest_neighbor(x_s, U):
    nearest = None
    min_distance = float('inf')
    for x in U:
        dist = euclidean_distance(x_s, x)
        if dist < min_distance:
            min_distance = dist
            nearest = x
    return nearest

# Function to update the labeled and unlabeled datasets
def query_label(nearest, U, L):
    # Simulate querying the label (here we randomly assign a label for demonstration purposes)
    label = random.choice([1, -1])
    
    # Add the nearest neighbor to the labeled dataset
    L.append((nearest, label))
    
    # Remove it from the unlabeled dataset
    U.remove(nearest)
    
    return label

# Initialize labeled dataset L and unlabeled dataset U
# Here we simulate the datasets. In practice, you'd have real data.
L = [([1.0, 2.0], 1), ([3.0, 4.0], -1)]  # Labeled data: list of (feature_vector, label)
U = [[2.5, 3.5], [4.0, 5.0], [6.0, 7.0], [0.5, 1.5]]  # Unlabeled data

# Centroids for positive and negative classes in L
def compute_centroids(L):
    positive = [x for x, y in L if y == 1]
    negative = [x for x, y in L if y == -1]

    if positive:
        c_pos = [sum(x) / len(positive) for x in zip(*positive)]
    else:
        c_pos = None

    if negative:
        c_neg = [sum(x) / len(negative) for x in zip(*negative)]
    else:
        c_neg = None

    return c_pos, c_neg

# Active learning algorithm
def active_learning(L, U):
    # Step 1: Compute centroids for the initial labeled data
    c_pos, c_neg = compute_centroids(L)
    
    if c_pos is None or c_neg is None:
        print("Not enough data in both classes to compute centroids.")
        return L, U
    
    # Step 2: Synthesize a new instance
    x_s = [(p + n) / 2 for p, n in zip(c_pos, c_neg)]
    
    # Step 3: Find the nearest neighbor to the synthesized instance
    nearest = find_nearest_neighbor(x_s, U)
    
    # Step 4: Query its label and update datasets
    label = query_label(nearest, U, L)
    
    # Step 5: Update centroids based on the newly labeled instance
    if label == 1:
        c_pos = nearest
    else:
        c_neg = nearest
    
    return L, U

# Run the active learning process until no unlabeled data remains
while U:
    L, U = active_learning(L, U)
    print(f"Labeled set: {L}")
    print(f"Remaining unlabeled set: {U}")

# Final labeled set
print("Final labeled set:", L)


Labeled set: [([1.0, 2.0], 1), ([3.0, 4.0], -1), ([2.5, 3.5], -1)]
Remaining unlabeled set: [[4.0, 5.0], [6.0, 7.0], [0.5, 1.5]]
Labeled set: [([1.0, 2.0], 1), ([3.0, 4.0], -1), ([2.5, 3.5], -1), ([0.5, 1.5], 1)]
Remaining unlabeled set: [[4.0, 5.0], [6.0, 7.0]]
Labeled set: [([1.0, 2.0], 1), ([3.0, 4.0], -1), ([2.5, 3.5], -1), ([0.5, 1.5], 1), ([4.0, 5.0], -1)]
Remaining unlabeled set: [[6.0, 7.0]]
Labeled set: [([1.0, 2.0], 1), ([3.0, 4.0], -1), ([2.5, 3.5], -1), ([0.5, 1.5], 1), ([4.0, 5.0], -1), ([6.0, 7.0], -1)]
Remaining unlabeled set: []
Final labeled set: [([1.0, 2.0], 1), ([3.0, 4.0], -1), ([2.5, 3.5], -1), ([0.5, 1.5], 1), ([4.0, 5.0], -1), ([6.0, 7.0], -1)]


## Active Learning Using Extreme Learning Machine

Suppose we are given \(N\) arbitrary distinct training instances \(\{x_i, y_i\}_{i=1}^{N}\), where \(x_i \in \mathbb{R}^n\) and \(y_i \in \mathbb{R}^m\). Consider a single-hidden layer feedforward network (SLFN) with \(L\) hidden nodes that approximates these \(N\) samples with zero error. There exist \(\beta_i\), \(a_i\), and \(b_i\) such that:

$$
f_L(x_j) = \sum_{i=1}^{L} \beta_i G(a_i, b_i, x_j) = y_j, \quad j = 1, \dots, N
$$

where \(a_i\) and \(b_i\) denote the \(i\)-th weight and bias of the hidden layer, respectively, and \(\beta_i\) is the weight vector connecting the \(i\)-th hidden node to the output nodes.

Let:

$$
H = \begin{bmatrix}
G(a_1, b_1, x_1) & \dots & G(a_L, b_L, x_1) \\
\vdots & \ddots & \vdots \\
G(a_1, b_1, x_N) & \dots & G(a_L, b_L, x_N)
\end{bmatrix} \in \mathbb{R}^{N \times L}
$$

$$
B = \begin{bmatrix}
\beta_1^T \\
\vdots \\
\beta_L^T
\end{bmatrix} \in \mathbb{R}^{L \times m}
$$

$$
Y = \begin{bmatrix}
y_1^T \\
\vdots \\
y_N^T
\end{bmatrix} \in \mathbb{R}^{N \times m}
$$

Then, the equation can be written in the following compact matrix form:

$$
H B = Y
$$

Here, $G(a_i, b_i, x_j)$ denotes the activation function used to calculate the output of the $i$-th hidden node on the $j$-th training instance. $H$ is called the hidden layer output matrix, where its $i$-th column denotes the $i$-th hidden node’s output vector with respect to inputs $x_1, \dots, x_N$, and its $j$-th row represents the output vector of the hidden layer with respect to the input $x_j$.

The solution of $H B = Y$ is given by:

$$
\hat{B} = H^{\dagger} Y
$$

where $H^{\dagger}$ is the Moore–Penrose inverse of the hidden layer output matrix $H$, and:

$$
H^{\dagger} = (H^T H)^{-1} H^T \quad \text{if } H \text{ is of full column rank}
$$

or

$$
H^{\dagger} = H^T (H H^T)^{-1} \quad \text{if } H \text{ is of full row rank}
$$

The solution $\hat{B}$ of the overdetermined matrix equation $H B = Y$ is the minimum norm least squares solution, i.e.,

$$
\|\hat{B}\| \leq \|Z\| \quad \text{and} \quad \text{var}(\hat{B}) \leq \text{var}(Z)
$$

for all other positive solutions $Z$.

Since the norm of the output weights $\|\hat{B}\|$ is closely related to the generalization ability of the network, this provides a robust solution for learning with extreme learning machines.


## Active Learning Using Extreme Learning Machine

Given $N$ distinct training instances $\{x_i, y_i\}_{i=1}^{N}$, where $x_i \in \mathbb{R}^n$ and $y_i \in \mathbb{R}^m$, consider a single-hidden layer feedforward network (SLFN) with $L$ hidden nodes that approximates these $N$ samples with zero error. This is expressed as:

$$
f_L(x_j) = \sum_{i=1}^{L} \beta_i G(a_i, b_i, x_j) = y_j, \quad j = 1, \dots, N
$$

where \(a_i\) and \(b_i\) denote the weights and biases of the hidden layer, and \(\beta_i\) is the weight vector connecting the \(i\)-th hidden node to the output nodes.

The hidden layer output matrix \(H\) is defined as:

$$
H = \begin{bmatrix}
G(a_1, b_1, x_1) & \dots & G(a_L, b_L, x_1) \\
\vdots & \ddots & \vdots \\
G(a_1, b_1, x_N) & \dots & G(a_L, b_L, x_N)
\end{bmatrix} \in \mathbb{R}^{N \times L}
$$

The weight matrix \(B\) is:

$$
B = \begin{bmatrix}
\beta_1^T \\
\vdots \\
\beta_L^T
\end{bmatrix} \in \mathbb{R}^{L \times m}
$$

The output matrix \(Y\) is:

$$
Y = \begin{bmatrix}
y_1^T \\
\vdots \\
y_N^T
\end{bmatrix} \in \mathbb{R}^{N \times m}
$$

The matrix equation becomes:

$$
HB = Y
$$

The solution to this is given by:

$$
\hat{B} = H^{\dagger} Y
$$

where $H^{\dagger}$ is the Moore-Penrose inverse of \(H\). The optimization problem to minimize both the weights and the error is defined as:

$$
\min_{B} L_{ELM} = \frac{1}{2} \|B\|^2 + \frac{C}{2} \|HB - Y\|^2
$$

The first-order optimality condition leads to:

$$
B + C(H^T H B - H^T Y) = 0
$$

Letting $\lambda = \frac{1}{C}$, the solution is:

$$
B = \begin{cases}
(H^T H + \lambda I)^{-1} H^T Y, & \text{if } H^T H \text{ is nonsingular} \\
H^T (H H^T + \lambda I)^{-1} Y, & \text{if } H H^T \text{ is nonsingular}
\end{cases}
$$

This is equivalent to the Tikhonov regularization solution. The output function vector is defined as:

$$
f(x) = h(x)B = [G(a_1, b_1, x), \dots, G(a_L, b_L, x)] B
$$

In extreme learning machine, the instance is labeled based on the category with the maximum output:

$$
\text{label}(x) = \arg \max_i f_i(x), \quad i = [1, \dots, m]
$$

For binary classification, the probability is calculated as:

$$
P(y = 1 | f_i(x)) = \frac{1}{1 + \exp(-f_i(x))}
$$

The normalized posterior probabilities are:

$$
\bar{P}(y = 1 | f_i(x)) = \frac{P(y = 1 | f_i(x))}{\sum_{j=1}^{m} P(y = 1 | f_j(x))}
$$

### Algorithm: Active Learning-Extreme Learning Machine (AL-ELM)

1. **Input**: A small labeled set $L = \{x_i, y_i\}_{i=1}^{l}$, a large unlabeled set $U = \{x_i\}_{i=l+1}^{l+u}$, activation function $G(x)$, penalty factor $C$, and the number of hidden nodes \(L\).
2. **Initialization**: Assign random hidden nodes and calculate the initial hidden layer output matrix $H_0$.
3. **Repeat**:
   1. Use the current weights to calculate the outputs for instances in $U$.
   2. Calculate posteriori probabilities and normalized posteriori probabilities.
   3. Rank the uncertainty of instances in $U$ and select the most uncertain ones for labeling.
   4. Update the hidden layer output matrix and the weights.
4. **Stopping Criterion**: Stop when the pre-defined condition is met.

Once $B$ is found, the binary classification label or regression output can be determined for new instances.


In [2]:
import numpy as np

# Activation function: Sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# ELM training function
def train_elm(X, Y, L, C=1.0):
    """
    Train an Extreme Learning Machine (ELM).
    
    Parameters:
        X : Input data (N x n)
        Y : Target data (N x m)
        L : Number of hidden neurons
        C : Regularization factor (default=1.0)
        
    Returns:
        B : Output weights (L x m)
        a : Random weights of hidden neurons (L x n)
        b : Random biases of hidden neurons (L x 1)
    """
    
    # Step 1: Assign random hidden nodes parameters (weights and biases)
    N, n = X.shape
    a = np.random.randn(L, n)  # Random hidden layer weights
    b = np.random.randn(L, 1)  # Random biases for hidden neurons
    
    # Step 2: Calculate hidden layer output matrix H
    H = sigmoid(np.dot(X, a.T) + b.T)  # H is (N x L)
    
    # Step 3: Compute output weights using regularization
    if H.T.dot(H).shape[0] == L:  # Full rank case
        B = np.linalg.inv(H.T.dot(H) + (1/C) * np.identity(L)).dot(H.T).dot(Y)
    else:  # If full row rank, use the alternative formula
        B = H.T.dot(np.linalg.inv(H.dot(H.T) + (1/C) * np.identity(N))).dot(Y)
    
    return B, a, b

# ELM prediction function
def predict_elm(X, B, a, b):
    """
    Predict using the trained ELM.
    
    Parameters:
        X : Input data (M x n)
        B : Output weights (L x m)
        a : Hidden layer weights (L x n)
        b : Hidden layer biases (L x 1)
        
    Returns:
        Predictions (M x m)
    """
    H = sigmoid(np.dot(X, a.T) + b.T)  # Hidden layer output
    return np.dot(H, B)  # Predictions

# Generating random training data
np.random.seed(0)
X_train = np.random.rand(100, 10)  # 100 samples, 10 features
Y_train = np.random.randint(0, 2, size=(100, 1))  # Binary labels (0 or 1)

# Training the ELM
L = 20  # Number of hidden nodes
C = 1.0  # Regularization parameter
B, a, b = train_elm(X_train, Y_train, L, C)

# Generating random test data
X_test = np.random.rand(20, 10)  # 20 test samples

# Predicting using the ELM
Y_pred = predict_elm(X_test, B, a, b)
Y_pred_binary = (Y_pred > 0.5).astype(int)  # Convert probabilities to binary labels

# Output predictions
print("Predicted labels:", Y_pred_binary.ravel())


Predicted labels: [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
