In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Extreme learning machine (ELM) algorithm
The offline solution of (7.11.7) is given by

$$
B = H^\dagger Y,
$$

(7.11.11)

where \( H^\dagger \) is the Moore–Penrose generalized inverse of the hidden layer output matrix \( H \).

Based on the minimum norm LS solution \( B = H^\dagger Y \), Huang et al. [70] proposed an extreme learning machine (ELM) as a simple learning algorithm for SLFNs, as shown in Algorithm 7.5.

**Algorithm 7.5 Extreme learning machine (ELM) algorithm**

1. **Input**: A training set \( \{(x_i , y_i )|x_i \in \mathbb{R}^n , y_i \in \mathbb{R}^m , i = 1, \ldots , N \} \), activation function \( g(x) \), and hidden node number \( d \).

2. **Initialization**: Randomly assign input weight \( w_i \) and bias \( b_i \), \( i = 1, \ldots , d \).

3. **Learning step**:

   3.1. Use (7.11.8) to calculate the hidden layer output matrix \( H \).

   3.2. Use (7.11.10) to construct the training output matrix \( Y \).

   3.3. Calculate the minimum norm least squares weight matrix \( B = H^\dagger Y \), and get \( \beta_i \) from \( B^T = [\beta_1, \ldots , \beta_d] \).

4. **Testing step**: For given testing sample \( x \in \mathbb{R}^n \), the output of SLFNs is given by

$$
y = \sum_{i=1}^d \beta_i g(w_i^T x + b_i).
$$

It is shown [71] that the ELM can use a wide type of feature mappings (hidden layer output functions), including random hidden nodes and kernels. With this extension, the unified ELM solution can be obtained for feedforward neural networks, RBF network, LS-SVM, and PSVM.

In ELM, the hidden layer need not be tuned. For one output node case, the output function of ELM for generalized SLFNs is given by

$$
f_d (x) = \sum_{j=1}^d \beta_j h_j (x) = h^T (x) \beta,
$$

(7.11.12)

where \( h(x) = [h_1 (x), \ldots , h_d (x)]^T \) is the output vector of the hidden layer with respect to the input \( x \), and \( \beta = [\beta_1 , \ldots , \beta_d ]^T \) is the vector of the output weights between the hidden layer of \( d \) nodes.

The output vector \( h(x) \) is a feature mapping: it actually maps the data from the \( n \)-dimensional input space to the \( d \)-dimensional hidden layer feature space (ELM feature space) \( H \). For the binary classification problem, the decision function of ELM is

$$
f_d (x) = \text{sign} (h^T (x) \beta).
$$

(7.11.13)

An ELM with a single-output node is used to generate \( N \) output samples:

$$
\sum_{j=1}^d \beta_j h_j (x_i ) = y_i ,
$$

which can be rewritten as the form

$$
h^T (x_i ) \beta = y_i , \quad i = 1, \ldots , N
$$

or

$$
H \beta = y,
$$

where

$$
H = \begin{bmatrix}
h_1 (x_1 ) & \cdots & h_d (x_1) \\
\vdots & \ddots & \vdots \\
h_1 (x_N ) & \cdots & h_d (x_N )
\end{bmatrix},
$$

$$
\beta = \begin{bmatrix}
\beta_1 \\
\vdots \\
\beta_d
\end{bmatrix},
$$

$$
y = \begin{bmatrix}
y_1 \\
\vdots \\
y_N
\end{bmatrix} = \begin{bmatrix}
h^T (x_1) \\
\vdots \\
h^T (x_N)
\end{bmatrix}.
$$

The constrained optimization problem for ELM regression and binary classification with a single-output node can be formulated as [71]:

$$
\min_{\beta, \xi_i} L_{PELM} = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2,
$$

subject to \( h^T (x_i ) \beta = y_i - \xi_i, \quad i = 1, \ldots , N. \)

The dual unconstrained optimization problem is

$$
\min_{\beta, \xi_i, \alpha_i} L_{DELM} (\beta, \xi_i, \alpha_i) = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2 - \sum_{i=1}^N \alpha_i (h^T (x_i) \beta - y_i + \xi_i)
$$

with the Lagrange multipliers \( \alpha_i \geq 0, \quad i = 1, \ldots , N. \)

From the optimality conditions one has

$$
\frac{\partial L_{DELM}}{\partial \beta} = 0 \Rightarrow
$$

where \( \alpha = [\alpha_1, \ldots , \alpha_N]^T. \)

$$
\beta = \sum_{i=1}^N \alpha_i h(x_i) = H^T \alpha = H^T \alpha,
$$

which can be rewritten as the form

$$
h^T (x_i ) \beta = y_i , \quad i = 1, \ldots , N
$$

or

$$
H \beta = y,
$$

where

$$
H = \begin{bmatrix}
h_1 (x_1 ) & \cdots & h_d (x_1) \\
\vdots & \ddots & \vdots \\
h_1 (x_N ) & \cdots & h_d (x_N )
\end{bmatrix},
$$

$$
\beta = \begin{bmatrix}
\beta_1 \\
\vdots \\
\beta_d
\end{bmatrix},
$$

$$
y = \begin{bmatrix}
y_1 \\
\vdots \\
y_N
\end{bmatrix} = \begin{bmatrix}
h^T (x_1) \\
\vdots \\
h^T (x_N)
\end{bmatrix}.
$$

The constrained optimization problem for ELM regression and binary classification with a single-output node can be formulated as [71]:

$$
\min_{\beta, \xi_i} L_{PELM} = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2,
$$

subject to \( h^T (x_i ) \beta = y_i - \xi_i, \quad i = 1, \ldots , N. \)

The dual unconstrained optimization problem is

$$
\min_{\beta, \xi_i, \alpha_i} L_{DELM} (\beta, \xi_i, \alpha_i) = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2 - \sum_{i=1}^N \alpha_i (h^T (x_i) \beta - y_i + \xi_i)
$$

with the Lagrange multipliers \( \alpha_i \geq 0, \quad i = 1, \ldots , N. \)

From the optimality conditions one has

$$
\frac{\partial L_{DELM}}{\partial \beta} = 0 \Rightarrow
$$

where \( \alpha = [\alpha_1, \ldots , \alpha_N]^T. \)

$$
\beta = \sum_{i=1}^N \alpha_i h(x_i) = H^T \alpha = H^T \alpha,
$$


In [1]:
import numpy as np
from numpy.linalg import pinv

class ExtremeLearningMachine:
    def __init__(self, n_hidden_units, activation_function):
        self.n_hidden_units = n_hidden_units
        self.activation_function = activation_function

    def _add_bias(self, X):
        # Add bias term to input data
        bias = np.ones((X.shape[0], 1))
        return np.hstack([X, bias])

    def _init_weights(self, n_features):
        # Initialize input weights and biases randomly
        self.input_weights = np.random.randn(n_features, self.n_hidden_units)
        self.biases = np.random.randn(1, self.n_hidden_units)

    def _compute_hidden_layer_output(self, X):
        # Compute hidden layer output
        Z = np.dot(X, self.input_weights) + self.biases
        H = self.activation_function(Z)
        return H

    def fit(self, X_train, y_train):
        # Add bias term to input data
        X_train = self._add_bias(X_train)
        
        # Initialize weights
        n_features = X_train.shape[1]
        self._init_weights(n_features)
        
        # Compute hidden layer output
        H = self._compute_hidden_layer_output(X_train)
        
        # Compute output weights using Moore-Penrose pseudoinverse
        H_pinv = pinv(H)
        self.output_weights = np.dot(H_pinv, y_train)

    def predict(self, X_test):
        # Add bias term to input data
        X_test = self._add_bias(X_test)
        
        # Compute hidden layer output
        H = self._compute_hidden_layer_output(X_test)
        
        # Compute output
        y_pred = np.dot(H, self.output_weights)
        return y_pred

# Activation function: Sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Example usage
if __name__ == "__main__":
    # Generate some random data for demonstration
    np.random.seed(0)
    X_train = np.random.rand(100, 10)
    y_train = np.random.rand(100, 1)
    X_test = np.random.rand(10, 10)

    # Create ELM instance
    elm = ExtremeLearningMachine(n_hidden_units=50, activation_function=sigmoid)

    # Train ELM
    elm.fit(X_train, y_train)

    # Predict with ELM
    y_pred = elm.predict(X_test)
    print("Predictions:\n", y_pred)


Predictions:
 [[-0.18704703]
 [ 0.22153287]
 [ 0.27996201]
 [ 0.77352065]
 [ 0.4413394 ]
 [ 0.46376024]
 [ 0.0402454 ]
 [ 0.53954023]
 [ 0.29939952]
 [ 0.52385948]]


Substituting (7.11.22) into (7.11.15) and using (7.11.18), we get the equation
$$
H^T H \alpha = y,
$$
(7.11.23)
where
$$
\begin{bmatrix}
h^T(x_1) \\
\vdots \\
h^T(x_N)
\end{bmatrix}^T
H = \begin{bmatrix}
h(x_1), \ldots, h(x_N)
\end{bmatrix}
$$
$$
H^T H = \begin{bmatrix}
h^T(x_1)h(x_1) & \cdots & h^T(x_1)h(x_N) \\
\vdots & \ddots & \vdots \\
h^T(x_N)h(x_1) & \cdots & h^T(x_N)h(x_N)
\end{bmatrix}
$$
(7.11.24)
The LS solution of Eq. (7.11.23) is given by
$$
\alpha = (H^T H)^\dagger y.
$$
(7.11.25)
If a feature mapping $h(x)$ is unknown to users, one can apply Mercer's conditions
on ELM, and hence a kernel matrix for ELM can be defined as follows [71]:
$$
K(x, x_i) = h^T(x)h(x_i).
$$
(7.11.26)
If using (7.11.26), then (7.11.24) can be rewritten as in the following kernel form:
$$
\begin{bmatrix}
K(x_1, x_1) & \cdots & K(x_1, x_N) \\
\vdots & \ddots & \vdots \\
K(x_N, x_1) & \cdots & K(x_N, x_N)
\end{bmatrix}
$$
(7.11.27)
whose $(i, j)$th entry is $H^T H_{ij} = K(x_i, x_j)$ for $i = 1, \ldots, N; j = 1, \ldots, N$.
This shows that the feature mapping $h(x)$ need not be used; instead, it is enough
to use the corresponding kernel $K(x, x_i)$, e.g., $K(x, x_i) = \exp(-\gamma \|x - x_i\|)$.
Finally, from (7.11.22) it follows that the EML regression function is
$$
\hat{y} = h^T(x) \beta = h^T(x) \sum_{i=1}^N \alpha_i h(x_i) = \sum_{i=1}^N \alpha_i K(x, x_i),
$$
while the decision function for EML binary classification is
$$
\text{class of } x = \text{sign} \left( \sum_{i=1}^N \alpha_i K(x, x_i) \right).
$$
(7.11.28)
(7.11.29)
Algorithm 7.6 shows the ELM algorithm for regression and binary classification.
Algorithm 7.6 ELM algorithm for regression and binary classification [71]

1. **Input**: A training set $\{(x_i, y_i) | x_i \in \mathbb{R}^n, y_i \in \mathbb{R}, i = 1, \ldots, N\}$, hidden node number $d$ and the kernel function $K(u, v) = \exp(-\gamma \|u - v\|)$.
2. **Initialization**: $y = [y_1, \ldots, y_N]^T$.
3. **Learning step**:
    1. Use the kernel function to construct the matrix $H^T H_{ij} = K(x_i, x_j)$, $i, j = 1, \ldots, N$.
    2. Calculate the minimum norm least squares solution $\alpha = (H^T H)^\dagger y$.
4. **Testing step**: For given testing sample $x \in \mathbb{R}^n$, the ELM regression is $\sum_{i=1}^N \alpha_i K(x, x_i)$, while the ELM binary classification is $\text{class of } x = \text{sign} \left( \sum_{i=1}^N \alpha_i K(x, x_i) \right)$.


In [2]:
import numpy as np
from numpy.linalg import pinv

class ELM:
    def __init__(self, n_hidden_units, kernel_function):
        self.n_hidden_units = n_hidden_units
        self.kernel_function = kernel_function

    def fit(self, X_train, y_train):
        N = X_train.shape[0]
        
        # Construct the kernel matrix
        H = np.zeros((N, N))
        for i in range(N):
            for j in range(N):
                H[i, j] = self.kernel_function(X_train[i], X_train[j])
        
        # Calculate the minimum norm least squares solution
        H_pinv = pinv(H)
        self.alpha = np.dot(H_pinv, y_train)
        
        self.X_train = X_train

    def predict(self, X_test):
        N = self.X_train.shape[0]
        M = X_test.shape[0]
        
        # Compute the kernel matrix between test data and training data
        K = np.zeros((M, N))
        for i in range(M):
            for j in range(N):
                K[i, j] = self.kernel_function(X_test[i], self.X_train[j])
        
        # Compute the predictions
        y_pred = np.dot(K, self.alpha)
        return y_pred

# Example kernel function: RBF (Gaussian)
def rbf_kernel(x, y, gamma=1.0):
    return np.exp(-gamma * np.linalg.norm(x - y) ** 2)

# Example usage
if __name__ == "__main__":
    # Generate some random data for demonstration
    np.random.seed(0)
    X_train = np.random.rand(100, 10)
    y_train = np.random.rand(100, 1)
    X_test = np.random.rand(10, 10)

    # Create ELM instance
    elm = ELM(n_hidden_units=50, kernel_function=rbf_kernel)

    # Train ELM
    elm.fit(X_train, y_train)

    # Predict with ELM
    y_pred = elm.predict(X_test)
    print("Predictions:\n", y_pred)


Predictions:
 [[ 0.3175779 ]
 [ 0.40938498]
 [ 0.28007594]
 [ 0.84347293]
 [ 0.50490296]
 [ 0.51523179]
 [-0.0439662 ]
 [ 0.67854794]
 [ 0.24910511]
 [ 0.01155326]]


In [None]:
## Extreme Learning Machine Algorithm for Multiclass Classification
