In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Extreme learning machine (ELM) algorithm
The offline solution of (7.11.7) is given by

$$
B = H^\dagger Y,
$$

(7.11.11)

where \( H^\dagger \) is the Moore–Penrose generalized inverse of the hidden layer output matrix \( H \).

Based on the minimum norm LS solution \( B = H^\dagger Y \), Huang et al. [70] proposed an extreme learning machine (ELM) as a simple learning algorithm for SLFNs, as shown in Algorithm 7.5.

**Algorithm 7.5 Extreme learning machine (ELM) algorithm**

1. **Input**: A training set \( \{(x_i , y_i )|x_i \in \mathbb{R}^n , y_i \in \mathbb{R}^m , i = 1, \ldots , N \} \), activation function \( g(x) \), and hidden node number \( d \).

2. **Initialization**: Randomly assign input weight \( w_i \) and bias \( b_i \), \( i = 1, \ldots , d \).

3. **Learning step**:

   3.1. Use (7.11.8) to calculate the hidden layer output matrix \( H \).

   3.2. Use (7.11.10) to construct the training output matrix \( Y \).

   3.3. Calculate the minimum norm least squares weight matrix \( B = H^\dagger Y \), and get \( \beta_i \) from \( B^T = [\beta_1, \ldots , \beta_d] \).

4. **Testing step**: For given testing sample \( x \in \mathbb{R}^n \), the output of SLFNs is given by

$$
y = \sum_{i=1}^d \beta_i g(w_i^T x + b_i).
$$

It is shown [71] that the ELM can use a wide type of feature mappings (hidden layer output functions), including random hidden nodes and kernels. With this extension, the unified ELM solution can be obtained for feedforward neural networks, RBF network, LS-SVM, and PSVM.

In ELM, the hidden layer need not be tuned. For one output node case, the output function of ELM for generalized SLFNs is given by

$$
f_d (x) = \sum_{j=1}^d \beta_j h_j (x) = h^T (x) \beta,
$$

(7.11.12)

where \( h(x) = [h_1 (x), \ldots , h_d (x)]^T \) is the output vector of the hidden layer with respect to the input \( x \), and \( \beta = [\beta_1 , \ldots , \beta_d ]^T \) is the vector of the output weights between the hidden layer of \( d \) nodes.

The output vector \( h(x) \) is a feature mapping: it actually maps the data from the \( n \)-dimensional input space to the \( d \)-dimensional hidden layer feature space (ELM feature space) \( H \). For the binary classification problem, the decision function of ELM is

$$
f_d (x) = \text{sign} (h^T (x) \beta).
$$

(7.11.13)

An ELM with a single-output node is used to generate \( N \) output samples:

$$
\sum_{j=1}^d \beta_j h_j (x_i ) = y_i ,
$$

which can be rewritten as the form

$$
h^T (x_i ) \beta = y_i , \quad i = 1, \ldots , N
$$

or

$$
H \beta = y,
$$

where

$$
H = \begin{bmatrix}
h_1 (x_1 ) & \cdots & h_d (x_1) \\
\vdots & \ddots & \vdots \\
h_1 (x_N ) & \cdots & h_d (x_N )
\end{bmatrix},
$$

$$
\beta = \begin{bmatrix}
\beta_1 \\
\vdots \\
\beta_d
\end{bmatrix},
$$

$$
y = \begin{bmatrix}
y_1 \\
\vdots \\
y_N
\end{bmatrix} = \begin{bmatrix}
h^T (x_1) \\
\vdots \\
h^T (x_N)
\end{bmatrix}.
$$

The constrained optimization problem for ELM regression and binary classification with a single-output node can be formulated as [71]:

$$
\min_{\beta, \xi_i} L_{PELM} = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2,
$$

subject to \( h^T (x_i ) \beta = y_i - \xi_i, \quad i = 1, \ldots , N. \)

The dual unconstrained optimization problem is

$$
\min_{\beta, \xi_i, \alpha_i} L_{DELM} (\beta, \xi_i, \alpha_i) = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2 - \sum_{i=1}^N \alpha_i (h^T (x_i) \beta - y_i + \xi_i)
$$

with the Lagrange multipliers \( \alpha_i \geq 0, \quad i = 1, \ldots , N. \)

From the optimality conditions one has

$$
\frac{\partial L_{DELM}}{\partial \beta} = 0 \Rightarrow
$$

where \( \alpha = [\alpha_1, \ldots , \alpha_N]^T. \)

$$
\beta = \sum_{i=1}^N \alpha_i h(x_i) = H^T \alpha = H^T \alpha,
$$

which can be rewritten as the form

$$
h^T (x_i ) \beta = y_i , \quad i = 1, \ldots , N
$$

or

$$
H \beta = y,
$$

where

$$
H = \begin{bmatrix}
h_1 (x_1 ) & \cdots & h_d (x_1) \\
\vdots & \ddots & \vdots \\
h_1 (x_N ) & \cdots & h_d (x_N )
\end{bmatrix},
$$

$$
\beta = \begin{bmatrix}
\beta_1 \\
\vdots \\
\beta_d
\end{bmatrix},
$$

$$
y = \begin{bmatrix}
y_1 \\
\vdots \\
y_N
\end{bmatrix} = \begin{bmatrix}
h^T (x_1) \\
\vdots \\
h^T (x_N)
\end{bmatrix}.
$$

The constrained optimization problem for ELM regression and binary classification with a single-output node can be formulated as [71]:

$$
\min_{\beta, \xi_i} L_{PELM} = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2,
$$

subject to \( h^T (x_i ) \beta = y_i - \xi_i, \quad i = 1, \ldots , N. \)

The dual unconstrained optimization problem is

$$
\min_{\beta, \xi_i, \alpha_i} L_{DELM} (\beta, \xi_i, \alpha_i) = \frac{1}{2} \|\beta\|_2^2 + \frac{C}{2} \sum_{i=1}^N \xi_i^2 - \sum_{i=1}^N \alpha_i (h^T (x_i) \beta - y_i + \xi_i)
$$

with the Lagrange multipliers \( \alpha_i \geq 0, \quad i = 1, \ldots , N. \)

From the optimality conditions one has

$$
\frac{\partial L_{DELM}}{\partial \beta} = 0 \Rightarrow
$$

where \( \alpha = [\alpha_1, \ldots , \alpha_N]^T. \)

$$
\beta = \sum_{i=1}^N \alpha_i h(x_i) = H^T \alpha = H^T \alpha,
$$


In [1]:
import numpy as np
from numpy.linalg import pinv

class ExtremeLearningMachine:
    def __init__(self, n_hidden_units, activation_function):
        self.n_hidden_units = n_hidden_units
        self.activation_function = activation_function

    def _add_bias(self, X):
        # Add bias term to input data
        bias = np.ones((X.shape[0], 1))
        return np.hstack([X, bias])

    def _init_weights(self, n_features):
        # Initialize input weights and biases randomly
        self.input_weights = np.random.randn(n_features, self.n_hidden_units)
        self.biases = np.random.randn(1, self.n_hidden_units)

    def _compute_hidden_layer_output(self, X):
        # Compute hidden layer output
        Z = np.dot(X, self.input_weights) + self.biases
        H = self.activation_function(Z)
        return H

    def fit(self, X_train, y_train):
        # Add bias term to input data
        X_train = self._add_bias(X_train)
        
        # Initialize weights
        n_features = X_train.shape[1]
        self._init_weights(n_features)
        
        # Compute hidden layer output
        H = self._compute_hidden_layer_output(X_train)
        
        # Compute output weights using Moore-Penrose pseudoinverse
        H_pinv = pinv(H)
        self.output_weights = np.dot(H_pinv, y_train)

    def predict(self, X_test):
        # Add bias term to input data
        X_test = self._add_bias(X_test)
        
        # Compute hidden layer output
        H = self._compute_hidden_layer_output(X_test)
        
        # Compute output
        y_pred = np.dot(H, self.output_weights)
        return y_pred

# Activation function: Sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Example usage
if __name__ == "__main__":
    # Generate some random data for demonstration
    np.random.seed(0)
    X_train = np.random.rand(100, 10)
    y_train = np.random.rand(100, 1)
    X_test = np.random.rand(10, 10)

    # Create ELM instance
    elm = ExtremeLearningMachine(n_hidden_units=50, activation_function=sigmoid)

    # Train ELM
    elm.fit(X_train, y_train)

    # Predict with ELM
    y_pred = elm.predict(X_test)
    print("Predictions:\n", y_pred)


Predictions:
 [[-0.18704703]
 [ 0.22153287]
 [ 0.27996201]
 [ 0.77352065]
 [ 0.4413394 ]
 [ 0.46376024]
 [ 0.0402454 ]
 [ 0.53954023]
 [ 0.29939952]
 [ 0.52385948]]


Substituting (7.11.22) into (7.11.15) and using (7.11.18), we get the equation
$$
H^T H \alpha = y,
$$
(7.11.23)
where
$$
\begin{bmatrix}
h^T(x_1) \\
\vdots \\
h^T(x_N)
\end{bmatrix}^T
H = \begin{bmatrix}
h(x_1), \ldots, h(x_N)
\end{bmatrix}
$$
$$
H^T H = \begin{bmatrix}
h^T(x_1)h(x_1) & \cdots & h^T(x_1)h(x_N) \\
\vdots & \ddots & \vdots \\
h^T(x_N)h(x_1) & \cdots & h^T(x_N)h(x_N)
\end{bmatrix}
$$
(7.11.24)
The LS solution of Eq. (7.11.23) is given by
$$
\alpha = (H^T H)^\dagger y.
$$
(7.11.25)
If a feature mapping $h(x)$ is unknown to users, one can apply Mercer's conditions
on ELM, and hence a kernel matrix for ELM can be defined as follows [71]:
$$
K(x, x_i) = h^T(x)h(x_i).
$$
(7.11.26)
If using (7.11.26), then (7.11.24) can be rewritten as in the following kernel form:
$$
\begin{bmatrix}
K(x_1, x_1) & \cdots & K(x_1, x_N) \\
\vdots & \ddots & \vdots \\
K(x_N, x_1) & \cdots & K(x_N, x_N)
\end{bmatrix}
$$
(7.11.27)
whose $(i, j)$th entry is $H^T H_{ij} = K(x_i, x_j)$ for $i = 1, \ldots, N; j = 1, \ldots, N$.
This shows that the feature mapping $h(x)$ need not be used; instead, it is enough
to use the corresponding kernel $K(x, x_i)$, e.g., $K(x, x_i) = \exp(-\gamma \|x - x_i\|)$.
Finally, from (7.11.22) it follows that the EML regression function is
$$
\hat{y} = h^T(x) \beta = h^T(x) \sum_{i=1}^N \alpha_i h(x_i) = \sum_{i=1}^N \alpha_i K(x, x_i),
$$
while the decision function for EML binary classification is
$$
\text{class of } x = \text{sign} \left( \sum_{i=1}^N \alpha_i K(x, x_i) \right).
$$
(7.11.28)
(7.11.29)
Algorithm 7.6 shows the ELM algorithm for regression and binary classification.
Algorithm 7.6 ELM algorithm for regression and binary classification [71]

1. **Input**: A training set $\{(x_i, y_i) | x_i \in \mathbb{R}^n, y_i \in \mathbb{R}, i = 1, \ldots, N\}$, hidden node number $d$ and the kernel function $K(u, v) = \exp(-\gamma \|u - v\|)$.
2. **Initialization**: $y = [y_1, \ldots, y_N]^T$.
3. **Learning step**:
    1. Use the kernel function to construct the matrix $H^T H_{ij} = K(x_i, x_j)$, $i, j = 1, \ldots, N$.
    2. Calculate the minimum norm least squares solution $\alpha = (H^T H)^\dagger y$.
4. **Testing step**: For given testing sample $x \in \mathbb{R}^n$, the ELM regression is $\sum_{i=1}^N \alpha_i K(x, x_i)$, while the ELM binary classification is $\text{class of } x = \text{sign} \left( \sum_{i=1}^N \alpha_i K(x, x_i) \right)$.


In [2]:
import numpy as np
from numpy.linalg import pinv

class ELM:
    def __init__(self, n_hidden_units, kernel_function):
        self.n_hidden_units = n_hidden_units
        self.kernel_function = kernel_function

    def fit(self, X_train, y_train):
        N = X_train.shape[0]
        
        # Construct the kernel matrix
        H = np.zeros((N, N))
        for i in range(N):
            for j in range(N):
                H[i, j] = self.kernel_function(X_train[i], X_train[j])
        
        # Calculate the minimum norm least squares solution
        H_pinv = pinv(H)
        self.alpha = np.dot(H_pinv, y_train)
        
        self.X_train = X_train

    def predict(self, X_test):
        N = self.X_train.shape[0]
        M = X_test.shape[0]
        
        # Compute the kernel matrix between test data and training data
        K = np.zeros((M, N))
        for i in range(M):
            for j in range(N):
                K[i, j] = self.kernel_function(X_test[i], self.X_train[j])
        
        # Compute the predictions
        y_pred = np.dot(K, self.alpha)
        return y_pred

# Example kernel function: RBF (Gaussian)
def rbf_kernel(x, y, gamma=1.0):
    return np.exp(-gamma * np.linalg.norm(x - y) ** 2)

# Example usage
if __name__ == "__main__":
    # Generate some random data for demonstration
    np.random.seed(0)
    X_train = np.random.rand(100, 10)
    y_train = np.random.rand(100, 1)
    X_test = np.random.rand(10, 10)

    # Create ELM instance
    elm = ELM(n_hidden_units=50, kernel_function=rbf_kernel)

    # Train ELM
    elm.fit(X_train, y_train)

    # Predict with ELM
    y_pred = elm.predict(X_test)
    print("Predictions:\n", y_pred)


Predictions:
 [[ 0.3175779 ]
 [ 0.40938498]
 [ 0.28007594]
 [ 0.84347293]
 [ 0.50490296]
 [ 0.51523179]
 [-0.0439662 ]
 [ 0.67854794]
 [ 0.24910511]
 [ 0.01155326]]


## Extreme Learning Machine Algorithm for Multiclass Classification
For multiclass applications, let ELM have multi-output nodes instead of a single-output node. For a \(k\)-class classifier, we are given \(N\) training set \(\{(x_i, y_i) | x_i \in \mathbb{R}^n, y_i \in \mathbb{R}^d\}, i = 1, \ldots, N\). If the original class label is \(p\), then the \(j\)-th entry of the expected output vector of the \(d\) output nodes, \(y_i = [y_{i1}, \ldots, y_{id}]^T \in \mathbb{R}^d\), is denoted as

$$
y_{ij} =
\begin{cases} 
1, & \text{if } j = p, \\
-1, & \text{if } j \neq p,
\end{cases}
$$

where \(p \in \{1, \ldots, k\}\). That is, only the \(p\)-th entry of \(y_i\) is +1, while the rest of the entries are set to -1.

The primal classification problem of the \(m\)-th class for ELM with multi-output nodes can be formulated as [71]:

$$
\min_{\beta^m, b^m, \xi_{m,i}} \mathcal{L}_{\text{PELM}} =
\frac{1}{2} \|\beta^m\|_2^2 + b^m +
\frac{C}{2} \sum_{i=1}^{N} \xi_{m,i}^2
$$

subject to

$$
\begin{cases} 
(h^m(x_1))^T \beta^m + b^m = 1 - \xi_{m,1}, \\
\vdots \\
(h^m(x_N))^T \beta^m + b^m = 1 - \xi_{m,N}.
\end{cases}
$$

The corresponding dual unconstrained optimization problem is given by

$$
\min_{\beta^m, b^m, \xi_{m,i}, \alpha_{m,i}} \mathcal{L}_{\text{DELM}} =
\frac{1}{2} \|\beta^m\|_2^2 + b^m +
\frac{C}{2} \sum_{i=1}^{N} \xi_{m,i}^2 -
\sum_{i=1}^{N} \alpha_{m,i} y_i^{(m)} \left( (h^m(x_i))^T \beta^m + b^m - 1 + \xi_{m,i} \right)
$$

From the optimality conditions one has

$$
\frac{\partial \mathcal{L}_{\text{DELM}}}{\partial \beta^m} = 0 \Rightarrow \beta^m = \sum_{i=1}^{N} \alpha_{m,i} y_i^{(m)} h^m(x_i),
$$

$$
\frac{\partial \mathcal{L}_{\text{DELM}}}{\partial b^m} = 0 \Rightarrow b^m = \sum_{i=1}^{N} \alpha_{m,i} y_i^{(m)},
$$

$$
\frac{\partial \mathcal{L}_{\text{DELM}}}{\partial \xi_{m,i}} = 0 \Rightarrow \xi_{m,i} = C^{-1} \alpha_{m,i},
$$

$$
\frac{\partial \mathcal{L}_{\text{DELM}}}{\partial \alpha_{m,i}} = 0 \Rightarrow y_i^{(m)} \left( (h^m(x_i))^T \beta^m + b^m - 1 + \xi_{m,i} \right) = 0,
$$

for \(i = 1, \ldots, N\).

Eliminating \(\beta^m\) and \(\xi_{m,i}\) in the above equations yields the KKT equation

$$
b^m = \sum_{i=1}^{N} \alpha_{m,i} y_i^{(m)} = (\alpha^m)^T y^m
$$

and

$$
\left( C^{-1} I + H^m (H^m)^T + y^m (y^m)^T \right) \alpha^m = 1,
$$

where \(I\) is an \(N \times N\) identity matrix, \(1\) is an \(N \times 1\) summing vector with all entries equal to 1, and

$$
H^m = \left[ y_1^{(m)} h^m(x_1), \ldots, y_N^{(m)} h^m(x_N) \right],
$$

$$
y^m = \left[ y_1^{(m)}, \ldots, y_N^{(m)} \right]^T,
$$

$$
\alpha^m = \left[ \alpha_{m,1}, \ldots, \alpha_{m,N} \right]^T.
$$

It easily follows that the \((i,j)\)-th entry of the matrix \(H^m (H^m)^T\) can be represented as

$$
\left( H^m (H^m)^T \right)_{ij} = y_i^{(m)} y_j^{(m)} (h^m(x_i))^T h^m(x_j) = y_i^{(m)} y_j^{(m)} K^m(x_i, x_j),
$$

where \(K^m(x_i, x_j) = (h^m(x_i))^T h^m(x_j)\) is the kernel function for the \(m\)-th ELM classifier.

Algorithm 7.7 shows the ELM multiclass classification algorithm.

### Algorithm 7.7 ELM multiclass classification algorithm [71]

1. **Input:** A training set \(\{(x_i, y_i) | x_i \in \mathbb{R}^n, y_i \in \mathbb{R}^d, i = 1, \ldots, N\}\), hidden node number \(d\), the number \(k\) of classes and the kernel function \(K^m(u, v)\) for the \(m\)-th classifier \(m = 1, \ldots, k\), such as \(K^m(u, v) = \exp(-\gamma \|u - v\|^2)\).
2. **Initialization:** Reconstruct the \(m\)-th class’s output vector \(y^{(m)} = \left[ y_1^{(m)}, \ldots, y_N^{(m)} \right]^T\) with
   $$
   y_i^{(m)} =
   \begin{cases} 
   1, & \text{if } y_i = m; \\
   -1, & \text{if } y_i \neq m;
   \end{cases}
   $$
   for \(m = 1, \ldots, k\); \(i = 1, \ldots, N\).
3. **Learning step:**
   1. while \(m = 1, \ldots, k\)
   2. Use the kernel function \(K^m(x_i, x_j)\) and the \(m\)-th class’s output vector \(y^{(m)}\) to construct the matrix \((H^m)^T H^m\) where
      $$
      \left( (H^m)^T H^m \right)_{ij} = y_i^{(m)} y_j^{(m)} K^m(x_i, x_j), \quad i, j = 1, \ldots, N.
      $$
   3. Calculate the minimum norm least square solution
      $$
      \alpha^m = \left( (H^m)^T H^m + y^m (y^m)^T + C^{-1} I \right)^{-1} 1.
      $$
   4. Calculate
      $$
      b^m = (\alpha^m)^T y^m.
      $$
   5. endwhile
4. **Testing step:** For a given testing sample \(x \in \mathbb{R}^n\), the decision function of the ELM multiclass classifier is given by
   $$
   \text{class of } x = \arg \max_{m=1, \ldots, k} \left( \sum_{i=1}^{N} \alpha_{m,i} y_i^{(m)} K^m(x, x_i) + b^m \right).
   $$


In [5]:
import numpy as np
from scipy.linalg import solve
from sklearn.metrics.pairwise import rbf_kernel

class ELM_Multiclass:
    def __init__(self, hidden_nodes, C, gamma):
        self.hidden_nodes = hidden_nodes
        self.C = C
        self.gamma = gamma
        self.k = None
        self.beta = None
        self.b = None
        self.X = None

    def _construct_target_vector(self, y, m):
        return np.where(y == m, 1, -1)

    def fit(self, X, y):
        N, n = X.shape
        self.k = len(np.unique(y))
        H = np.random.rand(self.hidden_nodes, n)
        b = np.random.rand(self.hidden_nodes)

        H = np.tanh(np.dot(X, H.T) + b)
        
        self.X = X  # Store the training data for use in the predict method
        self.beta = []
        self.b = []
        
        for m in range(1, self.k + 1):
            y_m = self._construct_target_vector(y, m)
            
            K = rbf_kernel(X, X, gamma=self.gamma)
            Hm = y_m[:, None] * K
            
            identity_matrix = np.eye(N)
            one_vector = np.ones(N)
            
            HmHTm = np.dot(Hm, Hm.T)
            A = HmHTm + np.outer(y_m, y_m) + (1 / self.C) * identity_matrix
            alpha_m = solve(A, one_vector)
            
            beta_m = np.dot(Hm.T, alpha_m)
            b_m = np.sum(alpha_m * y_m)
            
            self.beta.append(beta_m)
            self.b.append(b_m)

    def predict(self, X):
        N = X.shape[0]
        H = np.random.rand(self.hidden_nodes, X.shape[1])
        b = np.random.rand(self.hidden_nodes)

        H = np.tanh(np.dot(X, H.T) + b)
        
        scores = np.zeros((N, self.k))
        
        for m in range(1, self.k + 1):
            beta_m = self.beta[m - 1]
            b_m = self.b[m - 1]
            K = rbf_kernel(X, self.X, gamma=self.gamma)
            scores[:, m - 1] = np.dot(K, beta_m) + b_m
        
        return np.argmax(scores, axis=1) + 1  # Class labels start from 1

# Example usage
if __name__ == "__main__":
    # Generating random data for demonstration
    np.random.seed(0)
    X_train = np.random.rand(100, 20)
    y_train = np.random.randint(1, 4, size=100)  # 3 classes
    X_test = np.random.rand(10, 20)
    
    elm = ELM_Multiclass(hidden_nodes=50, C=1.0, gamma=0.5)
    elm.fit(X_train, y_train)
    y_pred = elm.predict(X_test)
    
    print("Predicted class labels:", y_pred)


Predicted class labels: [3 1 3 2 2 2 1 1 2 3]


We first introduce the definition of the basic concepts in graph embedding. Suppose we are given a graph \( G = (V, E) \), where \( v \in V \) is a vertex or node and \( e \in E \) is an edge. \( G \) is associated with a node type mapping function \( f_v: V \to T_v \) and an edge type mapping function \( f_e: E \to T_e \), where \( T_v \) and \( T_e \) denote the set of node types and edge types, respectively. Each node \( v_i \in V \) belongs to one particular type, i.e., \( f_v(v_i) \in T_v \). Similarly, for \( e_{ij} \in E \), \( f_e(e_{ij}) \in T_e \).

Graph learning is closely related to graph proximities and graph embedding. Graph learning tasks can be broadly abstracted into the following four categories:
- **Node classification** aims at determining the label of nodes (a.k.a. vertices) based on other labeled nodes and the topology of the network.
- **Link prediction** refers to the task of predicting missing links or links that are likely to occur in the future.
- **Clustering** is used to find subsets of similar nodes and group them together.
- **Visualization** helps in providing insights into the structure of the network.

The most basic measure for both dimension reduction and structure preservation of a graph is the graph proximity. Proximity measures are usually adopted to quantify the graph property to be preserved in the embedded space.

The microscopic structures of a graph can be described by its first-order proximity and second-order proximity. The first-order proximity between the vertices is their local pairwise similarity between only the nodes connected by edges.

**Definition 7.10 (First-Order Proximity [144])** The first-order proximity is the observed pairwise proximity between two nodes \( v_i \) and \( v_j \), denoted as \( S_{ij}^{(1)} = s_{ij} \), where \( s_{ij} \) is the edge weight between the two nodes. If no edge is observed between nodes \( i \) and \( j \), then their first-order proximity \( S_{ij}^{(1)} = 0 \).

The first-order proximity is the first and foremost measure of similarity between two nodes. The first-order proximity implies that two nodes in real-world networks are always similar if they are linked by an observed edge. For example, if a paper cites another paper, they should contain some common topic or keywords. However, it is not sufficient to only capture the first-order proximity, and it is also necessary to introduce the second-order proximity to capture the global network structure.

**Definition 7.11 (Second-Order Proximity [144])** Let \( \mathbf{s}_i^{(1)} = [S_{i,1}^{(1)}, \ldots, S_{i,n}^{(1)}]^T \) and \( \mathbf{s}_i^{(2)} = [S_{i,1}^{(2)}, \ldots, S_{i,n}^{(2)}]^T \) be the first-order and the second-order proximity vectors between node \( i \) and other nodes, respectively. Then the second-order proximity \( S_{ij}^{(2)} \) is determined by the similarity of \( \mathbf{s}_i^{(1)} \) and \( \mathbf{s}_j^{(1)} \). If no vertex is linked from/to both \( i \) and \( j \), then the second-order proximity between \( v_i \) and \( v_j \) is zero, i.e., \( S_{ij}^{(2)} = 0 \).

The second-order proximity \( S_{ij}^{(2)} \) is the similarity between \( v_i \)’s neighborhood \( \mathbf{s}_i^{(1)} \) and \( v_j \)’s neighborhood \( \mathbf{s}_j^{(1)} \).

- The first-order proximity compares the similarity between the nodes \( i \) and \( j \). The more similar two nodes are, the larger the first-order proximity value between them.
- The second-order proximity compares the similarity between the nodes’ neighborhood structures. The more similar two nodes’ neighborhoods are, the larger the second-order proximity value between them.

Similarly, we can define the higher-order proximity \( S_{ij}^{(k)} \) (where \( k \geq 3 \)) between a pair of vertices \( (i, j) \) in a graph.

**Definition 7.12 (k-Order Proximity)** Let \( \mathbf{s}_i^{(k)} = [S_{i,1}^{(k)}, \ldots, S_{i,n}^{(k)}]^T \) be the \( k \)-th order proximity vector between node \( i \) and other nodes. Then the \( k \)-th order proximity \( S_{ij}^{(k)} \) is determined by the similarity of \( \mathbf{s}_i^{(k-1)} \) and \( \mathbf{s}_j^{(k-1)} \).

In particular, when \( k \geq 3 \), the \( k \)-th order proximity is generally referred to as the higher-order proximity. The matrix \( S^{(k)} = [S_{ij}^{(k)}] \) is known as the \( k \)-order proximity matrix. When \( k \geq 3 \), \( S^{(k)} \) is called the higher-order proximity matrix. The higher-order proximity matrices are also defined using some other metrics, e.g., Katz Index, Rooted PageRank, Adamic-Adar, etc. that will be discussed in Sect. 7.13.3.

By Definitions 7.10, 7.11, and 7.12, the first-order, second-order, and third-order proximity matrices \( S^{(k)} = [S_{ij}^{(k)}] \in \mathbb{R}^{n \times n} \) (where \( k = 1, 2, 3 \)) are nonnegative matrices, respectively.

If considering the cosine similarity as the \( k \)-order proximity, then for nodes \( v_i \) and \( v_j \), we have the following results:
1. The first-order proximity
$$
S_{ij}^{(1)} = s_{ij}
$$
where \( s_{ij} \) is the edge weight between the two nodes.

2. The second-order proximity
$$
S_{ij}^{(2)} = \frac{\mathbf{s}_i^{(1)} \cdot \mathbf{s}_j^{(1)}}{\| \mathbf{s}_i^{(1)} \|_2 \| \mathbf{s}_j^{(1)} \|_2} = \frac{\sum_{l=1}^{n} S_{i,l}^{(1)} S_{j,l}^{(1)}}{\sqrt{\sum_{l=1}^{n} (S_{i,l}^{(1)})^2} \sqrt{\sum_{l=1}^{n} (S_{j,l}^{(1)})^2}}
$$
In this way, the second-order proximity is between \([0, 1]\).

3. The third-order proximity
$$
S_{ij}^{(3)} = \frac{\| \mathbf{s}_i^{(2)} \|_2 \| \mathbf{s}_j^{(2)} \|_2}{\sum_{l=1}^{n} (S_{i,l}^{(2)})^2 \sum_{l=1}^{n} (S_{j,l}^{(2)})^2}
$$

**Definition 7.13 (Graph Embedding [14, 41])** Given the inputs of a graph \( G = (V, E) \), and a predefined dimensionality of the embedding \( d \) (\( d \ll |V| \)), the graph embedding is to convert \( G \) into a \( d \)-dimensional space \( \mathbb{R}^d \). In this space, the graph properties (such as the first-order, second-order, and higher-order proximities) are preserved as much as possible. The graph is represented as either a \( d \)-dimensional vector (for a whole graph) or a set of \( d \)-dimensional vectors with each vector representing the embedding of part of the graph (e.g., node, edge, substructure). Therefore, a graph embedding maps each node of graph \( G(E, V) \) to a low-dimensional feature vector \( \mathbf{y}_i \) and tries to preserve the connection strengths between vertices. For example, a graph embedding preserving first-order proximity might be obtained by minimizing
$$
\sum_{i,j} s_{ij} \| \mathbf{y}_i - \mathbf{y}_j \|^2_2
$$

Graph embedding is an important method for learning low-dimensional representations of vertices in networks, aiming to capture and preserve the network structure. Learning network representations faces the following great challenges [154]:
1. **High nonlinearity:** The underlying structure of a graph or network is highly nonlinear. Therefore, designing a model to capture the highly nonlinear structure is rather difficult.
2. **Topology structure-preserving:** To support applications in analyzing networks, network embedding is required to preserve the network structure. However, the underlying structure of the network is very complex. The similarity of vertices is dependent on both the local and global network structures. Therefore, how to simultaneously preserve the local


In [6]:
# graph_embedding.py

import networkx as nx
from node2vec import Node2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd

# Create a graph
G = nx.Graph()

# Add nodes and edges
G.add_edges_from([
    (1, 2), (1, 3), (2, 4), (3, 4),
    (4, 5), (5, 6), (6, 7), (7, 5)
])

# Optionally, add node attributes or edge weights
nx.set_node_attributes(G, {1: 'A', 2: 'B', 3: 'C', 4: 'D', 5: 'E', 6: 'F', 7: 'G'}, 'label')

# Apply Node2Vec
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)
model = node2vec.fit(window=10, min_count=1, sg=1)

# Get node embeddings
embeddings = {node: model.wv[node] for node in G.nodes()}

# Convert embeddings to a matrix
embedding_matrix = [embeddings[node] for node in G.nodes()]

# Perform PCA to reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embedding_matrix)

# Create a DataFrame for plotting
df = pd.DataFrame(reduced_embeddings, index=G.nodes(), columns=['x', 'y'])

# Plotting
plt.figure(figsize=(10, 8))
plt.scatter(df['x'], df['y'])

for node, (x, y) in df.iterrows():
    plt.text(x, y, str(node), fontsize=12)

plt.title('Node Embeddings Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()


ModuleNotFoundError: No module named 'networkx'