In [None]:
'''
 * Copyright (c) 2018 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Structure and Property Preserving Network Embedding

Rich structural information for network embedding from nodes and links is closely related to neighborhood structure, higher-order proximities of nodes, and community structures.

### Definition 7.15 (Network Embedding)
Given a graph denoted as \( G = (V, E) \), network embedding aims to learn a mapping function \( f: v_i \rightarrow \mathbf{y}_i \in \mathbb{R}^d \), where \( d \ll |V| \). The objective of the function is to make the similarity between \( \mathbf{y}_i \) and \( \mathbf{y}_j \) explicitly preserve the first-order, second-order, and higher-order proximities of \( v_i \) and \( v_j \).

The microscopic structures of a network can be described by its first-order proximity and second-order proximity. Network embedding usually has the following two goals:

- **Network Reconstruction:** To learn low-dimensional vector representations for network nodes, the relationships among the nodes, which were originally represented by edges or other higher-order topological measures in graphs, are captured by the distances between nodes in the vector space. The topological and structural characteristics of a node are encoded into its embedding vector.
- **Network Inference:** The learned embedding space can effectively support network inference, such as predicting unseen links, identifying important nodes, and inferring node labels.

### Definition 7.16 (Information Network)
An information network is defined as \( G = (V, E) \), where \( V \) is the set of vertices, each representing a data object and \( E \) is the set of edges between the vertices, each representing a relationship between two data objects. Each edge \( e \in E \) is an ordered pair \( e = (i, j) \) and is associated with a weight \( w_{ij} > 0 \), which indicates the strength of the relation. If \( G(V, E) \) is undirected, then \( (i, j) \equiv (j, i) \) and \( w_{ij} \equiv w_{ji} \); and if \( G(V, E) \) is directed, then \( (i, j) \neq (j, i) \) and \( w_{ij} \neq w_{ji} \).

### Definition 7.17 (Large-Scale Information Network Embedding)
Given a large network \( G = (V, E) \), the problem of large-scale information network embedding aims to represent each vertex \( v \in V \) in a low-dimensional space \( \mathbb{R}^d \), where \( d \ll |V| \). In the space \( \mathbb{R}^d \), both the first-order proximity and the second-order proximity between the vertices are preserved.

The first-order proximity can be measured by the joint probability distribution between two nodes \( v_i \) and \( v_j \) as
$$
p_1(v_i, v_j) = \frac{1}{1 + \exp(-\mathbf{u}_i^T \mathbf{u}_j)},
$$
where \( \mathbf{u}_i \in \mathbb{R}^d \) is the low-dimensional vector representation of vertex \( v_i \). A straightforward way to preserve the first-order proximity is to minimize the following objective function:
$$
O_1 = -\sum_{(i, j) \in E} \log p_1(v_i, v_j).
$$

The second-order proximity is modeled by the probability of the context node \( v_j \) being generated by node \( v_i \), that is,
$$
p_2(v_j \mid v_i) = \frac{\exp(\bar{\mathbf{u}}_j^T \mathbf{u}_i)}{\sum_{k=1}^{|V|} \exp(\bar{\mathbf{u}}_k^T \mathbf{u}_i)},
$$
where \( \mathbf{u}_i \) is the representation of \( v_i \) when it is treated as a vertex, while \( \bar{\mathbf{u}}_i \) is the representation of \( v_i \) when it is treated as a specific “context.” A straightforward way to preserve the second-order proximity is to minimize the following objective function:
$$
O_2 = -\sum_{(i, j) \in E} \log p_2(v_j \mid v_i).
$$

By learning \( \{\mathbf{u}_i\} \) for \( i = 1, \ldots, |V| \) and \( \{\bar{\mathbf{u}}_i\} \) for \( i = 1, \ldots, |V| \) that minimize these objectives, we are able to represent every vertex \( v_i \) with a \( d \)-dimensional vector \( \mathbf{u}_i \).


In [1]:
import numpy as np
import networkx as nx
from node2vec import Node2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Create a sample graph
G = nx.erdos_renyi_graph(n=100, p=0.1, seed=42)

# Node2Vec parameters
p = 1  # Return parameter
q = 1  # In-out parameter
dimensions = 64  # Dimension of the embedding space
walk_length = 30  # Length of each random walk
num_walks = 200  # Number of random walks per node

# Generate node embeddings using Node2Vec
node2vec = Node2Vec(G, dimensions=dimensions, walk_length=walk_length, num_walks=num_walks, p=p, q=q, workers=4)
model = node2vec.fit(window=10, min_count=1, iter=1)

# Extract embeddings
embeddings = np.array([model.wv.get_vector(str(node)) for node in G.nodes])

# Dimensionality reduction for visualization
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)

# Plot the embeddings
plt.figure(figsize=(12, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], alpha=0.5, edgecolors='k')
plt.title('Node2Vec Embeddings Visualized Using PCA')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()


Computing transition probabilities:   0%|          | 0/100 [00:00<?, ?it/s]

Generating walks (CPU: 2): 100%|██████████| 50/50 [00:25<00:00,  1.97it/s]

TypeError: __init__() got an unexpected keyword argument 'iter'







## Community Preserving Network Embedding

Given a network \( G(V, E) \) with \( |V| \) nodes and \( |E| \) edges, represented by an adjacency or similarity matrix \( S = [S_{ij}] \in \mathbb{R}^{|V| \times |V|} \), the goal is to find natural divisions of vertices into non-overlapping communities.

### Modularity

The **modularity** \( Q \) quantifies the strength of the division of a network into communities. It measures the difference between the number of edges within communities and the expected number of such edges in a random network.

For a particular division of the network into two groups, where the community membership \( h_i = 1 \) if vertex \( i \) belongs to group 1 and \( h_i = -1 \) if it belongs to group 2, the modularity can be expressed as:

$$
Q = \frac{1}{4m} \sum_{i,j} \left( A_{ij} - \frac{k_i k_j}{2m} \right) h_i h_j
$$

where:
- \( A_{ij} \) is the number of edges between vertices \( i \) and \( j \),
- \( k_i \) and \( k_j \) are the degrees of vertices \( i \) and \( j \),
- \( m = \frac{1}{2} \sum_{i} k_i \) is the total number of edges in the network.

In matrix-vector form:

$$
Q = \frac{1}{4m} h^T B h
$$

where:
- \( h \) is the community membership vector,
- \( B \) is the modularity matrix with elements:

$$
B_{ij} = A_{ij} - \frac{k_i k_j}{2m}
$$

### Modularity Improvement

When dividing a group \( g \) of size \( n_g \) into two, the additional contribution to modularity can be expressed as:

$$
\Delta Q = \frac{1}{4m} \left( \sum_{i,j \in g} B_{ij} (h_i h_j + 1) - \sum_{i,j \in g} B_{ij} \right)
$$

or:

$$
\Delta Q = \frac{1}{4m} \left( \sum_{i,j \in g} B_{ij} h_i h_j - \sum_{i,j \in g} B_{ij} \right)
$$

which can be written as:

$$
\Delta Q = \frac{1}{4m} \left( h^T B(g) h \right)
$$

where:

$$
B(g)_{ij} = B_{ij} - \delta_{ij} \sum_{k \in g} B_{ik}
$$

### Nonnegative Matrix Factorization (NMF)

To preserve both first-order and second-order proximities, the final similarity matrix is defined as:

$$
S = S^{(1)} + \eta S^{(2)}
$$

where \( \eta > 0 \) is the weight of the second-order proximity.

The community-preserving network embedding finds nonnegative matrices \( M \) and \( U \) via NMF:

$$
(M, U) = \arg \min_{M, U} \| S - MU \|_F^2
$$

subject to:

$$
M \geq 0, \quad U \geq 0
$$

### Community Structure Incorporation

For a network with \( k > 2 \) communities, the community membership indicator \( H \) is a matrix with one column for each community. The constraint is:

$$
\text{tr}(H^T H) = n
$$

The community-preserving embedding optimization problem is:

$$
H = \arg \min_H \text{tr}(H^T B H)
$$

subject to:

$$
\text{tr}(H^T H) = n
$$

Additionally, introduce an auxiliary nonnegative matrix \( C \in \mathbb{R}^{k \times m} \) (community representation matrix). The matrix factorization problem is:

$$
(U, C) = \arg \min_{U, C} \| H - CU \|_F^2
$$

subject to:

$$
U \geq 0, \quad C \geq 0
$$


## Community Preserving Network Embedding Optimization

Combining Eqs. (7.13.10), (7.13.13), and (7.13.15), the objective of the community-preserving network embedding is given by [156]:

$$
\min_{M,U,H,C} \ (1 - \alpha) \| S - MU \|_F^2 + \alpha \| H - CU \|_F^2 - \beta \text{tr}(H^T B H)
$$
\tag{7.13.17}
$$

subject to:

$$
M \geq 0, \ U \geq 0, \ H \geq 0, \ C \geq 0, \ \text{tr}(H^T H) = n,
$$
\tag{7.13.18}
$$

where \( \alpha \) and \( \beta \) are positive parameters for adjusting the contribution of the corresponding terms.

### H-Subproblem

Updating \( H \) with the other parameters fixed leads to the following optimization subproblem [156]:

$$
\min_H \ \alpha \| H - UC^T \|_F^2 - \beta \text{tr}(H^T B H)
$$
\tag{7.13.19}
$$

subject to:

$$
\text{tr}(H^T H) = n.
$$
\tag{7.13.20}
$$

This constrained condition can be relaxed to the regularization \( H^T H = I \), leading to:

$$
H = \arg \min_H \ \alpha \| H - UC^T \|_F^2 - \beta \text{tr}(H^T B H) + \lambda \| H^T H - I \|_F^2,
$$
\tag{7.13.21}
$$

where \( \lambda > 0 \) should be large enough to ensure orthogonality is satisfied. The successive updating rule for \( H \) is:

$$
H \leftarrow H \odot \sqrt{\frac{H \odot (2\beta (B_1 H))}{8\lambda (H^T H) H} \bigg/ \left( \frac{2\beta (B_1 H) + 16\lambda (H^T H)}{2\beta AH + 2\alpha UC^T + (4\lambda - 2\alpha)H} \right)},
$$
\tag{7.13.22-7.13.24}
$$

where \( \odot \) denotes the Hadamard product and division represents element-wise operations.

### Joint NMF Subproblem

Updating \( M \), \( U \), and \( C \) with \( H \) fixed leads to the joint NMF problems [3]:

$$
\min_{M,U,C} \ (1 - \alpha) \| S - MU \|_F^2 + \alpha \| H - CU \|_F^2
$$
\tag{7.13.25}
$$

subject to:

$$
M \geq 0, \ U \geq 0, \ C \geq 0.
$$
\tag{7.13.26}
$$

The update rules for matrices \( M \) and \( C \) are:

$$
M \leftarrow M \odot \frac{S U^T}{M U U^T},
$$
\tag{7.13.27}
$$

$$
C \leftarrow C \odot \frac{H U^T}{C U U^T}.
$$
\tag{7.13.28}
$$

For the coefficient matrix \( U \), the update rule is given by:

$$
U \leftarrow U \odot \frac{(1 - \lambda) M^T S + \lambda C^T H}{((1 - \lambda) M^T M + \lambda C^T C) U}.
$$
\tag{7.13.29}
$$


In [6]:
import numpy as np
from numpy.linalg import norm

def update_H(H, U, C, B, alpha, beta, lambda_reg):
    # Ensure dimensions align: H (n, k), U (m, k), C (k, m), B (n, n)
    
    # Calculate B * H (n x n) * (n x k) -> (n x k)
    BH = B @ H
    
    # Calculate U @ C.T (m x k) @ (m x k) -> (n x k) => Adjust U, C if necessary
    UCt = U @ C.T

    # Update H using the formula
    num = 2 * beta * BH
    denom = 8 * lambda_reg * (H.T @ H) @ H + 2 * beta * BH + 16 * lambda_reg * (H.T @ H) - 2 * alpha * UCt + (4 * lambda_reg - 2 * alpha) * H
    return H * np.sqrt(num / (denom + 1e-9))

def community_preserving_embedding(S, B, alpha, beta, lambda_reg, max_iter=100, tol=1e-5):
    n, m = S.shape
    k = B.shape[0]  # Number of communities
    
    # Initialize M, U, H, C with proper dimensions
    M = np.random.rand(n, m)
    U = np.random.rand(m, k)  # Adjusted dimensions
    H = np.random.rand(n, k)
    C = np.random.rand(k, m)
    
    # Iterative optimization
    for iteration in range(max_iter):
        H_prev = H.copy()
        
        # Update H, M, C, U
        H = update_H(H, U, C, B, alpha, beta, lambda_reg)
        M = update_M(S, U, M)
        C = update_C(H, U, C)
        U = update_U(S, M, H, C, U, alpha, lambda_reg)
        
        # Compute the objective function value
        obj_val = objective_function(S, M, U, H, C, B, alpha, beta)
        print(f"Iteration {iteration + 1}, Objective Function: {obj_val:.6f}")
        
        # Check for convergence
        if norm(H - H_prev) < tol:
            print("Converged")
            break
    
    return M, U, H, C


In [4]:
# Example input data
n = 50  # number of nodes
m = 50   # dimensionality of the embedding space
k = 10   # number of communities

S = np.random.rand(n, m)  # Similarity matrix (random example)
B = np.random.rand(n, n)  # Modularity matrix (random example)

# Hyperparameters
alpha = 0.5
beta = 0.1
lambda_reg = 0.01
max_iter = 100

# Run the optimization
M, U, H, C = community_preserving_embedding(S, B, alpha, beta, lambda_reg, max_iter)


Iteration 1, Objective Function: -5668.849384
Iteration 2, Objective Function: -2867.772942
Iteration 3, Objective Function: -1795.832908
Iteration 4, Objective Function: -1283.430944
Iteration 5, Objective Function: -1002.813208
Iteration 6, Objective Function: -835.549885
Iteration 7, Objective Function: -730.103340
Iteration 8, Objective Function: -654.811796
Iteration 9, Objective Function: 35006378773.614059
Iteration 10, Objective Function: 237165350780293032706048.000000
Iteration 11, Objective Function: 9833237210223619710722879219688140408094720.000000
Iteration 12, Objective Function: 3162670638276736060502075932401614686268774816909601777908074230705029120.000000
Iteration 13, Objective Function: 581914901938162064470495874251106936504327247795400167182607738247786106878073450607154180748071309418234295261069312.000000
Iteration 14, Objective Function: 1452444722421497688238828061045824909658855845774925331028707607862242273144806204856282743078537253841032529926283739357198

  from ipykernel import kernelapp as app


In [1]:
import numpy as np
from numpy.linalg import norm

def update_H(H, U, C, B, alpha, beta, lambda_reg):
    # H (n x k), U (m x k), C (k x m), B (n x n)
    
    # Calculate B * H (n x n) * (n x k) -> (n x k)
    BH = B @ H
    
    # Calculate U @ C.T (m x k) @ (m x k) -> (n x k) => Adjust U, C if necessary
    UCt = U @ C.T

    # Update H using the formula
    num = 2 * beta * BH + 2 * alpha * UCt
    denom = num + 16 * lambda_reg * (H.T @ H) - 2 * alpha * UCt + (4 * lambda_reg - 2 * alpha) * H
    
    # Ensure dimensions align before performing Hadamard division
    denom = np.maximum(denom, 1e-9)  # Prevent division by zero
    H_new = H * np.sqrt(num / denom)
    
    return H_new

def update_M(S, U, M):
    # Update M using the formula SUT / (MUUT)
    SUT = S @ U.T  # (n x m) * (m x k) -> (n x k)
    MUUT = M @ (U @ U.T)  # (n x m) * (m x k) -> (n x k)
    
    M_new = M * (SUT / np.maximum(MUUT, 1e-9))  # Prevent division by zero
    
    return M_new

def update_C(H, U, C):
    # Update C using the formula HUT / (CUUT)
    HUT = H.T @ U.T  # (k x n) * (n x m) -> (k x m)
    CUUT = C @ (U @ U.T)  # (k x m) * (m x k) -> (k x m)
    
    C_new = C * (HUT / np.maximum(CUUT, 1e-9))  # Prevent division by zero
    
    return C_new

def update_U(S, M, H, C, U, alpha, lambda_reg):
    # Update U using the formula (1 - alpha) * MT * S + alpha * CT * H
    MT_S = M.T @ S  # (m x n) * (n x m) -> (m x m)
    CT_H = C.T @ H  # (m x k) * (k x n) -> (m x n)
    
    num = (1 - alpha) * MT_S + alpha * CT_H
    denom = ((1 - alpha) * (M.T @ M) + alpha * (C.T @ C)) @ U
    
    U_new = U * (num / np.maximum(denom, 1e-9))  # Prevent division by zero
    
    return U_new

def objective_function(S, M, U, H, C, B, alpha, beta):
    term1 = (1 - alpha) * norm(S - M @ U, 'fro') ** 2
    term2 = alpha * norm(H - U @ C.T, 'fro') ** 2
    term3 = -beta * np.trace(H.T @ B @ H)
    
    return term1 + term2 + term3

def community_preserving_embedding(S, B, alpha, beta, lambda_reg, max_iter=100, tol=1e-5):
    n, m = S.shape
    k = B.shape[0]  # Number of communities
    
    # Initialize M, U, H, C with proper dimensions
    M = np.random.rand(n, m)
    U = np.random.rand(m, k)  # Adjusted dimensions
    H = np.random.rand(n, k)
    C = np.random.rand(k, m)
    
    # Iterative optimization
    for iteration in range(max_iter):
        H_prev = H.copy()
        
        # Update H, M, C, U
        H = update_H(H, U, C, B, alpha, beta, lambda_reg)
        M = update_M(S, U, M)
        C = update_C(H, U, C)
        U = update_U(S, M, H, C, U, alpha, lambda_reg)
        
        # Compute the objective function value
        obj_val = objective_function(S, M, U, H, C, B, alpha, beta)
        print(f"Iteration {iteration + 1}, Objective Function: {obj_val:.6f}")
        
        # Check for convergence
        if norm(H - H_prev) < tol:
            print("Converged")
            break
    
    return M, U, H, C


In [3]:
# Example input data
n = 50  # number of nodes
m = 50   # dimensionality of the embedding space
k = 10   # number of communities

S = np.random.rand(n, m)  # Similarity matrix (random example)
B = np.random.rand(n, n)  # Modularity matrix (random example)

# Hyperparameters
alpha = 0.5
beta = 0.1
lambda_reg = 0.01
max_iter = 100

# Run the optimization
M, U, H, C = community_preserving_embedding(S, B, alpha, beta, lambda_reg, max_iter)

Iteration 1, Objective Function: -5624.462098
Iteration 2, Objective Function: -2848.779393
Iteration 3, Objective Function: -1783.438750
Iteration 4, Objective Function: -1272.894975
Iteration 5, Objective Function: -992.106087
Iteration 6, Objective Function: -822.188698
Iteration 7, Objective Function: -676.259977
Iteration 8, Objective Function: 270570294627.840515
Iteration 9, Objective Function: 19858424634688086367797248.000000
Iteration 10, Objective Function: 14474254935549750414750908178654406701926580224.000000
Iteration 11, Objective Function: 304314339602189369667290981950282977006897012698144286196236392582741420408832.000000
Iteration 12, Objective Function: 29915185264741766494408675242774361494538223026060301814705592083415374787921732880017490296642352980415585241960877133922304.000000
Iteration 13, Objective Function: 9381180867581291124413589114773160575829042882462483542808707227738709446908834898553362120278381476788424828516342883852272411758385784306861039606148

  from ipykernel import kernelapp as app


###  Higher-Order Proximity Preserved Network Embedding

Graph embedding algorithms aim to embed a graph into a vector space where the structure and the inherent properties of the graph are preserved without considering how to preserve its asymmetric transitivity, which is a critical property of directed graphs.

Transitivity is a common characteristic of undirected and directed graphs [111, 140], and plays a key role in graph inference and analysis tasks, such as calculating similarities between nodes and measuring the importance of nodes.

- **In undirected graphs**: If there is an edge between vertices \( u \) and \( w \), and another between \( w \) and \( v \), then \( u \) and \( v \) are likely connected by an edge. Transitivity is symmetric in undirected graphs.
- **In directed graphs**: There is a directed path from \( u \) to \( v \), but not from \( v \) to \( u \). That is, transitivity is asymmetric in directed graphs.

Consider a directed graph \( G = (V , E) \), where \( V = \{v_1, \dots, v_N\} \) is the vertex set, and \( N \) is the number of vertices. \( E \) is the directed edge set, i.e., \( e_{ij} = (v_i , v_j) \in E \) represents a directed edge from \( v_i \) to \( v_j \). The adjacency matrix is denoted by \( A \). If \( S_{ij} \) are the higher-order proximities between \( v_i \) and \( v_j \), then \( S = [S_{ij}] \) is known as a higher-order proximity matrix. Let \( U = [U^s , U^t] \) be the embedding matrix whose \( i \)-th row \( u_i \) is the embedding vector of \( v_i \), and let \( U^s \), \( U^t \in \mathbb{R}^{N \times K} \) be the source embedding vectors and target embedding vectors, respectively, where \( K \) is the embedding dimension.

The higher-order proximity preserved embedding (HOPE) in [111] can be stated as follows: Given a higher-order proximity matrix \( S \), find the source embedding matrix \( U^s \) and the target embedding matrix \( U^t \). The objective of this problem is defined as [111]:

$$
\min \| S - U^s (U^t)^\top \|_F^2 .
\tag{7.13.30}
$$

Let the singular value decomposition (SVD) of the higher-order proximity matrix \( S \) be given by

$$
S = \sum_{i=1}^{N} \sigma_i v^s_i (v^t_i)^\top ,
\tag{7.13.31}
$$

where \( \sigma_1 , \dots, \sigma_N \) are the singular values sorted in decreasing order, and \( v^s_i \) and \( v^t_i \) are the left- and right-singular vectors associated with \( \sigma_i \) of \( S \).

By comparison of (7.13.31) with (7.13.30), it is easily known that the source and target embedding matrices can be determined by

$$
U^s = [ \sqrt{\sigma_1} v^s_1, \dots, \sqrt{\sigma_K} v^s_K ] ,
\tag{7.13.32}
$$

$$
U^t = [ \sqrt{\sigma_1} v^t_1, \dots, \sqrt{\sigma_K} v^t_K ] ,
\tag{7.13.33}
$$

where \( K \) is the number of the largest singular values of \( S \), giving the estimate of the embedding dimension.

Many higher-order proximity measurements in graph can reflect the asymmetric transitivity. The higher-order proximity matrix shares a general formulation:

$$
S = M_g^{-1} M_l ,
\tag{7.13.34}
$$

where \( M_g \) and \( M_l \) are both polynomials of matrices.

The following are a few examples of higher-order proximity matrices [111]:

1. **Katz Index [76]**:

$$
S_{\text{Katz}} = \beta A + \beta A S_{\text{Katz}} ,
\tag{7.13.35}
$$

from which it follows that

$$
S_{\text{Katz}} = (I - \beta A)^{-1} \beta A ,
\tag{7.13.36}
$$

where \( \beta \) is a decay parameter. \( \beta \) should be smaller than the spectral radius of the adjacency matrix. Clearly, for Katz index, one has

$$
M_g = (I - \beta A) \quad \text{and} \quad M_l = \beta A .
\tag{7.13.37}
$$

2. **Rooted PageRank (RPR)**:

$$
S_{\text{RPR}} = \alpha S_{\text{RPR}} P + (1 - \alpha) I \implies S_{\text{RPR}} = (I - \alpha P)^{-1} (1 - \alpha) I ,
\tag{7.13.38}
$$

where \( \alpha \in [0, 1) \) is the probability to randomly walk to a neighbor, and \( P \) is the probability transition matrix satisfying the condition \( \sum_{i=1}^{N} P_{ij} = 1 \). Clearly, for RPR,

$$
M_g = I - \alpha P \quad \text{and} \quad M_l = (1 - \alpha) I .
\tag{7.13.39}
$$

3. **Common Neighbors (CN)**: \( S_{ij}^{\text{CN}} \) counts the number of vertices connecting to both \( v_i \) and \( v_j \). For directed graphs, \( S_{ij}^{\text{CN}} \) is the number of vertices which are the target of an edge from \( v_i \) and the source of an edge to \( v_j \). Formally,

$$
S^{\text{CN}} = A^2
\tag{7.13.40}
$$

from which we get

$$
M_g = I \quad \text{and} \quad M_l = A^2 .
\tag{7.13.41}
$$

4. **Adamic-Adar (AA)**: Adamic-Adar is a variant of common neighbors:

$$
S^{\text{AA}} = A D A
\tag{7.13.42}
$$

which gives

$$
M_g = I \quad \text{and} \quad M_l = A D A ,
\tag{7.13.43}
$$

where

$$
D_{ii} = \left( \sum_j (A_{ij} + A_{ji}) \right)^{-1} .
\tag{7.13.44}
$$


It needs the matrix inversion \( M^{-1}_g \) to compute the higher-order proximity matrix \( S \) from \( M_g \) and \( M_l \). To improve the numerical stability of the higher-order proximity preserved embedding, Ou et al. [111] suggested computing the generalized singular value decomposition (GSVD) of the matrix pair \( (M_g , M_l) \) instead of the SVD of \( S \):

$$
\mathbf{V}^\top_t M^\top_l \mathbf{X} = \text{Diag}(\sigma^l_1, \dots, \sigma^l_N) ,
\tag{7.13.45}
$$

$$
\mathbf{V}^\top_s M^\top_g \mathbf{X} = \text{Diag}(\sigma^g_1, \dots, \sigma^g_N) ,
\tag{7.13.46}
$$

where \( \mathbf{X} \) is a nonsingular matrix, and

$$
\sigma^l_1 \geq \sigma^l_2 \geq \dots \geq \sigma^l_N \geq 0,
\tag{7.13.47}
$$

$$
0 \leq \sigma^g_1 \leq \sigma^g_2 \leq \dots \leq \sigma^g_N,
\tag{7.13.48}
$$

$$
(\sigma^l_i)^2 + (\sigma^g_i)^2 = 1, \quad \forall i.
\tag{7.13.49}
$$

Most existing embedding methods focus on the static network while neglecting the evolving characteristic of real-world networks. Recently, Zhu et al. [176] proposed a higher-order proximity preserved embedding for dynamic networks.


In [5]:
import numpy as np
from scipy.linalg import gsvd

def compute_gsvd(Mg, Ml):
    """
    Compute the Generalized Singular Value Decomposition (GSVD) of the matrix pair (Mg, Ml).

    Parameters:
    Mg (numpy.ndarray): Matrix Mg of shape (N, N).
    Ml (numpy.ndarray): Matrix Ml of shape (N, N).

    Returns:
    U (numpy.ndarray): Left singular vectors.
    V (numpy.ndarray): Right singular vectors.
    X (numpy.ndarray): Nonsingular matrix X.
    sigma_g (numpy.ndarray): Singular values associated with Mg.
    sigma_l (numpy.ndarray): Singular values associated with Ml.
    """
    # Compute GSVD of the matrix pair (Mg, Ml)
    U, V, X, sigma_g, sigma_l = gsvd(Mg, Ml)
    
    return U, V, X, sigma_g, sigma_l

def higher_order_proximity_embedding(Mg, Ml, K):
    """
    Compute the higher-order proximity preserved embedding using GSVD.

    Parameters:
    Mg (numpy.ndarray): Matrix Mg of shape (N, N).
    Ml (numpy.ndarray): Matrix Ml of shape (N, N).
    K (int): Embedding dimension.

    Returns:
    Us (numpy.ndarray): Source embedding matrix of shape (N, K).
    Ut (numpy.ndarray): Target embedding matrix of shape (N, K).
    """
    # Compute the GSVD of the matrix pair (Mg, Ml)
    U, V, X, sigma_g, sigma_l = compute_gsvd(Mg, Ml)

    # Select the top K components for the embedding
    Us = np.dot(np.diag(np.sqrt(sigma_l[:K])), V[:, :K])
    Ut = np.dot(np.diag(np.sqrt(sigma_g[:K])), X[:, :K])
    
    return Us, Ut

# Example usage
if __name__ == "__main__":
    # Define Mg and Ml matrices (example)
    N = 100  # Number of nodes
    K = 10   # Embedding dimension
    
    # Example matrices (Mg and Ml can be computed based on your specific higher-order proximity measure)
    Mg = np.random.rand(N, N)
    Ml = np.random.rand(N, N)
    
    # Compute the higher-order proximity preserved embedding
    Us, Ut = higher_order_proximity_embedding(Mg, Ml, K)
    
    print("Source Embedding Matrix Us:")
    print(Us)
    print("\nTarget Embedding Matrix Ut:")
    print(Ut)


ImportError: cannot import name 'gsvd' from 'scipy.linalg' (/home/radha/anaconda3/envs/cv37/lib/python3.7/site-packages/scipy/linalg/__init__.py)

In [6]:
import numpy as np
from scipy.linalg import svd

def compute_svd(M):
    """
    Compute the Singular Value Decomposition (SVD) of a matrix M.

    Parameters:
    M (numpy.ndarray): Matrix M of shape (2N, N).

    Returns:
    U (numpy.ndarray): Left singular vectors.
    Sigma (numpy.ndarray): Singular values.
    Vt (numpy.ndarray): Right singular vectors transposed.
    """
    U, Sigma, Vt = svd(M, full_matrices=False)
    return U, Sigma, Vt

def higher_order_proximity_embedding(Mg, Ml, K):
    """
    Compute the higher-order proximity preserved embedding using SVD.

    Parameters:
    Mg (numpy.ndarray): Matrix Mg of shape (N, N).
    Ml (numpy.ndarray): Matrix Ml of shape (N, N).
    K (int): Embedding dimension.

    Returns:
    Us (numpy.ndarray): Source embedding matrix of shape (N, K).
    Ut (numpy.ndarray): Target embedding matrix of shape (N, K).
    """
    # Create the combined matrix for SVD
    M_combined = np.vstack((Mg, Ml))
    
    # Compute the SVD of the combined matrix
    U, Sigma, Vt = compute_svd(M_combined)
    
    # Select the top K components for the embedding
    Us = np.dot(U[:len(Mg), :K], np.diag(np.sqrt(Sigma[:K])))
    Ut = np.dot(Vt.T[:, :K], np.diag(np.sqrt(Sigma[:K])))
    
    return Us, Ut

# Example usage
if __name__ == "__main__":
    # Define Mg and Ml matrices (example)
    N = 100  # Number of nodes
    K = 10   # Embedding dimension
    
    # Example matrices (Mg and Ml can be computed based on your specific higher-order proximity measure)
    Mg = np.random.rand(N, N)
    Ml = np.random.rand(N, N)
    
    # Compute the higher-order proximity preserved embedding
    Us, Ut = higher_order_proximity_embedding(Mg, Ml, K)
    
    print("Source Embedding Matrix Us:")
    print(Us)
    print("\nTarget Embedding Matrix Ut:")
    print(Ut)


Source Embedding Matrix Us:
[[-5.71882004e-01  7.63750807e-02  2.82158184e-03  2.11606254e-01
   9.25585971e-02 -7.69594324e-02 -1.84000634e-01  3.30159805e-01
   2.46215990e-01  2.17099452e-01]
 [-6.05296377e-01 -2.60366664e-01  1.80521634e-02  4.06743796e-01
   5.27179658e-02 -1.81657828e-02 -2.03102469e-01  4.10328954e-02
   4.01836168e-01 -2.66194396e-01]
 [-5.64220706e-01 -1.89510090e-02 -3.44030808e-02  8.98704360e-02
   3.91677452e-02  5.87813470e-02 -9.53271815e-02  3.75936567e-01
  -4.19153094e-02  5.41842762e-02]
 [-5.51358318e-01 -2.71564255e-01  2.20026888e-01  1.63022670e-01
   2.73420908e-01  4.95301326e-02 -1.33590999e-01  7.65198610e-02
  -1.44954523e-01  9.51919743e-03]
 [-5.94100737e-01  5.60785215e-02  1.46036513e-01  1.29295659e-01
   1.92301657e-01 -3.00869154e-01  1.59082986e-01  2.31780167e-03
  -8.07635427e-03  5.84800934e-02]
 [-6.20010066e-01  4.62067179e-02  2.11046882e-01 -5.17463628e-01
   1.12933851e-01  3.20894436e-01  8.18208267e-04 -3.83453385e-02
  -8.

GNN is a kind of neural network which processes directly the data represented in a graph domain. A typical application of GNN is node classification. Essentially, each node in the graph is associated with a label. Our goal is to predict the label of a node without an associated ground-truth.

Let the vertex or node \( x_i \) (i = 1, \dots, n) represent the ith data point (feature). The goal of GNNs is to learn a state embedding \( h_i \in \mathbb{R}^s \) which contains the information of the neighborhood for the \( i \)th node \( x_i \).

GNNs are a general neural network architecture defined according to a graph structure \( G = (V, E) \). Nodes \( j \in V \) take unique values from \(\{1, \dots, |V|\}\), and edges are pairs \( e = (i, j) \in V \times V \). In directed graphs, \( (i, j) \) represents a directed edge \( i \rightarrow j \). The node vector, also called node representation or node embedding, for node \( j \) is denoted by \( x_j \in \mathbb{R}^D \). Graphs may also contain node labels \( l_j \in \{1, \dots, L_{|V|}\} \) for each node \( j \) and edge labels or edge types \( l_e \in \{1, \dots, L_{|E|}\} \) for each edge. Let \( x_S = \{x_j \ |\ j \in S\} \) when \( S \) is a set of nodes, and \( l_E = \{l_e \ |\ e \in E\} \) when \( E \) is a set of edges.

Let \( f_i \) be a parametric function (for the node \( i \)), called the local transition function, that expresses the dependence of node \( i \) on its neighborhood, and let \( g_i \) be the local output function (for the node \( i \)) that describes how the output is produced.

The set \( \text{ne}[n] \) stands for the neighbors of the vertex \( n \), i.e., the nodes connected to \( n \) by an arc, while \( \text{co}[n] \) denotes the set of arcs having \( n \) as a vertex.

The state vector or state embedding \( h_i \) and the output \( o_i \) can be represented by:

$$
h_i = f_i\left(x_i, x_{\text{co}[i]}, h_{\text{ne}[i]}, x_{\text{ne}[i]}\right),
$$
$$
o_i = g_i\left(h_i, x_i\right),
$$

where:

- \( x_i \): the features of the node \( i \),
- \( x_{\text{co}[i]} \): the features of edges connecting with \( i \),
- \( h_{\text{ne}[i]} \): the embedding of the nodes in the neighborhood of \( i \),
- \( x_{\text{ne}[i]} \): the features of the nodes in the neighborhood of \( i \),
- \( f_i \): a transition function that maps the above four inputs to d-dimensional space,
- \( g_i \): an output function when the input is \( x_i \) and the transited state is \( h_i \).

**Example 7.2** Let \( x_1, \dots, x_8 \) be eight data points, where \( x_2, x_3, x_4, x_6 \) are the neighbors of \( x_1 \). Then, \( x_{\text{co}[1]} = \left(e(1,2), e(3,1), e(1,4), e(6,1)\right) \), where \( e(ij) \) denotes the edge label connecting node \( i \) to node \( j \), \( x_{\text{ne}[1]} = \left(x_2, x_3, x_4, x_6\right) \), and \( h_{\text{ne}[1]} = \) ...


(h_2, h_3, h_4, h_6). In other words,

$$
h_1 = f\left( x_1, x_{\text{co}[1]} = \left(e(1,2), e(3,1), e(1,4), e(6,1)\right), h_{\text{ne}[1]} = \left(h_2, h_3, h_4, h_6\right), x_{\text{ne}[1]} = \left(x_2, x_3, x_4, x_6\right)\right),
$$

which contains information on the neighborhood for the first node \( x_1 \).

Let \( h \), \( o \), \( x \), and \( x_N \) be the vectors constructed by stacking all the states, all the outputs, all the features, and all the node features, respectively. Equations (7.14.1) and (7.14.2) can be rewritten in a compact form as [129]:

$$
h = f(h, x),
$$
$$
o = g(h, x_N),
$$

where \( f = [f_1, \dots, f_N]^T \) and \( g = [g_1, \dots, g_N]^T \) are stacked versions of local transition functions and local output functions corresponding to all \( N \) nodes, respectively; and are known as the global transition function and global output function for all nodes in a graph, respectively. The aim of GNNs is to learn the global transition function \( f \) and the global output function \( g \).

Let \( t_i = Wh_i \) be the target information (for a specific node \( i \)) for the supervision, the loss can be written as follows:

$$
L(W) = \frac{1}{2} \sum_{i=1}^{p} \|t_i - o_i\|_2^2,
$$

where \( p \) is the number of supervised nodes.

By Banach’s fixed point theorem [78], GNN uses the following classic iterative scheme for updating the state:

$$
h^{(t+1)} = f\left(h^{(t)}, x\right),
$$

where \( h^{(t)} \) denotes the \( t \)th iteration of \( h \). This updating converges exponentially fast to the solution of Eq. (7.14.3) for any initial value \( h^{(0)} \).

The dynamical systems based on the computations of \( f \) and \( g \) can be interpreted as feedforward neural networks. To learn the parameters of \( f \) and \( g \), given the target information \( t \) for the supervision, the loss in (7.14.5) can be rewritten as follows:

$$
L(W) = \frac{1}{2} \|t - o\|_2^2 = \frac{1}{2} \|Wh - o\|_2^2.
$$


The learning algorithm is based on a gradient descent strategy composed of the following steps [129]:

1. The states \( h_i^{(t)} \) are iteratively updated by Eq. (7.14.1) until a time \( T \). They approach the fixed point solution to Eq. (7.14.3): 
   $$
   h^{(T)} \approx h.
   $$

2. Compute the gradient of weights:
   $$
   \nabla L(W^t) = \frac{\partial L}{\partial W^t} = (W^t h - o^t) h^T.
   $$

3. The weights are updated as:
   $$
   W^{t+1} = W^t - \mu_t \nabla L(W^t).
   $$

4. Return to Step 1 and repeat the above steps until \( W \) is converged.


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the GNN model
class GNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GNNModel, self).__init__()
        self.fc1 = nn.Linear(input_dim * 2, hidden_dim)  # Transition function
        self.fc2 = nn.Linear(hidden_dim, output_dim)     # Output function

    def forward(self, x, adjacency_matrix):
        # x: node features
        # adjacency_matrix: adjacency matrix of the graph

        # Update states based on neighborhood and self features
        h = torch.relu(self.fc1(torch.cat([x, torch.matmul(adjacency_matrix, x)], dim=1)))

        # Compute outputs
        o = self.fc2(h)
        return h, o

# Define the loss function
def compute_loss(outputs, targets):
    return nn.MSELoss()(outputs, targets)

# Training function
def train_gnn(model, adjacency_matrix, node_features, targets, num_epochs, learning_rate):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    for epoch in range(num_epochs):
        model.train()
        
        # Forward pass
        h, outputs = model(node_features, adjacency_matrix)
        
        # Compute loss
        loss = compute_loss(outputs, targets)
        
        # Zero gradients, backward pass, optimizer step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')

# Example usage
num_nodes = 8
input_dim = 5
hidden_dim = 16
output_dim = 1
num_epochs = 100
learning_rate = 0.01

# Dummy data
node_features = torch.rand(num_nodes, input_dim)  # Random node features
adjacency_matrix = torch.rand(num_nodes, num_nodes)  # Random adjacency matrix
targets = torch.rand(num_nodes, output_dim)  # Random targets

model = GNNModel(input_dim, hidden_dim, output_dim)
train_gnn(model, adjacency_matrix, node_features, targets, num_epochs, learning_rate)


Epoch 1/100, Loss: 0.29036661982536316
Epoch 2/100, Loss: 0.17837175726890564
Epoch 3/100, Loss: 0.11794856190681458
Epoch 4/100, Loss: 0.10147053748369217
Epoch 5/100, Loss: 0.11370628327131271
Epoch 6/100, Loss: 0.13271023333072662
Epoch 7/100, Loss: 0.14212076365947723
Epoch 8/100, Loss: 0.13877928256988525
Epoch 9/100, Loss: 0.12701861560344696
Epoch 10/100, Loss: 0.11275219917297363
Epoch 11/100, Loss: 0.10054752230644226
Epoch 12/100, Loss: 0.09286817163228989
Epoch 13/100, Loss: 0.09019728004932404
Epoch 14/100, Loss: 0.09137881547212601
Epoch 15/100, Loss: 0.09436710178852081
Epoch 16/100, Loss: 0.09707813709974289
Epoch 17/100, Loss: 0.09805217385292053
Epoch 18/100, Loss: 0.09674113988876343
Epoch 19/100, Loss: 0.09343274682760239
Epoch 20/100, Loss: 0.08895740658044815
Epoch 21/100, Loss: 0.0843329131603241
Epoch 22/100, Loss: 0.08045006543397903
Epoch 23/100, Loss: 0.07785101979970932
Epoch 24/100, Loss: 0.07662783563137054
Epoch 25/100, Loss: 0.0764523595571518
Epoch 26/10

In [8]:
import numpy as np

class GNNModel:
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        
        # Initialize weights
        self.W1 = np.random.randn(input_dim * 2, hidden_dim)
        self.W2 = np.random.randn(hidden_dim, output_dim)
        self.b1 = np.zeros(hidden_dim)
        self.b2 = np.zeros(output_dim)
    
    def forward(self, X, adjacency_matrix):
        # X: node features (num_nodes x input_dim)
        # adjacency_matrix: adjacency matrix (num_nodes x num_nodes)
        
        # Compute neighborhood aggregation
        aggregated_neigh = np.dot(adjacency_matrix, X)
        
        # Combine self features and neighborhood features
        combined = np.hstack([X, aggregated_neigh])
        
        # Transition function
        self.hidden = np.maximum(0, np.dot(combined, self.W1) + self.b1)  # ReLU activation
        
        # Output function
        self.outputs = np.dot(self.hidden, self.W2) + self.b2
        return self.outputs
    
    def compute_loss(self, outputs, targets):
        return np.mean((outputs - targets) ** 2)
    
    def backward(self, X, adjacency_matrix, targets, learning_rate):
        num_nodes = X.shape[0]
        
        # Compute gradients using the chain rule
        dL_doutputs = 2 * (self.outputs - targets) / num_nodes
        dL_dW2 = np.dot(self.hidden.T, dL_doutputs)
        dL_db2 = np.sum(dL_doutputs, axis=0)
        
        dL_dhidden = np.dot(dL_doutputs, self.W2.T)
        dL_dhidden[self.hidden <= 0] = 0  # ReLU derivative
        
        dL_dW1 = np.dot(np.hstack([X, np.dot(adjacency_matrix, X)]).T, dL_dhidden)
        dL_db1 = np.sum(dL_dhidden, axis=0)
        
        # Update weights and biases
        self.W2 -= learning_rate * dL_dW2
        self.b2 -= learning_rate * dL_db2
        self.W1 -= learning_rate * dL_dW1
        self.b1 -= learning_rate * dL_db1
    
    def train(self, X, adjacency_matrix, targets, num_epochs, learning_rate):
        for epoch in range(num_epochs):
            # Forward pass
            outputs = self.forward(X, adjacency_matrix)
            
            # Compute loss
            loss = self.compute_loss(outputs, targets)
            
            # Backward pass and weight update
            self.backward(X, adjacency_matrix, targets, learning_rate)
            
            print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss}')

# Example usage
num_nodes = 8
input_dim = 5
hidden_dim = 16
output_dim = 1
num_epochs = 100
learning_rate = 0.01

# Dummy data
X = np.random.rand(num_nodes, input_dim)  # Node features
adjacency_matrix = np.random.rand(num_nodes, num_nodes)  # Adjacency matrix
targets = np.random.rand(num_nodes, output_dim)  # Target values

model = GNNModel(input_dim, hidden_dim, output_dim)
model.train(X, adjacency_matrix, targets, num_epochs, learning_rate)


Epoch 1/100, Loss: 963.8184548060966
Epoch 2/100, Loss: 72167.33222532854
Epoch 3/100, Loss: 26.00344496518261
Epoch 4/100, Loss: 24.977228963022203
Epoch 5/100, Loss: 23.991651114547345
Epoch 6/100, Loss: 23.0451021488721
Epoch 7/100, Loss: 22.136036522237585
Epoch 8/100, Loss: 21.2629698944178
Epoch 9/100, Loss: 20.424476705059686
Epoch 10/100, Loss: 19.619187846000145
Epoch 11/100, Loss: 18.845788425759366
Epoch 12/100, Loss: 18.10301562256012
Epoch 13/100, Loss: 17.389656622367564
Epoch 14/100, Loss: 16.704546638582634
Epoch 15/100, Loss: 16.046567010155588
Epoch 16/100, Loss: 15.41464337501425
Epoch 17/100, Loss: 14.80774391582451
Epoch 18/100, Loss: 14.224877675218686
Epoch 19/100, Loss: 13.665092937740852
Epoch 20/100, Loss: 13.127475675867139
Epoch 21/100, Loss: 12.611148057563625
Epoch 22/100, Loss: 12.115267012944933
Epoch 23/100, Loss: 11.639022857693138
Epoch 24/100, Loss: 11.181637970989314
Epoch 25/100, Loss: 10.742365525798963
Epoch 26/100, Loss: 10.320488269438147
Epoch

## DeepWalk and GraphSAGE

The original GNN has three main limitations:

1. **Fixed Point Hypothesis:**
   - If the “fixed point” hypothesis is relaxed, a more stable representation can be learned using multilayer perceptrons, and the iterative update process can be deleted.

2. **Edge Information Handling:**
   - It does not handle edge information (for example, different edges in a knowledge graph may represent different relationships between nodes).

3. **Node Representation Diversity:**
   - Fixed points hinder the diversity of node distribution and are not suitable for learning good representations of nodes.

In view of the above limitations of the original GNN, many variants have been proposed. Two typical variants are **DeepWalk** and **GraphSAGE**.

### DeepWalk

Social representations are expected to have the following characteristics:

- **Adaptable:** Real social networks are constantly evolving; new social relations should not require repeating the learning process all over again.

- **Community Aware:** The distance between latent dimensions should represent a metric for evaluating social similarity between the corresponding members of the network. This allows generalization in networks with homophily.

- **Low Dimensional:** When labeled data is scarce, low-dimensional models generalize better and speed up convergence and inference.

- **Continuous:** Latent representations are required to model partial community membership in continuous space. In addition to providing a nuanced view of community membership, a continuous representation has smooth decision boundaries between communities, which allows more robust classification.



### DeepWalk

DeepWalk, introduced by Perozzi et al. [114], satisfies these requirements by learning representations for vertices from a stream of short random walks, using optimization techniques originally designed for language modeling.

#### 1. Random Walks

Random walks are executed on nodes in a graph to generate a sequence of nodes. A random walk rooted at vertex \( v_i \) is denoted as \( W_{v_i} \), which is a stochastic process with random variables \( W_{v_i}, \ldots, W_{v_{i+k}} \) such that \( W_{v_{i+k}} \) is a vertex chosen at random from the neighbors of vertex \( v_i \). Due to the local structure, a stream of short random walks can be used as a basic tool for extracting information from a network. Moreover, using random walks has two other desirable properties [114]:

- **Local Exploration:** Local exploration is easy to parallelize. Several random walkers (in different threads, processes, or machines) can simultaneously explore different parts of the same graph.

- **Adaptability:** Relying on information obtained from short random walks makes it possible to accommodate small changes in the graph structure without the need for global recomputation. The learned model can be iteratively updated with new random walks from the changed region in time sub-linear to the entire graph.

#### 2. Language Modeling

Run skip-gram to learn the embedding of each node according to the sequence of nodes generated in random walks. Language modeling can be generalized to explore the graph through a stream of short random walks. These random walks can be thought of as short sentences and phrases in a special language; the direct analog is to estimate the likelihood of observing vertex \( v_i \) given all the previous vertices visited so far in the random walk, i.e.,

$$
\text{Pr}(v_i | (v_1, v_2, \ldots, v_{i-1}))
$$

#### Algorithm 7.8: DeepWalk(G, w, d, γ, t) [114]

```python
1. input: Graph G(V, E); window size w; embedding size d; walks per vertex γ; walk length t.
2. initialization: Sample matrix \( \Theta \) from \( U^{|V| \times d} \).
3. Build a binary Tree T from V.
4. for i = 0 to γ do
5.     O = Shuffle(V).
6.     for each \( v_i \) ∈ O do
7.         \( W_{v_i} = \text{RandomWalk}(G, v_i, t) \).
8.         \text{SkipGram}( \Theta, W_{v_i}, w).
9.     end for
10. end for
11. output: matrix of vertex representations \( \Theta \in \mathbb{R}^{|V| \times d} \)


In [11]:
import numpy as np
import random

# Function to generate a random walk
def random_walk(graph, start_node, walk_length):
    walk = [start_node]
    for _ in range(walk_length - 1):
        cur_node = walk[-1]
        neighbors = list(graph.get(cur_node, []))
        if not neighbors:
            break
        walk.append(random.choice(neighbors))
    return walk

# Skip-gram model class
class SkipGramModel:
    def __init__(self, vocab_size, embedding_dim, learning_rate=0.01):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.learning_rate = learning_rate
        
        # Initialize weights
        self.W = np.random.randn(vocab_size, embedding_dim)  # Embedding matrix for input words
        self.W_out = np.random.randn(embedding_dim, vocab_size)  # Embedding matrix for output words
    
    def train(self, walks, window_size, epochs):
        for epoch in range(epochs):
            total_loss = 0
            for walk in walks:
                for i, center in enumerate(walk):
                    context = [walk[j] for j in range(max(0, i - window_size), min(len(walk), i + window_size + 1)) if j != i]
                    for word in context:
                        loss = self._update(center, word)
                        total_loss += loss
            print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss:.4f}")
    
    def _update(self, center, context):
        center_vector = self.W[center]
        context_vector = self.W_out[:, context]
        scores = np.dot(center_vector, context_vector)
        probs = self._softmax(scores)
        
        # Compute loss
        loss = -np.log(probs[context])
        
        # Compute gradients
        grad = probs
        grad[context] -= 1
        
        # Update weights
        self.W[center] -= self.learning_rate * np.dot(grad, context_vector.T)
        self.W_out[:, context] -= self.learning_rate * np.outer(center_vector, grad)
        
        return loss
    
    def _softmax(self, x):
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum(axis=0)

# DeepWalk algorithm
def deepwalk(graph, num_walks, walk_length, window_size, embedding_dim, epochs):
    # 1. Generate random walks
    walks = []
    nodes = list(graph.keys())
    for _ in range(num_walks):
        random.shuffle(nodes)
        for node in nodes:
            walks.append(random_walk(graph, node, walk_length))
    
    # 2. Train skip-gram model
    model = SkipGramModel(vocab_size=len(graph), embedding_dim=embedding_dim)
    model.train(walks, window_size, epochs)
    
    return model.W

# Example usage
if __name__ == "__main__":
    # Example graph as an adjacency list
    graph = {
        0: [1, 2],
        1: [0, 2, 3],
        2: [0, 1, 3],
        3: [1, 2]
    }

    # Parameters
    num_walks = 10
    walk_length = 5
    window_size = 2
    embedding_dim = 2
    epochs = 10

    # Run DeepWalk
    embeddings = deepwalk(graph, num_walks, walk_length, window_size, embedding_dim, epochs)
    print("Node embeddings:")
    print(embeddings)


IndexError: invalid index to scalar variable.

In [12]:
import numpy as np

# Example function that may cause IndexError
def example_function():
    # Create a dictionary with nodes and their neighbors
    graph = {
        0: [1, 2],
        1: [0, 2, 3],
        2: [0, 1, 3],
        3: [1, 2]
    }
    
    # Initialize the embedding matrix
    num_nodes = len(graph)
    embedding_dim = 2
    embeddings = np.random.randn(num_nodes, embedding_dim)
    
    # Function to perform a simple operation
    def process_node(node):
        if node in graph:
            neighbors = graph[node]
            # Avoid IndexError: Ensure we index into arrays properly
            node_embedding = embeddings[node]  # This should be a valid row from the embeddings matrix
            print(f"Node {node} embedding: {node_embedding}")
            for neighbor in neighbors:
                if neighbor < num_nodes:  # Ensure neighbor index is valid
                    neighbor_embedding = embeddings[neighbor]
                    print(f"Neighbor {neighbor} embedding: {neighbor_embedding}")
                else:
                    print(f"Neighbor index {neighbor} is out of bounds.")
        else:
            print(f"Node {node} not found in the graph.")
    
    # Example usage
    process_node(0)
    process_node(5)  # Example of a node that is not in the graph

# Call the function
example_function()


Node 0 embedding: [-0.29050043  1.26720343]
Neighbor 1 embedding: [0.36383828 1.31667723]
Neighbor 2 embedding: [-2.29991282 -1.43031955]
Node 5 not found in the graph.
