A PyTorch-based library for embedding large graphs to low-dimensional space using force-directed layouts with GPU acceleration.
The aim of this project is to speed up the process of obtaining low-dimensional layouts for large graphs, especially with GPU acceleration.
- Install PyTorch (follow official instructions)
- Install
torch-graph-force
:
pip install torch-graph-force
The dataset can be created from a dataframe, an edgelist or Networkx Graph using from_pandas_dataframe
, from_edgelist
, or from_networkx
respectively. from_pandas_dataframe
is the recommended way as it's more efficient compared to other methods.
If the node IDs are consecutive integers starting from 0, the dataset can be constructed with a dataframe for edges and the number of nodes:
import pandas as pd
import torch_graph_force
# The first argument is a dataframe of edges with at least two columns for source and target nodes.
# By default, column names "source", "target" and "weight" are taken as source nodes, target nodes and edge weights.
df = pd.DataFrame([[0, 1], [1, 2], [2, 3]], columns=['source', 'target'])
# Having a column for edge weights is optional. If the column for edge weights does not exist, 1.0 will be used for all edges.
# The second argument is the number of nodes in case the node IDs are consecutive integers starting from 0.
n_nodes = 4
# Create a GraphDataset for the graph
ds = torch_graph_force.from_pandas_dataframe(
df, n_nodes
)
If the node IDs are not consecutive integers, a list of node IDs must be provided:
import pandas as pd
import torch_graph_force
df = pd.DataFrame([["A", "B"], ["B", "C"], ["C", "D"]], columns=['source', 'target'])
# Order of the nodes in "nodes" is used to map the node IDs to node indices.
nodes = ["A", "B", "C", "D"]
ds = torch_graph_force.from_pandas_dataframe(
df, nodes
)
# the dataset's order follows the order of the provided list of nodes. In this example, calling ds[0] will return the data for node "A" and ds[1] for node "B"
# List of nodes can be access with ds.nodes
print(ds.nodes)
Once having the graph dataset ready, we can feed the dataset to spring_layout
to compute the graph layout.
pos = torch_graph_force.spring_layout(
ds
)
# pos is a numpy array of size (n_nodes, n_dim)
# each row represents the position of a node with corresponding index
print(pos)
# if node IDs are not consecutive integers, the nodes' positions can be obtained from the node list
node_pos = {nid: pos[idx] for idx, nid in enumerate(ds.nodes)}
Optional arguments for spring_layout
:
batch_size
: number of nodes to process in a batch. Larger batch size usually speeds up the processing, but it consumes more memory. (default: 64)iterations
: Maximum number of iterations taken. (default: 50)num_workers
: number of workers to fetch data from GraphDataset. If device is "cuda",num_workers
must be 0. (default: 0)device
: the device to store the graph and the layout model. If None, it's "cuda" if cuda is available otherwise "cpu". (default: None)iteration_progress
: monitor the progress of each iteration, it's useful for large graph. (default: False)layout_config
: additional config for the layout model. (default: {})
The layout model has some parameters with default values:
default_layout_config = {
# Tensor of shape (n_nodes, ndim) for initial positions
"pos": None,
# Optimal distance between nodes
"k": None,
# Dimension of the layout
"ndim": 2,
# Threshold for relative error in node position changes.
"threshold": 1e-4,
}
Use the layout_config
argument to change the parameters if needed. The example below provides intial positions for the layout model:
n_nodes = len(ds)
n_dim = 2
# Generate initial positions for the nodes
init_pos = np.random.rand(n_nodes, n_dim)
pos = torch_graph_force.spring_layout(
ds,
layout_config={"pos": init_pos}
)
The implementation from torch-graph-force
without GPU acceleration is 1.5x faster than Networkx's implementation.
GPU accelerated torch-graph-force
can compute layouts of graphs with 100k nodes within minutes. The benchmark was conducted with Tesla P100.
Code for the benchmarks can be found here