We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to sub-optimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.
This project contains the code for the paper Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication, Gianinazzi et al., PPoPP 2024
Scalable and Distributed Computing: With support for mpi4py and Cray-MPICH, the module is designed for scalability, facilitating distributed computing across multiple nodes and GPUs.
Efficient SpMM Operations: By integrating CSRMM kernels and leveraging GPU acceleration, our module offers highly efficient SpMM operations suitable for large-scale scientific computing tasks.
Advanced Decomposition Techniques: The use of linear arrangement frameworks and pruning, coupled with the innovative decomposition algorithm, ensures optimal performance and resource utilization in SpMM operations.
Compatibility and Versatility: The implementation's reliance on widely-used and well-supported libraries and frameworks ensures broad compatibility and application across various computing environments and use cases.
The package can be installed using pip:
pip install -e .
To enable gpu support you additionally need to install cupy
For example:
pip install cupy-cuda11x
or:
pip install cupy-cuda12x
To verify the installation, you can run the tests:
cd scripts
chmod +x run_tests.sh
./run_tests.sh
Using the arrow matrix spmm requires two steps:
- decompose the matrix
- perform the spmm
We provide two implementations for 1., one in python and one in Julia.
The python implementation may be called from the arrow_decompose
commandline call.
Example Usage (.mat input):
arrow_decompose --dataset_dir ~/data --dataset_name graph1 graph2 --format 'matlab' --width 10000
Example Usage Matrix Market (.mtx) input:
arrow_decompose --dataset_dir ~/data --dataset_name graph1 graph2 --format 'mtx' --width 10000
Options:
- For a directed graph, pass
--directed True
. - To visualize the arrow matrices, pass
--visualize True
. - Pass
save_input_graph True
to save the input graph in order to speed up later invocations of the script.
The Julia implementation may be called from the ArrowDecompositionMain.jl
script.
It is necessary to convert its output to the npy format using the convert_to_csr.jl
scripy
To multiply 10 times with the decomposed matrix on random right-hand sides, you can use the spmm_arrow
commandline call.
mpiexec -n 8 spmm_arrow --path ./data/graph1_B --width 10000 --features 16 --device cpu --iterations 10
To use your custom right-hand sides, you need to
use the ArrowDecompositionMPI
class directly, as defined in arrow_dec_mpi.py
.
To see how to use that class, refer to arrow_bench.py
.
The arrow matrix decomposition-based kernel can be invoked via the spmm_arrow entry point. It requires that the arrow matrices have been decomposed and are available in the specified directory.
Example usage:
mpiexec -n 8 ./scripts/spmm_arrow_main.py --path ./data/graph1_B --width 10000 --features 16 --device gpu --iterations 10
The 1.5D A-Stationary-based kernel can be invoked via the spmm_15d entry point.
Example usage:
mpiexec -n 8 ./scripts/spmm_15d_main.py --dataset file --file /path/to/matrix.mat --iterations 10 --device gpu
To run a benchmark with a random float32 sparse matrix having 100,000 vertices and 1,000,000 edges on a GPU:
mpiexec -n 8 ./scripts/spmm_15d_main.py --vertices 100000 --edges 1000000 --device gpu
The hypergraph partitioning-based kernel can be invoked via the spmm_petsc entry point.
python ./scripts/spmm_petsc_main.py --type float64 --file matrix.part.1.slice.2.npz --gpu-tiling True