Why is Metis reordering necessary for the sptrsv example? #7

learning-chip · 2022-12-19T08:57:54Z

The SpTRSV_runtime.cpp example orders input matrix A before extracting its lower half:

Lines 81 to 98 in da104fa

    
             /// Re-ordering matrix A 
        
             //    std::cout << "METIS IS NOT ACTIVATED" << std::endl; 
        
           #ifdef METIS 
        
             std::cout << "METIS IS ACTIVATED" << std::endl; 
        
             // We only reorder A since dependency matters more in l-solve. 
        
             A = make_full(Lower_A_CSC); 
        
             delete Lower_A_CSC; 
        
             metis_perm_general(A, perm); 
        
             Lower_A_CSC = make_half(A->n, A->p, A->i, A->x); 
        
             CSC *Lt = transpose_symmetric(Lower_A_CSC, perm); 
        
             CSC *L1_ord = transpose_symmetric(Lt, NULLPNTR); 
        
             delete Lower_A_CSC; 
        
             Lower_A_CSC = L1_ord; 
        
             Lower_A_CSR = csc_to_csr(Lower_A_CSC); 
        
             delete Lt; 
        
             delete A; 
        
             delete[] perm; 
        
           #else

I wonder how would the reordering benefits sparse triangular solve? In the ParSy&HDagg papers it is only mentioned that METIS is used for fill-in reduction reordering, but that's for cholesky factorization and not relevant for sptrsv. Disabling the reordering doesn't seem to affect performance in my experiments. It might have postive or negative impacts on data locality, but I didn't find any arguments in the paper.

The text was updated successfully, but these errors were encountered:

cheshmi · 2022-12-19T18:06:03Z

METIS reordering is used to reduce fill-in in direct solvers.
For iterative solvers, specially when the preconditioner does not introduce fill-in, METIS ordering helps to improve thread-parallelism and sometimes help with stability (this should change per application).

It is interesting that enabling METIS does not help. I have checked with METIS most of the time. I should be able to test without METIS in coming months/weeks to see how it performs.

learning-chip · 2022-12-21T13:46:21Z

when the preconditioner does not introduce fill-in, METIS ordering helps to improve thread-parallelism and sometimes help with stability

This is exactly my question -- is there a reference/document on this aspect? I don't see related discussions in Sympiler/ParSy/HDagg papers, except for fill-in reduction.

cheshmi · 2022-12-21T23:23:58Z

You can check this thesis:
l'Excellent, Jean-Yves. Multifrontal methods: parallelism, memory usage and numerical aspects. Diss. Ecole normale supérieure de lyon-ENS LYON, 2012

In page 15, it shows how reordering changes the tree and thus thread level parallelism. There are other work that mention the effect of permutation on parallelism. A recent one is the sparse fusion paper, you may check the first figure.

I am curious to know about your application if possible. I will be happy to know how Sympiler helps. You may email me if it is easier.

learning-chip · 2022-12-22T02:04:26Z

Thanks, a few more clarification questions to check my understanding --

In page 15, it shows how reordering changes the tree and thus thread level parallelism.

I understand the different elimination tree structures for complete factorization. For sptrsv, however, the matrix is lower/upper triangular, so its corresponding graph is directed acyclic. Reordering such DAG will only change the node labels, but has no effect on the DAG structure and its intrinsic data dependency (this is the main reason why I didn't see how METIS helps). However, it is also true that the (coarsened) levelset construction algorithm contains sequential heuristics, so the node order can indeed affect the partitioning results.

The Sparso paper (start of section 3) also comments on the how reordering might improve the data locality of SpMV and SpTRSV. But the theoretical grounding doesn't seem very solid (say, why is Metis ND ordering preferred over AMD/RCM?), and the potential benefit is really case-by-case (I can also construct cases where Metis slows down the code). It is unlike the sparse complete factorization case, where nested-dissection reordering gives theoretically near-optimal fill-in for 3D mesh topology, and is always much better than without reordering.

A recent one is the sparse fusion paper, you may check the first figure.

I like this idea of cross-kernel optimization! Is the fusion code open-source? It would really help PCG solver.

I am curious to know about your application if possible. I will be happy to know how Sympiler helps

I use it for parallel cholesky solver in circuit simulation. The matrix is a bit like the G3_circuit dataset, but much bigger. Because the matrix is very sparse, the blocked/supernode version (calling BLAS-3) doesn't help much, but the locality-aware partitioning (ParSy and HDagg) is very effective. This is why I raised more issues in this aggregation repo than in the Sympiler repo.

cheshmi · 2022-12-22T14:13:07Z

Reordering such DAG will only change the node labels, but has no effect on the DAG structure and its intrinsic data dependency

For sptrsv, we reorder the matrix and not the DAG. After reordering the matrix, we create a new DAG. The solution of the reordered matrix is different and that is why the solution vector should be reordered as well. I agree that reordering DAG does not make sense.

But the theoretical grounding doesn't seem very solid (say, why is Metis ND ordering preferred over AMD/RCM?), and the potential benefit is really case-by-case (I can also construct cases where Metis slows down the code).

Each ordering has a specific property. I agree that each ordering should be discussed independently. The thesis that I sent shares some insights. But, again these are heuristics and there is no guarantee.
If you count the number of wavefronts before and after metis reordering, you will see how it makes the difference theoretically (like less wavefronts means less synchronization). But in practice it is implementation dependent.

I use it for parallel cholesky solver in circuit simulation. The matrix is a bit like the G3_circuit dataset,

makes sense. How many cores does your processor have?

learning-chip · 2022-12-23T03:02:08Z

For sptrsv, we reorder the matrix and not the DAG. After reordering the matrix, we create a new DAG. The solution of the reordered matrix is different and that is why the solution vector should be reordered as well.

Indeed, my previous understanding (reordering DAG) was wrong. But I am pretty sure that ordering the matrix also changes the mathematical behavior. The solution of reordered matrix is mathematically different from the original one, and cannot be recovered by just "reordering back".

For example, taking a simple matrix from discretization of 1D poisson equation:

import numpy as np
import scipy.sparse as sp
import networkx as nx
import matplotlib.pyplot as plt

A = (
    sp.diags([2.0] * 4) + 
    sp.diags([-1.0] * 3, offsets=-1) + 
    sp.diags([-1.0] * 3, offsets=1) 
    ).tocsr()
print(A.todense()) 
# [[ 2. -1.  0.  0.]
# [-1.  2. -1.  0.]
# [ 0. -1.  2. -1.]
# [ 0.  0. -1.  2.]]

L = sp.tril(A).tocsr()  # extract lower half
print(L.todense())
# [[ 2.  0.  0.  0.]
# [-1.  2.  0.  0.]
# [ 0. -1.  2.  0.]
# [ 0.  0. -1.  2.]]

perm = [0, 2, 1, 3]
A_perm = A[perm,:][:,perm]  # symmetric reordering
L_perm = sp.tril(A_perm).tocsr()
print(L_perm.todense())
# [[ 2.  0.  0.  0.]
# [ 0.  2.  0.  0.]
# [-1. -1.  2.  0.]
# [ 0. -1.  0.  2.]]

b = np.ones(4)  # invariant to reordering
sptrsv = sp.linalg.spsolve_triangular
print(sptrsv(L, b))  # [0.5    0.75   0.875  0.9375]
print(sptrsv(L_perm, b))  # [0.5  0.5  1.   0.75], different from reordering the above

# plot dependency DAG
def plot_dag(L):
    g_L = nx.from_scipy_sparse_array(L, create_using=nx.DiGraph)
    g_L.remove_edges_from(nx.selfloop_edges(g_L))
    plt.figure(1, figsize=(3, 2)) 
    nx.draw(
        g_L, 
        pos=nx.layout.circular_layout(g_L),
        arrows=True,
        with_labels=True
    )
plot_dag(L)
plot_dag(L_perm)

The original DAG is strictly sequential, while the reordered one has more parallelism (x0 and x1 can be solve together):

From the 1D grid perspective, the original sptrsv is the forward Gauss–Seidel relaxation, while the reordered sptrsv is the red-black Gauss–Seidel variant. They both converge for SPD matrix, but at different convergence rates. See Figure 2.3 and Table 4.2 of the Multigrid Tutorial. The RBGS has better parallelism, but might converge faster or slower depending on the problem.

So I think users should really be aware of this different mathematical behavior, and use Metis preprocessing with caution. When used as preconditioner it is less of an issue (only changes the convergence rate of ILU or GS preconditioners), but when used as direct solve the result will be simply wrong (can't be corrected by reordering back!). Anyways, in case of direct solve, the L factor will naturally have an elimination tree structure with sufficient parallelism (generated by reordering matrix A before factorizing it), so it makes no sense to reorder L again.

How many cores does your processor have?

It's a dual-socket Intel CPU with 18x2=36 cores. There's quite linear speed-up from 1->18 cores, but little improvement from 18->36 cores. Maybe due to data movement between NUMA domains.

cheshmi · 2022-12-23T19:55:34Z

Thanks for the illustrations.
I am not sure if a triangular solver is used standalone. It is almost always within an iterative or a direct solver (I am curious to know if you are aware of any). Linear systems of Ax=b and PAx=Pb are mathematically equivalent (P is the permutation matrix). The triangular matrices, resulting from solving the reordered system, are reordered. So practically, if we want to evaluate the triangular solver fairly, we should work with the reordered one.

If you have a standalone triangular matrix, reordering it does not make sense because reordering often changes the lower triangular matrix to a general matrix, which is very expensive to solve.

Overall, any reordering changes the solution due to rounding errors, but I don't think you meant this. This rounding error certainly affects the convergence rate as well (a random effect).

learning-chip · 2022-12-23T23:56:48Z

Thanks, I think we agree on most parts now, just few clarifications:

Linear systems of Ax=b and PAx=Pb are mathematically equivalent (P is the permutation matrix).

This is totally right. What I was confused by the original SpTRSV_runtime.cpp example is that it reads an input matrix A, optionally reorders it, and extracts it lower triangular part. Then such reordering is not changing the equation to PAx=Pb, but solving two different equations tril(A)x=b and tril(PA)x=Pb. Such non-trivial difference exactly corresponds to the sequential vs red-black Gauss-Seidel example I showed before.

So practically, if we want to evaluate the triangular solver fairly, we should work with the reordered one.

Right this makes sense for ILU0/IC0 preconditioners that use the same pattern as original A, and also for complete factorizations that add fill-in. The SpTRSV_runtime.cpp example code is correct in demonstrating the degree of parallelism.

But if you mathematically compare the convergence of ILU0(A) and ILU0(PA), the convergence rate can be much different -- this is a well-known theoretical result for ILU0, see the paper Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations. Often the natural ordering has a low parallelism but best convergence, while nested-dissection reordering has a high parallelism but worse convergence (thus the "trade-off" in the title). The paper The effect of ordering on preconditioned conjugate gradients also has some earlier results.

any reordering changes the solution due to rounding errors

Right, this is quite minor effect. The non-associative floating point operations will also lead to some difference between sequential vs parallel runs, but not visible enough to impact real applications.

cheshmi · 2022-12-24T17:56:57Z

Thanks for the discussion.

Such non-trivial difference exactly corresponds to the sequential vs red-black Gauss-Seidel example #7 (comment).

I agree that reordering has non-trivial effect for GS and similar cases.

But if you mathematically compare the convergence of ILU0(A) and ILU0(PA), the convergence rate can be much different

This is also true. I have seen the effect of reordering on convergence for some matrices that I was playing with for a simulation application. This tells me that for a fair performance comparison, both reordered and original order should be reported. Nice papers!

learning-chip · 2022-12-25T01:57:43Z

👍No further issues, closing

learning-chip mentioned this issue Dec 19, 2022

Metis reordering necessary to sptrsv example? #6

Closed

learning-chip closed this as completed Dec 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is Metis reordering necessary for the sptrsv example? #7

Why is Metis reordering necessary for the sptrsv example? #7

learning-chip commented Dec 19, 2022 •

edited

Loading

cheshmi commented Dec 19, 2022

learning-chip commented Dec 21, 2022

cheshmi commented Dec 21, 2022 •

edited

Loading

learning-chip commented Dec 22, 2022 •

edited

Loading

cheshmi commented Dec 22, 2022 •

edited

Loading

learning-chip commented Dec 23, 2022 •

edited

Loading

cheshmi commented Dec 23, 2022

learning-chip commented Dec 23, 2022 •

edited

Loading

cheshmi commented Dec 24, 2022 •

edited

Loading

learning-chip commented Dec 25, 2022

Why is Metis reordering necessary for the sptrsv example? #7

Why is Metis reordering necessary for the sptrsv example? #7

Comments

learning-chip commented Dec 19, 2022 • edited Loading

cheshmi commented Dec 19, 2022

learning-chip commented Dec 21, 2022

cheshmi commented Dec 21, 2022 • edited Loading

learning-chip commented Dec 22, 2022 • edited Loading

cheshmi commented Dec 22, 2022 • edited Loading

learning-chip commented Dec 23, 2022 • edited Loading

cheshmi commented Dec 23, 2022

learning-chip commented Dec 23, 2022 • edited Loading

cheshmi commented Dec 24, 2022 • edited Loading

learning-chip commented Dec 25, 2022

learning-chip commented Dec 19, 2022 •

edited

Loading

cheshmi commented Dec 21, 2022 •

edited

Loading

learning-chip commented Dec 22, 2022 •

edited

Loading

cheshmi commented Dec 22, 2022 •

edited

Loading

learning-chip commented Dec 23, 2022 •

edited

Loading

learning-chip commented Dec 23, 2022 •

edited

Loading

cheshmi commented Dec 24, 2022 •

edited

Loading