Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is Metis reordering necessary for the sptrsv example? #7

Closed
learning-chip opened this issue Dec 19, 2022 · 10 comments
Closed

Why is Metis reordering necessary for the sptrsv example? #7

learning-chip opened this issue Dec 19, 2022 · 10 comments

Comments

@learning-chip
Copy link
Contributor

learning-chip commented Dec 19, 2022

The SpTRSV_runtime.cpp example orders input matrix A before extracting its lower half:

/// Re-ordering matrix A
// std::cout << "METIS IS NOT ACTIVATED" << std::endl;
#ifdef METIS
std::cout << "METIS IS ACTIVATED" << std::endl;
// We only reorder A since dependency matters more in l-solve.
A = make_full(Lower_A_CSC);
delete Lower_A_CSC;
metis_perm_general(A, perm);
Lower_A_CSC = make_half(A->n, A->p, A->i, A->x);
CSC *Lt = transpose_symmetric(Lower_A_CSC, perm);
CSC *L1_ord = transpose_symmetric(Lt, NULLPNTR);
delete Lower_A_CSC;
Lower_A_CSC = L1_ord;
Lower_A_CSR = csc_to_csr(Lower_A_CSC);
delete Lt;
delete A;
delete[] perm;
#else

I wonder how would the reordering benefits sparse triangular solve? In the ParSy&HDagg papers it is only mentioned that METIS is used for fill-in reduction reordering, but that's for cholesky factorization and not relevant for sptrsv. Disabling the reordering doesn't seem to affect performance in my experiments. It might have postive or negative impacts on data locality, but I didn't find any arguments in the paper.

@cheshmi
Copy link
Collaborator

cheshmi commented Dec 19, 2022

METIS reordering is used to reduce fill-in in direct solvers.
For iterative solvers, specially when the preconditioner does not introduce fill-in, METIS ordering helps to improve thread-parallelism and sometimes help with stability (this should change per application).

It is interesting that enabling METIS does not help. I have checked with METIS most of the time. I should be able to test without METIS in coming months/weeks to see how it performs.

@learning-chip
Copy link
Contributor Author

when the preconditioner does not introduce fill-in, METIS ordering helps to improve thread-parallelism and sometimes help with stability

This is exactly my question -- is there a reference/document on this aspect? I don't see related discussions in Sympiler/ParSy/HDagg papers, except for fill-in reduction.

@cheshmi
Copy link
Collaborator

cheshmi commented Dec 21, 2022

You can check this thesis:
l'Excellent, Jean-Yves. Multifrontal methods: parallelism, memory usage and numerical aspects. Diss. Ecole normale supérieure de lyon-ENS LYON, 2012

In page 15, it shows how reordering changes the tree and thus thread level parallelism. There are other work that mention the effect of permutation on parallelism. A recent one is the sparse fusion paper, you may check the first figure.

I am curious to know about your application if possible. I will be happy to know how Sympiler helps. You may email me if it is easier.

@learning-chip
Copy link
Contributor Author

learning-chip commented Dec 22, 2022

Thanks, a few more clarification questions to check my understanding --

In page 15, it shows how reordering changes the tree and thus thread level parallelism.

I understand the different elimination tree structures for complete factorization. For sptrsv, however, the matrix is lower/upper triangular, so its corresponding graph is directed acyclic. Reordering such DAG will only change the node labels, but has no effect on the DAG structure and its intrinsic data dependency (this is the main reason why I didn't see how METIS helps). However, it is also true that the (coarsened) levelset construction algorithm contains sequential heuristics, so the node order can indeed affect the partitioning results.

The Sparso paper (start of section 3) also comments on the how reordering might improve the data locality of SpMV and SpTRSV. But the theoretical grounding doesn't seem very solid (say, why is Metis ND ordering preferred over AMD/RCM?), and the potential benefit is really case-by-case (I can also construct cases where Metis slows down the code). It is unlike the sparse complete factorization case, where nested-dissection reordering gives theoretically near-optimal fill-in for 3D mesh topology, and is always much better than without reordering.

A recent one is the sparse fusion paper, you may check the first figure.

I like this idea of cross-kernel optimization! Is the fusion code open-source? It would really help PCG solver.

I am curious to know about your application if possible. I will be happy to know how Sympiler helps

I use it for parallel cholesky solver in circuit simulation. The matrix is a bit like the G3_circuit dataset, but much bigger. Because the matrix is very sparse, the blocked/supernode version (calling BLAS-3) doesn't help much, but the locality-aware partitioning (ParSy and HDagg) is very effective. This is why I raised more issues in this aggregation repo than in the Sympiler repo.

@cheshmi
Copy link
Collaborator

cheshmi commented Dec 22, 2022

Reordering such DAG will only change the node labels, but has no effect on the DAG structure and its intrinsic data dependency

For sptrsv, we reorder the matrix and not the DAG. After reordering the matrix, we create a new DAG. The solution of the reordered matrix is different and that is why the solution vector should be reordered as well. I agree that reordering DAG does not make sense.

But the theoretical grounding doesn't seem very solid (say, why is Metis ND ordering preferred over AMD/RCM?), and the potential benefit is really case-by-case (I can also construct cases where Metis slows down the code).

Each ordering has a specific property. I agree that each ordering should be discussed independently. The thesis that I sent shares some insights. But, again these are heuristics and there is no guarantee.
If you count the number of wavefronts before and after metis reordering, you will see how it makes the difference theoretically (like less wavefronts means less synchronization). But in practice it is implementation dependent.

I use it for parallel cholesky solver in circuit simulation. The matrix is a bit like the G3_circuit dataset,

makes sense. How many cores does your processor have?

@learning-chip
Copy link
Contributor Author

learning-chip commented Dec 23, 2022

For sptrsv, we reorder the matrix and not the DAG. After reordering the matrix, we create a new DAG. The solution of the reordered matrix is different and that is why the solution vector should be reordered as well.

Indeed, my previous understanding (reordering DAG) was wrong. But I am pretty sure that ordering the matrix also changes the mathematical behavior. The solution of reordered matrix is mathematically different from the original one, and cannot be recovered by just "reordering back".

For example, taking a simple matrix from discretization of 1D poisson equation:

import numpy as np
import scipy.sparse as sp
import networkx as nx
import matplotlib.pyplot as plt

A = (
    sp.diags([2.0] * 4) + 
    sp.diags([-1.0] * 3, offsets=-1) + 
    sp.diags([-1.0] * 3, offsets=1) 
    ).tocsr()
print(A.todense()) 
# [[ 2. -1.  0.  0.]
# [-1.  2. -1.  0.]
# [ 0. -1.  2. -1.]
# [ 0.  0. -1.  2.]]

L = sp.tril(A).tocsr()  # extract lower half
print(L.todense())
# [[ 2.  0.  0.  0.]
# [-1.  2.  0.  0.]
# [ 0. -1.  2.  0.]
# [ 0.  0. -1.  2.]]

perm = [0, 2, 1, 3]
A_perm = A[perm,:][:,perm]  # symmetric reordering
L_perm = sp.tril(A_perm).tocsr()
print(L_perm.todense())
# [[ 2.  0.  0.  0.]
# [ 0.  2.  0.  0.]
# [-1. -1.  2.  0.]
# [ 0. -1.  0.  2.]]

b = np.ones(4)  # invariant to reordering
sptrsv = sp.linalg.spsolve_triangular
print(sptrsv(L, b))  # [0.5    0.75   0.875  0.9375]
print(sptrsv(L_perm, b))  # [0.5  0.5  1.   0.75], different from reordering the above

# plot dependency DAG
def plot_dag(L):
    g_L = nx.from_scipy_sparse_array(L, create_using=nx.DiGraph)
    g_L.remove_edges_from(nx.selfloop_edges(g_L))
    plt.figure(1, figsize=(3, 2)) 
    nx.draw(
        g_L, 
        pos=nx.layout.circular_layout(g_L),
        arrows=True,
        with_labels=True
    )
plot_dag(L)
plot_dag(L_perm)

The original DAG is strictly sequential, while the reordered one has more parallelism (x0 and x1 can be solve together):

image
image

From the 1D grid perspective, the original sptrsv is the forward Gauss–Seidel relaxation, while the reordered sptrsv is the red-black Gauss–Seidel variant. They both converge for SPD matrix, but at different convergence rates. See Figure 2.3 and Table 4.2 of the Multigrid Tutorial. The RBGS has better parallelism, but might converge faster or slower depending on the problem.

So I think users should really be aware of this different mathematical behavior, and use Metis preprocessing with caution. When used as preconditioner it is less of an issue (only changes the convergence rate of ILU or GS preconditioners), but when used as direct solve the result will be simply wrong (can't be corrected by reordering back!). Anyways, in case of direct solve, the L factor will naturally have an elimination tree structure with sufficient parallelism (generated by reordering matrix A before factorizing it), so it makes no sense to reorder L again.

How many cores does your processor have?

It's a dual-socket Intel CPU with 18x2=36 cores. There's quite linear speed-up from 1->18 cores, but little improvement from 18->36 cores. Maybe due to data movement between NUMA domains.

@cheshmi
Copy link
Collaborator

cheshmi commented Dec 23, 2022

Thanks for the illustrations.
I am not sure if a triangular solver is used standalone. It is almost always within an iterative or a direct solver (I am curious to know if you are aware of any). Linear systems of Ax=b and PAx=Pb are mathematically equivalent (P is the permutation matrix). The triangular matrices, resulting from solving the reordered system, are reordered. So practically, if we want to evaluate the triangular solver fairly, we should work with the reordered one.

If you have a standalone triangular matrix, reordering it does not make sense because reordering often changes the lower triangular matrix to a general matrix, which is very expensive to solve.

Overall, any reordering changes the solution due to rounding errors, but I don't think you meant this. This rounding error certainly affects the convergence rate as well (a random effect).

@learning-chip
Copy link
Contributor Author

learning-chip commented Dec 23, 2022

Thanks, I think we agree on most parts now, just few clarifications:

Linear systems of Ax=b and PAx=Pb are mathematically equivalent (P is the permutation matrix).

This is totally right. What I was confused by the original SpTRSV_runtime.cpp example is that it reads an input matrix A, optionally reorders it, and extracts it lower triangular part. Then such reordering is not changing the equation to PAx=Pb, but solving two different equations tril(A)x=b and tril(PA)x=Pb. Such non-trivial difference exactly corresponds to the sequential vs red-black Gauss-Seidel example I showed before.

So practically, if we want to evaluate the triangular solver fairly, we should work with the reordered one.

Right this makes sense for ILU0/IC0 preconditioners that use the same pattern as original A, and also for complete factorizations that add fill-in. The SpTRSV_runtime.cpp example code is correct in demonstrating the degree of parallelism.

But if you mathematically compare the convergence of ILU0(A) and ILU0(PA), the convergence rate can be much different -- this is a well-known theoretical result for ILU0, see the paper Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations. Often the natural ordering has a low parallelism but best convergence, while nested-dissection reordering has a high parallelism but worse convergence (thus the "trade-off" in the title). The paper The effect of ordering on preconditioned conjugate gradients also has some earlier results.

any reordering changes the solution due to rounding errors

Right, this is quite minor effect. The non-associative floating point operations will also lead to some difference between sequential vs parallel runs, but not visible enough to impact real applications.

@cheshmi
Copy link
Collaborator

cheshmi commented Dec 24, 2022

Thanks for the discussion.

Such non-trivial difference exactly corresponds to the sequential vs red-black Gauss-Seidel example #7 (comment).

I agree that reordering has non-trivial effect for GS and similar cases.

But if you mathematically compare the convergence of ILU0(A) and ILU0(PA), the convergence rate can be much different

This is also true. I have seen the effect of reordering on convergence for some matrices that I was playing with for a simulation application. This tells me that for a fair performance comparison, both reordered and original order should be reported. Nice papers!

@learning-chip
Copy link
Contributor Author

👍No further issues, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants