-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is Metis reordering necessary for the sptrsv example? #7
Comments
METIS reordering is used to reduce fill-in in direct solvers. It is interesting that enabling METIS does not help. I have checked with METIS most of the time. I should be able to test without METIS in coming months/weeks to see how it performs. |
This is exactly my question -- is there a reference/document on this aspect? I don't see related discussions in Sympiler/ParSy/HDagg papers, except for fill-in reduction. |
You can check this thesis: In page 15, it shows how reordering changes the tree and thus thread level parallelism. There are other work that mention the effect of permutation on parallelism. A recent one is the sparse fusion paper, you may check the first figure. I am curious to know about your application if possible. I will be happy to know how Sympiler helps. You may email me if it is easier. |
Thanks, a few more clarification questions to check my understanding --
I understand the different elimination tree structures for complete factorization. For sptrsv, however, the matrix is lower/upper triangular, so its corresponding graph is directed acyclic. Reordering such DAG will only change the node labels, but has no effect on the DAG structure and its intrinsic data dependency (this is the main reason why I didn't see how METIS helps). However, it is also true that the (coarsened) levelset construction algorithm contains sequential heuristics, so the node order can indeed affect the partitioning results. The Sparso paper (start of section 3) also comments on the how reordering might improve the data locality of SpMV and SpTRSV. But the theoretical grounding doesn't seem very solid (say, why is Metis ND ordering preferred over AMD/RCM?), and the potential benefit is really case-by-case (I can also construct cases where Metis slows down the code). It is unlike the sparse complete factorization case, where nested-dissection reordering gives theoretically near-optimal fill-in for 3D mesh topology, and is always much better than without reordering.
I like this idea of cross-kernel optimization! Is the fusion code open-source? It would really help PCG solver.
I use it for parallel cholesky solver in circuit simulation. The matrix is a bit like the G3_circuit dataset, but much bigger. Because the matrix is very sparse, the blocked/supernode version (calling BLAS-3) doesn't help much, but the locality-aware partitioning (ParSy and HDagg) is very effective. This is why I raised more issues in this aggregation repo than in the Sympiler repo. |
For sptrsv, we reorder the matrix and not the DAG. After reordering the matrix, we create a new DAG. The solution of the reordered matrix is different and that is why the solution vector should be reordered as well. I agree that reordering DAG does not make sense.
Each ordering has a specific property. I agree that each ordering should be discussed independently. The thesis that I sent shares some insights. But, again these are heuristics and there is no guarantee.
makes sense. How many cores does your processor have? |
Indeed, my previous understanding (reordering DAG) was wrong. But I am pretty sure that ordering the matrix also changes the mathematical behavior. The solution of reordered matrix is mathematically different from the original one, and cannot be recovered by just "reordering back". For example, taking a simple matrix from discretization of 1D poisson equation: import numpy as np
import scipy.sparse as sp
import networkx as nx
import matplotlib.pyplot as plt
A = (
sp.diags([2.0] * 4) +
sp.diags([-1.0] * 3, offsets=-1) +
sp.diags([-1.0] * 3, offsets=1)
).tocsr()
print(A.todense())
# [[ 2. -1. 0. 0.]
# [-1. 2. -1. 0.]
# [ 0. -1. 2. -1.]
# [ 0. 0. -1. 2.]]
L = sp.tril(A).tocsr() # extract lower half
print(L.todense())
# [[ 2. 0. 0. 0.]
# [-1. 2. 0. 0.]
# [ 0. -1. 2. 0.]
# [ 0. 0. -1. 2.]]
perm = [0, 2, 1, 3]
A_perm = A[perm,:][:,perm] # symmetric reordering
L_perm = sp.tril(A_perm).tocsr()
print(L_perm.todense())
# [[ 2. 0. 0. 0.]
# [ 0. 2. 0. 0.]
# [-1. -1. 2. 0.]
# [ 0. -1. 0. 2.]]
b = np.ones(4) # invariant to reordering
sptrsv = sp.linalg.spsolve_triangular
print(sptrsv(L, b)) # [0.5 0.75 0.875 0.9375]
print(sptrsv(L_perm, b)) # [0.5 0.5 1. 0.75], different from reordering the above
# plot dependency DAG
def plot_dag(L):
g_L = nx.from_scipy_sparse_array(L, create_using=nx.DiGraph)
g_L.remove_edges_from(nx.selfloop_edges(g_L))
plt.figure(1, figsize=(3, 2))
nx.draw(
g_L,
pos=nx.layout.circular_layout(g_L),
arrows=True,
with_labels=True
)
plot_dag(L)
plot_dag(L_perm) The original DAG is strictly sequential, while the reordered one has more parallelism ( From the 1D grid perspective, the original sptrsv is the forward Gauss–Seidel relaxation, while the reordered sptrsv is the red-black Gauss–Seidel variant. They both converge for SPD matrix, but at different convergence rates. See Figure 2.3 and Table 4.2 of the Multigrid Tutorial. The RBGS has better parallelism, but might converge faster or slower depending on the problem. So I think users should really be aware of this different mathematical behavior, and use Metis preprocessing with caution. When used as preconditioner it is less of an issue (only changes the convergence rate of ILU or GS preconditioners), but when used as direct solve the result will be simply wrong (can't be corrected by reordering back!). Anyways, in case of direct solve, the L factor will naturally have an elimination tree structure with sufficient parallelism (generated by reordering matrix A before factorizing it), so it makes no sense to reorder L again.
It's a dual-socket Intel CPU with 18x2=36 cores. There's quite linear speed-up from 1->18 cores, but little improvement from 18->36 cores. Maybe due to data movement between NUMA domains. |
Thanks for the illustrations. If you have a standalone triangular matrix, reordering it does not make sense because reordering often changes the lower triangular matrix to a general matrix, which is very expensive to solve. Overall, any reordering changes the solution due to rounding errors, but I don't think you meant this. This rounding error certainly affects the convergence rate as well (a random effect). |
Thanks, I think we agree on most parts now, just few clarifications:
This is totally right. What I was confused by the original SpTRSV_runtime.cpp example is that it reads an input matrix A, optionally reorders it, and extracts it lower triangular part. Then such reordering is not changing the equation to
Right this makes sense for ILU0/IC0 preconditioners that use the same pattern as original A, and also for complete factorizations that add fill-in. The But if you mathematically compare the convergence of ILU0(A) and ILU0(PA), the convergence rate can be much different -- this is a well-known theoretical result for ILU0, see the paper Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations. Often the natural ordering has a low parallelism but best convergence, while nested-dissection reordering has a high parallelism but worse convergence (thus the "trade-off" in the title). The paper The effect of ordering on preconditioned conjugate gradients also has some earlier results.
Right, this is quite minor effect. The non-associative floating point operations will also lead to some difference between sequential vs parallel runs, but not visible enough to impact real applications. |
Thanks for the discussion.
I agree that reordering has non-trivial effect for GS and similar cases.
This is also true. I have seen the effect of reordering on convergence for some matrices that I was playing with for a simulation application. This tells me that for a fair performance comparison, both reordered and original order should be reported. Nice papers! |
👍No further issues, closing |
The
SpTRSV_runtime.cpp
example orders input matrix A before extracting its lower half:aggregation/example/SpTRSV_runtime.cpp
Lines 81 to 98 in da104fa
I wonder how would the reordering benefits sparse triangular solve? In the ParSy&HDagg papers it is only mentioned that METIS is used for fill-in reduction reordering, but that's for cholesky factorization and not relevant for sptrsv. Disabling the reordering doesn't seem to affect performance in my experiments. It might have postive or negative impacts on data locality, but I didn't find any arguments in the paper.
The text was updated successfully, but these errors were encountered: