Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

equality of graphs is 60 times slower than equality of their list of edges #30641

Open
videlec opened this issue Sep 23, 2020 · 24 comments
Open

Comments

@videlec
Copy link
Contributor

videlec commented Sep 23, 2020

On sage 9.1

sage: G = Graph(loops=False, multiedges=False)
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: G.add_edges([(i, (i+37)%100) for i in range(100)])
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: H = G.copy()
sage: %timeit G == H
196 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
sage: E = list(G.edges())
sage: F = list(H.edges())
sage: %timeit E == F
3.3 µs ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using immutable graphs is better by a factor 2 (and hence "only" 30x slower than list comparisons)

sage: E = [(i, (i+85)%100) for i in range(100)] + \
....:     [(i, (i+37)%100) for i in range(100)] + \
....:     [(i, (i+85)%100) for i in range(100)]
sage: G = Graph(E, loops=False, multiedges=False, immutable=True)
sage: H = G.copy()
sage: %timeit G == H
111 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Depends on #30645
Depends on #30665
Depends on #30776

CC: @kliem @dimpase @dcoudert

Component: graph theory

Keywords: thursdaysbdx

Issue created by migration from https://trac.sagemath.org/ticket/30641

@videlec videlec added this to the sage-9.2 milestone Sep 23, 2020
@videlec
Copy link
Contributor Author

videlec commented Sep 23, 2020

Changed keywords from none to thursdaysbdx

@dcoudert
Copy link
Contributor

comment:2

So far, G.edges() returns a sorted list of edges. So the test E == F is simply a lexicographic comparison of lists, omitting the time to extract and sort the list of edges. The reporting time comparison is therefore not completely fair.

Furthermore, since the switch to Python3, we can no longer rely on sorted list of edges (vertices and edge labels may be of different types, leading to an error when trying to sort the list). One objective is to deprecate this behavior.

I think that the only way to speed up the test G == H, is to speed up the test other.has_edge(...).

@videlec
Copy link
Contributor Author

videlec commented Sep 23, 2020

comment:3

The reason why I created this ticket is that I have code which creates a big list of graphs and where I want to remove duplicates. Currently, the program spends more than 50% of its time in graph equality which is ridiculous.

@dcoudert
Copy link
Contributor

comment:4

I agree. I recently made a #30510 to speed up method subgraph which was ridiculously slow. It's better now.

Here, I don't know how to speed up the method without going deep into the backends...

@videlec
Copy link
Contributor Author

videlec commented Sep 23, 2020

comment:5

I am fine with "going deep into the backends". My graphs have vertices {0, 1, ..., n-1} and there is no multiple edge. The weights are integers (but I don't think this is taken into account in the backend). The comparison of such graphs should be faster than the generic Python list comparison.

@kliem
Copy link
Contributor

kliem commented Sep 23, 2020

comment:6

Already the edge iterator is pretty slow. It takes 76 µs out 175 for me.

@dcoudert
Copy link
Contributor

comment:7

for your code, it could be better to work with immutable graphs.

sage: sage: E = [(i, (i+85)%100) for i in range(100)] + [(i, (i+37)%100) for i in range(100)] + [(i, (i+85)%100) for i in range(100)]     
sage: G = Graph([range(100), E], format='vertices_and_edges', loops=False, multiedges=False, immutable=True)                                  
sage: H = G.copy()                                                                                                                  
sage: %timeit G == H                                                                                                                
131 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage: sage: G = Graph(loops=False, multiedges=False) 
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
....: sage: G.add_edges([(i, (i+37)%100) for i in range(100)]) 
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
....: sage: H = G.copy() 
....: sage: %timeit G == H 
....:                                                                                                                               
241 µs ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But of course it would be better to improve the backends to get faster edge iterator, edge membership tests, equality test, etc.

@videlec

This comment has been minimized.

@kliem
Copy link
Contributor

kliem commented Sep 23, 2020

comment:9

I wrote a quick dirty solution that counts compares two vertices at a time and counts how much the out arcs differ. This takes about 34 µs instead of 175 using out_arc_unsafe.

@kliem
Copy link
Contributor

kliem commented Sep 23, 2020

comment:10

I will start with a tiny ticket that does some trivial optimizations for has_edge.

@kliem
Copy link
Contributor

kliem commented Sep 26, 2020

Dependencies: #30645, #30665

@kliem
Copy link
Contributor

kliem commented Sep 26, 2020

comment:12

#30645 and #30665 are a good start I would say.

Things are set up in a way, that we can implement an optimized containment check in CGraph.

@videlec
Copy link
Contributor Author

videlec commented Sep 26, 2020

comment:13

Nice. Thanks Jonathan for your efforts!

I also wonder if we could implement a comparison that would bypass the translation between "vertex labels" and "integers in {0, ..., n-1}". Namely have a comparison of the internal representations (assuming the backend is the same).

@dcoudert
Copy link
Contributor

comment:14

Does it make sense ? Currently, we check that the graphs have same settings and sets of vertices and edges. However, the internal representation might differ a lot. Suppose one graph G is the result of many additions/deletions of vertices of edges. Typically, at some point it has 100000 vertices, but at the end of the sequence of operations, it remains only 2. Internal bitsets/data structures might be very large. Now, if H is a copy of G, it has a very compact internal representation. The 2 graphs are equal, but have different internal representations.

@videlec
Copy link
Contributor Author

videlec commented Sep 26, 2020

comment:15

That is one use case. Here is another. I am considering all trivalent graphs up to isomorphism on n vertices. I want to construct the flip graph where I put an edge between the trivalent graph G1 and trivalent graph G2 if they differ by a Whitehad move

 A          C                      A          C
  \        /                        \        /
   \      /                          \      /
    u -- v          ----------->      u -- v
   /      \                          /      \
  /        \                        /        \
 B          D                      D          B

In order to generate this flip graph, one has to be able to identify the graph after a move. Since canonical labels are computed we need to compare graphs on {0, ..., n-1} with no surprise... and these are a lot of comparisons!

@dcoudert
Copy link
Contributor

comment:16

I see, not easy to do it efficiently.

@kliem
Copy link
Contributor

kliem commented Sep 26, 2020

comment:17

To check whether one CGraph is contained in another one can store a translation array at the beginning. If you can guarantee that the vertices match, you can of course skip this step.

After this initial step is pretty much the same as #30665. Just that you check has_arc_unsafe of other instead of yield. Things are a bit more complicated for multiple edges.

If you can guarantee the vertices to match, things should be really fast for dense graphs, but of course only, if you are somewhat dense (e.g. if n <= 64). Here you can just immediately compare to vertices.

@kliem
Copy link
Contributor

kliem commented Oct 16, 2020

Changed dependencies from #30645, #30665 to #30645, #30665, #30776

@kliem
Copy link
Contributor

kliem commented Oct 16, 2020

comment:18

Things are improving, I think.

IMO what would still improve things here:

  • better heuristics for vertex translation to int,
  • factor out the backends to standard backends and backends with vertices in range(n) (we could still have active vertices or not, but all vertices need to be in range(n), if you add a new vertex that is too large, you will get a memory problem, but that is your problem),
  • implement specialized versions of is_subgraph and subgraph_given_vertices for dense graphs over range(n) (we need a good name for this graph backend yet), this should be much faster than the current methods

So instead of just CGraphBackend we would have CGraphBackend, CGraphVertices and CGraphRangeVertices (not good names yet). DenseGraphBackend as we know it would have to inherit from CGraphBackend and CGraphVertices and DenseGraphBackendRangeVertices would have to inherit from CGraphBackend and CGraphRangeVertices. Likewise for sparse and static sparse.

Does this sound plausible?

Does anyone have good suggestions for a naming scheme?

@kliem
Copy link
Contributor

kliem commented Oct 16, 2020

comment:19

Btw, dense graphs should really use bitsets. I don't understand why it isn't. Eventually this might be sped up with intrinsics. E.g. an iterator over a bitset should be much faster with an improved leading zero count (_lzcnt_u64).

@dcoudert
Copy link
Contributor

comment:20

I don't know if it's better to use bitsets for dense graphs. In case, we can use src/sage/data_structures/binary_matrix.<pxi|pxd>.

@seblabbe
Copy link
Contributor

comment:21

Replying to @kliem:

So instead of just CGraphBackend we would have CGraphBackend, CGraphVertices and CGraphRangeVertices (not good names yet). DenseGraphBackend as we know it would have to inherit from CGraphBackend and CGraphVertices and DenseGraphBackendRangeVertices would have to inherit from CGraphBackend and CGraphRangeVertices. Likewise for sparse and static sparse.

Does this sound plausible?

Does anyone have good suggestions for a naming scheme?

This is how DisjointSet (union-find data structure) is implemented in Sage.

def DisjointSet(arg):
cdef class DisjointSet_class(SageObject):
cdef class DisjointSet_of_integers(DisjointSet_class):
cdef class DisjointSet_of_hashables(DisjointSet_class):

in the sense that DisjointSet_of_integers uses directly the internal representation and DisjointSet_of_hashables uses maps from integers in range(n) to the hashable objects and vice versa. See https://github.com/sagemath/sage-prod/blob/develop/src/sage/sets/disjoint_set.pyx

I don't know if the naming scheme _of_integers and _of_hashables I chosen at the time is good or not.

@mkoeppe mkoeppe modified the milestones: sage-9.2, sage-9.3 Oct 24, 2020
@dcoudert
Copy link
Contributor

comment:23

Timing with 9.3.beta5 on a macbook air. It seems we are much better now !

sage: G = Graph(loops=False, multiedges=False) 
sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
sage: G.add_edges([(i, (i+37)%100) for i in range(100)]) 
sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
sage: H = G.copy() 
sage: %timeit G == H 
43.1 µs ± 821 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage:
sage: E = list(G.edges()) 
sage: F = list(H.edges()) 
sage: %timeit E == F 
12.4 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage:
sage: E = [(i, (i+85)%100) for i in range(100)] + \ 
....:     [(i, (i+37)%100) for i in range(100)] + \ 
....:     [(i, (i+85)%100) for i in range(100)] 
sage: G = Graph(E, loops=False, multiedges=False, immutable=True) 
sage: H = G.copy() 
sage: %timeit G == H                                                                                                                                               
31 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@mkoeppe
Copy link
Member

mkoeppe commented Feb 13, 2021

comment:24

Setting new milestone based on a cursory review of ticket status, priority, and last modification date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants