Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up some graph iterations #13730

Closed
dcoudert opened this issue Nov 20, 2012 · 31 comments
Closed

Speed up some graph iterations #13730

dcoudert opened this issue Nov 20, 2012 · 31 comments

Comments

@dcoudert
Copy link
Contributor

Several graph methods can be significantly speed up improving the data structure, or at least the way some basic operations are performed.
For instance, many functions are faster using networkx graphs (conversion time included) instead of sage graphs.

In fact, NetworkX uses dictionaries to store edges. If G is a networkx graph, it is also a dictionary indexed by the vertices, and G[u] is a dictionary indexed by neighbors and containing edge data. This way, iterations are fast.

sage: G = graphs.RandomBarabasiAlbert(100,2)
sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
125 loops, best of 3: 1.63 ms per loop
sage: ggnx = G.networkx_graph()
sage: %timeit EE = [(u,v) for u in ggnx for v in ggnx.neighbors_iter(u)] 
625 loops, best of 3: 141 µs per loop
sage: %timeit EE = [(u,v) for u in ggnx for v in ggnx[u]] 
625 loops, best of 3: 127 µs per loop

We should at least improve iteration over the neighbors.

CC: @sagetrac-azi @nathanncohen @lobiCode @sagetrac-brunellus @videlec

Component: graph theory

Issue created by migration from https://trac.sagemath.org/ticket/13730

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Nov 21, 2012

comment:1

Hello, dcoudert!!

First thing that I am wondering is why is sage constructing a set from the iterator and then returns the iterator of the set. Does anyone happen to know that? It does not seem to affect computational speed but I do not see why it is necessary. Anyone happens to know why is set() required in this case?

    def neighbor_iterator(self, vertex):
      
        if self._directed:
            return iter(set(self.neighbor_out_iterator(vertex)) \
                    | set(self.neighbor_in_iterator(vertex)))
        else: 
            return iter(set(self._backend.iterator_nbrs(vertex))))

@dcoudert
Copy link
Contributor Author

comment:2

I'm not expert enough in iterators to answer. Furthermore, the backend for graphs also constructs sets:

def iterator_nbrs(self, v):

    return iter(set(self.iterator_in_nbrs(v)) |
                set(self.iterator_out_nbrs(v)))

I don't know the reasons behind but this can certainly be improved/simplified.

David.

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Nov 24, 2012

comment:3

Agree!

It seems like something suspiciously complex is being done in these iterators (lists, to iterators, to sets, to iterators to sets..)

I've played around and removed the (intuitively) unneeded parts and observed that there is a minor performance gain.

But if we want to be as fast as networkx then we'd need to keep a hash table of neighbors for each vertex.

Before trying to implement this - does sage allow graphs with unhashable vertices?

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 24, 2012

comment:4

NOnonono, vertices are hashable or everything would break. I mean, or instance the code you just printed would not work at all !

sage: set([set([1])])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/ncohen/<ipython console> in <module>()

TypeError: unhashable type: 'set'

Nathann

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Dec 21, 2012

comment:5

Hello guys!

I have implemented a stupid "fix" by adding a hash table to the graph object storing neighbours for every vertex of the graph. I then modified the add_edge/delete_edge methods to update the hash table accordingly.

The speedup was drastic (as with networkX graphs) and it also speed up other graph theory functions (since many things rely on obtaining the neighbours of some vertices)

HOWEVER. Many of the graph theory tests made sage crash which makes uncomfortable in posting the patch here. I am not aware enough of the graph internals to be able to write a sane patch that is not just some naive fix.

Hence I call someone that is more experienced with the c_graph implementation to add this feature. It is really really worth doing it!

@dcoudert
Copy link
Contributor Author

comment:6

Hello,

you should post your patch so that others can try it and help identifying and fixing the problem. The status of this patch could be "needs work" or "needs info".

Best,
David.

@sagetrac-azi sagetrac-azi mannequin added the s: needs info label Dec 29, 2012
@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Feb 21, 2013

comment:8

Attachment: trac_17710_needs_info.patch.gz

Hello.

I tried to implement what azi did. I added the hash tabel and update some methods. I didn't make other tests.
I'm new to SAGE and I would like to work on this problem. Please help me and tell me if this patch is going in the right direction.

Best regards,

slani

new
==========================================================
sage:  G = graphs.RandomBarabasiAlbert(100,2)
sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
625 loops, best of 3: 479 µs per loop
sage: ggnx = G.networkx_graph()
sage: %timeit EE = [(u,v) for u in ggnx for v in ggnx.neighbors_iter(u)] 
625 loops, best of 3: 171 µs per loopsage: %timeit EE = [(u,v) for u in ggnx for v in ggnx[u]] 
625 loops, best of 3: 159 µs per loop

sage: %timeit G.neighbors(3)                
625 loops, best of 3: 5.35 µs per loop
===========================================================

old
===========================================================
sage: G = graphs.RandomBarabasiAlbert(100,2)
sage:  %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
125 loops, best of 3: 2.03 ms per loop
sage: ggnx = G.networkx_graph()
sage: %timeit EE = [(u,v) for u in ggnx for v in ggnx.neighbors_iter(u)] 
625 loops, best of 3: 177 µs per loop
sage: %timeit EE = [(u,v) for u in ggnx for v in ggnx[u]] 
625 loops, best of 3: 159 µs per loop

%timeit G.neighbors(3)
625 loops, best of 3: 91.1 µs per loop
============================================================

@lobiCode lobiCode mannequin added s: needs review and removed s: needs info labels Feb 21, 2013
@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Feb 22, 2013

comment:11

....

This patch does not even contain an instruction to remove an element from your cached list when an edge is deleted....

This problem has to be fixed, but it has to be fixed properly... If the backend is slow it has to be fixed there or the backend should be changed.

Nathann

@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Feb 25, 2013

comment:12

Hello

I've added the function to handle vertex removals as well. I am aware that this solution is not good but I just wanted to write something to get us going. To make it work properly I would have to change the respective backends but I am clueless about the internal structure of the backends. Is there anyone here that can shed some light on it or point to the relevant documentation - I was not able to find any.

Where should this improvement be implemented?

@lobiCode lobiCode mannequin added s: needs review and removed s: needs info labels Feb 25, 2013
@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Feb 25, 2013

Attachment: trac_17711_needs_info.patch.gz

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Feb 25, 2013

comment:13

Your updated patch would fail if there are two edges between two vertices, and the user removes one of the two edges.

You unnecessarily check whether the dictionaries contain a list associated to a vertex of the graph, and it would probably be better to associate an empty list to each vertex from the beginning. Anyway you have to make TESTS to decide what to do.

The functions you add are not doctested.

You do not document the features you add either.

I think that it would be better to add what you need in the code directly, without creating 2

Please, do not write this patch.

Is there anyone here that can shed some light on it or point to the relevant documentation - I was not able to find any.

There isn't, that's one of the problems. I write some comments in the code about some parts of it when I have to understand how it works to fix a bug.

Where should this improvement be implemented?

I do not know. What I would do if I were to write this patch, or what I will do if you force me to write it, is try to understand why the current implementation is so slow compared to dictionaries. I would try to locate the place which makes a difference in the timings. This has to come from the way edges are stored, and from what I remember this part of the code is not that clear.

Nathann

@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Jun 4, 2013

Attachment: trac_17710_iter_set.patch.gz

@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Jun 4, 2013

comment:16

Hello,

on picture you can see a diagram what happens when we call neighbors() on graph. You can see there is a lot conversion from set/iter and then back again.

Where this conversion is not necessary I remove it.

But most time reduction was made (for ours specific test), when function neighbor_iterator (F2) start calling iterator_out_nbrs (F4.2) for undirected graph instead iterator_nbrs (F3.1). If graph is undirected we only need out nbrs. or in nbrs.??????????

Before:
sage: G = graphs.RandomBarabasiAlbert(100,2)
sage:  %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
125 loops, best of 3: 2.03 ms per loop
Now:
sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
625 loops, best of 3: 900 µs per loop

I also remove

#Sparse
if self._cg_rev is not None:
return iter([vertex_label(u_int,
self.vertex_ints,
self.vertex_labels,
self._cg)
for u_int in self._cg_rev.out_neighbors(v_int)])

in iterator_in_nbrs (F 4.1). I don't know what relation this has with sparse/dense.

But most time consuming problem is conversion from vertex int to vertex label in iterator_out_nbrs (F4.2) and iterator_in_nbrs (F4.1)

This conversion for our specific test takes almost 500 µs.

I tried to implement array of python object in cython, where we can save vertex labels, but I was unsuccessful, because I don't know cython well (yet). Maybe someone can help me?

If this is a good solution then we can return iter of vertex labels instead of list of vertex int from cpdef list out_neighbors (F5).

So what do you think?

Best,
Uros

@lobiCode lobiCode mannequin removed the s: needs work label Jun 4, 2013
@videlec
Copy link
Contributor

videlec commented Jun 7, 2013

Attachment: trac_13730-v2-vd.patch.gz

@videlec
Copy link
Contributor

videlec commented Jun 7, 2013

comment:21

Hi,

I appologize, I was a bit old school in my precedent message.

A long time ago, Cython did not support the yield statement and it was necessary either to write a class dedicated to iteration or to build the set on which we would like to iterate and return iter(my_set). This is the reason why it is implemented like that in sage.graphs.base.c_graph and sage.graphs.base.sparse_graph.

I upload a patch where all iterators does not build the underlying set. The gain is not extraordinary (only a x2) and the output order is completely messed up (this need to be fixed).

@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Jun 8, 2013

comment:22

Hello,

I tried with yield in iterator_out_nbrs (l:1745, c_graph.pyx) but it does not improve anything.

sage: G = graphs.RandomBarabasiAlbert(100,2)
sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
625 loops, best of 3: 940 µs per loop

I did a lot of modification and testing (also tried with new function in cython which use yield for returning nbrs), but didn't improve anything.

I also made a function (in generic_graph.py) which directly call cpdef list out_neighbors (base/sparse_graph.pyx; l:798):

sage: %timeit E = [(u,v) for u in G for v in G.neighbors_three(u)]

625 loops, best of 3: 362 µs per loop 

It is still slower than NetworkX!

I don't know, but I came to the conclusion that we have tow problems:

  1. conversion from vertex ints to vertex lables
  2. how to faster return vertex ints from data structure we have for nbrs

Vincent, what do you think?

This is also interesting for me:

sage: G
Graph on 100 vertices
sage: ggnx
<networkx.classes.graph.Graph object at 0x65c8ed0>

sage: %timeit E = [(u,v) for u in G for v in G.neighbors_three(u)]
625 loops, best of 3: 358 µs per loop

sage: %timeit E = [(u,v) for u in ggnx for v in G.neighbors_three(u)]
625 loops, best of 3: 273 µs per loop


sage: %timeit EE = [(u,v) for u in ggnx for v in ggnx.neighbors(u)] 
625 loops, best of 3: 213 µs per loop

Best,
Uros

@videlec
Copy link
Contributor

videlec commented Jun 8, 2013

comment:23

Replying to @lobiCode:
Hello,

I don't know, but I came to the conclusion that we have tow problems:

  1. conversion from vertex ints to vertex lables

This should not slow down too much (not x4). It is a lookup in an array. Did you make some timings to see whether this is slow ? Perhaps we have to dig the way it is implemented.

  1. how to faster return vertex ints from data structure we have for nbrs

It might be a good idea to see whether iterator over int vertices are faster compared to the labeled one.

Other possibilities

  1. There are 3 levels in a graph: the GenericGraph class, the backend (stored at my_graph._backend) and then the datastructure (stored at my_graph._backend._cg for c graphs or my_graph._backend._nxg for networkx).

  2. Possibly Useless memory allocation in listing neighbors of a graph #14690 might be a good speedup (if it works one day ;-)

@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Jun 8, 2013

comment:24

Replying to @videlec:

Replying to @lobiCode:
Hello,

I don't know, but I came to the conclusion that we have tow problems:

  1. conversion from vertex ints to vertex lables

This should not slow down too much (not x4). It is a lookup in an array. Did you make some timings to see whether this is slow ? Perhaps we have to dig the way it is implemented.

Yes.
This is a timing when I'm calling iterator_out_nbrs (base/c_graph.pyx; l: 1754)

sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
625 loops, best of 3: 872 µs per loop

This is a timing when I'm calling cpdef list out_neighbors (base/sparse_graph.pyx; l:798) directly (no conversions vertex int to vertex labels)

sage: %timeit E = [(u,v) for u in G for v in G.neighbors_three(u)]
625 loops, best of 3: 362 µs per loop 
  1. how to faster return vertex ints from data structure we have for nbrs

It might be a good idea to see whether iterator over int vertices are faster compared to the labeled one.

I also trying with yield in sparse_graph.pyx

+    def iter_out_nbrs_one(self, u):
+        
+        cdef int i, num_nbrs = 0, current_nbr = 0
+        cdef int size = self.out_degrees[u]
+        if self.out_degrees[u] == 0:
+            return 0
+        cdef SparseGraphBTNode **pointers = <SparseGraphBTNode **> \
+          sage_malloc(size * sizeof(SparseGraphBTNode *))
+        if not pointers:
+            raise RuntimeError("Failure allocating memory.")
+        for i from u * self.hash_length <= i < (u+1) * self.hash_length:
+            if self.vertices[i] == NULL:
+                continue
+            if num_nbrs == size:
+                sage_free(pointers)
+                return -1
+            pointers[num_nbrs] = self.vertices[i]
+            yield self.vertices[i].vertex
+            num_nbrs += 1
+
+            # While all the neighbors have not been added to the list, explore
+            # element pointers[current_nbr] and append its children to the end
+            # of pointers if necessary, the increment current_nbr.
+            while current_nbr < num_nbrs:
+                if pointers[current_nbr].left != NULL:
+                    if num_nbrs == size:
+                        sage_free(pointers)
+                        return -1
+                    pointers[num_nbrs] = pointers[current_nbr].left
+                    yield pointers[current_nbr].left.vertex
+                    num_nbrs += 1
+                if pointers[current_nbr].right != NULL:
+                    if num_nbrs == size:
+                        sage_free(pointers)
+                        return -1
+                    pointers[num_nbrs] = pointers[current_nbr].right
+                    yield  pointers[current_nbr].right.vertex
+                    num_nbrs += 1
+                current_nbr += 1
+        sage_free(pointers)
+     

This doesn’t improve anything. I suppose this is not a good implementation but it was just a test.

Other possibilities

  1. There are 3 levels in a graph: the GenericGraph class, the backend (stored at my_graph._backend) and then the datastructure (stored at my_graph._backend._cg for c graphs or my_graph._backend._nxg for networkx).

  2. Possibly Useless memory allocation in listing neighbors of a graph #14690 might be a good speedup (if it works one day ;-)

@lobiCode
Copy link
Mannequin

lobiCode mannequin commented Jun 10, 2013

comment:25

Replying to @videlec:

  1. Possibly Useless memory allocation in listing neighbors of a graph #14690 might be a good speedup (if it works one day ;-)

I modified data structure for saving nbrs. I implemented linked list and save all nbrs for vertex in one place in vertices (not on 16 different places). Obvious this is not good for adds and removes nbrs.

sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
625 loops, best of 3: 871 µs per loop

Very small improvements.

How can networkX data structure (python dict) return data so fast?

Best,

Uros

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Aug 13, 2013

comment:26

That may be interesting

sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
1000 loops, best of 3: 1.65 ms per loop
sage: H = Graph(G.graph6_string(), implementation='networkx')
sage: %timeit E = [(u,v) for u in H for v in H.neighbor_iterator(u)]
100 loops, best of 3: 1.35 ms per loop

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Aug 13, 2013

comment:27

What about #14806 ? :-P

Nathann

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Aug 13, 2013

comment:28

Okay this is even more interesting and it indicates the problem is not really in the backend per se.


sage: G = graphs.RandomBarabasiAlbert(1000,2)
sage: %timeit E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
1000 loops, best of 3: 1.68 ms per loop

BUT:

sage: S = SparseGraph(1000)
sage: for i,j in G.edges(labels=False): S.add_arc(i,j)
sage: %timeit E = [(u,v) for u in S.verts() for v in S.out_neighbors(u)]
1000 loops, best of 3: 374 us per loop

What do you guys thin is going on here?

As for your static graphs thing If nobody wakes up then I promise to review it by the weekend! :-)

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Aug 13, 2013

comment:29

And also the relevant profilings.. If they are of any help..

sage: G = graphs.RandomBarabasiAlbert(1000,2)
sage: %prun E = [(u,v) for u in G for v in G.neighbor_iterator(u)]
         1003 function calls in 0.005 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.003    0.000    0.003    0.000 generic_graph.py:7848(neighbor_iterator)
        1    0.002    0.002    0.005    0.005 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 generic_graph.py:7794(vertex_iterator)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

sage: %prun E = [(u,v) for u in S.verts() for v in S.out_neighbors(u)]
         1003 function calls in 0.001 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <string>:1(<module>)
     1000    0.000    0.000    0.000    0.000 {method 'out_neighbors' of 'sage.graphs.base.sparse_graph.SparseGraph' objects}
        1    0.000    0.000    0.000    0.000 {method 'verts' of 'sage.graphs.base.c_graph.CGraph' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Aug 13, 2013

comment:30

Okay this is even more interesting and it indicates the problem is not really in the backend per se.

Welll... Is it just forwarding the values returned by out_neighbors through neighbor_iterator ?

As for your static graphs thing If nobody wakes up then I promise to review it by the weekend! :-)

That'd be cool :-PPPPPPPP

Nathann

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Aug 13, 2013

comment:31

Replying to @nathanncohen:

Okay this is even more interesting and it indicates the problem is not really in the backend per se.

Welll... Is it just forwarding the values returned by out_neighbors through neighbor_iterator ?

YES. But why does it take so long? This changes the runtime by a huge order of magnitude.

As for your static graphs thing If nobody wakes up then I promise to review it by the weekend! :-)

That'd be cool :-PPPPPPPP

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Aug 13, 2013

comment:32

YES. But why does it take so long? This changes the runtime by a huge order of magnitude.

HMmmmm... I'd say "just Python being Python" O_o

Nathann

@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Aug 13, 2013

comment:33

Yes! The only confusing thing to me is why then is NetworkX (called without going through the frontend) efficient? (confusing because NetworkX is in Python)

sage: N = G.networkx_graph()
sage: %timeit E = [(u,v) for u in N for v in N.neighbors_iter(u)]
1000 loops, best of 3: 599 us per loop

So there is some Python being Python thing going on in the Graph frontend class.

@jdemeyer jdemeyer modified the milestones: sage-5.11, sage-5.12 Aug 13, 2013
@sagetrac-azi
Copy link
Mannequin

sagetrac-azi mannequin commented Aug 30, 2013

comment:35

Here is a wrap up of what's going on with this problem https://groups.google.com/forum/#!topic/sage-devel/CwKSNWXVJtU

@sagetrac-vbraun-spam sagetrac-vbraun-spam mannequin modified the milestones: sage-6.1, sage-6.2 Jan 30, 2014
@sagetrac-vbraun-spam sagetrac-vbraun-spam mannequin modified the milestones: sage-6.2, sage-6.3 May 6, 2014
@sagetrac-vbraun-spam sagetrac-vbraun-spam mannequin modified the milestones: sage-6.3, sage-6.4 Aug 10, 2014
@dcoudert
Copy link
Contributor Author

comment:39

in view of recent progresses (see #28895), may be we can move this ticket to wont fix and close it.

@mkoeppe mkoeppe removed this from the sage-6.4 milestone Dec 29, 2022
@mkoeppe mkoeppe closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants