Skip to content

Ticket 1861 #460

Merged
merged 9 commits into from Mar 22, 2013

4 participants

@timleslie

This pull request addresses ticket 1861

http://projects.scipy.org/scipy/ticket/1861

All tests pass and memory performance is orders of magnitude better.

@pv
SciPy member
pv commented Mar 10, 2013

@jakevdp: looks ok?

@jakevdp
SciPy member
jakevdp commented Mar 11, 2013

This looks really nice -- thanks for the submission. It's great to see this being improved! I haven't yet pulled and tested it, but if it passes the current unit tests I think it's fine.

@timleslie - I wonder if you'd be willing to implement the algorithm for an undirected graph as well? The current one is a bit of a place-holder, and is pretty slow. What it would take is to create another version of the connected_components function which accepts the transposed version of indices and indptr. The loops through the indptr array would also loop through the second indptr array, checking those nodes as well (keeping in mind there could be some repeats).

Is this something you'd be willing to tackle and add to this pull request? I don't think it would be much added effort, and the payoff would be huge.

Thanks for looking at this -- again, very nice work here!

@timleslie

I'll have a look into the undirected version and see what I can come up with. It appears that it doesn't suffer from the same memory limitations due to a recursive algorithm that the directed version did, but it could potentially be simplified/consolidated to achieve better speed and memory performance.

@jakevdp
SciPy member
jakevdp commented Mar 11, 2013

Right - it didn't use the same algorithm: currently it works by constructing a full depth-first tree from each node. This produces the correct result, but is definitely not optimal. I put the current algorithm in as an easy place-holder when I first wrote the module, but never ended up implementing a faster version. I think the algorithm you implemented would be significantly faster and more memory-efficient.

Again, thanks for looking into this!

@timleslie

@jakevdp I've just pushed code for the undirected graph. As expected it's faster than the previous version and has the nice property of not needing to allocate any new arrays (beyond the transposed matrix and labels). This removes the need for 4*N worth of integers that the previous algorithm used.

@jakevdp
SciPy member
jakevdp commented Mar 11, 2013

Nice! I'll try to do a detailed review of this soon... I'm traveling and preparing for PyCon this week, so it might take a few days before I get to it.

@jakevdp jakevdp commented on an outdated diff Mar 22, 2013
scipy/sparse/csgraph/_traversal.pyx
- if labels[i] < 0:
- _depth_first_undirected(i, indices1, indptr1,
- indices2, indptr2,
- node_list, predecessors, root_list, flag)
- labels[flag > 0] = label
- flag.fill(0)
- label += 1
-
- return label
-
+cdef int _connected_components_directed(np.ndarray[ITYPE_t, ndim=1, mode='c'] indices,
+ np.ndarray[ITYPE_t, ndim=1, mode='c'] indptr,
+ np.ndarray[ITYPE_t, ndim=1, mode='c'] labels):
+ """
+ Uses an iterative version of Tarjan's algorithm to find the strongly connected components
+ of a directed graph represented as a sparse matrix (scipy.sparse.csc_matrix or scipy.sparse.csr_matrix).
@jakevdp
SciPy member
jakevdp added a note Mar 22, 2013

PEP8 suggests limiting code lines to 79 characters, and documentation to 72 characters - these lines should be shortened

@jakevdp
SciPy member
jakevdp added a note Mar 22, 2013

Also several places below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp jakevdp commented on an outdated diff Mar 22, 2013
scipy/sparse/csgraph/_traversal.pyx
- labels.fill(-1)
+ cdef np.ndarray[ITYPE_t, ndim=1, mode="c"] SS = np.ndarray((N,), dtype=np.int32)
+ cdef np.ndarray[ITYPE_t, ndim=1, mode="c"] lowlinks = labels
+ cdef np.ndarray[ITYPE_t, ndim=1, mode="c"] stack_f = SS
+ cdef np.ndarray[ITYPE_t, ndim=1, mode="c"] stack_b = np.ndarray((N,), dtype=np.int32)
+
@jakevdp
SciPy member
jakevdp added a note Mar 22, 2013

These should be dtype=ITYPE to match the array declaration. ITYPE_t is int32_t, but it's cleaner if we're consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
SciPy member
jakevdp commented Mar 22, 2013

Thanks for the fast fixes! I was just trying out the code, and comparing the results to other implementations. It's producing the correct results, and it's fast! One quick question: why are there fewer layers of loops in the undirected version? I would have thought the two implementations would be identical, except for the need to loop over two child arrays for the undirected graph. Is there a detail that I'm not seeing?

@jakevdp
SciPy member
jakevdp commented Mar 22, 2013

By the way - sorry for letting this sit for so long!

@timleslie

There's the same number of loops, but there's less conditional statements. This is because we don't have to do any work on the back tracking step of the depth first search. We can assign each node a label as we're traversing downwards, since the fact that we hit the node means it's part of the weakly connected component.

@jakevdp
SciPy member
jakevdp commented Mar 22, 2013

Great! I think this looks ready to go. I'm +1 for merge. @pv, what do you think?

@pv pv merged commit cc48790 into scipy:master Mar 22, 2013
@pv
SciPy member
pv commented Mar 22, 2013

Looks good, merged.

@rgommers
SciPy member

backported to 0.12.x in 2cfd985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.