Skip to content
This repository

Sparse Graph submodule #119

Merged
merged 56 commits into from almost 2 years ago

7 participants

Jake Vanderplas Ralf Gommers Travis E. Oliphant Dan Schult Pauli Virtanen Gael Varoquaux Vlad Niculae
Jake Vanderplas
Collaborator

This is an initial pull request aimed at adding some graph routines based on the scipy sparse matrices (see mailing list discussion here: http://mail.scipy.org/pipermail/scipy-dev/2011-December/016773.html )

The initial commit includes some routines modified from the scikit-learn utility functions:

  • Dijkstra's algorithm w/ Fibonacci heaps
    used to compute shortest path, implemented in cython

  • Floyd-Warshall algorithm
    also used for shortest path, implemented in cython

  • graph Laplacian
    implemented in pure python, for both sparse and dense inputs

This is very much a work in progress: I'd love to have a discussion about what routines should be included here.

I'd like this package to be a compendium of very fast, very general graph algorithms based on the scipy.sparse data model (i.e., no custom graph/node classes). Though this won't allow for every graph algorithm to be included, I think it is the best fit to the scipy philosophy. I think any routines included here should be very general, and address use cases in derived libraries, e.g. scikit-learn, scikit-image, networkx, etc. I'd appreciate input from the developers of those libraries.

I have not benchmarked the below algorithms against those in NetworkX, but I think they will compare favorably.

Ralf Gommers
Owner

Looks good at first glance. Just commenting to CC myself.

Travis E. Oliphant
Owner

I have not reviewed in detail, but this looks like a very nice addition. Thank you! Register my +1

Jake Vanderplas
Collaborator

I'd be interested in comments from some NetworkX developers regarding other functionality that would be useful to include.

Jake Vanderplas
Collaborator

Just a quick benchmark against NetworkX: I think this is a fair comparison, but someone with more familiarity with networkX should double-check this:

In [1]: import networkx as nx

In [2]: import numpy as np

In [3]: from scipy.sparse.graph import graph_shortest_path

In [4]: G = nx.gnm_random_graph(100, 200)

In [5]: M = nx.to_numpy_matrix(G)

In [6]: timeit nx.shortest_path(G)
10 loops, best of 3: 23.1 ms per loop

In [7]: timeit graph_shortest_path(M)
100 loops, best of 3: 4.89 ms per loop
Jake Vanderplas
Collaborator

I moved the algorithms from the graph subdirectory to the csgraph subdirectory, and renamed them to unify with the convention set by cs_graph_components.

I haven't gotten any feedback on this from networkx developers. I'm curious if people have ideas for other algorithms that could fit in the compressed sparse graph framework.

What else needs to be done before merging?

Dan Schult

I took a look through the code and it seems like a good module.
I have two suggestions/questions based on a first read-thru.
They are both about graph_shortest_path which is what I spent
time looking at.

1) shouldn't the names be switched for dist_matrix and graph?
The input matrix describes the graph and the output describes the
shortest path length (distance) between pairs. So it seems like
the input should be named graph and the output named dist_matrix.
Minor I know--but maybe easier to read?

2) Some users will want the path length and others want the path itself.
We found in NetworkX that it worked well to have the same routine optionally
return both a distance_matrix and a predecessor_matrix. Row i of the
predecessor_matrix gives the predecessor of each node j in a shortest
path from node i to node j. You can build paths quickly from the
predecessor_matrix. Such a routine could be included in this module for speed,
or left up to the user to post-process.

As for other algorithms, how about unweighted path lengths, depth first order
and breadth first order? How much do you want to put here vs in a specialty
package like NetworkX?

Jake Vanderplas
Collaborator

Thanks for the comments.

Good point on the naming. I'll switch that to be more clear

I'll think about how to modify the code to return the predecessor matrix as well if the user passes a return_paths flag.

An unweighted path length function is a good idea, and should be easy to implement.

Jake Vanderplas
Collaborator

How much functionality to include here is an interesting question. I started with these particular routines because they're used in scikit-learn, and folks on the scipy-dev mailing list thought that these algorithms were fundamental enough to warrant inclusion in scipy. I have to admit I have no idea how to draw a good line between what belongs here and what doesn't.

Ralf Gommers
Owner

It's probably not possible to draw a clear line. As rules of thumb perhaps:

  • should have potential uses on more than one project that depends on scipy
  • should appear in an introductory algorithm book such as "Introduction to Algorithms", Cormen et al.
Jake Vanderplas
Collaborator

Thanks for the input, Ralf. In light of that, I think I'll work to add the functionality suggested above by @dschult and call it good for now. Those algorithms are fairly basic and broadly applicable, and most are used downstream in scikit-learn. If more use-cases come up in the future, I'd be happy to work on adding them in.

Ralf Gommers
Owner

Sounds good.

Jake Vanderplas
Collaborator

I added functionality and tests to return the predecessor matrix in a shortest path search.
I'll do unweighted path lengths, breadth-first search, and depth-first search next.

Jake Vanderplas
Collaborator

@dschult - can you take a look at the depth-first and breadth-first search functions I added? Is this what you had in mind?

I still want to create a utility routine which will take the output of these and construct a sparse representation of the graph they represent.

Dan Schult
Jake Vanderplas
Collaborator

Dan,
I think you're right that the only attributes of the csr matrix that will be used by the routines are data, indices, indptr, and shape. We could certainly allow arbitrary objects to be passed which have these attributes, but we'd have to be careful. For example, a csc_matrix has all these attributes, but the code would have to decide whether the user means the csc matrix to represent an abstract graph object, or represent a regular matrix which should be converted to csr. It sounds kind of messy.

As far as providing user access to the cdef routines, I'd be hesitant at this point. Currently these routines don't perform any validation, and many have index-checking turned off for speed. This is fine because the input is pre-validated by the calling function, but allowing user-access to these could easily lead to segfaults, unless we turn on index checking, which will lead to a reduction in speed.

Interesting ideas, though!

Dan Schult
Jake Vanderplas
Collaborator

I added some more tests and a better validation routine. This is getting pretty close, I think.
I still need to add some submodule-level documentation, as well as some broader test coverage.

@rgommers, should scipy packages & tests use relative imports, or explicit import paths? I'm not sure what the style guidelines are on that.

Ralf Gommers
Owner

The tests should use full import paths, either from scipy.sparse import cs_graph or from scipy.sparse.cs_graph import func_a, func_b. The package (csgraph/__init__.py) can also use from graph_components import func_a, as you did. This is actually more common than the full path. Don't use relative imports with dots.

The way you structured it is very good I think. Just a few minor things to add to the docs:

  • You didn't add an import of csgraph in sparse/__init__.py, which I think is fine. But that means csgraph is a public module, you can document it as such in doc/API.rst.txt.
  • As you said, the package needs a slightly longer description in the docstring in __init__.py and a listing of functions.
  • The package should be added to the reference guide by adding a sparse.csgraph.rst file with an automodule statement in doc/source/, and an entry in index.rst.
  • A few docstrings can be completed, for example cs_graph_laplacian needs params/returns, and some examples would be nice. The breath/depth first tests are easy to understand by looking at input/output arrays, they can be directly used as examples I think.
Pauli Virtanen
Owner
pv commented February 09, 2012

API nits:

I'd recommend prepending _ to all modules except those considered public. Otherwise, people will do from scipy.sparse.csgraph.graph_laplacian import cs_graph_laplacian, which makes things break later on when you move stuff around (whereas scipy.sparse.csgraph._graph_laplacian has a warning sign).

In a similar vein, if the functions are imported also to scipy.sparse, then it IMO would be better to rename csgraph to _csgraph. If the csgraph is intended as a public module, I wouldn't import the functions into scipy.sparse (and in this case it might be possible to drop the cs_graph_ prefixes from the functions, and prefer from scipy.sparse import csgraph).

Jake Vanderplas
Collaborator

Thanks for the feedback - I'll address those concerns soon.
Te reason I used the cs_graph_ prefix was because of the convention set by the existing cs_graph_components function. I think it would make more sense to drop the cs_graph_ prefix in favor of keeping all the routines in the scipy.sparse.csgraph namespace, as @pv mentioned. This would require a link to the components function for backward compatibility, but is overall much more clean in my opinion. Thoughts on that?

Ralf Gommers
Owner

Good one Pauli, forgot about the underscores. +2 for those.

scipy.sparse.csgraph looks good to me.

Pauli Virtanen
Owner
pv commented February 10, 2012

Yep, dropping the prefixes and leaving the alias sounds OK. The alias could also be deprecated in due course.

Jake Vanderplas
Collaborator

I did a large number of updates, including the renaming discussed above, additional tests, examples in doc strings, documentation, and adding information to the release notes.

Let me know if you see anything I've missed!

scipy/sparse/csgraph/__init__.py
((57 lines not shown))
  57
+           'breadth_first_order',
  58
+           'depth_first_order',
  59
+           'breadth_first_tree',
  60
+           'depth_first_tree',
  61
+           'minimum_spanning_tree']
  62
+
  63
+from _components import connected_components
  64
+from _laplacian import laplacian
  65
+from _shortest_path import shortest_path, floyd_warshall, dijkstra
  66
+from _traversal import breadth_first_order, depth_first_order, \
  67
+    breadth_first_tree, depth_first_tree
  68
+from _min_spanning_tree import minimum_spanning_tree
  69
+from tools import construct_dist_matrix, reconstruct_path
  70
+
  71
+
  72
+def cs_graph_components(*args, **kwargs):
2
Pauli Virtanen Owner
pv added a note February 11, 2012

Here you can use the deprecate decorator from Numpy to raise the warning.

Dan Schult
dschult added a note February 11, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Owner

Can you add: rgommers@8228f76
It has fixes for numscons and bento builds. The numscons one I tested, I'm not 100% sure about the bento build. I couldn't find an example of usage of a .pxi file, perhaps that should be included in a builder (in which case a bscript file is needed).

Ralf Gommers
Owner

Why not also underscore validation.py and tools.pyx?

Jake Vanderplas
Collaborator

@rgommers - regarding the pxi file: I think that because I've cythonized the scripts and committed the resulting c code, users won't have to worry about the details of the cython implementation.

Ralf Gommers
Owner
In [6]: csgraph.<TAB>
csgraph.breadth_first_order    csgraph.depth_first_tree       csgraph.reconstruct_path
csgraph.breadth_first_tree     csgraph.dijkstra               csgraph.shortest_path
csgraph.connected_components   csgraph.floyd_warshall         csgraph.tools
csgraph.construct_dist_matrix  csgraph.laplacian              csgraph.validation
csgraph.cs_graph_components    csgraph.minimum_spanning_tree  
csgraph.depth_first_order      csgraph.numpy   

Shouldn't contain numpy, tools and validation. construct_dist_matrix and reconstruct_path are also not in __all__, should they be? If not, why are they imported in __init__.py?

scipy/sparse/csgraph/_laplacian.py
((98 lines not shown))
  98
+           [ 0,  3,  6,  9, 12],
  99
+           [ 0,  4,  8, 12, 16]])
  100
+    >>> csgraph.laplacian(G, normed=False)
  101
+    array([[  0,   0,   0,   0,   0],
  102
+           [  0,   9,  -2,  -3,  -4],
  103
+           [  0,  -2,  16,  -6,  -8],
  104
+           [  0,  -3,  -6,  21, -12],
  105
+           [  0,  -4,  -8, -12,  24]])
  106
+
  107
+    Notes
  108
+    -----
  109
+    The Laplacian matrix of a graph is sometimes referred to as the
  110
+    "Kirchoff matrix" or the "admittance matrix", and is useful in many
  111
+    parts of spectral graph theory.  In particular, the eigen-decomposition
  112
+    of the laplacian matrix can give insight into many properties of the
  113
+    graph.
1
Ralf Gommers Owner

Notes section should come before Examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Jake Vanderplas
Collaborator

I think construct_dist_matrix and reconstruct_path are tools which could be useful to users. Perhaps I should add some examples to show how they're used.

scipy/sparse/csgraph/_shortest_path.pyx
((158 lines not shown))
  158
+    dist_matrix : ndarray, shape=[N, N]
  159
+        the matrix of shortest paths between points.
  160
+        If no path exists, the path length is zero
  161
+
  162
+    predecessors : ndarray, shape=(N, N)
  163
+        returned only if return_predecessors == True.
  164
+        Matrix of predecessors, which can be used to reconstruct the shortest
  165
+        paths.  Row i of the predecessor matrix contains information on the
  166
+        shortest paths from point i: each entry predecessors[i, j]
  167
+        gives the index of the previous node in the path from point i
  168
+        to point j.  If no path exists between point i and j, then
  169
+        P[i, j] = -9999
  170
+
  171
+    Notes
  172
+    -----
  173
+    Thes routine has been written for positive graphs only.
1
Ralf Gommers Owner

Thes --> This

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/sparse/csgraph/_laplacian.py
((76 lines not shown))
  76
+    normed: boolean (optional)
  77
+        if True, then compute normalized Laplacian
  78
+    return_diag: boolean (optional)
  79
+        if True, then return diagonal as well as laplacian
  80
+
  81
+    Returns
  82
+    -------
  83
+    lap: ndarray, shape=(N, N)
  84
+        the laplacian matrix of graph
  85
+    
  86
+    diag: ndarray, size=N [if return_diag == True]
  87
+        the diagonal of the laplacian matrix
  88
+
  89
+    Examples
  90
+    --------
  91
+    >>> import numpy as np
1
Ralf Gommers Owner

Numpy import in examples is not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Owner

Some nitpicks on the docs (which look excellent overall):

  • All sentences should start with a capital letter and end with a dot.
  • array-like should be array_like.
  • Parameters/Returns type specifiers:
    • keywords get , optional appended
    • default value and shape are noted in the description, not the specifier.
    • integer should be int, boolean should be bool (only in the type specifier, not in a sentence).
    • if >1 type is possible, use "or". For example node_array: np.ndarray, int, shape=(N_nodes,) should be node_array: ndarray or int, optional.
  • "1d" should be "1-D"
  • the list in shortest_path (method parameter) won't render correctly I think. Probably best to indent it and add blank lines between the auto/FW/D items.
  • the examples in depth/breadth_first_tree look incorrect in a terminal (see below). In minimum_spanning_tree it looks fine.

::

     (0)                         (0)
    /   \                       /              3     8                     3     8
  /       \                   /               (3)---5---(1)               (3)       (1)
  \       /                           /
   6     2                           2
    \   /                           /
     (2)                         (2)
Ralf Gommers
Owner

You're right about the .pxi file, so the Bento fix should also be correct.

Jake Vanderplas
Collaborator

I think I've addressed all your concerns listed above. Let me know if you see any other issues.

Ralf Gommers
Owner

The info.py files have all been removed a while ago. Can you undo commit 3a5b909?

Is it still documented somewhere that info.py files should be added?

Jake Vanderplas
Collaborator

One last algorithm which would be nice to add would be the Bellman-Ford algorithm (and perhaps Johnson's algorithm) to perform shortest path searches on graphs with negative edges. But that brings up the slightly awkward question of how to deal with edges of zero weight. I think that zero-weight edges are probably a fundamental limitation of compressed sparse graph representations.

Ralf Gommers

dimesion --> dimension

The np. prefix can be left out of the type specifier everywhere

Ralf Gommers
Owner

I've got three more commit at https://github.com/rgommers/scipy/tree/sparse-graph to fix a compile error, add C files to .gitattributes and add a test() command for the package. Could you pull those?

Jake Vanderplas
Collaborator

I moved things to info.py because that's how it's done in scipy.sparse and scipy.sparse.linalg. I didn't know that we had changed that convention. I'll undo that commit.

Ralf Gommers
Owner

There's no info.py in scipy.sparse, but I see I forgot to remove the one in sparse.linalg (and its subfolders). I'll go fix that now.

Ralf Gommers
Owner

Zero-weight edges will be difficult, but in principle there's nothing to stop you from explicitly storing 0's in csr format, right? Just getting them in there from a dense matrix doesn't work, so you'd have to create a matrix with csr_matrix((data, indices, indptr), [shape=(M, N)]). I don't know how much of a pain that would be.

Bellman-Ford and Johnson's algorithm meet the rules of thumb I suggested above, so I certainly won't stop you from adding them.

Jake Vanderplas
Collaborator

Good point - I think I will work on Bellman-Ford and Johnson's algorithm. They can be included as alternate methods in csgraph.shortest_path.

One concern I have: some of these methods will give inconsistent results if passed a csr matrix in non-canonical form. e.g. if there are repeated indices in a row, then dijkstra's algorithm will choose the smallest value among them (it essentially treats them as multiple edges between the nodes) while the floyd-warshall algorithm would sum them in the conversion to dense format. A similar problem would occur with zero-weight edges: dijkstra's algorithm treats them as true zero-weight edges, while the floyd-warshall algorithm treats them as unconnected nodes.

Any thoughts on what to do with this?

Dan Schult
Jake Vanderplas
Collaborator

For people following this - I haven't had time to do much work in the last week or so, but I hope to get to it in the near future.

These are my thoughts - CSC and CSR matrices can be set up to allow for zero-edges, but these get lost on conversion to dense arrays with the toarray() function. My most recent commit includes a couple routines for sparse-dense conversion which are designed for compressed sparse graphs, and are more flexible in allowing the presence of meaningful zero-entries.

I need to update the documentation, examples, and some of the code to reflect these . With these tools in place, I should be able to implement Johnson's algorithm and the Bellman-Ford algorithm, link them into the shortest_paths wrapper function, add some test cases, and then I think the submodule will be ready for merge.

Ralf Gommers
Owner

The idea of the csgraph_from_dense and csgraph_to_dense should work well, but I'd choose the keywords such that they match and that a matrix roundtrips correctly. Something like

def csgraph_from_dense(graph, null_value=np.inf)
    ....

def csgraph_to_dense(csgraph, null_value=np.inf)
    ....
Jake Vanderplas
Collaborator

I added the beginnings of the Bellman-Ford algorithm. The basic implementation works, but I need to add more to make it full-featured. My list of remaining tasks is under TODO at the top of _shortest_path.pyx.

@rgommers - the reason I did slightly different call signatures for csgraph_from_dense and csgraph_to_dense is that there may be situations where one would want multiple values (say 0 and +/-infinity) to be non-edges when converting dense to sparse. If we follow your suggestion, the code would look cleaner, but that flexibility would be lost. Perhaps a compromise is to allow only one value to be passed in an argument null_value, but keep the infinity_null flag in csgraph_from_dense.

Ralf Gommers
Owner

That's a good reason - it would be annoying to have to zero infty's by hand if they are somehow present. I just liked being able to roundtrip from an aesthetic/principled POV, but if there's a good reason not to do that it's not a big deal.

scipy/sparse/csgraph/_shortest_path.pyx
((48 lines not shown))
  48
+    Perform a shortest-path graph search on a positive directed or
  49
+    undirected graph.
  50
+
  51
+    Parameters
  52
+    ----------
  53
+    csgraph : array, matrix, or sparse matrix, 2 dimensions
  54
+        The N x N array of non-negative distances representing the input graph.
  55
+    method : string ['auto'|'FW'|'D'], optional
  56
+        Algorithm to use for shortest paths.  Options are
  57
+        
  58
+           'auto' -- Attempt to choose the best method for the current problem.
  59
+           'FW'   -- Floyd-Warshall algorithm.  Computational cost is
  60
+                     approximately ``O[N^3]``.  The input csgraph will be
  61
+                     converted to a dense representation.
  62
+           'D'    -- Dijkstra's algorithm with Fibonacci heaps.  Computational
  63
+                     cost is approximately ``O[N(Nk + Nlog(N))]``, where ``k``
1
Ralf Gommers Owner
rgommers added a note March 02, 2012

A space would help here, Nk -> N k (or N * k).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
scipy/sparse/csgraph/_shortest_path.pyx
((43 lines not shown))
  43
+                  directed=True,
  44
+                  return_predecessors=False,
  45
+                  unweighted=False,
  46
+                  overwrite=False):
  47
+    """
  48
+    Perform a shortest-path graph search on a positive directed or
  49
+    undirected graph.
  50
+
  51
+    Parameters
  52
+    ----------
  53
+    csgraph : array, matrix, or sparse matrix, 2 dimensions
  54
+        The N x N array of non-negative distances representing the input graph.
  55
+    method : string ['auto'|'FW'|'D'], optional
  56
+        Algorithm to use for shortest paths.  Options are
  57
+        
  58
+           'auto' -- Attempt to choose the best method for the current problem.
1
Ralf Gommers Owner
rgommers added a note March 02, 2012

Could add here something like "(default). Chooses between 'D' and 'FW' based on number of nodes and edges.".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Owner

I just saw you added some more commits (Github doesn't send notifications for those). Looks good, and your TODO list is getting shorter.

Jake Vanderplas
Collaborator

Thanks! Once I'm finished with pydata and pycon, I hope to finish this off.

Jake Vanderplas
Collaborator

Pydata sprint tonight! I finished implementing Johnson's algorithm, and made all shortest path methods deal with negative cycles appropriately.
I'd still like to add some tests for negative cycles, and double-check that all the specialized conversion functions are consistent. Getting very close - if anyone feels like doing a detailed review of the submodule, now would be a great time to start!

Jake Vanderplas
Collaborator

A few notes on this commit: For consistency across the submodule, I decided to define a canonical dense graph representation, where null-edges are represented by infinite weights. This alleviates some problems I was having for graphs with zero-edges. There are several utility routines to efficiently convert graphs to this canonical form from other representations (e.g. where null edges are denoted by 0).

I also cleaned up the conversion functions, so they're closer to the form that @rgommers suggested.

I need to take a look at the laplacian and connected_components functions, and make sure they behave with this new canonical form. I also want to add a piece to the narrative documentation with some examples. The to-do list is getting much shorter!

Jake Vanderplas
Collaborator

Another thing - there are several locations where nearly pure python functions are in .pyx files. It may be nicer for people using the code (I'm thinking of source inspection with ipython, for example) if I work to factor pure python code to .py files. Thoughts on that?

Pauli Virtanen
Owner
pv commented March 08, 2012

Moving pure-Python code out of Cython files has also the advantage that it decreases the size of the generated code.

Jake Vanderplas
Collaborator

I've been looking closer at the function scipy.sparse.csgraph.cs_graph_components, which I've moved to scipy.sparse.csgraph.connected_components in this PR. I think it will need a complete rewrite, as there are several issues:

  • if a directed graph is passed, it does not return the correct component label on nodes with no children.
  • the doc string says that the function only uses the top triangle of the matrix; this is not true
  • the source is a c++ module which is not well documented, and difficult to debug

For these reasons, I'd like to re-implement this function in cython, and fix these bugs along the way. Thoughts?

Ralf Gommers
Owner

Both fixing bugs and reimplementing in Cython sound like good ideas.

Ralf Gommers
Owner

I can review the parts I commented on before and the docs again. I'm not an expert on graph algorithms though, so it would be good if @dschult or someone else could review the latest additions to this PR.

Dan Schult

I can review this PR and changes for components routines.

It'd be good to have all the graph routines in cython rather than some in c++.

Dan Schult

Here are some more comments:

  • dijkstra and FW docstring: specialized algorithms. -> specialized algorithms like Johnson or Bellman-Ford.

  • shortest_path.pyx docstrings for Paramater csgraph: remove "non-negative" where weights can be negative now. (e.g. shortest_path, bellman-ford, johnson)

  • shortest_path.pyx docstrings: sometimes phrase "negative distance" is used and sometimes "negative weights".

  • Johnson docstring: "without negative cycles" -> without negative weights

  • docstring at top of shortest_path.pyx could include the newer algorithms BF and J

  • validate_graph should probably not allow csr_output and dense_output to both be True (and maybe there should only be one input flag: e.g. csr_output==False used to indicate dense_output=True).

  • validate_graph: if graph is undirected and csc format, do we even need to transpose it? Can't we use the csc format as if it is csr format?

The code and tests all look good. And the more I look at the c++ version of components, the more I think it should be rewritten and included in this module.

Jake Vanderplas
Collaborator

Thanks! I appreciate the feedback. I'll have to think about the components rewrite... I have a few ideas.

Dan Schult

Yes.... thinking is usually good. :)

There are two types of routines because directed graphs are so different from undirected.
The c++ routine only worked for undirected graphs. We could stick with that. Then
we "just" need a single_source_dfsorder (or bfs order if that tends to be faster) along with
a way to store nodes that have been found so we only start at nodes we haven't yet seen.

We could also shoot for the directed versions too. Then we probably need two more routines:
weakly_connected_components and strongly_connected_components. Again, both will
base most of their work on a single_source_*order. Tarjan's algorithm for strongly connected
components is one option.

The NetworkX code might be a place to start--though there are a lot of extra routines in there
that you probably don't want here.

undirected:
http://networkx.lanl.gov/_modules/networkx/algorithms/components/connected.html
strongly connected:
http://networkx.lanl.gov/_modules/networkx/algorithms/components/strongly_connected.html
weakly connected:
http://networkx.lanl.gov/_modules/networkx/algorithms/components/weakly_connected.html

Jake Vanderplas
Collaborator

In the most recent commit I added a cython implementation of connected components. The undirected version is a bit slow (it re-uses the depth-first tree machinery, which allocates some extra arrays). I can optimize that later. The directed strong components function uses an improved version of Tarjan's algorithm (the paper is referenced in the code comments).

One issue I'd like some input on: the old cs_graph_components function does some things I don't understand. If there is a component consisting of a single node, it sets the label to -2 and doesn't count it in the component list. I think this behavior is confusing, so I designed these new functions to give results similar to those in networkX: single-component subgraphs are labeled as a graph with one component, and counted in the total. This will break backward compatibility, but I think it's a more intuitive interface.

Jake Vanderplas
Collaborator

By the way, @dschult, thanks for the detailed feedback and comments. It's been very helpful!

Ralf Gommers
Owner

If it breaks backwards compatibility, you should bring this up on scipy-dev. If the old behavior doesn't make much sense though, it should be OK to change it.

Jake Vanderplas
Collaborator

@rgommers - finally responding to your feedback on validation:

regarding the two flags: at the beginning of shortest_path, we need to validate the graph, but we don't want to convert it, so both flags are set to true.

Regarding CSC matrices: all the sparse algorithms expect CSR matrices. Using the CSC as a CSR essentially flips the direction of each edge. That's fine in the case of undirected graphs (which is where we use the transpose for efficiency), but for directed graphs we need to convert to CSR.

Ralf Gommers
Owner

The tests contain some plain asserts, those shouldn't be used. numpy.testing provides a replacement named assert_. Most of the time there are better alternatives though, for example assert np.allclose should be assert_allclose. Can you change that in test_connected_components.py and test_spanning_tree.py?

Ralf Gommers
Owner

I get 3 test failures:

======================================================================
ERROR: test_graph_components.test_cs_graph_components
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.1.2-py2.6.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/rgommers/Code/scipy/scipy/sparse/csgraph/tests/test_graph_components.py", line 8, in test_cs_graph_components
    n_comp, flag = csgraph.cs_graph_components(csr_matrix(D))
  File "/Users/rgommers/Code/numpy/numpy/lib/utils.py", line 139, in newfunc
    warnings.warn(depdoc, DeprecationWarning)
DeprecationWarning: `cs_graph_components` is deprecated!
In the future, use csgraph.connected_components. Note that this new function has a slightly different interface: see the docstring for more information.

======================================================================
FAIL: test_shortest_path.test_johnson
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.1.2-py2.6.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/rgommers/Code/scipy/scipy/sparse/csgraph/tests/test_shortest_path.py", line 87, in test_johnson
    assert_array_almost_equal(graph_J, graph_FW)
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 846, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 677, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 77.0%)
 x: array([[ 0.        ,  0.13337944,  0.20894104,  0.0871293 ,  0.12027134,
         0.0871293 ,  0.21611891,  0.14037888,  0.07904145,  0.11088038,
         0.16157334,  0.13845474,  0.1512768 ,  0.19360491,  0.11827443,...
 y: array([[ 0.        ,  0.13337944,  0.15881136,  0.09756325,  0.11557586,
         0.0871293 ,  0.21611891,  0.13568341,  0.07904145,  0.11088038,
         0.13622686,  0.12322597,  0.16171075,  0.19360491,  0.11827443,...

======================================================================
FAIL: test_shortest_path.test_unweighted_path
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.1.2-py2.6.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/rgommers/Code/scipy/scipy/sparse/csgraph/tests/test_shortest_path.py", line 169, in test_unweighted_path
    assert_array_almost_equal(D1, D2)
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 846, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 677, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 8.75%)
 x: array([[ 0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,...
 y: array([[ 0.        ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.00000019,  1.        ,...

----------------------------------------------------------------------
Ralf Gommers
Owner

Very nice tutorial.

Jake Vanderplas
Collaborator

Thanks for the feedback - I've made the small changes you pointed out.
I'm confused on the test failures - all the tests pass for me: I'm not sure how to debug what I can't reproduce! The fact that it's platform dependent makes me worry that it's something insidious with the numpy types in cython.
Are you on a 32 bit or 64 bit system?

Ralf Gommers
Owner

I'm on 32-bit Python 2.6 on OS X 10.6 (which is 64-bit), with gcc 4.0. I'll try to find some time this week to debug. The last two failures look like simple numerical accuracy differences between platforms. The first one is a bit strange, it looks like numpy.deprecate doesn't play well with compiled functions.

Ralf Gommers TST: fix cs_graph_components test by filtering deprecation warning.
Also seed random numbers in another test, and fix a typo.
5c5dd3b
Ralf Gommers
Owner

The test_johnson and test_unweighted_path have in common that they fail with method='J' and directed=False. It doesn't look like a numerical accuracy issue, I can set decimal=3 and still have it fail. Failures occur on ~50% of test runs.

It would be helpful to add a few tests in test_shortest_path with hardcoded results I think. Right now there are only tests which check one method against the other, making it hard to see which method is failing/incorrect.

In the Notes on johnson you could add in what cases dijkstra is used under the hood.

Ralf Gommers
Owner

The test_graph_components.test_cs_graph_components error is caused by the deprecation warning itself, numpy master throws an error when it sees one. Sent you a PR for that.

Ralf Gommers
Owner

A test with negative edge weights (but no negative cycles) would also be good to add.

Ralf Gommers
Owner

Never mind why comment about Notes in johnson, that's spelled out clearly in the first paragraph of the docstring.

Ralf Gommers
Owner

Hey Jake, do you have some time for this soon? I'd like to aim for an 0.11 release as soon as this is merged.

As for the last two failures, I find them hard to track down. Also I'm not sure which of the methods is incorrect when I do find a difference, because they're only compared against each other. If you could add the correct results in the test, that would help find the problem on my system.

Jake Vanderplas
Collaborator
jakevdp commented May 06, 2012

Hi Ralph, I've been working all my nights and weekends to try to meet some deadlines, so this has gone to the wayside. I'll at least try to shore up the testing stuff today.

Ralf Gommers
Owner

Don't overdo it - if you have no time you have no time. Is there anything besides tests to be finished? I saw a graph_laplacian discussion on the scikit-learn ML but not sure if any action is needed there.

Jake Vanderplas
Collaborator
jakevdp commented May 06, 2012

@rgommers - take a look at the new tests. I think they're more complete, and should give a better idea of where the failure lies.
One thing I found: I was getting some test failures on the predecessors that were due to multiple shortest paths being valid. The different methods came up with different correct solutions. I think that my hard-coded test case has solved that issue.

Jake Vanderplas
Collaborator
jakevdp commented May 06, 2012

I think the graph_laplacian issue was resolved: we're going to keep the current behavior unless there is a compelling reason to switch to the alternate formulation

Jake Vanderplas
Collaborator
jakevdp commented May 06, 2012

There are a few things that would be nice to do before merge. This is what I've been thinking of:

  • the new connected components routine is slow: it uses the depth_first search, which allocates some extra arrays. It would be better to re-implement it without those extra storage arrays
  • I'd still eventually like to refactor and move more of the pure python code out of pyx modules. This would allow easier code examination using ?? magic in ipython. We'd have to be careful about validation in the exposed cython routines; if the current routines were mis-used, it could easily lead to memory errors.
  • we need to double-check that the bento stuff is still good to go. I think I've modified some of the pyx files since you did that.

The third point certainly needs to be addressed before merge. The first and second could be put off - they only change internal stuff.

Ralf Gommers
Owner

Test changes look good, and all tests pass now.

bento.info files also still look good.

Ralf Gommers
Owner

I agree that the first and second points are nice to have but not essential. Why not merge this PR now, and open a ticket for those two points assigned to you? Merging now gives it some time to settle and discover possible issues on a wider range of machines.

Jake Vanderplas
Collaborator
jakevdp commented May 06, 2012

I realized I hadn't put in any tests for masked input. Added those (and caught an error! Tests are good).

I agree with your thoughts - let's merge ASAP and give people a while to play with it, and leave the other two tasks as open tickets. I'll get to them when I can.

Ralf Gommers rgommers merged commit 6a2d895 into from May 07, 2012
Ralf Gommers rgommers closed this May 07, 2012
Ralf Gommers
Owner

Merged!

Thanks again Jake and everyone who reviewed. This is looking pretty damn good.

Jake Vanderplas
Collaborator
jakevdp commented May 07, 2012

Great! Thanks Ralph. I hope people find a chance to play with these routines and find bugs before the release!

Gael Varoquaux

Awesome. Congratulations to everyone involved, Jake in particular

Vlad Niculae

The directed argument doesn't really exist: docstring is misleading.

Fixed in fd68897. Thanks Vlad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 56 unique commits by 3 authors.

Dec 23, 2011
Jake Vanderplas initial commit e4ec823
Jan 28, 2012
Jake Vanderplas Merge commit 'upstream/master' into sparse-graph 01480ed
Jake Vanderplas move scipy.sparse.graph to scipy.sparse.csgraph in order to follow pr…
…evious convention
2305de1
Jan 30, 2012
Jake Vanderplas rename graph->dist_matrix b9f76b2
Jake Vanderplas clean up FibonacciNode states 57eed49
Feb 04, 2012
Jake Vanderplas add ability to return predecessor matrix 541fe2c
Feb 05, 2012
Jake Vanderplas add breadth-first search 5936f7e
Feb 06, 2012
Jake Vanderplas add depth-first search fdc8c62
Feb 08, 2012
Jake Vanderplas add depth-first and breadth-first tree functions 74c352c
Jake Vanderplas add minimum spanning tree routine a9e25e0
Jake Vanderplas better code documentation c24a99d
Feb 09, 2012
Jake Vanderplas add unweighted shortest path b6cb644
Jake Vanderplas add validation tools 09e9680
Jake Vanderplas add graph traversal tests 197348e
Feb 11, 2012
Jake Vanderplas cleanup, tests, doc e2ac520
Jake Vanderplas move csgraph code to private submodules c55bfbb
Jake Vanderplas update documentation d8ce15b
Jake Vanderplas Merge commit 'upstream/master' into sparse-graph 96596c6
Jake Vanderplas update release notes d59ee20
Jake Vanderplas move parameters to a common file e29642e
Jake Vanderplas use numpy for deprecated function 4463429
Jake Vanderplas remove print statements in tests befe01a
Jake Vanderplas add csgraph to API doc 9171ea2
Jake Vanderplas add scipy.sparse.csgraph to auto modules 86396b5
Jake Vanderplas update sparse submodule doc string 0417ff0
Feb 12, 2012
Ralf Gommers BLD: sparse.csgraph: fix numscons and bento builds. 8228f76
Jake Vanderplas move csgraph doc string to info 3a5b909
Jake Vanderplas validation -> _validation; tools -> _tools 04b64ea
Jake Vanderplas move Notes before Examples in doc 37fb4e0
Jake Vanderplas Fix typos and address doc string standards 8650134
Jake Vanderplas increase test coverage 3a4c528
Ralf Gommers MAINT: mark Cython-generated C files in sparse.csgraph as binary. 040b8d9
Ralf Gommers BUG: sparse.csgraph: fix missing underscore. 51e3dcd
Ralf Gommers TST: sparse.csgraph: add test() command in csgraph subpackage. 217fdbb
Jake Vanderplas remove info.py 95eea0f
Jake Vanderplas fix typos; remove np in doc strings e999d4b
Feb 18, 2012
Jake Vanderplas Add csgraph-specific conversion functions 00c1ce9
Feb 23, 2012
Jake Vanderplas Beginnings of Bellman-Ford implementation 9bb2882
Jake Vanderplas finish bellman-ford 3f424d0
Jake Vanderplas add bellman-ford tests c6193c4
Jake Vanderplas clean up bellman-ford implementation d43e40e
Jake Vanderplas Cleanup dijkstra implementation fa95681
Feb 24, 2012
Jake Vanderplas fix reconstruct_distances function 6483136
Mar 02, 2012
Jake Vanderplas implement Johnson's Algorithm ec293ab
Mar 08, 2012
Jake Vanderplas use canonical form of dense matrices across module a379593
Mar 12, 2012
Jake Vanderplas implement connected components in cython 7d6d641
Mar 31, 2012
Jake Vanderplas change dense interface 5e24322
Jake Vanderplas add csgraph tutorial eafe976
Jake Vanderplas update shortest path docstrings fbb093c
Apr 01, 2012
Jake Vanderplas fix typos and other issues 733a0d3
Jake Vanderplas add to csgraph tutorial 7f6f9e1
Apr 08, 2012
Ralf Gommers TST: fix cs_graph_components test by filtering deprecation warning.
Also seed random numbers in another test, and fix a typo.
5c5dd3b
Jake Vanderplas Merge pull request #2 from rgommers/sparse-graph
Fix test error due to deprecation warning.
62856ce
Jake Vanderplas Merge commit 'origin/sparse-graph' into sparse-graph beac2bb
May 06, 2012
Jake Vanderplas add definite test cases fd849cd
Jake Vanderplas add test for masked input 0c111e9
Something went wrong with that request. Please try again.