Sparse Graph submodule #119

Merged
merged 56 commits into from May 7, 2012

Conversation

Projects
None yet
8 participants
Member

jakevdp commented Dec 23, 2011

This is an initial pull request aimed at adding some graph routines based on the scipy sparse matrices (see mailing list discussion here: http://mail.scipy.org/pipermail/scipy-dev/2011-December/016773.html )

The initial commit includes some routines modified from the scikit-learn utility functions:

  • Dijkstra's algorithm w/ Fibonacci heaps
    used to compute shortest path, implemented in cython
  • Floyd-Warshall algorithm
    also used for shortest path, implemented in cython
  • graph Laplacian
    implemented in pure python, for both sparse and dense inputs

This is very much a work in progress: I'd love to have a discussion about what routines should be included here.

I'd like this package to be a compendium of very fast, very general graph algorithms based on the scipy.sparse data model (i.e., no custom graph/node classes). Though this won't allow for every graph algorithm to be included, I think it is the best fit to the scipy philosophy. I think any routines included here should be very general, and address use cases in derived libraries, e.g. scikit-learn, scikit-image, networkx, etc. I'd appreciate input from the developers of those libraries.

I have not benchmarked the below algorithms against those in NetworkX, but I think they will compare favorably.

Owner

rgommers commented Dec 27, 2011

Looks good at first glance. Just commenting to CC myself.

Owner

teoliphant commented Jan 10, 2012

I have not reviewed in detail, but this looks like a very nice addition. Thank you! Register my +1

Member

jakevdp commented Jan 11, 2012

I'd be interested in comments from some NetworkX developers regarding other functionality that would be useful to include.

Member

jakevdp commented Jan 12, 2012

Just a quick benchmark against NetworkX: I think this is a fair comparison, but someone with more familiarity with networkX should double-check this:

In [1]: import networkx as nx

In [2]: import numpy as np

In [3]: from scipy.sparse.graph import graph_shortest_path

In [4]: G = nx.gnm_random_graph(100, 200)

In [5]: M = nx.to_numpy_matrix(G)

In [6]: timeit nx.shortest_path(G)
10 loops, best of 3: 23.1 ms per loop

In [7]: timeit graph_shortest_path(M)
100 loops, best of 3: 4.89 ms per loop
Member

jakevdp commented Jan 28, 2012

I moved the algorithms from the graph subdirectory to the csgraph subdirectory, and renamed them to unify with the convention set by cs_graph_components.

I haven't gotten any feedback on this from networkx developers. I'm curious if people have ideas for other algorithms that could fit in the compressed sparse graph framework.

What else needs to be done before merging?

dschult commented Jan 29, 2012

I took a look through the code and it seems like a good module.
I have two suggestions/questions based on a first read-thru.
They are both about graph_shortest_path which is what I spent
time looking at.

  1. shouldn't the names be switched for dist_matrix and graph?
    The input matrix describes the graph and the output describes the
    shortest path length (distance) between pairs. So it seems like
    the input should be named graph and the output named dist_matrix.
    Minor I know--but maybe easier to read?

  2. Some users will want the path length and others want the path itself.
    We found in NetworkX that it worked well to have the same routine optionally
    return both a distance_matrix and a predecessor_matrix. Row i of the
    predecessor_matrix gives the predecessor of each node j in a shortest
    path from node i to node j. You can build paths quickly from the
    predecessor_matrix. Such a routine could be included in this module for speed,
    or left up to the user to post-process.

As for other algorithms, how about unweighted path lengths, depth first order
and breadth first order? How much do you want to put here vs in a specialty
package like NetworkX?

Member

jakevdp commented Jan 30, 2012

Thanks for the comments.

Good point on the naming. I'll switch that to be more clear

I'll think about how to modify the code to return the predecessor matrix as well if the user passes a return_paths flag.

An unweighted path length function is a good idea, and should be easy to implement.

Member

jakevdp commented Jan 30, 2012

How much functionality to include here is an interesting question. I started with these particular routines because they're used in scikit-learn, and folks on the scipy-dev mailing list thought that these algorithms were fundamental enough to warrant inclusion in scipy. I have to admit I have no idea how to draw a good line between what belongs here and what doesn't.

Owner

rgommers commented Feb 1, 2012

It's probably not possible to draw a clear line. As rules of thumb perhaps:

  • should have potential uses on more than one project that depends on scipy
  • should appear in an introductory algorithm book such as "Introduction to Algorithms", Cormen et al.
Member

jakevdp commented Feb 1, 2012

Thanks for the input, Ralf. In light of that, I think I'll work to add the functionality suggested above by @dschult and call it good for now. Those algorithms are fairly basic and broadly applicable, and most are used downstream in scikit-learn. If more use-cases come up in the future, I'd be happy to work on adding them in.

Owner

rgommers commented Feb 2, 2012

Sounds good.

Member

jakevdp commented Feb 4, 2012

I added functionality and tests to return the predecessor matrix in a shortest path search.
I'll do unweighted path lengths, breadth-first search, and depth-first search next.

Member

jakevdp commented Feb 6, 2012

@dschult - can you take a look at the depth-first and breadth-first search functions I added? Is this what you had in mind?

I still want to create a utility routine which will take the output of these and construct a sparse representation of the graph they represent.

dschult commented Feb 9, 2012

Indeed.... those functions are what I had in mind. The expansion of the original routines also look great.

The next thing I wonder is perhaps not in the spirit of the original intent, but I'm curious.
How dangerous would it be to use these routines on objects that aren't csr sparse matrices
but look similar--have .shape, .indptr, .indexes, .data?

I'm thinking of a graph data structure with csr storage that replaces the slicing features of spmatrix with more graph-like operations. Right now the cdef code would work on these objects so long as those 4 attributes are defined. Is it unlikely that the code would be rewritten so that it requires other csr_matrix features? If so, it might be nice to enable access to the cdef functions directly.

I'm not sure this is possible, or even wise, and I'm sure it is hackish. :) But my question is really about whether these routines are abstractable to csr representations that are not quite spmatrix.

Thanks---this is great stuff!
Dan

Member

jakevdp commented Feb 9, 2012

Dan,
I think you're right that the only attributes of the csr matrix that will be used by the routines are data, indices, indptr, and shape. We could certainly allow arbitrary objects to be passed which have these attributes, but we'd have to be careful. For example, a csc_matrix has all these attributes, but the code would have to decide whether the user means the csc matrix to represent an abstract graph object, or represent a regular matrix which should be converted to csr. It sounds kind of messy.

As far as providing user access to the cdef routines, I'd be hesitant at this point. Currently these routines don't perform any validation, and many have index-checking turned off for speed. This is fine because the input is pre-validated by the calling function, but allowing user-access to these could easily lead to segfaults, unless we turn on index checking, which will lead to a reduction in speed.

Interesting ideas, though!

dschult commented Feb 9, 2012

That all makes good sense!
Thanks again for putting this together.
Dan

On Feb 9, 2012, at 9:28 AM, Jacob Vanderplas wrote:

Dan,
I think you're right that the only attributes of the csr matrix that will be used by the routines are data, indices, indptr, and shape. We could certainly allow arbitrary objects to be passed which have these attributes, but we'd have to be careful. For example, a csc_matrix has all these attributes, but the code would have to decide whether the user means the csc matrix to represent an abstract graph object, or represent a regular matrix which should be converted to csr. It sounds kind of messy.

As far as providing user access to the cdef routines, I'd be hesitant at this point. Currently these routines don't perform any validation, and many have index-checking turned off for speed. This is fine because the input is pre-validated by the calling function, but allowing user-access to these could easily lead to segfaults, unless we turn on index checking, which will lead to a reduction in speed.

Interesting ideas, though!


Reply to this email directly or view it on GitHub:
#119 (comment)

Member

jakevdp commented Feb 9, 2012

I added some more tests and a better validation routine. This is getting pretty close, I think.
I still need to add some submodule-level documentation, as well as some broader test coverage.

@rgommers, should scipy packages & tests use relative imports, or explicit import paths? I'm not sure what the style guidelines are on that.

Owner

rgommers commented Feb 9, 2012

The tests should use full import paths, either from scipy.sparse import cs_graph or from scipy.sparse.cs_graph import func_a, func_b. The package (csgraph/__init__.py) can also use from graph_components import func_a, as you did. This is actually more common than the full path. Don't use relative imports with dots.

The way you structured it is very good I think. Just a few minor things to add to the docs:

  • You didn't add an import of csgraph in sparse/__init__.py, which I think is fine. But that means csgraph is a public module, you can document it as such in doc/API.rst.txt.
  • As you said, the package needs a slightly longer description in the docstring in __init__.py and a listing of functions.
  • The package should be added to the reference guide by adding a sparse.csgraph.rst file with an automodule statement in doc/source/, and an entry in index.rst.
  • A few docstrings can be completed, for example cs_graph_laplacian needs params/returns, and some examples would be nice. The breath/depth first tests are easy to understand by looking at input/output arrays, they can be directly used as examples I think.
Owner

pv commented Feb 9, 2012

API nits:

I'd recommend prepending _ to all modules except those considered public. Otherwise, people will do from scipy.sparse.csgraph.graph_laplacian import cs_graph_laplacian, which makes things break later on when you move stuff around (whereas scipy.sparse.csgraph._graph_laplacian has a warning sign).

In a similar vein, if the functions are imported also to scipy.sparse, then it IMO would be better to rename csgraph to _csgraph. If the csgraph is intended as a public module, I wouldn't import the functions into scipy.sparse (and in this case it might be possible to drop the cs_graph_ prefixes from the functions, and prefer from scipy.sparse import csgraph).

Member

jakevdp commented Feb 9, 2012

Thanks for the feedback - I'll address those concerns soon.
Te reason I used the cs_graph_ prefix was because of the convention set by the existing cs_graph_components function. I think it would make more sense to drop the cs_graph_ prefix in favor of keeping all the routines in the scipy.sparse.csgraph namespace, as @pv mentioned. This would require a link to the components function for backward compatibility, but is overall much more clean in my opinion. Thoughts on that?

Owner

rgommers commented Feb 10, 2012

Good one Pauli, forgot about the underscores. +2 for those.

scipy.sparse.csgraph looks good to me.

Owner

pv commented Feb 10, 2012

Yep, dropping the prefixes and leaving the alias sounds OK. The alias could also be deprecated in due course.

Member

jakevdp commented Mar 31, 2012

@rgommers - finally responding to your feedback on validation:

regarding the two flags: at the beginning of shortest_path, we need to validate the graph, but we don't want to convert it, so both flags are set to true.

Regarding CSC matrices: all the sparse algorithms expect CSR matrices. Using the CSC as a CSR essentially flips the direction of each edge. That's fine in the case of undirected graphs (which is where we use the transpose for efficiency), but for directed graphs we need to convert to CSR.

Owner

rgommers commented Apr 1, 2012

The tests contain some plain asserts, those shouldn't be used. numpy.testing provides a replacement named assert_. Most of the time there are better alternatives though, for example assert np.allclose should be assert_allclose. Can you change that in test_connected_components.py and test_spanning_tree.py?

Owner

rgommers commented Apr 1, 2012

I get 3 test failures:

======================================================================
ERROR: test_graph_components.test_cs_graph_components
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.1.2-py2.6.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/rgommers/Code/scipy/scipy/sparse/csgraph/tests/test_graph_components.py", line 8, in test_cs_graph_components
    n_comp, flag = csgraph.cs_graph_components(csr_matrix(D))
  File "/Users/rgommers/Code/numpy/numpy/lib/utils.py", line 139, in newfunc
    warnings.warn(depdoc, DeprecationWarning)
DeprecationWarning: `cs_graph_components` is deprecated!
In the future, use csgraph.connected_components. Note that this new function has a slightly different interface: see the docstring for more information.

======================================================================
FAIL: test_shortest_path.test_johnson
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.1.2-py2.6.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/rgommers/Code/scipy/scipy/sparse/csgraph/tests/test_shortest_path.py", line 87, in test_johnson
    assert_array_almost_equal(graph_J, graph_FW)
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 846, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 677, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 77.0%)
 x: array([[ 0.        ,  0.13337944,  0.20894104,  0.0871293 ,  0.12027134,
         0.0871293 ,  0.21611891,  0.14037888,  0.07904145,  0.11088038,
         0.16157334,  0.13845474,  0.1512768 ,  0.19360491,  0.11827443,...
 y: array([[ 0.        ,  0.13337944,  0.15881136,  0.09756325,  0.11557586,
         0.0871293 ,  0.21611891,  0.13568341,  0.07904145,  0.11088038,
         0.13622686,  0.12322597,  0.16171075,  0.19360491,  0.11827443,...

======================================================================
FAIL: test_shortest_path.test_unweighted_path
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/nose-1.1.2-py2.6.egg/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/rgommers/Code/scipy/scipy/sparse/csgraph/tests/test_shortest_path.py", line 169, in test_unweighted_path
    assert_array_almost_equal(D1, D2)
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 846, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/Users/rgommers/Code/numpy/numpy/testing/utils.py", line 677, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 8.75%)
 x: array([[ 0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,...
 y: array([[ 0.        ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.00000019,  1.        ,...

----------------------------------------------------------------------

Very nice tutorial.

Member

jakevdp commented Apr 2, 2012

Thanks for the feedback - I've made the small changes you pointed out.
I'm confused on the test failures - all the tests pass for me: I'm not sure how to debug what I can't reproduce! The fact that it's platform dependent makes me worry that it's something insidious with the numpy types in cython.
Are you on a 32 bit or 64 bit system?

Owner

rgommers commented Apr 2, 2012

I'm on 32-bit Python 2.6 on OS X 10.6 (which is 64-bit), with gcc 4.0. I'll try to find some time this week to debug. The last two failures look like simple numerical accuracy differences between platforms. The first one is a bit strange, it looks like numpy.deprecate doesn't play well with compiled functions.

TST: fix cs_graph_components test by filtering deprecation warning.
Also seed random numbers in another test, and fix a typo.
Owner

rgommers commented Apr 8, 2012

The test_johnson and test_unweighted_path have in common that they fail with method='J' and directed=False. It doesn't look like a numerical accuracy issue, I can set decimal=3 and still have it fail. Failures occur on ~50% of test runs.

It would be helpful to add a few tests in test_shortest_path with hardcoded results I think. Right now there are only tests which check one method against the other, making it hard to see which method is failing/incorrect.

In the Notes on johnson you could add in what cases dijkstra is used under the hood.

Owner

rgommers commented Apr 8, 2012

The test_graph_components.test_cs_graph_components error is caused by the deprecation warning itself, numpy master throws an error when it sees one. Sent you a PR for that.

Owner

rgommers commented Apr 8, 2012

A test with negative edge weights (but no negative cycles) would also be good to add.

Owner

rgommers commented Apr 8, 2012

Never mind why comment about Notes in johnson, that's spelled out clearly in the first paragraph of the docstring.

Owner

rgommers commented May 5, 2012

Hey Jake, do you have some time for this soon? I'd like to aim for an 0.11 release as soon as this is merged.

As for the last two failures, I find them hard to track down. Also I'm not sure which of the methods is incorrect when I do find a difference, because they're only compared against each other. If you could add the correct results in the test, that would help find the problem on my system.

Member

jakevdp commented May 6, 2012

Hi Ralph, I've been working all my nights and weekends to try to meet some deadlines, so this has gone to the wayside. I'll at least try to shore up the testing stuff today.

Owner

rgommers commented May 6, 2012

Don't overdo it - if you have no time you have no time. Is there anything besides tests to be finished? I saw a graph_laplacian discussion on the scikit-learn ML but not sure if any action is needed there.

Member

jakevdp commented May 6, 2012

@rgommers - take a look at the new tests. I think they're more complete, and should give a better idea of where the failure lies.
One thing I found: I was getting some test failures on the predecessors that were due to multiple shortest paths being valid. The different methods came up with different correct solutions. I think that my hard-coded test case has solved that issue.

Member

jakevdp commented May 6, 2012

I think the graph_laplacian issue was resolved: we're going to keep the current behavior unless there is a compelling reason to switch to the alternate formulation

Member

jakevdp commented May 6, 2012

There are a few things that would be nice to do before merge. This is what I've been thinking of:

  • the new connected components routine is slow: it uses the depth_first search, which allocates some extra arrays. It would be better to re-implement it without those extra storage arrays
  • I'd still eventually like to refactor and move more of the pure python code out of pyx modules. This would allow easier code examination using ?? magic in ipython. We'd have to be careful about validation in the exposed cython routines; if the current routines were mis-used, it could easily lead to memory errors.
  • we need to double-check that the bento stuff is still good to go. I think I've modified some of the pyx files since you did that.

The third point certainly needs to be addressed before merge. The first and second could be put off - they only change internal stuff.

Owner

rgommers commented May 6, 2012

Test changes look good, and all tests pass now.

bento.info files also still look good.

Owner

rgommers commented May 6, 2012

I agree that the first and second points are nice to have but not essential. Why not merge this PR now, and open a ticket for those two points assigned to you? Merging now gives it some time to settle and discover possible issues on a wider range of machines.

Member

jakevdp commented May 7, 2012

I realized I hadn't put in any tests for masked input. Added those (and caught an error! Tests are good).

I agree with your thoughts - let's merge ASAP and give people a while to play with it, and leave the other two tasks as open tickets. I'll get to them when I can.

rgommers added a commit that referenced this pull request May 7, 2012

Merge pull request #119 from jakevdp/sparse-graph
Adds ``csgraph`` as a submodule under scipy.sparse.

A few of these functions were written for scikit-learn, then this grew
to a complete submodule of common graph algorithms, all using sparse
matrices as data structure.

@rgommers rgommers merged commit 6a2d895 into scipy:master May 7, 2012

Owner

rgommers commented May 7, 2012

Merged!

Thanks again Jake and everyone who reviewed. This is looking pretty damn good.

Member

jakevdp commented May 7, 2012

Great! Thanks Ralph. I hope people find a chance to play with these routines and find bugs before the release!

Member

GaelVaroquaux commented May 7, 2012

Awesome. Congratulations to everyone involved, Jake in particular

The directed argument doesn't really exist: docstring is misleading.

Owner

rgommers replied Sep 23, 2012

Fixed in fd68897. Thanks Vlad.

Would it be possible to embed a nice drawing of the graph instead of an ASCII drawing?

The sphinx.ext.graphviz extension was written for this very purpose.
(Here is an example of a Sphinx page that uses this extension.)

If we can be sure that adding the extension to scipy doc's conf.py will not break anything, I will write these two graphs in the dot format and propose a PR.

Owner

pv replied Oct 11, 2015

Better just embed an SVG/PNG image, so that graphviz doesn't become a requirement for building the docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment