Permalink
Browse files

Merge pull request #119 from jakevdp/sparse-graph

Adds ``csgraph`` as a submodule under scipy.sparse.

A few of these functions were written for scikit-learn, then this grew
to a complete submodule of common graph algorithms, all using sparse
matrices as data structure.
  • Loading branch information...
2 parents 92f62aa + 0c111e9 commit 6a2d89505a07ebd9461e12e6be7f2bf55fd3e772 @rgommers rgommers committed May 7, 2012
Showing with 45,862 additions and 41 deletions.
  1. +4 −0 .gitattributes
  2. +1 −0 doc/API.rst.txt
  3. +25 −0 doc/release/0.11.0-notes.rst
  4. +1 −0 doc/source/index.rst
  5. +1 −0 doc/source/sparse.csgraph.rst
  6. +212 −0 doc/source/tutorial/csgraph.rst
  7. +1 −0 doc/source/tutorial/index.rst
  8. +9 −5 scipy/sparse/__init__.py
  9. +4 −2 scipy/sparse/bento.info
  10. +11 −0 scipy/sparse/csgraph/SConscript
  11. +2 −0 scipy/sparse/csgraph/SConstruct
  12. +165 −0 scipy/sparse/csgraph/__init__.py
  13. +6 −12 scipy/sparse/{csgraph.py → csgraph/_components.py}
  14. +135 −0 scipy/sparse/csgraph/_laplacian.py
  15. +6,131 −0 scipy/sparse/csgraph/_min_spanning_tree.c
  16. +142 −0 scipy/sparse/csgraph/_min_spanning_tree.pyx
  17. +16,151 −0 scipy/sparse/csgraph/_shortest_path.c
  18. +1,239 −0 scipy/sparse/csgraph/_shortest_path.pyx
  19. +9,102 −0 scipy/sparse/csgraph/_tools.c
  20. +414 −0 scipy/sparse/csgraph/_tools.pyx
  21. +10,942 −0 scipy/sparse/csgraph/_traversal.c
  22. +681 −0 scipy/sparse/csgraph/_traversal.pyx
  23. +55 −0 scipy/sparse/csgraph/_validation.py
  24. +9 −0 scipy/sparse/csgraph/bento.info
  25. +11 −0 scipy/sparse/csgraph/parameters.pxi
  26. +28 −0 scipy/sparse/csgraph/setup.py
  27. +16 −0 scipy/sparse/csgraph/setupscons.py
  28. +47 −0 scipy/sparse/csgraph/tests/test_connected_components.py
  29. +54 −0 scipy/sparse/csgraph/tests/test_conversions.py
  30. +33 −0 scipy/sparse/csgraph/tests/test_graph_components.py
  31. +27 −0 scipy/sparse/csgraph/tests/test_graph_laplacian.py
  32. +140 −0 scipy/sparse/csgraph/tests/test_shortest_path.py
  33. +17 −0 scipy/sparse/csgraph/tests/test_spanning_tree.py
  34. +44 −0 scipy/sparse/csgraph/tests/test_traversal.py
  35. +1 −0 scipy/sparse/setup.py
  36. +1 −0 scipy/sparse/setupscons.py
  37. +0 −22 scipy/sparse/tests/test_spfuncs.py
View
@@ -11,6 +11,10 @@ scipy/io/matlab/mio_utils.c binary
scipy/io/matlab/mio5_utils.c binary
scipy/io/matlab/streams.c binary
scipy/signal/spectral.c binary
+scipy/sparse/csgraph/_min_spanning_tree.c binary
+scipy/sparse/csgraph/_shortest_path.c binary
+scipy/sparse/csgraph/_tools.c binary
+scipy/sparse/csgraph/_traversal.c binary
scipy/spatial/ckdtree.c binary
scipy/spatial/qhull.c binary
scipy/special/lambertw.c binary
View
@@ -117,6 +117,7 @@ change is made.
* scipy.sparse
- linalg
+ - csgraph
* scipy.spatial
@@ -86,9 +86,33 @@ A function for creating Pascal matrices, ``scipy.linalg.pascal``, was added.
``misc.logsumexp`` now takes an optional ``axis`` keyword argument.
+Sparse Graph Submodule
+----------------------
+The new submodule :mod:`scipy.sparse.csgraph` implements a number of efficient
+graph algorithms for graphs stored as sparse adjacency matrices. Available
+routines are:
+
+ - :func:`connected_components` - determine connected components of a graph
+ - :func:`laplacian` - compute the laplacian of a graph
+ - :func:`shortest_path` - compute the shortest path between points on a
+ positive graph
+ - :func:`dijkstra` - use Dijkstra's algorithm for shortest path
+ - :func:`floyd_warshall` - use the Floyd-Warshall algorithm for
+ shortest path
+ - :func:`breadth_first_order` - compute a breadth-first order of nodes
+ - :func:`depth_first_order` - compute a depth-first order of nodes
+ - :func:`breadth_first_tree` - construct the breadth-first tree from
+ a given node
+ - :func:`depth_first_tree` - construct a depth-first tree from a given node
+ - :func:`minimum_spanning_tree` - construct the minimum spanning
+ tree of a graph
+
Deprecated features
===================
+``scipy.sparse.cs_graph_components`` has been made a part of the sparse graph
+submodule, and renamed to ``scipy.sparse.csgraph.connected_components``.
+Calling the former routine will result in a deprecation warning.
``scipy.misc.radon`` has been deprecated. A more full-featured radon transform
can be found in scikits-image.
@@ -122,3 +146,4 @@ added.
Authors
=======
+Jake Vanderplas <vanderplas@hail.astro.washington.edu>, sparse graph submodule
@@ -38,6 +38,7 @@ Reference
signal
sparse
sparse.linalg
+ sparse.csgraph
spatial
special
stats
@@ -0,0 +1 @@
+.. automodule:: scipy.sparse.csgraph
@@ -0,0 +1,212 @@
+Compressed Sparse Graph Routines `scipy.sparse.csgraph`
+=======================================================
+
+.. sectionauthor:: Jake Vanderplas <vanderplas@astro.washington.edu>
+
+.. currentmodule: scipy.sparse.csgraph
+
+
+Example: Word Ladders
+---------------------
+
+A `Word Ladder <http://en.wikipedia.org/wiki/Word_ladder>`_ is a word game
+invented by Lewis Carroll in which players find paths between words by
+switching one letter at a time. For example, one can link "ape" and "man"
+in the following way::
+
+.. math::
+ {\rm ape \to apt \to ait \to bit \to big \to bag \to mag \to man}
+
+Note that each step involves changing just one letter of the word. This is
+just one possible path from "ape" to "man", but is it the shortest possible
+path? If we desire to find the shortest word ladder path between two given
+words, the sparse graph submodule can help.
+
+First we need a list of valid words. Many operating systems have such a list
+built-in. For example, on linux, a word list can often be found at one of the
+following locations::
+
+ /usr/share/dict
+ /var/lib/dict
+
+Another easy source for words are the scrabble word lists available at various
+sites around the internet (search with your favorite search engine). We'll
+first create this list. The system word lists consist of a file with one
+word per line. The following should be modified to use the particular word
+list you have available::
+
+ >>> word_list = open('/usr/share/dict/words').readlines()
+ >>> word_list = map(str.strip, word_list)
+
+We want to look at words of length 3, so let's select just those words of the
+correct length. We'll also eliminate words which start with upper-case
+(proper nouns) or contain non alpha-numeric characters like apostrophes and
+hyphens. Finally, we'll make sure everything is lower-case for comparison
+later::
+
+ >>> word_list = [word for word in word_list if len(word) == 3]
+ >>> word_list = [word for word in word_list if word[0].islower()]
+ >>> word_list = [word for word in word_list if word.isalpha()]
+ >>> word_list = map(str.lower, word_list)
+ >>> len(word_list)
+ 586
+
+Now we have a list of 586 valid three-letter words (the exact number may
+change depending on the particular list used). Each of these words will
+become a node in our graph, and we will create edges connecting the nodes
+associated with each pair of words which differs by only one letter.
+
+There are efficient ways to do this, and inefficient ways to do this. To
+do this as efficiently as possible, we're going to use some sophisticated
+numpy array manipulation:
+
+ >>> import numpy as np
+ >>> word_list = np.asarray(word_list)
+ >>> word_list.dtype
+ dtype('|S3')
+ >>> word_list.sort() # sort for quick searching later
+
+We have an array where each entry is three bytes. We'd like to find all pairs
+where exactly one byte is different. We'll start by converting each word to
+a three-dimensional vector:
+
+ >>> word_bytes = np.ndarray((word_list.size, word_list.itemsize),
+ ... dtype='int8',
+ ... buffer=word_list.data)
+ >>> word_bytes.shape
+ (586, 3)
+
+Now we'll use the
+`Hamming distance <http://en.wikipedia.org/wiki/Hamming_distance>`_
+between each point to determine which pairs of words are connected.
+The Hamming distance measures the fraction of entries between two vectors
+which differ: any two words with a hamming distance equal to :math:`1/N`,
+where :math:`N` is the number of letters, are connected in the word ladder::
+
+ >>> from scipy.spatial.distance import pdist, squareform
+ >>> from scipy.sparse import csr_matrix
+ >>> hamming_dist = pdist(word_bytes, metric='hamming')
+ >>> graph = csr_matrix(squareform(hamming_dist < 1.5 / word_list.itemsize))
+
+When comparing the distances, we don't use an equality because this can be
+unstable for floating point values. The inequality produces the desired
+result as long as no two entries of the word list are identical. Now that our
+graph is set up, we'll use a shortest path search to find the path between
+any two words in the graph::
+
+ >>> i1 = word_list.searchsorted('ape')
+ >>> i2 = word_list.searchsorted('man')
+ >>> word_list[i1]
+ 'ape'
+ >>> word_list[i2]
+ 'man'
+
+We need to check that these match, because if the words are not in the list
+that will not be the case. Now all we need is to find the shortest path
+between these two indices in the graph. We'll use dijkstra's algorithm,
+because it allows us to find the path for just one node::
+
+ >>> from scipy.sparse.csgraph import dijkstra
+ >>> distances, predecessors = dijkstra(graph, indices=i1,
+ ... return_predecessors=True)
+ >>> print distances[i2]
+ 5.0
+
+So we see that the shortest path between 'ape' and 'man' contains only
+five steps. We can use the predecessors returned by the algorithm to
+reconstruct this path::
+
+ >>> path = []
+ >>> i = i2
+ >>> while i != i1:
+ >>> path.append(word_list[i])
+ >>> i = predecessors[i]
+ >>> path.append(word_list[i1])
+ >>> print path[::-1]
+ ['ape', 'apt', 'opt', 'oat', 'mat', 'man']
+
+This is three fewer links than our initial example: the path from ape to man
+is only five steps.
+
+Using other tools in the module, we can answer other questions. For example,
+are there three-letter words which are not linked in a word ladder? This
+is a question of connected components in the graph::
+
+ >>> from scipy.sparse.csgraph import connected_components
+ >>> N_components, component_list = connected_components(graph)
+ >>> print N_components
+ 15
+
+In this particular sample of three-letter words, there are 15 connected
+components: that is, 15 distinct sets of words with no paths between the
+sets. How many words are in each of these sets? We can learn this from
+the list of components::
+
+ >>> [np.sum(component_list == i) for i in range(15)]
+ [571, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+
+There is one large connected set, and 14 smaller ones. Let's look at the
+words in the smaller ones::
+
+ >>> [list(word_list[np.where(component_list == i)]) for i in range(1, 15)]
+ [['aha'],
+ ['chi'],
+ ['ebb'],
+ ['ems', 'emu'],
+ ['gnu'],
+ ['ism'],
+ ['khz'],
+ ['nth'],
+ ['ova'],
+ ['qua'],
+ ['ugh'],
+ ['ups'],
+ ['urn'],
+ ['use']]
+
+These are all the three-letter words which do not connect to others via a word
+ladder.
+
+We might also be curious about which words are maximally separated. Which
+two words take the most links to connect? We can determine this by computing
+the matrix of all shortest paths. Note that by convention, the
+distance between two non-connected points is reported to be infinity, so
+we'll need to remove these before finding the maximum::
+
+ >>> distances, predecessors = dijkstra(graph, return_predecessors=True)
+ >>> np.max(distances[~np.isinf(distances)])
+ 13.0
+
+So there is at least one pair of words which takes 13 steps to get from one
+to the other! Let's determine which these are::
+
+ >>> i1, i2 = np.where(distances == 13)
+ >>> zip(word_list[i1], word_list[i2])
+ [('imp', 'ohm'),
+ ('imp', 'ohs'),
+ ('ohm', 'imp'),
+ ('ohm', 'ump'),
+ ('ohs', 'imp'),
+ ('ohs', 'ump'),
+ ('ump', 'ohm'),
+ ('ump', 'ohs')]
+
+We see that there are two pairs of words which are maximally separated from
+each other: 'imp' and 'ump' on one hand, and 'ohm' and 'ohs' on the other
+hand. We can find the connecting list in the same way as above::
+
+ >>> path = []
+ >>> i = i2[0]
+ >>> while i != i1[0]:
+ >>> path.append(word_list[i])
+ >>> i = predecessors[i1[0], i]
+ >>> path.append(word_list[i1[0]])
+ >>> print path[::-1]
+ ['imp', 'amp', 'asp', 'ask', 'ark', 'are', 'aye', 'rye', 'roe', 'woe', 'woo', 'who', 'oho', 'ohm']
+
+This gives us the path we desired to see.
+
+Word ladders are just one potential application of scipy's fast graph
+algorithms for sparse matrices. Graph theory makes appearances in many
+areas of mathematics, data analysis, and machine learning. The sparse graph
+tools are flexible enough to handle many of these situations.
@@ -17,6 +17,7 @@ SciPy Tutorial
signal
linalg
arpack
+ csgraph
stats
ndimage
io
@@ -61,12 +61,14 @@
isspmatrix_coo
isspmatrix_dia
-Graph algorithms:
+Submodules
+----------
.. autosummary::
:toctree: generated/
- cs_graph_components -- Determine connected components of a graph
+ csgraph - Compressed sparse graph routines
+ linalg - sparse linear algebra routines
Exceptions
----------
@@ -171,7 +173,8 @@
"""
# Original code by Travis Oliphant.
-# Modified and extended by Ed Schofield, Robert Cimrman, and Nathan Bell.
+# Modified and extended by Ed Schofield, Robert Cimrman,
+# Nathan Bell, and Jake Vanderplas.
from base import *
from csr import *
@@ -181,11 +184,12 @@
from coo import *
from dia import *
from bsr import *
-from csgraph import *
-
from construct import *
from extract import *
+# for backward compatibility with v0.10. This function is marked as deprecated
+from csgraph import cs_graph_components
+
#from spfuncs import *
__all__ = filter(lambda s:not s.startswith('_'),dir())
@@ -1,8 +1,10 @@
Recurse:
linalg,
- sparsetools
+ sparsetools,
+ csgraph
Library:
Packages:
linalg,
- sparsetools
+ sparsetools,
+ csgraph
@@ -0,0 +1,11 @@
+# vim:syntax=python
+from os.path import join as pjoin
+
+from numscons import GetNumpyEnvironment, CheckF77Clib
+
+env = GetNumpyEnvironment(ARGUMENTS)
+
+env.NumpyPythonExtension('_shortest_path', source = ['_shortest_path.c'])
+env.NumpyPythonExtension('_traversal', source = ['_traversal.c'])
+env.NumpyPythonExtension('_min_spanning_tree', source = ['_min_spanning_tree.c'])
+env.NumpyPythonExtension('_tools', source = ['_tools.c'])
@@ -0,0 +1,2 @@
+from numscons import GetInitEnvironment
+GetInitEnvironment(ARGUMENTS).DistutilsSConscript('SConscript')
Oops, something went wrong.

0 comments on commit 6a2d895

Please sign in to comment.