Permalink
Browse files

Merge branch 'master' into l1_logreg_minC

  • Loading branch information...
2 parents 27206e0 + cd5acaf commit a426dbfbd381e83a1a3cd8f9a4dc100363091e02 @paolo-losi paolo-losi committed Apr 25, 2011
View
@@ -1,8 +1,8 @@
.. _clustering:
-===================================================
+==========
Clustering
-===================================================
+==========
`Clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`__ of
unlabeled data can be performed with the module :mod:`scikits.learn.cluster`.
@@ -15,7 +15,7 @@ data can be found in the `labels_` attribute.
.. currentmodule:: scikits.learn.cluster
-.. topic:: Input data
+.. topic:: Input data
One important thing to note is that the algorithms implemented in
this module take different kinds of matrix as input. On one hand,
@@ -41,7 +41,6 @@ be specified. It scales well to large number of samples, however its
results may be dependent on an initialisation.
-
Affinity propagation
====================
@@ -84,7 +83,7 @@ of cluster. It will have difficulties scaling to thousands of samples.
Spectral clustering
-====================
+===================
:class:`SpectralClustering` does a low-dimension embedding of the
affinity matrix between samples, followed by a KMeans in the low
@@ -121,6 +120,24 @@ function of the gradient of the image.
* :ref:`example_cluster_plot_lena_segmentation.py`: Spectral clustering
to split the image of lena in regions.
+.. topic:: References:
+
+ * `"A Tutorial on Spectral Clustering"
+ <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323>`_
+ Ulrike von Luxburg, 2007
+
+ * `"Normalized cuts and image segmentation"
+ <http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324>`_
+ Jianbo Shi, Jitendra Malik, 2000
+
+ * `"A Random Walks View of Spectral Segmentation"
+ <http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.1501>`_
+ Marina Meila, Jianbo Shi, 2001
+
+ * `"On Spectral Clustering: Analysis and an algorithm"
+ <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8100>`_
+ Andrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001
+
.. _hierarchical_clustering:
@@ -132,27 +149,27 @@ build nested clusters by merging them successively. This hierarchy of
clusters represented as a tree (or dendrogram). The root of the tree is
the unique cluster that gathers all the samples, the leaves being the
clusters with only one sample. See the `Wikipedia page
-<http://en.wikipedia.org/wiki/Hierarchical_clustering for more
-details>`_.
+<http://en.wikipedia.org/wiki/Hierarchical_clustering>`_ for more
+details.
-
-The :class:`Ward` object performs a hierarchical clustering based on Ward
-algorithm, that is a variance-minimizing approach. At each step, it
-minimizes the sum of squared differences within all clusters (inertia
-criterion).
+The :class:`Ward` object performs a hierarchical clustering based on
+the Ward algorithm, that is a variance-minimizing approach. At each
+step, it minimizes the sum of squared differences within all clusters
+(inertia criterion).
This algorithm can scale to large number of samples when it is used jointly
with an connectivity matrix, but can be computationally expensive when no
connectivity constraints are added between samples: it considers at each step
all the possible merges.
-Adding connectivity constraints
-----------------------------------
+
+Adding connectivity constraints
+-------------------------------
An interesting aspect of the :class:`Ward` object is that connectivity
constraints can be added to this algorithm (only adjacent clusters can be
merged together), through an connectivity matrix that defines for each
-sample the neighboring samples following a given structure of the data. For
+sample the neighboring samples following a given structure of the data. For
instance, in the swiss-roll example below, the connectivity constraints
forbid the merging of points that are not adjacent on the swiss roll, and
thus avoid forming clusters that extend across overlapping folds of the
@@ -184,10 +201,10 @@ enable only merging of neighboring pixels on an image, as in the
.. topic:: Examples:
- * :ref:`example_cluster_plot_lena_ward_segmentation.py`: Ward clustering
+ * :ref:`example_cluster_plot_lena_ward_segmentation.py`: Ward clustering
to split the image of lena in regions.
- * :ref:`example_cluster_plot_ward_structured_vs_unstructured.py`: Example of
+ * :ref:`example_cluster_plot_ward_structured_vs_unstructured.py`: Example of
Ward algorithm on a swiss-roll, comparison of structured approaches
versus unstructured approaches.
@@ -12,6 +12,7 @@
from ..base import BaseEstimator
from ..metrics.pairwise import euclidean_distances
+from ..utils import make_rng
###############################################################################
@@ -52,8 +53,7 @@ def k_init(X, k, n_local_trials=None, rng=None, x_squared_norms=None):
which is the implementation used in the aforementioned paper.
"""
n_samples, n_features = X.shape
- if rng is None:
- rng = np.random
+ rng = make_rng(rng)
centers = np.empty((k, n_features))
@@ -80,8 +80,8 @@ def k_init(X, k, n_local_trials=None, rng=None, x_squared_norms=None):
for c in xrange(1, k):
# Choose center candidates by sampling with probability proportional
# to the squared distance to the closest existing center
- rand_vals = rng.random(n_local_trials) * current_pot
- candidate_ids = np.searchsorted(closest_dist_sq.cumsum(), rand_vals)
+ rand_vals = rng.random_sample(n_local_trials) * current_pot
+ candidate_ids = np.searchsorted(closest_dist_sq.cumsum(), rand_vals)
# Compute distances to center candidates
distance_to_candidates = euclidean_distances(
@@ -181,8 +181,7 @@ def k_means(X, k, init='k-means++', n_init=10, max_iter=300, verbose=0,
The final value of the inertia criterion
"""
- if rng is None:
- rng = np.random
+ rng = make_rng(rng)
n_samples = X.shape[0]
vdata = np.mean(np.var(X, 0))
Oops, something went wrong.

0 comments on commit a426dbf

Please sign in to comment.