Skip to content

Commit

Permalink
Merge branch 'master' into l1_logreg_minC
Browse files Browse the repository at this point in the history
  • Loading branch information
paolo-losi committed Apr 25, 2011
2 parents 27206e0 + cd5acaf commit a426dbf
Show file tree
Hide file tree
Showing 8 changed files with 298 additions and 165 deletions.
51 changes: 34 additions & 17 deletions doc/modules/clustering.rst
@@ -1,8 +1,8 @@
.. _clustering:

===================================================
==========
Clustering
===================================================
==========

`Clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`__ of
unlabeled data can be performed with the module :mod:`scikits.learn.cluster`.
Expand All @@ -15,7 +15,7 @@ data can be found in the `labels_` attribute.

.. currentmodule:: scikits.learn.cluster

.. topic:: Input data
.. topic:: Input data

One important thing to note is that the algorithms implemented in
this module take different kinds of matrix as input. On one hand,
Expand All @@ -41,7 +41,6 @@ be specified. It scales well to large number of samples, however its
results may be dependent on an initialisation.



Affinity propagation
====================

Expand Down Expand Up @@ -84,7 +83,7 @@ of cluster. It will have difficulties scaling to thousands of samples.


Spectral clustering
====================
===================

:class:`SpectralClustering` does a low-dimension embedding of the
affinity matrix between samples, followed by a KMeans in the low
Expand Down Expand Up @@ -121,6 +120,24 @@ function of the gradient of the image.
* :ref:`example_cluster_plot_lena_segmentation.py`: Spectral clustering
to split the image of lena in regions.

.. topic:: References:

* `"A Tutorial on Spectral Clustering"
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323>`_
Ulrike von Luxburg, 2007

* `"Normalized cuts and image segmentation"
<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324>`_
Jianbo Shi, Jitendra Malik, 2000

* `"A Random Walks View of Spectral Segmentation"
<http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.1501>`_
Marina Meila, Jianbo Shi, 2001

* `"On Spectral Clustering: Analysis and an algorithm"
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8100>`_
Andrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001


.. _hierarchical_clustering:

Expand All @@ -132,27 +149,27 @@ build nested clusters by merging them successively. This hierarchy of
clusters represented as a tree (or dendrogram). The root of the tree is
the unique cluster that gathers all the samples, the leaves being the
clusters with only one sample. See the `Wikipedia page
<http://en.wikipedia.org/wiki/Hierarchical_clustering for more
details>`_.
<http://en.wikipedia.org/wiki/Hierarchical_clustering>`_ for more
details.


The :class:`Ward` object performs a hierarchical clustering based on Ward
algorithm, that is a variance-minimizing approach. At each step, it
minimizes the sum of squared differences within all clusters (inertia
criterion).
The :class:`Ward` object performs a hierarchical clustering based on
the Ward algorithm, that is a variance-minimizing approach. At each
step, it minimizes the sum of squared differences within all clusters
(inertia criterion).

This algorithm can scale to large number of samples when it is used jointly
with an connectivity matrix, but can be computationally expensive when no
connectivity constraints are added between samples: it considers at each step
all the possible merges.

Adding connectivity constraints
----------------------------------

Adding connectivity constraints
-------------------------------

An interesting aspect of the :class:`Ward` object is that connectivity
constraints can be added to this algorithm (only adjacent clusters can be
merged together), through an connectivity matrix that defines for each
sample the neighboring samples following a given structure of the data. For
sample the neighboring samples following a given structure of the data. For
instance, in the swiss-roll example below, the connectivity constraints
forbid the merging of points that are not adjacent on the swiss roll, and
thus avoid forming clusters that extend across overlapping folds of the
Expand Down Expand Up @@ -184,10 +201,10 @@ enable only merging of neighboring pixels on an image, as in the

.. topic:: Examples:

* :ref:`example_cluster_plot_lena_ward_segmentation.py`: Ward clustering
* :ref:`example_cluster_plot_lena_ward_segmentation.py`: Ward clustering
to split the image of lena in regions.

* :ref:`example_cluster_plot_ward_structured_vs_unstructured.py`: Example of
* :ref:`example_cluster_plot_ward_structured_vs_unstructured.py`: Example of
Ward algorithm on a swiss-roll, comparison of structured approaches
versus unstructured approaches.

Expand Down
11 changes: 5 additions & 6 deletions scikits/learn/cluster/k_means_.py
Expand Up @@ -12,6 +12,7 @@

from ..base import BaseEstimator
from ..metrics.pairwise import euclidean_distances
from ..utils import make_rng


###############################################################################
Expand Down Expand Up @@ -52,8 +53,7 @@ def k_init(X, k, n_local_trials=None, rng=None, x_squared_norms=None):
which is the implementation used in the aforementioned paper.
"""
n_samples, n_features = X.shape
if rng is None:
rng = np.random
rng = make_rng(rng)

centers = np.empty((k, n_features))

Expand All @@ -80,8 +80,8 @@ def k_init(X, k, n_local_trials=None, rng=None, x_squared_norms=None):
for c in xrange(1, k):
# Choose center candidates by sampling with probability proportional
# to the squared distance to the closest existing center
rand_vals = rng.random(n_local_trials) * current_pot
candidate_ids = np.searchsorted(closest_dist_sq.cumsum(), rand_vals)
rand_vals = rng.random_sample(n_local_trials) * current_pot
candidate_ids = np.searchsorted(closest_dist_sq.cumsum(), rand_vals)

# Compute distances to center candidates
distance_to_candidates = euclidean_distances(
Expand Down Expand Up @@ -181,8 +181,7 @@ def k_means(X, k, init='k-means++', n_init=10, max_iter=300, verbose=0,
The final value of the inertia criterion
"""
if rng is None:
rng = np.random
rng = make_rng(rng)
n_samples = X.shape[0]

vdata = np.mean(np.var(X, 0))
Expand Down

0 comments on commit a426dbf

Please sign in to comment.