Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[ENH] ward_tree: return distances #3158

Merged
merged 12 commits into from Dec 18, 2014

Conversation

Projects
None yet
6 participants
Contributor

mvdoc commented May 18, 2014

Modified ward_tree to return distances for both the structured and unstructured versions,
useful to have when plotting the dendrogram with scipy.

Contributor

mvdoc commented May 18, 2014

Please check: is the square root of the 'inertia' distance for the structured version comparable to the distance of the unstructured version?

Coverage Status

Coverage decreased (-0.01%) when pulling 9e2640b on mvdoc:clustering into 8523766 on scikit-learn:master.

Owner

agramfort commented May 18, 2014

you'll need to add a test

is there a performance decrease on medium size data even when height is not returned?

Contributor

mvdoc commented May 18, 2014

I simply return the inertia value that is computed at each merge by the algorithm in case of the structured version, and the actual height from scipy for the unstructured version, so I'm not sure what I should test if the original algorithm is correct.

Also, the unstructured algorithm and the structured algorithm passing a complete graph do not yield identical results on the same data, in terms of merging clusters and distances.

I will check the performance.

Owner

GaelVaroquaux commented May 18, 2014

I simply return the inertia value that is computed at each merge by the
algorithm in case of the structured version, and the actual height from scipy
for the unstructured version, so I'm not sure what I should test if the
original algorithm is correct.

You could compare both implementation with a complete graph.

Also, the unstructured algorithm and the structured algorithm passing a
complete graph do not yield identical results on the same data, in terms of
merging clusters and distances.

I believe that if there are no draws in your distance matrix, the results
should be exactly the same. I believe that there is even a test comparing
both.

Contributor

mvdoc commented May 18, 2014

I believe that if there are no draws in your distance matrix, the results should be exactly the same. I believe that there is even a test comparing both.

You are right, my bad. I will write the test. Thanks for the suggestions.

Coverage Status

Coverage increased (+0.0%) when pulling 0f8f660 on mvdoc:clustering into 8523766 on scikit-learn:master.

Contributor

mvdoc commented May 19, 2014

I added a test to check that structured and unstructured versions return the same distances.

In order to make them equal, the inertia must be multiplied by 2 and then square rooted. This is because the unstructured version uses the euclidean distance, while the current implementation of the
structured version divides the euclidean distance by 2.

Tomorrow I will check the performance. @agramfort, could you define the dimensions for a medium sized dataset?

Contributor

mvdoc commented May 19, 2014

Returning the height doesn't seem to affect performance much.

See results of simulation below (average timing of 5 repetitions with different n_samples). No Height means return_height=False; Height means return_height=True.

Structured version was run on a complete graph.

Type \ n_samples 10 100 1,000
Previous Unstructured 1.10006332e-04 5.39937019e-03 8.03353009e+00
New Unstructured No Height 3.68976593e-04 5.39817810e-03 8.02031279e+00
New Unstructured Height 9.91344452e-05 5.23843765e-03 7.94552531e+00
Previous Structured 3.06277275e-03 2.26485062e-01 4.26110800e+01
New Structured No Height 3.67021561e-03 2.42472601e-01 4.45632022e+01
New Structured Height 3.91120911e-03 2.65254784e-01 4.41243954e+01
Owner

agramfort commented May 20, 2014

Returning the height doesn't seem to affect performance much.

great.

Coverage Status

Coverage increased (+0.0%) when pulling b6e7977 on mvdoc:clustering into 8523766 on scikit-learn:master.

Owner

yarikoptic commented Dec 11, 2014

I wondered if anything else left to do to bless this pr for merge?

@agramfort agramfort commented on an outdated diff Dec 11, 2014

sklearn/cluster/hierarchical.py
@@ -121,6 +121,9 @@ def ward_tree(X, connectivity=None, n_components=None, copy=None,
limited use, and the 'parents' output should rather be used.
This option is valid only when specifying a connectivity matrix.
+ return_height: bool (optional)
+ If True, return the distance between the clusters.
@agramfort

agramfort Dec 11, 2014

Owner

calling it height and documenting it with the word "distance" is confusing I think. Reading this I am not sure of what you return.

@agramfort agramfort commented on an outdated diff Dec 11, 2014

sklearn/cluster/hierarchical.py
@@ -248,7 +263,11 @@ def ward_tree(X, connectivity=None, n_components=None, copy=None,
n_leaves = n_samples
children = np.array(children) # return numpy array for efficient caching
- return children, n_components, n_leaves, parent
+ if return_height:
+ heights = (np.array(heights) * 2) ** 0.5 # scaling factor to compare w/ unstructured version
@agramfort

agramfort Dec 11, 2014

Owner

pep8

lines must be <80c

@agramfort agramfort commented on an outdated diff Dec 11, 2014

sklearn/cluster/tests/test_hierarchical.py
+ X -= 4 * np.arange(n)[:, np.newaxis]
+ X -= X.mean(axis=1)[:, np.newaxis]
+
+ out_unstructured = ward_tree(X, return_height=True)
+ out_structured = ward_tree(X, connectivity, return_height=True)
+
+ # sort labels
+ merge_unstructured = np.asanyarray([np.sort(m) for m in out_unstructured[0]])
+ merge_structured = np.asanyarray([np.sort(m) for m in out_structured[0]])
+
+ # sort clusters according to the same criterion
+ idx_unstructured = argsort(merge_unstructured, key=max)
+ idx_structured = argsort(merge_structured, key=max)
+
+ # check if we got the same clusters
+ assert_array_equal(merge_unstructured[idx_unstructured], merge_structured[idx_structured])
@agramfort

agramfort Dec 11, 2014

Owner

pep8

lines must be <80c

please run pep8 tool on your code.

thanks

@agramfort agramfort commented on an outdated diff Dec 11, 2014

sklearn/cluster/tests/test_hierarchical.py
@@ -16,6 +16,7 @@
from sklearn.utils.testing import assert_equal
from sklearn.utils.testing import assert_almost_equal
from sklearn.utils.testing import assert_array_almost_equal
+from numpy.testing import assert_array_almost_equal_nulp
@agramfort

agramfort Dec 11, 2014

Owner

it would be more consistent to add a assert_array_almost_equal_nulp in sklearn.utils.testing

btw why do you need assert_array_almost_equal_nulp here? It's not used anywhere else in the code base.

@agramfort agramfort commented on an outdated diff Dec 11, 2014

sklearn/cluster/tests/test_hierarchical.py
+ out_structured = ward_tree(X, connectivity, return_height=True)
+
+ # sort labels
+ merge_unstructured = np.asanyarray([np.sort(m) for m in out_unstructured[0]])
+ merge_structured = np.asanyarray([np.sort(m) for m in out_structured[0]])
+
+ # sort clusters according to the same criterion
+ idx_unstructured = argsort(merge_unstructured, key=max)
+ idx_structured = argsort(merge_structured, key=max)
+
+ # check if we got the same clusters
+ assert_array_equal(merge_unstructured[idx_unstructured], merge_structured[idx_structured])
+
+ # check if the distances are the same
+ dist_unstructured = out_unstructured[-1]
+ dist_structured = out_structured[-1]
@agramfort

agramfort Dec 11, 2014

Owner

this is a good test but it's not really testing that the value is correct. I would run the algo on a toy dataset where I know the truth.

Contributor

mvdoc commented Dec 14, 2014

Thanks for the comments. See latest commit.

@agramfort agramfort and 1 other commented on an outdated diff Dec 14, 2014

sklearn/cluster/hierarchical.py
@@ -248,7 +264,15 @@ def ward_tree(X, connectivity=None, n_components=None, copy=None,
n_leaves = n_samples
children = np.array(children) # return numpy array for efficient caching
- return children, n_components, n_leaves, parent
+ # sort children to get consistent output with unstructured version
+ children = [np.sort(c) for c in children]
@agramfort

agramfort Dec 14, 2014

Owner

this is an API change.

why do you need this?

cc @GaelVaroquaux

@mvdoc

mvdoc Dec 14, 2014

Contributor

atm the output of unstructured (scipy.cluster.ward) and structured versions are not consistent, the children are not sorted in the same way.

see below

import sklearn
import numpy as np
from sklearn.cluster.hierarchical import ward_tree

X = np.array([[1.43054825, -7.5693489],
                  [6.95887839, 6.82293382],
                  [2.87137846, -9.68248579],
                  [7.87974764, -6.05485803],
                  [8.24018364, -6.09495602],
                  [7.39020262, 8.54004355]])
connectivity = np.ones((6, 6))
out_unstructured = ward_tree(X)
out_structured = ward_tree(X, connectivity)

print(out_unstructured[0])
[[3 4]
 [1 5]
 [0 2]
 [6 8]
 [7 9]]

print(out_structured[0])
[[4 3]
 [5 1]
 [2 0]
 [8 6]
 [9 7]]
@agramfort

agramfort Dec 14, 2014

Owner

ok then it's a bug fix. And you create a dedicated test and document it in what's new as bug fix.

@mvdoc

mvdoc Dec 14, 2014

Contributor

OK. I created another pull request for bug fix. See here

#3966

@agramfort agramfort commented on an outdated diff Dec 14, 2014

sklearn/cluster/hierarchical.py
@@ -137,6 +141,10 @@ def ward_tree(X, connectivity=None, n_components=None, copy=None,
parents : 1D array, shape (n_nodes, ) or None
The parent of each node. Only returned when a connectivity matrix
is specified, elsewhere 'None' is returned.
+
+ distances : 1D array, shape (n_nodes, )
+ The distance between the clusters. Only returned if return_distance is
+ set to True (for compatibility).
@agramfort

agramfort Dec 14, 2014

Owner

What do you mean by the distance between the clusters? Between their center? So I don't get it.

mvdoc added some commits May 18, 2014

ward_tree returns heights (distances) between the clusters
both for the structured and unstructured version. Useful for
plotting the dendrogram with scipy.
ENH: `ward_tree` returns distances
Returns distances for both structured and unstructured version,
and they are equal for both versions.

Added test under test_hierarchical.
TST, test ward_tree_distance on known dataset
- Solve PEP8 problems in test_ward_tree_distance and ward_tree
- Refactor return_height into return_distance for ward_tree.
- Test that structured and unstructured versions of ward_tree return the
  same distances on a known dataset.
- structured and unstructured versions return children with the same
  order
FIX: structured and unstructured ward_tree return children sorted in the
same way

Add test
Clean code for test_ward_tree_distance
Contributor

mvdoc commented Dec 16, 2014

I rebased everything on top of current master. Latest change is better description of distances in the documentation. @agramfort please let me know if it's more clear now.

@mvdoc mvdoc changed the title from ward_tree: return distances to [ENH] ward_tree: return distances Dec 16, 2014

@agramfort agramfort commented on an outdated diff Dec 16, 2014

sklearn/cluster/hierarchical.py
@@ -212,6 +248,8 @@ def ward_tree(X, connectivity=None, n_components=None, n_clusters=None):
parent[i], parent[j] = k, k
children.append((i, j))
used_node[i] = used_node[j] = False
+ # store inertia value
+ distances.append(inert)
@agramfort

agramfort Dec 16, 2014

Owner

this distances variable is only needed if return_distance=True.

Owner

agramfort commented Dec 16, 2014

LGTM besides this.

@GaelVaroquaux please have a look

Coverage Status

Coverage increased (+0.0%) when pulling 73c7453 on mvdoc:clustering into 15157f9 on scikit-learn:master.

@MechCoder MechCoder and 1 other commented on an outdated diff Dec 17, 2014

sklearn/cluster/hierarchical.py
@@ -245,7 +284,12 @@ def ward_tree(X, connectivity=None, n_components=None, n_clusters=None):
children = [c[::-1] for c in children]
children = np.array(children) # return numpy array for efficient caching
- return children, n_components, n_leaves, parent
+ if return_distance:
+ # 2 is scaling factor to compare w/ unstructured version
+ distances = (np.array(distances) * 2) ** 0.5
@MechCoder

MechCoder Dec 17, 2014

Owner

nitpick: np.asarray ?

We probably would want to define a temporary array of size (n_nodes - n_samples) and then have a counter to update the inert value, after every merge, but that might be overkill.

@GaelVaroquaux

GaelVaroquaux Dec 17, 2014

Owner

Use np.sqrt, and not '** 0.5', please.

@MechCoder MechCoder commented on an outdated diff Dec 17, 2014

sklearn/cluster/tests/test_hierarchical.py
@@ -302,6 +302,79 @@ def test_ward_tree_children_order():
assert_array_equal(out_unstructured[0], out_structured[0])
+def test_ward_tree_distance():
+ """
+ Check that children are ordered in the same way for both structured and
+ unstructured versions of ward_tree.
+ """
+
+ def argsort(x, key, **kwargs):
+ return sorted(range(len(x)),
+ key=lambda e: key(x.__getitem__(e)), **kwargs)
@MechCoder

MechCoder Dec 17, 2014

Owner

This is the equivalent of doing np.argsort(np.max(children_unstructured, axis=1)) (I think) . It took me some time to read this (Especially since I haven't used lambda's much)

Owner

MechCoder commented Dec 17, 2014

lgtm. apart from these two nitpicks. i'm reading through the agglomerative code and this is helpful. thanks! :)

Contributor

mvdoc commented Dec 17, 2014

@MechCoder

nitpick: np.asarray ?

I thought that for lists np.asarray and np.array are equivalent. Is there any particular reason here to favor np.asarray here since we're passing a list?

This is the equivalent of doing np.argsort(np.max(children_unstructured, axis=1)) (I think) . It took me some time to read this (Especially since I haven't used lambda's much)

I believe you're right. However, you made me realize that it's not necessary anymore since now both structured and unstructured versions return the children ordered in the same way.

lgtm. apart from these two nitpicks. i'm reading through the agglomerative code and this is helpful. thanks! :)

you're welcome! glad it helped! :)

Owner

MechCoder commented Dec 17, 2014

Is there any particular reason here to favor np.asarray here since we're passing a list?

Oops you are right. Sorry

Contributor

mvdoc commented Dec 17, 2014

I wonder if there are any more changes suggested?

Owner

MechCoder commented Dec 17, 2014

Looks good to me. @GaelVaroquaux should I merge?

@GaelVaroquaux GaelVaroquaux commented on the diff Dec 17, 2014

sklearn/cluster/hierarchical.py
+ corresponds to a weighted euclidean distance between
+ the nodes `children[i, 1]` and `children[i, 2]`. If the nodes refer to
+ leaves of the tree, then `distances[i]` is their unweighted euclidean
+ distance. Distances are updated in the following way
+ (from scipy.hierarchy.linkage):
+
+ The new entry :math:`d(u,v)` is computed as follows,
+
+ .. math::
+
+ d(u,v) = \\sqrt{\\frac{|v|+|s|}
+ {T}d(v,s)^2
+ + \\frac{|v|+|t|}
+ {T}d(v,t)^2
+ - \\frac{|v|}
+ {T}d(s,t)^2}
@GaelVaroquaux

GaelVaroquaux Dec 17, 2014

Owner

Could you please check that this render right in the html docs. There have been problems with sphinx, math, and docstrings.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Dec 17, 2014

sklearn/cluster/hierarchical.py
@@ -245,7 +284,12 @@ def ward_tree(X, connectivity=None, n_components=None, n_clusters=None):
children = [c[::-1] for c in children]
children = np.array(children) # return numpy array for efficient caching
- return children, n_components, n_leaves, parent
+ if return_distance:
+ # 2 is scaling factor to compare w/ unstructured version
+ distances = np.sqrt(np.array(distances) * 2)
@GaelVaroquaux

GaelVaroquaux Dec 17, 2014

Owner

Complete nitpick (feel free to ignore): I usually prefer scalars in front of vectors, when multiplying one by the other.

Owner

GaelVaroquaux commented Dec 17, 2014

LGTM. +1 for merge once the doc rendering has been checked. Thanks!

Contributor

mvdoc commented Dec 18, 2014

Could you please check that this render right in the html docs. There have been problems with sphinx, math, and docstrings.

Looks good on my system:

screen shot 2014-12-18 at 1 19 17 pm

Owner

GaelVaroquaux commented Dec 18, 2014

OK. Merging. Thank you!

GaelVaroquaux added a commit that referenced this pull request Dec 18, 2014

Merge pull request #3158 from mvdoc/clustering
[ENH] ward_tree: return distances

@GaelVaroquaux GaelVaroquaux merged commit 4a02469 into scikit-learn:master Dec 18, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

@mvdoc mvdoc deleted the mvdoc:clustering branch Dec 18, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment