New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Use sparse cluster contingency matrix by default #7419

Closed
wants to merge 16 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@jnothman
Member

jnothman commented Sep 14, 2016

Supersedes #7203. Obviates the need for max_n_classes, to some extent and removes it as unreleased API (introduced in #5445 to fix #4976).

A sparse contingency matrix should handle the common case with a large number of clusters that overlap among them is sparse; where there are few clusters we suffer a penalty for using a sparse contingency matrix, but this penalty is small in absolute term.

The change is being benchmarked on my sparse_cluster_contingency_option branch, performing multiple segmentations of an image into different numbers of clusters, then timing the evaluation of metrics for different contingency matrix shapes. Benchmarks will be posted shortly.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 14, 2016

benchmarks at https://github.com/jnothman/scikit-learn/blob/sparse_cluster_contingency_option/stuppie-metrics_sparse_comparison.ipynb (see the end) evaluate timings for a 31000-point dataset (an image), with contingency matrix shapes in ({5, 10, 20, 50, 100, 250}, {5, 10, 20, 50, 100, 250}). We see:

  • small absolute differences (<4ms) for homogeneity_completeness_v_measure
  • negligible or negative differences (substantial in low density) for adjusted_rand_score
  • small absolute (<10ms) and relative (<0.3x, proportional to density) differences for adjusted_mutual_info_score
  • small absolute differences (<3ms) for normalized_mutual_info_score
  • small absolute differences (<3ms) for fowlkes_mallows_score
# Compute the ARI using the contingency data
sum_comb_c = sum(comb2(n_c) for n_c in contingency.sum(axis=1))
sum_comb_k = sum(comb2(n_k) for n_k in contingency.sum(axis=0))
if isinstance(contingency, np.ndarray):

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

Why keep the code for the dense case if sparse=True in the previous line?

This comment has been minimized.

@jnothman

jnothman Sep 14, 2016

Member

forgot :)

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

Given the previous line with sparse=True there is no need for supporting array-based contingency matrix here. We can delete this branch of the if statement as well as the final else clause.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 14, 2016

Thanks very much, I am in favor of always using sparse=True in those metrics and remove max_n_classes.

jnothman added some commits Sep 14, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 14, 2016

After giving myself a review and making numerous cosmetic tweaks, I'm going to bed. Would be a nice surprise to wake up to see this merged :p

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 14, 2016

(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.)

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 14, 2016

(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.)

I will do it and send you the credentials by private email.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 14, 2016

BTW I confirm that this PR solves the original issue (#4976) with large floating point valued vectors passed to clustering metrics:

>>> from sklearn.metrics import mutual_info_score
>>> import numpy as np
>>> y1 = np.random.randn(int(1e5))
>>> y2 = np.random.randn(int(1e5))
>>> %time mutual_info_score(y1, y2)
CPU times: user 96 ms, sys: 8 ms, total: 104 ms
Wall time: 99.1 ms
11.512925464970223
>>> %load_ext memory_profiler
>>> %memit mutual_info_score(y1, y2)
peak memory: 75.73 MiB, increment: 8.95 MiB

There is definitely no need for the ugly max_n_classes in our public API.

max_n_classes=max_n_classes)
contingency = np.array(contingency, dtype='float')
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
contingency = contingency.astype(float)

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

.astype(float) is not guaranteed to use the same level of precision on all platforms. To get platform agnostic results it's better to be specific:

contingency = contingency.astype(np.float64)
max_n_classes=max_n_classes)
contingency = np.array(contingency, dtype='float')
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
contingency = contingency.astype(float)

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

.astype(np.float64)

assert_equal(normalized_mutual_info_score(labels_a, labels_b,
max_n_classes=1e4), 0.0)
labels_a, labels_b = np.ones(i, dtype=np.int), \
np.arange(i, dtype=np.int)

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

This could be better written (no eol escape and platform independent precision level):

        labels_a = np.ones(i, dtype=np.int32)
        labels_b = np.ones(i, dtype=np.int32)
@@ -883,15 +874,13 @@ def fowlkes_mallows_score(labels_true, labels_pred, max_n_classes=5000):
.. [2] `Wikipedia entry for the Fowlkes-Mallows Index
<https://en.wikipedia.org/wiki/Fowlkes-Mallows_index>`_
"""
labels_true, labels_pred = check_clusterings(labels_true, labels_pred,)
labels_true, labels_pred = check_clusterings(labels_true, labels_pred, )

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

why not just remove the comma after labels_pred?

@@ -586,16 +575,12 @@ def mutual_info_score(labels_true, labels_pred, contingency=None,
labels_pred : array, shape = [n_samples]
A clustering of the data into disjoint subsets.
contingency: None or array, shape = [n_classes_true, n_classes_pred]
contingency: {None, array, sparse matrix},
shape = [n_classes_true, n_classes_pred]

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

@amueller Is this the right level of indentation for the numpydoc sphinx stuff? I never remember.

This comment has been minimized.

@amueller

amueller Sep 14, 2016

Member

no one knows ;)

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 14, 2016

Hum it seems that I made a bunch of comments on an old version of the diff. Sorry for the noise.

Anyways, besides my comments on astype(float), +1 for merge.

@ogrisel ogrisel changed the title from [MRG] Use sparse cluster contingency matrix by default to [MRG+1] Use sparse cluster contingency matrix by default Sep 14, 2016

@@ -28,8 +28,8 @@ def expected_mutual_information(contingency, int n_samples):
#cdef np.ndarray[int, ndim=2] start, end
R, C = contingency.shape
N = <DOUBLE>n_samples
a = np.sum(contingency, axis=1).astype(np.int32)
b = np.sum(contingency, axis=0).astype(np.int32)
a = np.ravel(contingency.sum(axis=1).astype(np.int32))

This comment has been minimized.

@amueller

amueller Sep 14, 2016

Member

why is the ravel needed?

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

Because it can now be a sparse matrix and the sum over the rows of a sparse matrix is a sparse matrix with a single column instead of a 1D array.

@@ -330,6 +330,11 @@ Enhancements
(`#7248 <https://github.com/scikit-learn/scikit-learn/pull/7248>_`)
By `Andreas Müller`_.
- Support sparse contingency matrices in cluster evaluation
(:mod:`metrics.cluster.supervised`) and use these by default.

This comment has been minimized.

@amueller

amueller Sep 14, 2016

Member

"uses these by default" is not entirely true, right? Only in the internals, but no return value changed.

This comment has been minimized.

@ogrisel

ogrisel Sep 14, 2016

Member

I agree.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 14, 2016

@amueller if you agree I can address the comments, merge and backport while @jnothman is AFK :)

@amueller

This comment has been minimized.

Member

amueller commented Sep 14, 2016

@ogrisel sounds good :)

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 14, 2016

polished, squashed, rebased and backported.

@ogrisel ogrisel closed this Sep 14, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 14, 2016

ooh :) you two are so kindly obliging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment