New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AffinityPropagation creates 3d array of cluster centers on rare occasions #9612

Closed
jsamoocha opened this Issue Aug 23, 2017 · 16 comments

Comments

Projects
None yet
5 participants
@jsamoocha
Contributor

jsamoocha commented Aug 23, 2017

Description

Just stumbled upon a rare combination of training data and preference value that causes the model to save its cluster centers as a 3d ndarray instead of expected 2d.

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster.affinity_propagation_ import AffinityPropagation

train_data = np.array([[-1.,  1.], [1., -1.]])
model = AffinityPropagation(preference=-10).fit(train_data)
model.cluster_centers_

yields

array([[[-1.,  1.], [ 1., -1.]]])  # 3d!!

and

model.predict(train_data)

leads to

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/cluster/affinity_propagation_.py", line 324, in predict
    return pairwise_distances_argmin(X, self.cluster_centers_)
  File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 464, in pairwise_distances_argmin
    metric_kwargs)[0]
  File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 339, in pairwise_distances_argmin_min
    X, Y = check_pairwise_arrays(X, Y)
  File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 111, in check_pairwise_arrays
    warn_on_dtype=warn_on_dtype, estimator=estimator)
  File "/Users/jsamoocha/.virtualenvs/coach/lib/python2.7/site-packages/sklearn/utils/validation.py", line 405, in check_array
    % (array.ndim, estimator_name))
ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2.

When using slightly different values for preference (e.g. 0 or -20), or slightly different training data (e.g. [[-1, 1], [1, -0.9]]), cluster centers are stored correctly as 2d ndarray.

Expected Results

Cluster centers to be stored as 2d ndarray, as in normal cases.

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Jul 18 2017, 09:16:53) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('Scikit-Learn', '0.18.2')

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 23, 2017

Member
Member

jnothman commented Aug 23, 2017

@jsamoocha

This comment has been minimized.

Show comment
Hide comment
@jsamoocha

jsamoocha Aug 24, 2017

Contributor

Clean install of v0.19 in an empty conda env leads to the same result.

Contributor

jsamoocha commented Aug 24, 2017

Clean install of v0.19 in an empty conda env leads to the same result.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2017

Member

Okay. This is happening when cluster_centers_indices is None, which happens in turn when K = 0 exemplars are found. The possibility that cluster_centers_indices is None is not documented in affinity_propagation, and not handled in AffinityPropagation. Fix welcome.

Member

jnothman commented Aug 24, 2017

Okay. This is happening when cluster_centers_indices is None, which happens in turn when K = 0 exemplars are found. The possibility that cluster_centers_indices is None is not documented in affinity_propagation, and not handled in AffinityPropagation. Fix welcome.

@jsamoocha

This comment has been minimized.

Show comment
Hide comment
@jsamoocha

jsamoocha Aug 24, 2017

Contributor

I'm a bit confused by the results of affinity_propagation() related to above example of [[-1, 1], [1, -1]] as training data: running it with preference=0 results in K = 2, running it with preference=-10 results in K = 0, but running it with preference=-20 results in K = 1 (taking the first point in the training data as exemplar).

Does this make sense? Shouldn't we expect lower preference values to result in less clusters?

Contributor

jsamoocha commented Aug 24, 2017

I'm a bit confused by the results of affinity_propagation() related to above example of [[-1, 1], [1, -1]] as training data: running it with preference=0 results in K = 2, running it with preference=-10 results in K = 0, but running it with preference=-20 results in K = 1 (taking the first point in the training data as exemplar).

Does this make sense? Shouldn't we expect lower preference values to result in less clusters?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2017

Member
Member

jnothman commented Aug 24, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2017

Member
Member

jnothman commented Aug 24, 2017

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 25, 2017

Member

I am seeing this now. Did someone already investigate?

Member

agramfort commented Aug 25, 2017

I am seeing this now. Did someone already investigate?

@jsamoocha

This comment has been minimized.

Show comment
Hide comment
@jsamoocha

jsamoocha Aug 25, 2017

Contributor

I took a more detailed look at what's happening inside affinity_propagation(). For the "symmetric" case above (i.e. n samples with equal mutual similarities and same preference), based on subtle differences of the preference argument and floating point rounding errors, the algorithm does or does not converge, leading in the latter case to K=0, a return of None as cluster_centers_indices and eventually a dimension being added to the array of cluster centers (using None as index on an ndarray inside AffinityPropagation.fit(), line 308, will add a dimension, see numpy.newaxis).

So it seems there are 2 issues:

  1. Non-convergence (or K=0) should not lead to the fit() method setting the model's cluster centers as a 3d array. In case of non-convergence, I suggest to set the model's cluster centers as empty array, and let predict return a unique label for each sample in X in that case (beside logging some warnings).
  2. The edge case of training data consisting of samples with equal mutual similarities and equal preferences is not defined. I suggest to define it as (a) every sample becomes its own cluster center if preference >= similarity, or (b) there will be a single cluster with arbitrary center if preference < similarity.

Unless someone disagrees with this approach, I can start working on the fix.

Contributor

jsamoocha commented Aug 25, 2017

I took a more detailed look at what's happening inside affinity_propagation(). For the "symmetric" case above (i.e. n samples with equal mutual similarities and same preference), based on subtle differences of the preference argument and floating point rounding errors, the algorithm does or does not converge, leading in the latter case to K=0, a return of None as cluster_centers_indices and eventually a dimension being added to the array of cluster centers (using None as index on an ndarray inside AffinityPropagation.fit(), line 308, will add a dimension, see numpy.newaxis).

So it seems there are 2 issues:

  1. Non-convergence (or K=0) should not lead to the fit() method setting the model's cluster centers as a 3d array. In case of non-convergence, I suggest to set the model's cluster centers as empty array, and let predict return a unique label for each sample in X in that case (beside logging some warnings).
  2. The edge case of training data consisting of samples with equal mutual similarities and equal preferences is not defined. I suggest to define it as (a) every sample becomes its own cluster center if preference >= similarity, or (b) there will be a single cluster with arbitrary center if preference < similarity.

Unless someone disagrees with this approach, I can start working on the fix.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 25, 2017

Member
Member

agramfort commented Aug 25, 2017

jsamoocha added a commit to jsamoocha/scikit-learn that referenced this issue Aug 26, 2017

Added test exposing non-convergence issues
As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 27, 2017

Member
Member

jnothman commented Aug 27, 2017

@jsamoocha

This comment has been minimized.

Show comment
Hide comment
@jsamoocha

jsamoocha Aug 27, 2017

Contributor

I saw the floating point issues only when dealing with the edge case above, i.e. [[-1, 1], [1, -1]] as training samples. Depending on preference and damping params, the A and R diagonals would "converge" to e.g. [0.45, -0.45] and [-0.45, 0.45]. But then the code

# Check for convergence
E = (np.diag(A) + np.diag(R)) > 0
e[:, it % convergence_iter] = E
K = np.sum(E, axis=0)

would somehow lead to different values of E (and K) per (small sequence of) iteration(s). This would then lead to the incidental non-convergence for particular values of preferences, as I mentioned before (i.e. convergence to K=2 when preference=0, convergence to K=1 when preference<-20, but intermittent convergence to K=1 or non-convergence for preference in <-20, -9].

The solution in the PR immediately returns cluster centers for the edge case above without running the actual algorithm, and as such avoids the rounding issues.

Contributor

jsamoocha commented Aug 27, 2017

I saw the floating point issues only when dealing with the edge case above, i.e. [[-1, 1], [1, -1]] as training samples. Depending on preference and damping params, the A and R diagonals would "converge" to e.g. [0.45, -0.45] and [-0.45, 0.45]. But then the code

# Check for convergence
E = (np.diag(A) + np.diag(R)) > 0
e[:, it % convergence_iter] = E
K = np.sum(E, axis=0)

would somehow lead to different values of E (and K) per (small sequence of) iteration(s). This would then lead to the incidental non-convergence for particular values of preferences, as I mentioned before (i.e. convergence to K=2 when preference=0, convergence to K=1 when preference<-20, but intermittent convergence to K=1 or non-convergence for preference in <-20, -9].

The solution in the PR immediately returns cluster centers for the edge case above without running the actual algorithm, and as such avoids the rounding issues.

jnothman added a commit that referenced this issue Sep 5, 2017

[MRG+1] Affinity propagation edge cases (#9612) (#9635)
* Added test exposing non-convergence issues

As discussed in issue #9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'

massich added a commit to massich/scikit-learn that referenced this issue Sep 15, 2017

[MRG+1] Affinity propagation edge cases (scikit-learn#9612) (scikit-l…
…earn#9635)

* Added test exposing non-convergence issues

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017

[MRG+1] Affinity propagation edge cases (scikit-learn#9612) (scikit-l…
…earn#9635)

* Added test exposing non-convergence issues

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this issue Dec 18, 2017

[MRG+1] Affinity propagation edge cases (scikit-learn#9612) (scikit-l…
…earn#9635)

* Added test exposing non-convergence issues

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be
unique for every sample.

* Addresses non-convergence issues

Returns empty list as cluster center indices to prevent adding a dimension in fit() method,
returns unique labels for samples making this consistent with (TBD) predict() behavior for
non-convergence.

* Made predict() handle case of non-convergence while fitting

In this case, it will log a warning and return unique labels for every new sample.

* Added helper function for detecting mutually equal similarities and preferences

* Tidied imports

* Immediately returning trivial clusters and labels in case of equal similarities and preferences

* Simplified code for preference(s) equality test

* Corrected for failing unit tests covering case of n_samples=1

* Corrected for PEP8 line too long

* Rewriting imports to comply with max 80-column lines

* Simplified code

n_samples == 1 case does not need a separate return statement.

* Replaced logging warnings by warnings.warn()

Added assertions for warnings in tests.

* Marking function as non-public

* Using mask instead of modifying S

* Improvement suggested by review comment

* Avoided casting preference to array twice

* Readability improvements

* Improved returned labels in case of no cluster centers

Returning a unique label for every sample in X suggests that these were based on actual clusters.
Since there are no clusters, it makes more sense to return a negative label for all samples,
indicating there were no clusters.

* PEP8 line too long

* Avoided creating separate variable for preference as array

* Corrected warning message

* Making labels consistent with predict() behavior in case of non-convergence

* Minor readability improvement

* Added detail to test comment about expected result

* Added documentation about edge cases

* Added documentation to 'what's new'
@MartinHjelm

This comment has been minimized.

Show comment
Hide comment
@MartinHjelm

MartinHjelm May 30, 2018

I am experiencing the same problem with scikit-learn 0.19.1 on Python3. I am using AF on a subset of the Iris dataset that has been PCA projected down to 2D. The following code reproduces the error,

import numpy as np 
from sklearn.cluster import AffinityPropagation

X = np.array([[ 1.23906249, 0.49611541], [ 3.39649346, -0.40897545], [-3.12219045, 0.42151394], 
              [ 0.80784673, 0.50505868], [-3.01872322, 0.07849451], [-2.51973134, -0.4776176 ], 
              [ 2.18521652, 0.00668716], [ 0.81586318, -0.24716449], [ 3.29251742, -0.45548925], 
              [-2.72639784, -0.69526994], [ 1.53713634, -0.09261543], [-2.84425659, -0.36466696], 
              [ 1.80389803, -0.26573823], [-2.79804411, -0.37832532], [ 0.93171373, -0.23522147], 
              [-2.38476659, -0.73831055], [ 1.83654306, 0.29829933], [ 2.28357399, -0.42918803], 
              [-2.99943574, 0.79122123], [ 0.34610084, 1.32853088], [ 1.0425316 , 0.77427372], 
              [ 2.5083168 , -0.46682954], [ 1.1105675 , -0.3676059 ], [-2.71730829, 0.12686326], 
              [ 0.82102709, -0.45054543], [-2.3251329 , -0.41434859], [-2.67379819, -0.43171684], 
              [-0.21329465, 0.67433196], [ 1.18787965, 0.9021088 ], [-2.74512103, -0.22317351], 
              [ 2.04334762, -0.09038415], [-3.03008892, 0.50259852], [ 0.7935851 , 0.19528684], 
              [ 2.29055389, -0.07254246], [ 3.6837104 , -0.18406905], [ 0.16080443, 0.38841395]])

AP = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True, damping=0.9900000000000004, 
                         max_iter=200, preference=None, verbose=False)
AP.fit(X)
pred_X = AP.predict(X)

And gives the following Traceback

   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/local/lib/python3.6/site-packages/sklearn/cluster/affinity_propagation_.py", line 333, in predict
       return pairwise_distances_argmin(X, self.cluster_centers_)
     File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 467, in pairwise_distances_argmin
       metric_kwargs)[0]
     File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 342, in pairwise_distances_argmin_min
       X, Y = check_pairwise_arrays(X, Y)
     File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 112, in check_pairwise_arrays
       warn_on_dtype=warn_on_dtype, estimator=estimator)
     File "/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 451, in check_array
       % (array.ndim, estimator_name))
   ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2.

MartinHjelm commented May 30, 2018

I am experiencing the same problem with scikit-learn 0.19.1 on Python3. I am using AF on a subset of the Iris dataset that has been PCA projected down to 2D. The following code reproduces the error,

import numpy as np 
from sklearn.cluster import AffinityPropagation

X = np.array([[ 1.23906249, 0.49611541], [ 3.39649346, -0.40897545], [-3.12219045, 0.42151394], 
              [ 0.80784673, 0.50505868], [-3.01872322, 0.07849451], [-2.51973134, -0.4776176 ], 
              [ 2.18521652, 0.00668716], [ 0.81586318, -0.24716449], [ 3.29251742, -0.45548925], 
              [-2.72639784, -0.69526994], [ 1.53713634, -0.09261543], [-2.84425659, -0.36466696], 
              [ 1.80389803, -0.26573823], [-2.79804411, -0.37832532], [ 0.93171373, -0.23522147], 
              [-2.38476659, -0.73831055], [ 1.83654306, 0.29829933], [ 2.28357399, -0.42918803], 
              [-2.99943574, 0.79122123], [ 0.34610084, 1.32853088], [ 1.0425316 , 0.77427372], 
              [ 2.5083168 , -0.46682954], [ 1.1105675 , -0.3676059 ], [-2.71730829, 0.12686326], 
              [ 0.82102709, -0.45054543], [-2.3251329 , -0.41434859], [-2.67379819, -0.43171684], 
              [-0.21329465, 0.67433196], [ 1.18787965, 0.9021088 ], [-2.74512103, -0.22317351], 
              [ 2.04334762, -0.09038415], [-3.03008892, 0.50259852], [ 0.7935851 , 0.19528684], 
              [ 2.29055389, -0.07254246], [ 3.6837104 , -0.18406905], [ 0.16080443, 0.38841395]])

AP = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True, damping=0.9900000000000004, 
                         max_iter=200, preference=None, verbose=False)
AP.fit(X)
pred_X = AP.predict(X)

And gives the following Traceback

   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/local/lib/python3.6/site-packages/sklearn/cluster/affinity_propagation_.py", line 333, in predict
       return pairwise_distances_argmin(X, self.cluster_centers_)
     File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 467, in pairwise_distances_argmin
       metric_kwargs)[0]
     File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 342, in pairwise_distances_argmin_min
       X, Y = check_pairwise_arrays(X, Y)
     File "/usr/local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 112, in check_pairwise_arrays
       warn_on_dtype=warn_on_dtype, estimator=estimator)
     File "/usr/local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 451, in check_array
       % (array.ndim, estimator_name))
   ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 30, 2018

Member
Member

jnothman commented May 30, 2018

@MartinHjelm

This comment has been minimized.

Show comment
Hide comment
@MartinHjelm

MartinHjelm May 30, 2018

Hmm. I am using Python 3.6.5 and Numpy 1.14.3 could that be the culprit? Maybe not.

I have experimented a bit with the code and it seems that the fit function sets the cluster_centers_ to shape (1, 36, 2) whenever damping >= 0.99. For damping=0.98 the shape is 2D. So a simple fix seems to be avoid values of damping>=0.99.

MartinHjelm commented May 30, 2018

Hmm. I am using Python 3.6.5 and Numpy 1.14.3 could that be the culprit? Maybe not.

I have experimented a bit with the code and it seems that the fit function sets the cluster_centers_ to shape (1, 36, 2) whenever damping >= 0.99. For damping=0.98 the shape is 2D. So a simple fix seems to be avoid values of damping>=0.99.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 31, 2018

Member
Member

jnothman commented May 31, 2018

@MartinHjelm

This comment has been minimized.

Show comment
Hide comment
@MartinHjelm

MartinHjelm May 31, 2018

I am using OSX so I have Hombrew's Python3 and installed scikit-learn and numpy via pip. I do not use Homebrew's numpy install.

MartinHjelm commented May 31, 2018

I am using OSX so I have Hombrew's Python3 and installed scikit-learn and numpy via pip. I do not use Homebrew's numpy install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment