Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

THaar50 · 2020-07-20T20:40:21Z

Describe the bug

Using the single linkage option in Agglomerative clustering results in a MemoryError when the input data contains very large values, likely due to numeric overflows.

Steps/Code to Reproduce

Example:
Running the code below with half of the sample data does not produce a MemoryError (although the same numpy RuntimeWarning is being thrown). However, if the data contains as many very large numbers as in my sample below my machine with 16Gb RAM runs out of memory and I get a MemoryError. To me this seems a bit disproportionate as the sample data is not very large.

from sklearn.cluster import AgglomerativeClustering

X = [[1.30830774e+307, 6.02217328e+307],
     [1.54166067e+308, 1.75812744e+308],
     [5.57938866e+307, 4.13840113e+307],
     [1.36302835e+308, 1.07968131e+308],
     [1.58772669e+308, 1.19380571e+307],
     [2.20362426e+307, 1.58814671e+308],
     [1.06216028e+308, 1.14258583e+308],
     [7.18031911e+307, 1.69661213e+308],
     [7.91182553e+307, 5.12892426e+307],
     [5.58470885e+307, 9.13566765e+306],
     [1.22366243e+308, 8.29427922e+307],
     [4.39205961e+306, 1.26048413e+308],
     [4.61599953e+306, 7.24075646e+307],
     [1.66596896e+308, 1.65498552e+308],
     [4.73958815e+307, 7.66412710e+307],
     [1.57013390e+308, 1.03527051e+308],
     [1.28464631e+308, 1.68216358e+308],
     [1.07121506e+308, 3.11489418e+307],
     [2.97524276e+307, 1.59238260e+306],
     [1.24529964e+308, 5.18478922e+306],
     [7.55088957e+307, 7.08726240e+307],
     [8.38161061e+307, 1.50363727e+307],
     [1.11502738e+308, 3.77564517e+307],
     [1.33977566e+308, 4.21606136e+307],
     [3.18410347e+306, 1.41182675e+308],
     [1.37949806e+307, 1.33091975e+308],
     [8.84125355e+307, 8.83334687e+307],
     [4.16683700e+307, 4.38042780e+307],
     [6.33989592e+307, 1.13127438e+308],
     [3.30270525e+307, 5.82271903e+307],
     [1.73464450e+308, 1.57922601e+308],
     [8.50840567e+307, 8.34750980e+307],
     [1.06389805e+308, 7.01361194e+307],
     [1.20971776e+308, 1.42173002e+308],
     [1.91271271e+307, 7.63302091e+307],
     [1.46375293e+308, 3.27352625e+307],
     [5.64255485e+307, 1.62562194e+308],
     [5.96180877e+307, 4.31806469e+307],
     [1.18773200e+308, 4.91146568e+307],
     [1.19298612e+307, 1.54110558e+308],
     [3.51302340e+307, 5.60375139e+307],
     [5.14281492e+307, 9.76302343e+307],
     [1.77408023e+308, 1.65760854e+308],
     [1.46544741e+308, 1.47445640e+308],
     [1.76109403e+308, 1.30923042e+308],
     [1.64984146e+308, 2.58303609e+307],
     [1.61558758e+307, 1.03985868e+308],
     [1.37977676e+308, 4.90921157e+307],
     [1.01105745e+308, 1.57678709e+307],
     [1.24672794e+308, 5.96657664e+307]]

clusterer = AgglomerativeClustering(linkage='single')
clusterer.fit(X)

Expected Results

No error is thrown or ValueError that specifies the range of allowed values. GaussianMixture clustering is handling the same data like this:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Actual Results

C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Traceback (most recent call last):
  File "C:/Users/thaar/PycharmProjects/sklearn-dev/agglomerative_test.py", line 58, in <module>
    clusterer.fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_agglomerative.py", line 898, in fit
    self.n_leaves_)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_agglomerative.py", line 674, in _hc_cut
    label[_hierarchical._hc_get_descendent(-node, children, n_leaves)] = i
  File "sklearn\cluster\_hierarchical_fast.pyx", line 96, in sklearn.cluster._hierarchical_fast._hc_get_descendent
MemoryError

Versions

System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0

Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

jnothman · 2020-07-21T14:32:05Z

Thanks for the report. This needs investigation.

juliusvering · 2020-07-27T11:25:29Z

take

JiaHong-Lee · 2020-07-30T03:20:19Z

Hi,

I face the same issue using Agglomerative methods. I can reproduce the behaviour using:

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

X, Y = make_blobs(n_samples=[25000, 50000], cluster_std=[1.0, 0.8], n_features=4, centers=None, random_state=0)
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="ward") # out of memory
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="complete") # out of memory
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="average") # out of memory
clusterer = AgglomerativeClustering(n_clusters=5, linkage="single") # okay
clusterer.fit(X)

My error message: MemoryError: Unable to allocate 21.0 GiB for an array with shape (2812462500,) and data type float64 .

Details:

scikit-learn v0.23.1
Python v3.7.5
Ubuntu 18.04.3 LTS
16 GB RAM

juliusvering · 2020-08-13T21:51:54Z

I have been looking into this issue. It seems that the generated tree for such high valued nodes becomes highly connected. As a consequence, as we are trying to explore the hierarchical tree, we keep re-exploring the same nodes. I implemented a simple fix by keeping track of what nodes we have already visited during our exploration and that fixes the issue above. Will follow up shortly with a PR.

juliusvering · 2020-08-13T21:57:21Z

However, this only fixes the issue described by THaar50, not the one described by JiaHong-Lee. While I'm not sure, I think the latter issue might not necessarily be a bug, but I can look into it as well.

jnothman · 2020-08-13T22:28:47Z

I think fixing the one and not the other is acceptable: JiaHong-Lee's issue isn't about single linkage and should be investigated separately. Pull request very welcome, thanks.

THaar50 added the Bug: triage label Jul 20, 2020

jnothman added Bug help wanted module:cluster and removed Bug: triage labels Jul 20, 2020

github-actions bot assigned juliusvering Jul 27, 2020

github-actions bot removed the help wanted label Jul 27, 2020

juliusvering linked a pull request Aug 13, 2020 that will close this issue

[MRG] Fix Single linkage option in Agglomerative causes MemoryError for very large numbers #18155

Open

cmarmo unassigned juliusvering Nov 17, 2021

cmarmo added the help wanted label Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

THaar50 commented Jul 20, 2020 •

edited by ogrisel

Loading

jnothman commented Jul 21, 2020

juliusvering commented Jul 27, 2020

JiaHong-Lee commented Jul 30, 2020

juliusvering commented Aug 13, 2020

juliusvering commented Aug 13, 2020

jnothman commented Aug 13, 2020 via email

Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

Comments

THaar50 commented Jul 20, 2020 • edited by ogrisel Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented Jul 21, 2020

juliusvering commented Jul 27, 2020

JiaHong-Lee commented Jul 30, 2020

juliusvering commented Aug 13, 2020

juliusvering commented Aug 13, 2020

jnothman commented Aug 13, 2020 via email

THaar50 commented Jul 20, 2020 •

edited by ogrisel

Loading