Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

Open
THaar50 opened this issue Jul 20, 2020 · 6 comments · May be fixed by #18155
Open

Single linkage option in Agglomerative causes MemoryError for very large numbers #17960

THaar50 opened this issue Jul 20, 2020 · 6 comments · May be fixed by #18155

Comments

@THaar50
Copy link

THaar50 commented Jul 20, 2020

Describe the bug

Using the single linkage option in Agglomerative clustering results in a MemoryError when the input data contains very large values, likely due to numeric overflows.

Steps/Code to Reproduce

Example:
Running the code below with half of the sample data does not produce a MemoryError (although the same numpy RuntimeWarning is being thrown). However, if the data contains as many very large numbers as in my sample below my machine with 16Gb RAM runs out of memory and I get a MemoryError. To me this seems a bit disproportionate as the sample data is not very large.

from sklearn.cluster import AgglomerativeClustering

X = [[1.30830774e+307, 6.02217328e+307],
     [1.54166067e+308, 1.75812744e+308],
     [5.57938866e+307, 4.13840113e+307],
     [1.36302835e+308, 1.07968131e+308],
     [1.58772669e+308, 1.19380571e+307],
     [2.20362426e+307, 1.58814671e+308],
     [1.06216028e+308, 1.14258583e+308],
     [7.18031911e+307, 1.69661213e+308],
     [7.91182553e+307, 5.12892426e+307],
     [5.58470885e+307, 9.13566765e+306],
     [1.22366243e+308, 8.29427922e+307],
     [4.39205961e+306, 1.26048413e+308],
     [4.61599953e+306, 7.24075646e+307],
     [1.66596896e+308, 1.65498552e+308],
     [4.73958815e+307, 7.66412710e+307],
     [1.57013390e+308, 1.03527051e+308],
     [1.28464631e+308, 1.68216358e+308],
     [1.07121506e+308, 3.11489418e+307],
     [2.97524276e+307, 1.59238260e+306],
     [1.24529964e+308, 5.18478922e+306],
     [7.55088957e+307, 7.08726240e+307],
     [8.38161061e+307, 1.50363727e+307],
     [1.11502738e+308, 3.77564517e+307],
     [1.33977566e+308, 4.21606136e+307],
     [3.18410347e+306, 1.41182675e+308],
     [1.37949806e+307, 1.33091975e+308],
     [8.84125355e+307, 8.83334687e+307],
     [4.16683700e+307, 4.38042780e+307],
     [6.33989592e+307, 1.13127438e+308],
     [3.30270525e+307, 5.82271903e+307],
     [1.73464450e+308, 1.57922601e+308],
     [8.50840567e+307, 8.34750980e+307],
     [1.06389805e+308, 7.01361194e+307],
     [1.20971776e+308, 1.42173002e+308],
     [1.91271271e+307, 7.63302091e+307],
     [1.46375293e+308, 3.27352625e+307],
     [5.64255485e+307, 1.62562194e+308],
     [5.96180877e+307, 4.31806469e+307],
     [1.18773200e+308, 4.91146568e+307],
     [1.19298612e+307, 1.54110558e+308],
     [3.51302340e+307, 5.60375139e+307],
     [5.14281492e+307, 9.76302343e+307],
     [1.77408023e+308, 1.65760854e+308],
     [1.46544741e+308, 1.47445640e+308],
     [1.76109403e+308, 1.30923042e+308],
     [1.64984146e+308, 2.58303609e+307],
     [1.61558758e+307, 1.03985868e+308],
     [1.37977676e+308, 4.90921157e+307],
     [1.01105745e+308, 1.57678709e+307],
     [1.24672794e+308, 5.96657664e+307]]

clusterer = AgglomerativeClustering(linkage='single')
clusterer.fit(X)

Expected Results

No error is thrown or ValueError that specifies the range of allowed values. GaussianMixture clustering is handling the same data like this:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Actual Results

C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Traceback (most recent call last):
  File "C:/Users/thaar/PycharmProjects/sklearn-dev/agglomerative_test.py", line 58, in <module>
    clusterer.fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_agglomerative.py", line 898, in fit
    self.n_leaves_)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_agglomerative.py", line 674, in _hc_cut
    label[_hierarchical._hc_get_descendent(-node, children, n_leaves)] = i
  File "sklearn\cluster\_hierarchical_fast.pyx", line 96, in sklearn.cluster._hierarchical_fast._hc_get_descendent
MemoryError

Versions

System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0

Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True

@jnothman
Copy link
Member

Thanks for the report. This needs investigation.

@juliusvering
Copy link

take

@JiaHong-Lee
Copy link

Hi,

I face the same issue using Agglomerative methods. I can reproduce the behaviour using:

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

X, Y = make_blobs(n_samples=[25000, 50000], cluster_std=[1.0, 0.8], n_features=4, centers=None, random_state=0)
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="ward") # out of memory
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="complete") # out of memory
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="average") # out of memory
clusterer = AgglomerativeClustering(n_clusters=5, linkage="single") # okay
clusterer.fit(X)

My error message: MemoryError: Unable to allocate 21.0 GiB for an array with shape (2812462500,) and data type float64 .

Details:

  1. scikit-learn v0.23.1
  2. Python v3.7.5
  3. Ubuntu 18.04.3 LTS
  4. 16 GB RAM

@juliusvering
Copy link

I have been looking into this issue. It seems that the generated tree for such high valued nodes becomes highly connected. As a consequence, as we are trying to explore the hierarchical tree, we keep re-exploring the same nodes. I implemented a simple fix by keeping track of what nodes we have already visited during our exploration and that fixes the issue above. Will follow up shortly with a PR.

@juliusvering
Copy link

However, this only fixes the issue described by THaar50, not the one described by JiaHong-Lee. While I'm not sure, I think the latter issue might not necessarily be a bug, but I can look into it as well.

@jnothman
Copy link
Member

jnothman commented Aug 13, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants