-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single linkage option in Agglomerative causes MemoryError for very large numbers #17960
Comments
Thanks for the report. This needs investigation. |
take |
Hi, I face the same issue using Agglomerative methods. I can reproduce the behaviour using: from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=[25000, 50000], cluster_std=[1.0, 0.8], n_features=4, centers=None, random_state=0)
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="ward") # out of memory
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="complete") # out of memory
# clusterer = AgglomerativeClustering(n_clusters=5, linkage="average") # out of memory
clusterer = AgglomerativeClustering(n_clusters=5, linkage="single") # okay
clusterer.fit(X) My error message: Details:
|
I have been looking into this issue. It seems that the generated tree for such high valued nodes becomes highly connected. As a consequence, as we are trying to explore the hierarchical tree, we keep re-exploring the same nodes. I implemented a simple fix by keeping track of what nodes we have already visited during our exploration and that fixes the issue above. Will follow up shortly with a PR. |
However, this only fixes the issue described by THaar50, not the one described by JiaHong-Lee. While I'm not sure, I think the latter issue might not necessarily be a bug, but I can look into it as well. |
I think fixing the one and not the other is acceptable: JiaHong-Lee's issue
isn't about single linkage and should be investigated separately. Pull
request very welcome, thanks.
|
Describe the bug
Using the single linkage option in Agglomerative clustering results in a MemoryError when the input data contains very large values, likely due to numeric overflows.
Steps/Code to Reproduce
Example:
Running the code below with half of the sample data does not produce a MemoryError (although the same numpy RuntimeWarning is being thrown). However, if the data contains as many very large numbers as in my sample below my machine with 16Gb RAM runs out of memory and I get a MemoryError. To me this seems a bit disproportionate as the sample data is not very large.
Expected Results
No error is thrown or ValueError that specifies the range of allowed values. GaussianMixture clustering is handling the same data like this:
Actual Results
Versions
System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0
Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0
Built with OpenMP: True
The text was updated successfully, but these errors were encountered: