Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when points are equal #623

Open
mdruiter opened this issue Feb 1, 2024 · 1 comment
Open

Crash when points are equal #623

mdruiter opened this issue Feb 1, 2024 · 1 comment

Comments

@mdruiter
Copy link

mdruiter commented Feb 1, 2024

I experienced a hard Python crash on some real-life data. It turns out it's a call to .minimum_spanning_tree_.plot() when many points are exactly equal. A minimal example to reproduce:

import numpy as np
import hdbscan
model = hdbscan.HDBSCAN(gen_min_span_tree=True)
data = np.zeros((91, 3))
clustering = model.fit(data)
clustering.minimum_spanning_tree_.plot()

Note that it also happens when only a relative small proportion of points are equal (but only sometimes?), this is just the easiest way to show it.

I looked into this and it appears to be a problem in sklearn.manifold._t_sne._barnes_hut_tsne.gradient(),not (always?) being able to handle nans. For example:

import numpy as np
from sklearn.manifold._t_sne import _barnes_hut_tsne
neighbors = np.array([1, 2, 0, 2, 0, 1], dtype='int64')
val_P = np.full_like(neighbors, 2 / 45, dtype='float32')
pos_output = np.full((3, 2), np.nan, dtype='float32')
forces = np.zeros_like(pos_output)
indptr = np.arange(7, step=2, dtype='int64')
_barnes_hut_tsne.gradient(val_P, pos_output, neighbors, indptr, forces, 0.5, 2, 11)

One layer deeper, the crash occurs inside sklearn.neighbors._quad_tree._QuadTree.build_tree():

import numpy as np
from sklearn.neighbors._quad_tree import _QuadTree
qt = _QuadTree(2, 11)
X = np.full((3, 2), np.nan, dtype='float32')
qt.build_tree(X)

This output of this (due to verbose=11) up to the crash is:

[QuadTree] bounding box axis 0 : [nan, nan]
[QuadTree] bounding box axis 1 : [nan, nan]
[QuadTree] Inserting depth 0
[QuadTree] inserted point 0 in cell 0
[QuadTree] Inserting depth 0
[QuadTree] inserted point 0 in new child 1
[QuadTree] Inserting depth 0
[QuadTree] Inserting depth 1
...
[QuadTree] Inserting depth 6271
[QuadTree] inserted point 0 in new child 6272
[QuadTree] Inserting depth 6271
[QuadTree] Inserting depth 6272

I didn't dig into the QuadTree code.
I'm unsure whether this is a bug in hdbscan, scikit-learn, Cython or CPython...?

@lmcinnes
Copy link
Collaborator

lmcinnes commented Feb 1, 2024

I think you might need to push this upstream to scikit-learn and the t-SNE implementation there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants