-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow np.inf for missing values in case of precomputed metric #189
Conversation
Hello @LGro, Thank you for updating !
Comment last updated on March 22, 2018 at 14:55 Hours UTC |
This looks good -- I'm glad it actually all goes through. Currently it is getting stuck on the case of a sparse precomputed distance matrix as np.isinf in your check function doesn't apply to sparse matrices. I think you can probably just check for a sparse matrix and handle that separately to keep things working. |
The current state now just doesn't allow I further added a Would you accept both |
Thanks, this looks like it covers the potential cases well. As for allowing np.nan -- that might make some sense, especially if we are getting data handed to us by pandas or similar, but I don't see that as a priority. The main thing is to provide some reasonably robust code that will allow this, as you've already done. Things can always be improved at a later date. Thanks again for taking the time to get this implemented, it is definitely greatly appreciated. |
This is a first attempt to allow entries in a precomputed distance matrix to be undefined by setting them to
numpy.inf
(#187). The only necessary change seemed to be handling the input validation separately for the casemetric == 'precomputed'
and usingnumpy.inf
instead ofnumpy.nan
to not mess with the sorting, comparing and dividing in down-stream computations.I have attempted a workaround for the input validation by adding
check_precomputed_distance_matrix(X)
that executes the same checks ascheck_matrix(X)
but allows fornumpy.inf
entries. Please check if that is clean enough for your taste and ifhdbscan_.py
is the place you would leave that function in.Something else that I couldn't reliably judge is if copying the
distance_matrix
in line 78 is necessary or can be omitted.At the moment no "error handling" is in place for when the minimum spanning tree can't be created anymore without having edge weights that are infinite. Depending on how much one trusts HDBSCAN users using this edge case, at least a warning message for that case might be a wise. If you think that might be a good idea, I'm happy to add one in a further revision.
I'm looking forward to your comments and suggestions.