Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow np.inf for missing values in case of precomputed metric #189

Merged
merged 3 commits into from
Mar 22, 2018

Conversation

LGro
Copy link
Contributor

@LGro LGro commented Mar 19, 2018

This is a first attempt to allow entries in a precomputed distance matrix to be undefined by setting them to numpy.inf (#187). The only necessary change seemed to be handling the input validation separately for the case metric == 'precomputed'and using numpy.inf instead of numpy.nan to not mess with the sorting, comparing and dividing in down-stream computations.
I have attempted a workaround for the input validation by adding check_precomputed_distance_matrix(X) that executes the same checks as check_matrix(X) but allows for numpy.infentries. Please check if that is clean enough for your taste and if hdbscan_.py is the place you would leave that function in.
Something else that I couldn't reliably judge is if copying the distance_matrix in line 78 is necessary or can be omitted.
At the moment no "error handling" is in place for when the minimum spanning tree can't be created anymore without having edge weights that are infinite. Depending on how much one trusts HDBSCAN users using this edge case, at least a warning message for that case might be a wise. If you think that might be a good idea, I'm happy to add one in a further revision.

I'm looking forward to your comments and suggestions.

@pep8speaks
Copy link

pep8speaks commented Mar 19, 2018

Hello @LGro, Thank you for updating !

Line 46:1: E402 module level import not at top of file
Line 48:1: E302 expected 2 blank lines, found 1
Line 122:25: E128 continuation line under-indented for visual indent
Line 160:25: E128 continuation line under-indented for visual indent
Line 196:25: E128 continuation line under-indented for visual indent
Line 228:25: E128 continuation line under-indented for visual indent
Line 289:25: E128 continuation line under-indented for visual indent
Line 594:13: E127 continuation line over-indented for visual indent
Line 892:1: W293 blank line contains whitespace
Line 936:18: E128 continuation line under-indented for visual indent
Line 951:18: E128 continuation line under-indented for visual indent
Line 952:18: E128 continuation line under-indented for visual indent
Line 963:18: E128 continuation line under-indented for visual indent
Line 964:18: E128 continuation line under-indented for visual indent

Comment last updated on March 22, 2018 at 14:55 Hours UTC

@lmcinnes
Copy link
Collaborator

This looks good -- I'm glad it actually all goes through. Currently it is getting stuck on the case of a sparse precomputed distance matrix as np.isinf in your check function doesn't apply to sparse matrices. I think you can probably just check for a sparse matrix and handle that separately to keep things working.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 91.053% when pulling 9cdc513 on LGro:missing_distances into f3769a5 on scikit-learn-contrib:master.

@LGro
Copy link
Contributor Author

LGro commented Mar 22, 2018

The current state now just doesn't allow np.inf for missing distances in sparse distance matrices, only in full distance matrices.
With a bit more time on my hands I might also add support for sparse matrices.

I further added a UserWarning in case at least one of the MST's edges has the value np.inf and hint at too many missing distances, potentially in conjunction with a large neighborhood size, to be the reason for that warning.

Would you accept both numpy.nan and numpy.inf for missing distances?
It might make sense to convert nans to infs and use force_all_finite=False with sklearn.utils.check_array and not have a separate check function at all.
This then also adds support for the kind of sparse matrices that can be converted to a 2D ndarray.

@lmcinnes
Copy link
Collaborator

Thanks, this looks like it covers the potential cases well. As for allowing np.nan -- that might make some sense, especially if we are getting data handed to us by pandas or similar, but I don't see that as a priority. The main thing is to provide some reasonably robust code that will allow this, as you've already done. Things can always be improved at a later date.

Thanks again for taking the time to get this implemented, it is definitely greatly appreciated.

@lmcinnes lmcinnes merged commit 51fd5ed into scikit-learn-contrib:master Mar 22, 2018
@LGro LGro deleted the missing_distances branch March 23, 2018 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants