allow np.inf for missing values in case of precomputed metric #189

LGro · 2018-03-19T09:28:24Z

This is a first attempt to allow entries in a precomputed distance matrix to be undefined by setting them to numpy.inf (#187). The only necessary change seemed to be handling the input validation separately for the case metric == 'precomputed'and using numpy.inf instead of numpy.nan to not mess with the sorting, comparing and dividing in down-stream computations.
I have attempted a workaround for the input validation by adding check_precomputed_distance_matrix(X) that executes the same checks as check_matrix(X) but allows for numpy.infentries. Please check if that is clean enough for your taste and if hdbscan_.py is the place you would leave that function in.
Something else that I couldn't reliably judge is if copying the distance_matrix in line 78 is necessary or can be omitted.
At the moment no "error handling" is in place for when the minimum spanning tree can't be created anymore without having edge weights that are infinite. Depending on how much one trusts HDBSCAN users using this edge case, at least a warning message for that case might be a wise. If you think that might be a good idea, I'm happy to add one in a further revision.

I'm looking forward to your comments and suggestions.

pep8speaks · 2018-03-19T09:28:26Z

Hello @LGro, Thank you for updating !

In the file hdbscan/hdbscan_.py, following are the PEP8 issues :

Line 46:1: E402 module level import not at top of file
Line 48:1: E302 expected 2 blank lines, found 1
Line 122:25: E128 continuation line under-indented for visual indent
Line 160:25: E128 continuation line under-indented for visual indent
Line 196:25: E128 continuation line under-indented for visual indent
Line 228:25: E128 continuation line under-indented for visual indent
Line 289:25: E128 continuation line under-indented for visual indent
Line 594:13: E127 continuation line over-indented for visual indent
Line 892:1: W293 blank line contains whitespace
Line 936:18: E128 continuation line under-indented for visual indent
Line 951:18: E128 continuation line under-indented for visual indent
Line 952:18: E128 continuation line under-indented for visual indent
Line 963:18: E128 continuation line under-indented for visual indent
Line 964:18: E128 continuation line under-indented for visual indent

Comment last updated on March 22, 2018 at 14:55 Hours UTC

lmcinnes · 2018-03-19T17:04:29Z

This looks good -- I'm glad it actually all goes through. Currently it is getting stuck on the case of a sparse precomputed distance matrix as np.isinf in your check function doesn't apply to sparse matrices. I think you can probably just check for a sparse matrix and handle that separately to keep things working.

coveralls · 2018-03-22T14:59:45Z

Coverage increased (+0.01%) to 91.053% when pulling 9cdc513 on LGro:missing_distances into f3769a5 on scikit-learn-contrib:master.

LGro · 2018-03-22T15:05:30Z

The current state now just doesn't allow np.inf for missing distances in sparse distance matrices, only in full distance matrices.
With a bit more time on my hands I might also add support for sparse matrices.

I further added a UserWarning in case at least one of the MST's edges has the value np.inf and hint at too many missing distances, potentially in conjunction with a large neighborhood size, to be the reason for that warning.

Would you accept both numpy.nan and numpy.inf for missing distances?
It might make sense to convert nans to infs and use force_all_finite=False with sklearn.utils.check_array and not have a separate check function at all.
This then also adds support for the kind of sparse matrices that can be converted to a 2D ndarray.

lmcinnes · 2018-03-22T17:25:46Z

Thanks, this looks like it covers the potential cases well. As for allowing np.nan -- that might make some sense, especially if we are getting data handed to us by pandas or similar, but I don't see that as a priority. The main thing is to provide some reasonably robust code that will allow this, as you've already done. Things can always be improved at a later date.

Thanks again for taking the time to get this implemented, it is definitely greatly appreciated.

add np.inf for missing values in precomputed metric

e92012d

LGro added 2 commits March 22, 2018 15:46

handle sparse matrices, add user warning for too many missing

1f04e0b

handle 2nd sparse matrix validation

9cdc513

lmcinnes merged commit 51fd5ed into scikit-learn-contrib:master Mar 22, 2018

LGro deleted the missing_distances branch March 23, 2018 13:12

LGro mentioned this pull request Apr 13, 2018

add doc for missing distances support #192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow np.inf for missing values in case of precomputed metric #189

allow np.inf for missing values in case of precomputed metric #189

LGro commented Mar 19, 2018

pep8speaks commented Mar 19, 2018 •

edited

Loading

lmcinnes commented Mar 19, 2018

coveralls commented Mar 22, 2018

LGro commented Mar 22, 2018 •

edited

Loading

lmcinnes commented Mar 22, 2018

allow np.inf for missing values in case of precomputed metric #189

allow np.inf for missing values in case of precomputed metric #189

Conversation

LGro commented Mar 19, 2018

pep8speaks commented Mar 19, 2018 • edited Loading

Comment last updated on March 22, 2018 at 14:55 Hours UTC

lmcinnes commented Mar 19, 2018

coveralls commented Mar 22, 2018

LGro commented Mar 22, 2018 • edited Loading

lmcinnes commented Mar 22, 2018

pep8speaks commented Mar 19, 2018 •

edited

Loading

LGro commented Mar 22, 2018 •

edited

Loading