FEA Allow string input for pairwise distances#27456
FEA Allow string input for pairwise distances#27456adrinjalali merged 29 commits intoscikit-learn:mainfrom julibeg:allow-non-numeric-input-for-pairwise-distances
Conversation
|
added an extra test for pairwise dists on boolean arrays |
|
@julibeg I solved the conflict, I will do a review now. |
glemaitre
left a comment
There was a problem hiding this comment.
I just pushed a small change to make codecov happy.
adrinjalali
left a comment
There was a problem hiding this comment.
Other than the public API note, LGTM.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
|
I changed the API to use |
adrinjalali
left a comment
There was a problem hiding this comment.
It's looking quite good to me now.
sklearn/metrics/pairwise.py
Outdated
| elif ensure_2d: | ||
| if X.shape[1] != Y.shape[1]: | ||
| raise ValueError( | ||
| "Incompatible dimension for X and Y matrices: " | ||
| "X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1]) | ||
| ) | ||
| else: | ||
| # the distances are neither pre-computed nor is the input expected to be 2D; we | ||
| # thus only check if the number of samples is the same in both arrays | ||
| if len(X) != len(Y): | ||
| raise ValueError( | ||
| f"Incompatible length for X and Y matrices: len(X) == {len(X)} while " | ||
| f" len(Y) == {len(Y)}" | ||
| ) |
There was a problem hiding this comment.
seems like this can be replaced with:
| elif ensure_2d: | |
| if X.shape[1] != Y.shape[1]: | |
| raise ValueError( | |
| "Incompatible dimension for X and Y matrices: " | |
| "X.shape[1] == %d while Y.shape[1] == %d" % (X.shape[1], Y.shape[1]) | |
| ) | |
| else: | |
| # the distances are neither pre-computed nor is the input expected to be 2D; we | |
| # thus only check if the number of samples is the same in both arrays | |
| if len(X) != len(Y): | |
| raise ValueError( | |
| f"Incompatible length for X and Y matrices: len(X) == {len(X)} while " | |
| f" len(Y) == {len(Y)}" | |
| ) | |
| else: | |
| check_consistent_length(X, Y) |
There was a problem hiding this comment.
check_consistent_length is not appropriate here since we want to ensure that both arrays have the same number of features. In fact, the added check is wrong because even when input is non-numeric, we don't want to force an equal number of samples for X and Y. I just removed it.
sklearn/metrics/pairwise.py
Outdated
| if Y is not None: | ||
| y_dtype = Y.dtype if hasattr(Y, "dtype") else type(Y[0]) | ||
| if dtype != y_dtype: | ||
| raise TypeError("X and Y have different dtypes.") |
There was a problem hiding this comment.
this seems to be a check that didn't exist before, do we really need it? The distance function is the one deciding whether it can handle mixed types or not I think.
There was a problem hiding this comment.
I agree. I guess it was added to prevent check_pairwise_arrays from converting to float. But instead I removed this check and made check_pairwise_arrays preserve the dtype when using a custom metric.
jeremiedbb
left a comment
There was a problem hiding this comment.
Thanks for the PR @julibeg. While trying it out, I noticed an inconsistency between the dtype parameter of check_pairwise_arrays and the one of check_array. I also found that one of the test ensuring that both arrays have the same length is not correct: Different length are allowed, this function computes distances between all pairs of elements from X and Y. I directly pushed fixes for these along with some clean-ups.
jeremiedbb
left a comment
There was a problem hiding this comment.
LGTM. Wanna have another look @glemaitre or @adrinjalali ?
| *, | ||
| precomputed=False, | ||
| dtype=None, | ||
| dtype="infer_float", |
There was a problem hiding this comment.
I changed the default dtype in check_pairwaise_arrays and also the behavior of dtype=None because it was inconsistent with the behavior of dtype=None in check_array (i.e. preserve dtype).
Now:
dtype="infer_float" has the same behavior as previously dtype=None, that is convert to appropriate float.
dtype=None preserves the input dtype.
check_pairwaise_arrays is private and I checked that scikit-learn doesn't use the dtype parameter anywhere else than in pairwise_distances. That's why I directly changed the default and the behavior. Let me know if you're okay with that.
|
Actually, I think that we don't need to clutter the public API with a new @glemaitre this is ready for a final review |
sklearn/metrics/pairwise.py
Outdated
| "for %d indexed." % (X.shape[0], X.shape[1], Y.shape[0]) | ||
| ) | ||
| elif X.shape[1] != Y.shape[1]: | ||
| elif X.ndim == 2 and X.shape[1] != Y.shape[1]: |
There was a problem hiding this comment.
I'm not sure why we wouldn't do this if X is one dimensional
| elif X.ndim == 2 and X.shape[1] != Y.shape[1]: | |
| elif _num_samples(X) != _num_samples(Y): |
There was a problem hiding this comment.
Because we're not comparing n_samples but n_features instead. To be able to compute pairwise distances both arrays need to have the same number of features.
We could use _num_features but it errors on 1d arrays so I wanted to keep it simple and just enable the check when we enforce 2d arrays. I can replace the if X.ndim == 2 by if ensure_2d if it makes the reason clearer but it's essentially the same.
There was a problem hiding this comment.
I think a comment would be enough there then.
Reference Issues/PRs
closes #15932
closes #17991
closes #24674
supersedes #17991
supersedes #24674
What does this implement/fix? Explain your changes.
This allows the user to compute pairwise distance using a custom metric for non-metric data types (e.g. string or boolean).
Any other comments?
This PR just implements the changes from #24674 (which has stalled), including the last round of suggestions, and changes the name of the new parameter from
check_length_onlytoonly_check_num_samples.