-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] Implement haversine metric in pairwise #4458
Conversation
3dba895
to
c9a02bd
Compare
You could add a test here :) |
sklearn/metrics/pairwise.py
Outdated
@@ -426,6 +426,57 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean", | |||
metric_kwargs)[0] | |||
|
|||
|
|||
def haversine_distances(X, Y=None, ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(X, Y=None, )
--> (X, y=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry keep the y uppercased...
(X, Y=None, ) --> (X, Y=None)
@ragv I looked into wrong test folders :( Besides, the testing in nbrs = cls(algorithm='ball_tree', metric='haversine')
assert_raises(ValueError,
nbrs.predict,
X)
assert_raises(ValueError,
ignore_warnings(nbrs.fit),
Xsparse, y) Is there any reason that haversine metric cannot be used with ball_tree? @jakevdp ? |
sklearn/metrics/pairwise.py
Outdated
with shape (n_samples_X, 2). | ||
|
||
Y : array_like | ||
with shape (n_samples_Y, 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shape information can go in the first line itself... More description, if available, can follow next :) Like here.
You haven't validated the data like done here for |
@ragv I don't think |
I suggested input (
I think since previously |
There is an unrelated test failure in Python 3.4 alone... :-/ Others seem to pass... Probably temporary server unavailability... |
Any review on this PR @amueller ? Should I need to check the data is in the domain of cos, sin? |
You meant scipy right? :p |
Also in your PR description change fix to fixes so that github can automatically close that issue when this PR gets merged :) |
Hey @xuewei4d. Sorry I was sick over the weekend and just catching up. It is fine that |
# Test with tuples as X and Y | ||
X_tuples = tuple([tuple([v for v in row]) for row in X]) | ||
Y_tuples = tuple([tuple([v for v in row]) for row in Y]) | ||
S2 = pairwise_distances(X_tuples, Y_tuples, metric="euclidean") | ||
assert_array_almost_equal(S, S2) | ||
|
||
# Test haversine distnace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distance
You should add the metric to |
Does the tree also raise an exception if the data is not 2d? Maybe that should be tested in test_neighbors_badargs? |
restarted. |
Yes, haversine distances are defined only in two dimensions. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx#L955 |
Just out of curiosity, could you check the performance of this algorithm vs. using the In [1]: from sklearn.neighbors import DistanceMetric
In [2]: import numpy as np
In [3]: X = np.random.random((100, 2))
In [20]: metric = DistanceMetric.get_metric('haversine')
In [21]: dist = metric.pairwise(X, X) The difference is that this doesn't do any broadcasting, so it may be faster and/or more memory efficient. Note that the |
I wasn't really aware of the DistanceMetric interface. Shame on me :-/ could we also use that for the jaccard? |
Yes, it would work for Jaccard! Actually, it will work for any of the metrics supported by BallTree. For pairwise distances, it tends to be a factor of a few slower than the equivalent scipy routines, though (if I remember correctly; I did the benchmarks a long time ago). |
Ok we should check for jaccard. |
what was the status here? Just waiting for another review? |
I think I need to do some benchmark. |
@xuewei4d Would you have some time to finish this up? |
Hi all, How can I finish this? |
Could you please benchmark how fast this is compared to an implementation based on Distance metric as suggested by jakevdp above? |
Closing due to being superseded by #12568 |
Fixes #4453 #4452 , add haversine distrance for pairwise_distances and document it.
I did not find any test forpairwise.py
, so I did not add any test cases.