Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] Implement haversine metric in pairwise #4458

Closed
wants to merge 1 commit into from

Conversation

xuewei4d
Copy link
Contributor

Fixes #4453 #4452 , add haversine distrance for pairwise_distances and document it.

I did not find any test for pairwise.py, so I did not add any test cases.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 3dba895 on xuewei4d:haversine_pairwise into 0117fb5 on scikit-learn:master.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling c9a02bd on xuewei4d:haversine_pairwise into 0117fb5 on scikit-learn:master.

@raghavrv
Copy link
Member

You could add a test here :)

@@ -426,6 +426,57 @@ def pairwise_distances_argmin(X, Y, axis=1, metric="euclidean",
metric_kwargs)[0]


def haversine_distances(X, Y=None, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(X, Y=None, ) --> (X, y=None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry keep the y uppercased...

(X, Y=None, ) --> (X, Y=None)

@xuewei4d
Copy link
Contributor Author

@ragv I looked into wrong test folders :(

Besides, the testing in test_neighbors.py failed in line 815.

        nbrs = cls(algorithm='ball_tree', metric='haversine')
        assert_raises(ValueError,
                      nbrs.predict,
                      X)
        assert_raises(ValueError,
                      ignore_warnings(nbrs.fit),
                      Xsparse, y)

Is there any reason that haversine metric cannot be used with ball_tree? @jakevdp ?

with shape (n_samples_X, 2).

Y : array_like
with shape (n_samples_Y, 2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape information can go in the first line itself... More description, if available, can follow next :) Like here.

@xuewei4d xuewei4d changed the title [MRG] Implement haversine metric in pairwise [WIP] Implement haversine metric in pairwise Mar 28, 2015
@raghavrv
Copy link
Member

Besides, the testing in test_neighbors.py failed in line 815.

You haven't validated the data like done here for euclidean_distances. :)

@xuewei4d
Copy link
Contributor Author

@ragv I don't think check_pairwise_arrays will let it pass test_neighbors_badargs in neighbors/test_neighbors.py. It looks like nbrs = cls(algorithm='ball_tree', metric='haversine') is a bad argument.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling f7670d4 on xuewei4d:haversine_pairwise into 0117fb5 on scikit-learn:master.

@raghavrv
Copy link
Member

I suggested input (X, Y) validation, simply upon seeing bad args in the test title... sorry :p

haversine metric was introduced in this PR only right? It says ValueError was not raised...

I think since previously haversine was not implemented, there was a test to check if ValueError was raised... Now that we have haversine implemented I think you can go ahead and remove L815-L821.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 2b329b4 on xuewei4d:haversine_pairwise into 0117fb5 on scikit-learn:master.

@xuewei4d
Copy link
Contributor Author

Thanks for your comments, @ragv. I think we need someone else to take a look at L815

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 3746d7d on xuewei4d:haversine_pairwise into 0117fb5 on scikit-learn:master.

@xuewei4d xuewei4d changed the title [WIP] Implement haversine metric in pairwise [MRG] Implement haversine metric in pairwise Mar 28, 2015
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) to 95.11% when pulling 3746d7d on xuewei4d:haversine_pairwise into 0117fb5 on scikit-learn:master.

@raghavrv
Copy link
Member

There is an unrelated test failure in Python 3.4 alone... :-/ Others seem to pass... Probably temporary server unavailability...

@xuewei4d
Copy link
Contributor Author

xuewei4d commented Apr 1, 2015

Any review on this PR @amueller ? Should I need to check the data is in the domain of cos, sin? Spicy Scipy will give a running time warning anyway.

@raghavrv
Copy link
Member

raghavrv commented Apr 1, 2015

You meant scipy right? :p

@raghavrv
Copy link
Member

raghavrv commented Apr 1, 2015

Also in your PR description change fix to fixes so that github can automatically close that issue when this PR gets merged :)

@amueller
Copy link
Member

amueller commented Apr 1, 2015

Hey @xuewei4d. Sorry I was sick over the weekend and just catching up. It is fine that BallTree doesn't work with this metric.

# Test with tuples as X and Y
X_tuples = tuple([tuple([v for v in row]) for row in X])
Y_tuples = tuple([tuple([v for v in row]) for row in Y])
S2 = pairwise_distances(X_tuples, Y_tuples, metric="euclidean")
assert_array_almost_equal(S, S2)

# Test haversine distnace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distance

@amueller
Copy link
Member

amueller commented Apr 1, 2015

You should add the metric to test_neighbors_metrics in test_neighbors.py

@amueller
Copy link
Member

amueller commented Apr 1, 2015

Does the tree also raise an exception if the data is not 2d? Maybe that should be tested in test_neighbors_badargs?

@amueller
Copy link
Member

amueller commented Apr 5, 2015

restarted.

@jakevdp
Copy link
Member

jakevdp commented May 5, 2015

Yes, haversine distances are defined only in two dimensions. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/dist_metrics.pyx#L955

@jakevdp
Copy link
Member

jakevdp commented May 5, 2015

Just out of curiosity, could you check the performance of this algorithm vs. using the DistanceMetric class directly? e.g.

In [1]: from sklearn.neighbors import DistanceMetric

In [2]: import numpy as np

In [3]: X = np.random.random((100, 2))

In [20]: metric = DistanceMetric.get_metric('haversine')

In [21]: dist = metric.pairwise(X, X)

The difference is that this doesn't do any broadcasting, so it may be faster and/or more memory efficient.

Note that the DistanceMetric stuff is not really documented in the package, as I considered it primarily a workhorse for the BallTree/KDTree. The only reason the Python interface exists is to facilitate unit tests, but there's no reason we can't take advantage of it elsewhere!

@amueller
Copy link
Member

amueller commented May 5, 2015

I wasn't really aware of the DistanceMetric interface. Shame on me :-/ could we also use that for the jaccard?

@jakevdp
Copy link
Member

jakevdp commented May 5, 2015

Yes, it would work for Jaccard! Actually, it will work for any of the metrics supported by BallTree. For pairwise distances, it tends to be a factor of a few slower than the equivalent scipy routines, though (if I remember correctly; I did the benchmarks a long time ago).

@amueller
Copy link
Member

amueller commented May 5, 2015

Ok we should check for jaccard.

@amueller
Copy link
Member

what was the status here? Just waiting for another review?

@xuewei4d
Copy link
Contributor Author

I think I need to do some benchmark.

@raghavrv
Copy link
Member

raghavrv commented Nov 27, 2016

@xuewei4d Would you have some time to finish this up?

@jnothman jnothman added this to the 0.20 milestone Jun 14, 2017
@jnothman jnothman added Easy Well-defined and straightforward way to resolve Need Contributor labels Jun 14, 2017
@ibrahimsharaf
Copy link

Hi all, How can I finish this?

@jnothman
Copy link
Member

jnothman commented Aug 23, 2017

Could you please benchmark how fast this is compared to an implementation based on Distance metric as suggested by jakevdp above?

@jnothman
Copy link
Member

Closing due to being superseded by #12568

@rth rth closed this in #12568 Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve help wanted Stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement haversine metric in pairwise, document properly
10 participants