MRG: Support for distance matrices in NearestNeighbor #2532

Closed
wants to merge 15 commits into
from

Projects

None yet

9 participants

@robertlayton
Member

Support for NearestNeighbor (the unsupervised one) to handle being given a distance matrix (i.e. a matrix D such that D[i][j] is the distance between samples i and j).

@robertlayton
Member

Note that, because of the nature of the algorithm, this cannot (without ugly hacking) support fitting on one matrix and giving the neighbors graph on another matrix. This is implicit in the code -- it will not fail, but just give the neighbors of the given matrix without taking the fitted matrix into consideration. I'm happy to take thoughts on how to handle this.

@coveralls

Coverage Status

Coverage remained the same when pulling e060e81 on robertlayton:knnd into 02e0267 on scikit-learn:master.

@coveralls

Coverage Status

Coverage remained the same when pulling e060e81 on robertlayton:knnd into 02e0267 on scikit-learn:master.

@jakevdp jakevdp commented on an outdated diff Oct 18, 2013
sklearn/neighbors/base.py
**self.effective_metric_kwds_)
-
- # XXX: should be implemented with a partial sort
- neigh_ind = dist.argsort(axis=1)
- neigh_ind = neigh_ind[:, :n_neighbors]
+ if hasattr(dist, 'argsort'):
+ # XXX: should be implemented with a partial sort
+ neigh_ind = dist.argsort(axis=1)
+ neigh_ind = neigh_ind[:, :n_neighbors]
+ else:
+ # Sparse matrices don't have argsort, so do it per-row
+ neigh_ind = np.array([dist[i].indices[
+ dist[i].data.argsort()[:n_neighbors]]
+ for i in range(dist.shape[1])])
@jakevdp
jakevdp Oct 18, 2013 Member

Though it won't matter for the square matrix, we should use for i in range(dist.shape[0]) to make the code more clear.

@jakevdp
jakevdp Oct 18, 2013 Member

Also, this will be pretty slow for large matrices. This line is a possible candidate for cythonization: it would be a pretty easy utility routine to write for CSR matrices, especially if you take advantage of the cython sorting functions already implemented in the neighbors-related code.

@larsmans
Member

I don't really see the use of this. When a distance matrix is sparse, doesn't that mean it contains a lot of duplicates?

@jakevdp jakevdp commented on an outdated diff Oct 18, 2013
sklearn/neighbors/base.py
return dist[j, neigh_ind], neigh_ind
+ else:
+ j = np.arange(neigh_ind.shape[0])
@jakevdp
jakevdp Oct 18, 2013 Member

I think j is not needed here

@jakevdp jakevdp commented on an outdated diff Oct 18, 2013
sklearn/neighbors/base.py
return dist[j, neigh_ind], neigh_ind
+ else:
+ j = np.arange(neigh_ind.shape[0])
+ return (np.array([dist[i, neigh_ind[i]]
+ for row in range(dist.shape[0])]),
+ neigh_ind)
@jakevdp
jakevdp Oct 18, 2013 Member

Again, this is a candidate for cythonization.
There has been some work in scipy on fancy indexing of sparse matrices... it might be work checking whether something in there would apply here.

@jakevdp
Member
jakevdp commented Oct 18, 2013

We should also add documentation here to the effect that a sparse distance matrix passed as 'precomputed' is effectively saying that the missing entries are infinity.

I wonder if there will be applications for this beyond the EAC algorithm in #1830. It might be more effective just to write a short custom kneighbor graph algorithm there.

@robertlayton
Member

I had considered just putting something directly in there but, especially considering the recent survey said sparse support was something people wanted, that upgrading existing code should be the preferred option.

You are correct on the "missing entries are infinity", which is how matrices that represent graphs are used. I'll update docs.

@larsmans Lots of duplicates, or some combination of small feature space (i.e. categorical with few options and few features) or a non-disciminating distance matrix (like a locality sensitive hash with very few bins).

Given all of that, it is possible that I just remove the sparse option. I have that in there, as I have plans to make the co-association matrices in the eac algorithm sparse, as it's n_samples by n_samples. The current iteration does not require it though. Maybe that's a better option?

@amueller
Member

I think sparse distance matrices are sometimes useful. In vision they pop up all the time. If you have some sort of topology on your input data, you might only want to compute distances between points that you know are close, to speed up computations (in an image, it would be pretty bad to compute distances between all pairs of pixels).
I don't have a strong opinion on including it, though.

@GaelVaroquaux
Member

What remains to be done on this PR?

@larsmans larsmans commented on an outdated diff Dec 7, 2013
sklearn/neighbors/tests/test_neighbors.py
@@ -93,6 +94,24 @@ def test_unsupervised_inputs():
assert_array_almost_equal(ind1, ind2)
+def test_unsupervisd_knn_distance():
@larsmans
larsmans Dec 7, 2013 Member

What happens when the matrix is not square?

@robertlayton
Member

I haven't come back to this for a while -- work has been hectic.

I believe with sparse support removed (because sparse distance matrices should be rare), this works and is ready to go. While I agree that sparse support would be nice, I would argue to delay that until someone needs it, rather than code for the possibility.

@GaelVaroquaux
Member

I agree that sparse matrix support shouldn't delay merging this in. I'll
try to review the PR soon ( :$ )

@coveralls

Coverage Status

Coverage remained the same when pulling 5d7b75d on robertlayton:knnd into 877d892 on scikit-learn:master.

@perimosocordiae
Contributor

Are there any outstanding problems with this PR?

I just implemented an incomplete version of this same idea before realizing that this PR exists: perimosocordiae@scikit-learn:master...perimosocordiae:precomputed-dist

@robertlayton
Member

Thanks @perimosocordiae, I had forgotten about this -- I believe it is good to go.

@larsmans larsmans and 1 other commented on an outdated diff Apr 8, 2014
sklearn/neighbors/tests/test_neighbors.py
@@ -94,6 +95,24 @@ def test_unsupervised_inputs():
assert_array_almost_equal(ind1, ind2)
+def test_unsupervisd_knn_distance():
+ """Tests unsupervised NearestNeighbors with a distance matrix."""
+ X = rng.random_sample((3, 3))
+ D = metrics.pairwise_distances(X, metric='euclidean')
+ # As a feature matrix (n_samples by n_features)
+ nbrs_X = neighbors.NearestNeighbors(n_neighbors=3)
+ nbrs_X.fit(X)
+ dist_X, ind_X = nbrs_X.kneighbors(X)
+ # As a dense distance matrix (n_samples by n_samples)
+ nbrs_D = neighbors.NearestNeighbors(n_neighbors=3, algorithm='brute',
+ metric='precomputed')
+ nbrs_D.fit(D)
@larsmans
larsmans Apr 8, 2014 Member

Can you try to pickle nbrs_D in the test? (Since you're using a lambda.)

@robertlayton
robertlayton Apr 9, 2014 Member

Sure. I'll try get to this on the weekend.

On 8 April 2014 19:42, Lars Buitinck notifications@github.com wrote:

In sklearn/neighbors/tests/test_neighbors.py:

@@ -94,6 +95,24 @@ def test_unsupervised_inputs():
assert_array_almost_equal(ind1, ind2)

+def test_unsupervisd_knn_distance():

  • """Tests unsupervised NearestNeighbors with a distance matrix."""
  • X = rng.random_sample((3, 3))
  • D = metrics.pairwise_distances(X, metric='euclidean')
  • As a feature matrix (n_samples by n_features)

  • nbrs_X = neighbors.NearestNeighbors(n_neighbors=3)
  • nbrs_X.fit(X)
  • dist_X, ind_X = nbrs_X.kneighbors(X)
  • As a dense distance matrix (n_samples by n_samples)

  • nbrs_D = neighbors.NearestNeighbors(n_neighbors=3, algorithm='brute',
  •                                    metric='precomputed')
    
  • nbrs_D.fit(D)

Can you try to pickle nbrs_D in the test? (Since you're using a lambda.)


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2532/files#r11382858
.

@robertlayton
robertlayton Apr 20, 2014 Member

That "just worked". Pushing soon.

@coveralls

Coverage Status

Changes Unknown when pulling a63866d on robertlayton:knnd into * on scikit-learn:master*.

@robertlayton
Member

OK. I've tested this on python 3.4 on the Ubuntu 14.04 and all works. Assuming the automated check comes back clean, this is good to go.

@perimosocordiae
Contributor

Bump?

@robertlayton
Member

Yes please. Looks like there are conflicts. I don't have time to fix them this week, so could you please to a merge --rebase with master and fix conflicts? You should be able to PR into my branch and I'll accept that.

@perimosocordiae perimosocordiae Fix merge conflict with scikit-learn/master
Turns out the conflict wasn't anything significant.
1c786ab
@perimosocordiae perimosocordiae referenced this pull request in robertlayton/scikit-learn Sep 11, 2014
Merged

Fix merge conflict with scikit-learn/master #6

@robertlayton robertlayton Merge pull request #6 from perimosocordiae/patch-1
Fix merge conflict with scikit-learn/master
82bb6ca
@robertlayton
Member

OK, this looks good to go. @GaelVaroquaux ?

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 82bb6ca on robertlayton:knnd into 4b82379 on scikit-learn:master.

@jnothman jnothman commented on an outdated diff Jan 5, 2015
sklearn/metrics/pairwise.py
@@ -929,7 +929,8 @@ def chi2_kernel(X, Y=None, gamma=1.):
'euclidean': euclidean_distances,
'l2': euclidean_distances,
'l1': manhattan_distances,
- 'manhattan': manhattan_distances, }
+ 'manhattan': manhattan_distances,
+ 'precomputed': lambda x: x }
@jnothman
jnothman Jan 5, 2015 Member

Should probably test this in the context of sklearn.metrics.pairwise...

@jnothman
Member
jnothman commented Jan 5, 2015

I think this feature would be useful in some circumstances (and obviously both @preimosocordiae and @robertlayton thought so too). It would certainly factor out a few lines of code in DBSCAN, for instance.

I think this needs to ensure that if algorithm='auto' then metric=precomputed will select brute. It doesn't look like that's in the PR yet.

jnothman and others added some commits Jan 10, 2015
@jnothman jnothman ENH precomputed is now a valid metric for 'auto' neighbors
Also, handle radius_neighbors case identically to kneighbors for precomputed
45a01fd
@jnothman jnothman pep8 e516960
@robertlayton robertlayton Merge pull request #7 from jnothman/knnd
some fixes to your PR 2532
6c298b3
@jnothman
Member

I now realise this implementation doesn't make sense. In the precomputed case, the query functions should accept X of shape (n_queries, n_indexed) rather than a square. The tests should consider a different query to the indexed data.

@jnothman
Member

The other thing this PR needs is changes to docstrings.

@perimosocordiae
Contributor

@jnothman I agree with the (n_queries, n_indexed) part. My original use-case was for building k-nearest neighbor graphs, where the query and target sets are the same, but this should be a general-purpose change.

@jnothman
Member

This PR is likely to be superseded by #4090

@raghavrv
Member
raghavrv commented Nov 4, 2016

With #4090 merged, should this be closed? @jnothman

@jnothman
Member
jnothman commented Nov 5, 2016

Thanks for the reminder.

@jnothman jnothman closed this Nov 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment