MRG: Faster spatial trees with enhancements #1732

Closed
wants to merge 1 commit into
from

Projects

None yet
@jakevdp
scikit-learn member

New Features

  • The BallTree class is enhanced to be compatible with 20 different distance metrics. There are also depth first, breadth first, single tree, and dual tree algorithms for the queries. The interface is designed to be fully backward compatible with the old code: all test scripts from previous versions pass on this new version.

  • The same core binary tree code which implements the Ball Tree is used to also implement a KD Tree, which can be faster in low dimensions for some datasets.

  • Both BallTree and KDTree now have a method for performing fast kernel density estimates in O[N log(N)] rather than the naive O[N^2]. Six common kernels are
    implemented, and each kernel can work with any of the valid distance metrics.

  • Both BallTree and KDTree have a method for computing fast 2-point correlation functions, also in O[N log(N)] rather than the naive O[N^2]. This works with any of the valid distance metrics.

  • As a bonus, there is a set of class interfaces to the various distance metrics, with functions to generate a pairwise distance matrix under any of the metrics.

  • All functions are documented and tested.

Remaining Tasks

These tasks can be added to this PR, or saved for a later one. I wanted to get some feedback on the changes before going forward with these:

  • Update the NearestNeighbors classes to use the available metrics, as well as to use the new KDTree class

  • Rework the narrative documentation for nearest neighbors to reflect these changes.

  • Create estimator classes to wrap the kernel density estimator functionality, and perhaps also the 2-point correlation function

Speed

The the new code is generally faster than the old ball tree implementation and the recent enhancement to the scipy KDTree. Here are some runtimes in seconds for a 2000 pt query on my slightly old linux laptop. The points are in 3 dimensions, with each dimension uniformly distributed between 0 and 1. These timings will change in hard-to-predict ways as the distribution of points is changed. Uniform point distributions in low dimensions tend to favor the KD Tree approach.

2000 points, metric = euclidean
            Build       KNN         RNN         KDE         2PT        
cKDTree      0.00022     0.014      
old BallTree 0.00072     0.023       0.022      
BallTree     0.0008      0.013       0.011       0.043       0.26       
KDTree       0.0011      0.008       0.02        0.031       0.31       

2000 points, metric = minkowski (p=3)
            Build       KNN         RNN         KDE         2PT        
cKDTree      0.00021     0.0088     
old BallTree 0.00095     0.029       0.022      
BallTree     0.00089     0.017       0.012       0.038       0.18      
KDTree       0.0011      0.0095      0.013       0.023       0.2       
@ogrisel
scikit-learn member

This looks amazing Jake. Now I wonder when I will be able to find the time to review all of this... :) How do you plan to expose the new Kernel Density Estimation features of those data-structures? New estimators with a fit + 1D predict_proba method? Maybe the predict_proba method of classifiers would be a confusing API. A maybe a new (negative_log_)likelihood method on the existing neighbors model?

@ogrisel
scikit-learn member

BTW, are the KDE methods faster alternative to scipy.stats.kde.gaussian_kde that can be used to smooth 1D or 2D distribution plots?

@ogrisel
scikit-learn member

Also have you done benchmark in higher dimensions (e.g. 30 to 100)?

@ogrisel
scikit-learn member

BTW: the travis failure seems to be a real test failure.

@jakevdp
scikit-learn member

Regarding the KDE API: I think it would benefit us to come up with a good general API for density estimation within scikit-learn. There are many algorithms which are based on estimating density (nonparametric/naive Bayes, generative classification, etc.), and there are several approaches which are useful (KDE, GMM, bayesian k-neighbors methods, etc. Take a look at the density estimation routines in astroML) A uniform density estimation interface for all these would aid in using and comparing them, as well as to help make it easy to build a set of nonparametric Bayes classifiers which use the various methods.

The speed of the KDE compared to the scipy gaussian_kde depends on several things. Scipy's function is O[N^2] every time. This version can be as bad as O[N^2] with a very large kernel and a very small tolerance: even in that worst case, however, it's about a factor of a few faster than gaussian_kde. As the kernel width is decreased and the tolerance is increased, the number of evaluations theoretically will get to O[N log(N)], or even approach nearly O[N] using the dual tree approach, so the speed gain should be even more significant. Another difference is that scipy's gaussian_kde seems is written for statisticians: it includes built-in bandwidth estimation using several approaches common in the statistics community. I've done a more functional implementation common in the machine learning community, and I assume the user will implement their own bandwidth selection routine, through, e.g. cross-validation.

I haven't yet done detailed benchmarks on data dimension, data size, etc. but I hope to do a blog post on that in the near future.

The test failure is related to canberra distance. The distances are tested by comparing to the results to those of scipy.spatial.distance.cdist(). The canberra distance was implemented incorrectly before scipy version 0.10 (see scipy/scipy@32f9e3d). I believe the jenkins build uses scipy 0.9 currently, so that would lead to the errors. What's the best way to deal with that?

@ogrisel
scikit-learn member

Thanks for the explanation. Looking forward to your blog post.

I think we should add a CVed KDE utility function by default: most of the time the user don't want to implement the CV by hand just to plot a smoothed distribution estimation.

The test failure is related to canberra distance. The distances are tested by comparing to the results to those of scipy.spatial.distance.cdist(). The canberra distance was implemented incorrectly before scipy version 0.10 (see scipy/scipy@32f9e3d). I believe the jenkins build uses scipy 0.9 currently, so that would lead to the errors. What's the best way to deal with that?

Check the value of scipy.__version__ in the test and raise a nose.tools.SkipTest in case it's too old.

@jakevdp
scikit-learn member

I think we should add a CVed KDE utility function by default: most of the time the user don't want to implement the CV by hand just to plot a smoothed distribution estimation.

Yes, I completely agree. When I said above that I wanted to leave cross-validation to the user, I had in mind that the "user" might be a wrapper function in scikit-learn! I view this tree code primarily as a low-level tool that will provide a good foundation for a lot of different estimators.

@ogrisel
scikit-learn member

Ok we agree then :)

@amueller
scikit-learn member

Wow this is awesome!
I don't fully understand the speed table, though. cKDTree is the scipy implementation, right? To me it looks like that is faster than KDTree. Is that not right?

@jakevdp
scikit-learn member

Andy - Yes, cKDTree is the scipy implementation. It's a bit faster for some things, much slower for others, and only implements a subset of the operations.

@jakevdp
scikit-learn member

I should mention the other difference: this KDTree is picklable, and uses no dynamic memory allocation: all tree data is stored in pre-allocated numpy arrays. In cKDTree all nodes are allocated dynamically, so it can't be pickled, and comes with the other disadvantages of dynamic memory allocation. I'm not sure about how that fact affects the performance, but it seems to be a wash -- except for the building phase, which is only a small part of the total computation time. I just looked at the code, and realized there are some efficiencies that could be enabled in the KDTree node initialization. I hadn't worried about it until now because the build time is so small compared to the time of a query.

@jakevdp
scikit-learn member

Just FYI - though cKDTree is faster on a few things, everything here is faster than the old Ball Tree, so it's a net win for scikit-learn.

@jakevdp
scikit-learn member

@amueller - I took a closer look at why Scipy's kdtree construction is so much faster than this code. The reason is that the scipy version keeps track of an array of maxes and mins during construction, which saves an iteration over the points in each node. That approach leads to faster construction times, but there's a problem: the min and max bounds of a subnode will generally be smaller than the min and max bounds of its parent. Scipy's version doesn't allow the bounds to shrink at each node. I believe this is the reason that our kdtree outperforms theirs on query: we have tighter constraints on each node. This makes a bit of a difference for uniform data, but I anticipate a much bigger improvement when it comes to structured data -- I plan to do some detailed benchmarks as soon as I'm finished with my PyCon tutorial next week.

In any case, the build time is only a few percent of the query time, even with the slower method here. For that reason, I think we should stick with our code as-is.

I did manage to speed up the build by ~25%, though, by using raw pointers rather than typed memoryviews in some places. It turns out that if a typed memoryview is a class variable, there's some overhead associated with accessing its elements.

@amueller
scikit-learn member

Thanks for the in-depth analysis Jake :) I'm a bit swamped right now (again) but I'll try to have a look and review asap!

@jaquesgrobler
scikit-learn member

Wow Jake this is very well done. Took me a while to just glance through it all.. Will try and look at it more indepth soon. Fantastic work!

@jakevdp
scikit-learn member

I've been thinking more about this - one issue right now is that the dual tree approach to KDE is sub-optimal for larger atol, and explicitly cannot handle non-zero rtol. I first wrote it this way because I wasn't sure there was an efficient way to implement it, but I'm starting to think it can be done. I'll have to try some things out during the PyCon sprints!

@larsmans larsmans and 1 other commented on an outdated diff Mar 16, 2013
sklearn/neighbors/binary_tree.pxi
+
+ cdef _sort(self):
+ cdef DTYPE_t[:, ::1] distances = self.distances
+ cdef ITYPE_t[:, ::1] indices = self.indices
+ cdef ITYPE_t row
+ for row in range(distances.shape[0]):
+ _simultaneous_sort(&distances[row, 0],
+ &indices[row, 0],
+ distances.shape[1])
+
+
+#------------------------------------------------------------
+# simultaneous_sort :
+# this is a recursive quicksort implementation which sorts `distances`
+# and simultaneously performs the same swaps on `indices`.
+cdef void _simultaneous_sort(DTYPE_t* dist, ITYPE_t* idx, ITYPE_t size):
@larsmans
larsmans Mar 16, 2013

Is this really faster than reusing some off-the-shelf sort such as np.argsort or qsort? It looks like a rather naive quicksort (no random pivoting, median-of-three rule, tail-recursion elimination, back-off to insertion sort, ...).

@jakevdp
jakevdp Mar 16, 2013

I have no doubt that the numpy sort is superior to this algorithm. The problem is that here we need to simultaneously sort the indices and the distances. I benchmarked it against numpy using something along the lines of

i = np.argsort(dist)
ind = ind[i]
dist = dist[i]

and found that this cython version was consistently about twice as fast (mainly due to the fancy indexing). If you know of a better way to do this without resorting to cython, please let me know!

@larsmans
larsmans Mar 16, 2013

Well, no, I suspect that qsort will be even slower because of all the pointer indirection. But I'm still a bit worried about the naive quicksort, because there are inputs which make this take quadratic time.

@jakevdp
jakevdp Mar 16, 2013

True - that's a good point. One piece to keep in mind, though, is that this is only sorting lists of length k, where k is the number of neighbors (typically ~1-50), so it shouldn't be a huge penalty even in the rare worst-case.

Can you think of a good route that doesn't involve re-implementing a more sophisticated sort?

@larsmans
larsmans Mar 16, 2013

I'm actually a bit surprised that this is faster than argsort, because the expected number of swaps is around 2 n lg n whereas sorting indices and then using those to sort distances would take 2 n lg n + n.

But no, I can't think of a good alternative -- there's the C++ std::sort, which is practically always implemented as the extremely fast introsort algorithm, but we also have good reasons not to use C++ code if it's not needed.

Is this called at prediction time?

@jakevdp
jakevdp Mar 16, 2013

Yes, this is called at the prediction/query, though only if the sort_results keyword is True. I think the reason this is faster is because the constant in the O[n] indexing operations is larger than the constant in front of the O[n log n] swaps. I only benchmarked it for ~1-100 neighbors, which is typical for k neighbors queries.

@jakevdp
jakevdp Mar 18, 2013

I just double-checked the benchmarks I had run... the simultaneous sort is about 10 times faster than the numpy option for O[1] points, and about 3 times faster than the numpy option for O[1000] points.

I was thinking that if that difference is a negligible percentage of the query time, it might be cleaner just to use the numpy option. But when searching for a large number of neighbors, the cost of the current implementation is on order 30% the query cost, so it's definitely worth worrying about!

I also realized that I haven't yet implemented the sort_results keyword I mentioned above. With these benchmarks in mind, that seems like it would be a really good thing to have -- I'll take a stab at implementing that when my PyCon/PyData talks are done!

@larsmans
larsmans Mar 18, 2013

Alright. My concern is that it may be possible to DoS a web app that calls predict on user input, but I guess we'll have to take the risk. (Median-of-three pivoting, while still O(n²), would still be nice to have as it fixes almost all non-adversarial performance problems.)

@jakevdp
jakevdp Mar 18, 2013

Median-of-three pivoting sounds like a good idea -- I'll plan to put that in.

@larsmans larsmans and 1 other commented on an outdated diff Mar 16, 2013
sklearn/neighbors/binary_tree.pxi
+ return 0.5 / h
+ if kernel == LINEAR_KERNEL:
+ return 1.0 / h
+ elif kernel == COSINE_KERNEL:
+ return 0.25 * PI / h
+
+
+######################################################################
+# Tree Utility Routines
+cdef inline void swap(DITYPE_t* arr, ITYPE_t i1, ITYPE_t i2):
+ cdef DITYPE_t tmp = arr[i1]
+ arr[i1] = arr[i2]
+ arr[i2] = tmp
+
+#------------------------------------------------------------
+# NeighborsHeap
@larsmans
larsmans Mar 16, 2013

This is just a binary heap, right?

@jakevdp
jakevdp Mar 16, 2013

It's a custom binary heap: there's a maximum number of elements (the number of neighbors) and it throws out neighbors that are too far away. It also performs the same swaps on both the indices and distances.

@jakevdp
jakevdp Mar 16, 2013

Also, this heap stores the neighbors and indices in place, so no copying is required at the end of the tree traversal.

@larsmans larsmans commented on the diff Mar 16, 2013
sklearn/neighbors/binary_tree.pxi
+ ITYPE_t i1
+ ITYPE_t i2
+
+# use a dummy variable to determine the python data type
+cdef NodeHeapData_t ndummy
+cdef NodeHeapData_t[:] ndummy_view = <NodeHeapData_t[:1]> &ndummy
+NodeHeapData = np.asarray(ndummy_view).dtype
+
+
+cdef inline void swap_nodes(NodeHeapData_t* arr, ITYPE_t i1, ITYPE_t i2):
+ cdef NodeHeapData_t tmp = arr[i1]
+ arr[i1] = arr[i2]
+ arr[i2] = tmp
+
+
+cdef class NodeHeap:
@larsmans
larsmans Mar 16, 2013

Why does this look completely different from your other heap? Could you merge some of the code?

@jakevdp
jakevdp Mar 16, 2013

Unlike the other heap, this one can't throw away items and has to be able to grow as nodes are appended (it's all but impossible to predict how many nodes will be pushed into the heap during a query). It might be possible to abstract some of the heap infrastructure and share code between the two, but it wouldn't be straightforward.

@larsmans
scikit-learn member

Just out of curiosity, how long does this take to compile? It's 30kLOC in C...

@jakevdp
scikit-learn member

On my newish macbook pro, compilation takes ~8 seconds.

@jakevdp
scikit-learn member

@larsmans - I added median-of-three pivoting, as well as using a fast direct sort for partitions less than or equal to length 3.

I also added the sort_results keyword to the query command. Thanks for all the feedback on this!

@jakevdp
scikit-learn member

Dang, I just realized that my new quicksort algorithm broke something... I need to figure out how to fix it.

@jakevdp
scikit-learn member

Another small addition: the ability to pass a python function defining a user metric. This is slow due to the python call overhead, but based on an email from a user today, it seems like it would be interesting to allow the option.

@jakevdp
scikit-learn member

Another TODO: implement the haversine formula. This is a 2D metric which gives the angle between points on the surface of a sphere, where the positions can be specified in latitude and longitude. It should be fairly easy to implement, and useful in astronomy applications - thanks to @jsundram for the suggestion.

@larsmans
scikit-learn member

Also useful for GIS or anything using location information!

@jakevdp
scikit-learn member

OK, I added the Haversine distance & associated tests, and did some cleanup (including adding the ability for our functions to pass-on exceptions). I think this metric will be a useful addition!

@jsundram

Looks good to me. I don't have a good sense for how (in)accurate this is for (nearly) antipodal points. Do you think it may be worth checking for that, and using the cosine-based formula instead?

@GaelVaroquaux GaelVaroquaux and 2 others commented on an outdated diff Mar 25, 2013
sklearn/neighbors/dist_metrics.pyx
+ - N : number of dimensions
+ - NTT : number of dims in which both values are True
+ - NTF : number of dims in which the first value is True, second is False
+ - NFT : number of dims in which the first value is False, second is True
+ - NFF : number of dims in which both values are False
+ - NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT
+ - NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT
+
+ ============= ====================== =================================
+ identifier class name distance function
+ ------------- ---------------------- ---------------------------------
+ "jaccard" JaccardDistance NNEQ / NNZ
+ "maching" MatchingDistance NNEQ / N
+ "dice" DiceDistance NNEQ / (NTT + NNZ)
+ "kulsinski" KulsinskiDistance (NNEQ + N - NTT) / (NNEQ + N)
+ "rogerstanimoto" RogerStanimotoDistance 2 * NNEQ / (N + NNEQ)
@GaelVaroquaux
GaelVaroquaux Mar 25, 2013

My gut feeling is that to follow scikit-learn's naming convention, this should be 'roger_stanimoto' (a similar remark applies below). However, maybe you are using these string to be compatible with another library?

@larsmans
larsmans Mar 25, 2013

It's actually called the Rogers-Tanimoto distance (not Roger-Stanimoto).

@jakevdp
jakevdp Mar 25, 2013

Ha! Thanks @larsmans, I'll change that.

Gael, the reason I chose the string identifiers I did is because they match what's used within pdist and cdist in the scipy.spatial.distance module.

@GaelVaroquaux
GaelVaroquaux Mar 25, 2013
@jakevdp
jakevdp Mar 25, 2013

The other option is to allow both forms: a scikit-learn version with underscores in our documentation, but also silently accept the no-underscore version for compatibility with scipy. I do that already with the manhattan distance. Scipy calls it 'cityblock', so the code accepts that as well.

@GaelVaroquaux
scikit-learn member
@jakevdp
scikit-learn member

First of all, welcome back Gael! I hope your trip went well!

As a general note about integration of this with scikit-learn: yes, I think there are some good things that can be done. I intended to first get this code out there as-is, without any class wrappers or other attempts at integrating the functionality into the sklearn estimator syntax, mainly so we could start discussions like this one. Whether these additional features should be part of this PR or part of separate PRs in the future is up for discussion; because the scope of the changes is so large, I'd lean toward the latter.

As I see it, this is what would be nice to do with this new stuff:

  • improve the neighbors classifier and regressor to make use of the available metrics
  • think more about the implementation of a general density estimation framework, encompassing KDE, GMM, nearest-neighbors-based methods, Gaussian approximations, and perhaps others. I like the idea of using the score function rather than the predict_proba function, but that's something we can discuss. Once we have a uniform interface to some density estimators, it will be trivial to implement some very general generative classification algorithms on top of them. This would be a superset of what's currently available in the naive_bayes module.
  • integrate this functionality more with the pairwise module, as well as the neighbors graph routines. The pairwise functions here were really just an afterthought that allowed easier testing against the scipy implementations. I doubt they'll be faster than what's currently in the pairwise module -- that would be worth checking -- but they're certainly more general.
  • think about where the two-point statistics fit in. In astronomy, we often compute the 2-point correlation of objects on the sky, and fit nonlinear theoretical models to the result to learn about cosmological parameters
  • do an overhaul of the documention for all of this

This is a large project, and it's more than I can work on in the near term. I leave the country Friday, and won't really have much time to devote to this until mid-April or later. That being said, it really is something I'd like to do! Actually, come to think of it, this might be a good GSoC project: it's well-defined, well-contained, impactful, and something that could be accomplished and completed during a GSoC tenure.

Anyway, I guess the first thing to decide is whether this can be committed as-is, or whether the sklearn integration & documentation should be part of this PR as well. Also, anything else I missed in my list?

@jakevdp
scikit-learn member

An important remaining task: I realized that I took a shortcut in the KDE normalizations, and need to change them so they're correct in N dimensions. This should be pretty straightforward.

A question for anyone with an opinion: should the KDE return a normalized density (so that the integral over all space is 1) or an unnormalized density (so the integral over all space is N, the number of points)? I lean toward unnormalized, because physical density is more useful in my own KDE applications. But scipy's kernel density functions return normalized results. Any thoughts?

@agramfort
scikit-learn member
@jakevdp jakevdp referenced this pull request in astroML/astroML Apr 16, 2013
Closed

Cross-matching uses incorrect distances #12

@jakevdp
scikit-learn member

I implemented the correct normalizations for D-dimensional euclidean KDE. These were fun integrals to compute!
For simplicity, I ended up going with normalizations such that the integral over all space is N. I'm not sure an extra class param is warranted, because the user can just as easily divide the result by N if they need to.

My next task is going to be putting together some benchmarks. I've started on this, and will try to get out a blog post on the subject in the next week or so. While working through that, I may find some other things in the code that need to be tweaked.

Given the size of this PR already, I'd prefer to delegate remaining tasks to later PRs - splitting them up, I think, would make thorough review a less daunting task. Given this, here are my thoughts for work I'd like to accomplish based on this:

  1. merge this PR basically as-is (modulo any other comments on the current changes)
  2. do a PR which incorporates the new tree options into the existing nearest neighbors estimators and other places where BallTree is used.
  3. do a PR which overhauls the narrative documentation of the neighbors estimators and the Ball/KD tree.
  4. do a PR which implements a density estimation framework... this would include the sklearn-style interface to the KDE & correlation function computations, and also incorporate GMM, bayesian nearest neighbors methods, Gaussian approximation, and other relevant tools into a unified density estimation framework (perhaps a new submodule -- that can be discussed). Include narrative documentation of the new functionality.
  5. do a PR which implements a general generative classifier based on these density estimators. This is akin to naive bayes classification, but takes the "naive" out of it! If we design it well, it could probably supersede almost everything that is currently in the naive bayes module. Include narrative documentation of the new module.

That's a lot to do before the next release, especially given my travel schedule over the next couple months. Perhaps we can push it to the point where it can be finished during the Scipy US sprint? In any case, let me know if you have feedback about this!

@GaelVaroquaux
scikit-learn member
@jakevdp
scikit-learn member

Gael - that sounds good. I'll plan to incorporate the new metric options into the neighbors module and the manifold module. I've pushed a few little tweaks to the code that will allow this to be done more easily! I'm working on the incorporation into NearestNeighbors right now.

@jakevdp
scikit-learn member

OK - all the NearestNeighbors-related classes can now use any of the available metrics with algorithm in ['brute', 'ball_tree', 'kd_tree'], by setting the metric keyword. Backward compatibility is maintained by making the default metric minkowski, so that p can be specified alone. The default is p=2, and this will switch the metric to euclidean for efficiency.

Note also that I changed the API so that algorithm = 'kd_tree' refers to the new scikit-learn kd-tree, while algorithm = 'ckd_tree' refers to scipy.spatial.cKDTree.

We should talk about whether we even want to maintain access to the ckdtree. I don't think it has any advantages (at best, its benchmarks seem to match, and never outperform, those of the new KD Tree), and it has the distinct disadvantage that it dynamically allocates all its memory, and thus cannot be pickled. On the other hand, currently it will only be selected if the user specifically asks for it, so it may be alright to leave in as-is (perhaps we should put some warnings to this effect in the docs).

@GaelVaroquaux
scikit-learn member
@jakevdp
scikit-learn member

Re: cKDTree -- should we ask for people's opinions on the mailing list?

@GaelVaroquaux
scikit-learn member
@jakevdp
scikit-learn member

As I work on writing up some benchmarks for this, I'm finding little issues here and there that need to be addressed.

The last thing as far as the guts of the code (rather than the sklearn wrappers, etc.) is to redo the dual tree KDE methods. I spent some time thinking about it as I flew across the US yesterday, and I think I know how to write the code. Now all I need is to write it!

After looking at the benchmarks, I think it will be safe to remove our support for cKDTree. Its only real advantage is that build time is slightly faster, but that savings is extremely small compared to the cost of a full KNN query. Removing it will also lead to simpler code in the NearestNeighbors classes.

@jakevdp
scikit-learn member

I just published a blog post with some detailed benchmarks of the two new trees in comparison with scipy.spatial.cKDTree: http://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/
I think Gael's right - we should remove cKDTree bindings, as I haven't found any situations where it produces better results.

@clifflyon

I happen to be working on a python module to do nearest-neighbors for lat/long points using haversine distance, and came across this, which looks ideal - any idea when/if this will be available? thanks!

@lelale
@clifflyon

@lelale no i think you are just following sklearn now. (but i'm a noob, too - i just signed up b/c i wanted to know about the status of this pull request.)

@jakevdp
scikit-learn member

I've been completely swamped and haven't had the time to finish off the check-list: hopefully I can do the remaining pieces at the sprint at Scipy2013 in a week and a half. Until then, the core routines are mostly functional, so you could use them by building from this branch:

git clone git://github.com/jakevdp/scikit-learn.git
cd scikit-learn
git checkout new_ball_tree
python setup.py build_ext --inplace

and see more build instructions on http://scikit-learn.org/dev/install.html

@clifflyon

got it - thanks!

@jakevdp
scikit-learn member

I just fixed a few bugs, and added a KernelDensity estimator class.

@ogrisel, @GaelVaroquaux and I spent some time thinking about the best way to do this, but we'd still like some feedback. In particular:

  • where should KDE live? We put it with neighbors because it's an instance-based learner, but it is also closely related to GMM. Perhaps we should have a new density submodule? But then GMM wouldn't be in it, which could be confusing.
  • for the KDE interface, I chose to follow the current API of GMM: to evaluate the (log-)density at a point, you can use the eval method. I also implemented the score method and sample method, which I think should be standard for any density estimation classes we end up including.

My long-term goal for the density estimators would be to create a very general Bayesian generative classifier. I have a gist with the basic idea here: https://gist.github.com/jakevdp/5891921 This gist shows how GMM, KDE, and a normal approximation can be used in a GenerativeBayes estimator in order to perform classification. This application is one reason that following the current GMM conventions is very convenient.

The density stuff aside, I think this PR is ready to go pending some documentation updates, especially listing the available kernels and generalized 2-point methods available. That's what I'm planning to work on next.

@jakevdp
scikit-learn member

I added documentation of the new features (leaving out only the correlation function -- I'm still unsure what example to use without access to an astronomical dataset). I've been unable to build the documentation to check formatting (cf. #1140), so it would help if someone could check the build.

Pending any comments, I think this is ready to merge!

@ogrisel
scikit-learn member

Maybe covertype or a projection to some of its axes? However triggering the download of the covertype dataset to build the doc if the example has a plot might be an issue. The archive is 11M.

@ogrisel ogrisel and 1 other commented on an outdated diff Jul 6, 2013
examples/cluster/plot_cluster_comparison.py
@@ -36,7 +36,7 @@
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
-n_samples = 1500
+n_samples = 500
@ogrisel
ogrisel Jul 6, 2013

Is this related to this PR? I am fine if the example plot looks the same but runs faster though.

@jakevdp
jakevdp Jul 6, 2013

Oops, I changed that because 1500 points was crashing on my system when I tried building the docs. I'll change it back.

@ogrisel
scikit-learn member

BTW I am building the branch to run the tests, examples and build the doc.

@ogrisel ogrisel commented on an outdated diff Jul 6, 2013
sklearn/neighbors/base.py
+ % (metric, algorithm))
+
+ # For minkowski distance, use more efficient methods where available
+ if self.metric == 'minkowski':
+ p = self.metric_kwds.pop('p', 2)
+ if p < 1:
+ raise ValueError("p must be greater than zero "
+ "for minkowski metric")
+ elif p == 1:
+ self.metric = 'manhattan'
+ elif p == 2:
+ self.metric = 'euclidean'
+ elif p == np.inf:
+ self.metric = 'chebyshev'
+ else:
+ self.metric_kwds['p'] = p
@ogrisel
ogrisel Jul 6, 2013

This logic should be done to set an metric_ (or effective_metric_) attribute in the fit method. Otherwise it breaks the static curification behavior of the estimator constructors as checked by this failing test:

======================================================================
FAIL: sklearn.tests.test_common.test_all_estimators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/ogrisel/code/scikit-learn/sklearn/tests/test_common.py", line 110, in test_all_estimators
    assert_equal(params[arg], default)
AssertionError: 'euclidean' != 'minkowski'
    "'euclidean' != 'minkowski'" = '%s != %s' % (safe_repr('euclidean'), safe_repr('minkowski'))
    "'euclidean' != 'minkowski'" = self._formatMessage("'euclidean' != 'minkowski'", "'euclidean' != 'minkowski'")
>>  raise self.failureException("'euclidean' != 'minkowski'")
@ogrisel
scikit-learn member

BTW, could you please rebase this branch on top of current master and fix the conflicts? That would make travis able to run the tests.

@ogrisel
scikit-learn member

The digits sampling example (with the KDE class) is very nice :)

@ogrisel ogrisel commented on an outdated diff Jul 6, 2013
doc/modules/neighbors.rst
+point is the point itself, at a distance of zero.
+
+It is also possible to efficiently produce a sparse graph showing the
+connections between neighboring points:
+
+ >>> nbrs.kneighbors_graph(X).toarray()
+ array([[ 1., 1., 0., 0., 0., 0.],
+ [ 1., 1., 0., 0., 0., 0.],
+ [ 0., 1., 1., 0., 0., 0.],
+ [ 0., 0., 0., 1., 1., 0.],
+ [ 0., 0., 0., 1., 1., 0.],
+ [ 0., 0., 0., 0., 1., 1.]])
+
+Our dataset is structured such that points nearby in index order are nearby
+in parameter space, leading to an approximately block-diagonal matrix of
+$K$-nearest neighbors.
@ogrisel
ogrisel Jul 6, 2013

The $ sign does not seem to be interpreted for latex math rendering. You probably need to replace expressions such as:

$K$-nearest

by:

:math:`K`-nearest
@ogrisel ogrisel commented on an outdated diff Jul 6, 2013
doc/modules/neighbors.rst
@@ -413,3 +485,80 @@ the model from 0.81 to 0.82.
* :ref:`example_neighbors_plot_nearest_centroid.py`: an example of
classification using nearest centroid with different shrink thresholds.
+
+Kernel Density Estimation
+=========================
+Nearest neighbors queries form the basis of a popular algorithm for estimating
+the density of points in a parameter space: Kernel Density Estimation.
+Given a positive kernel $K(x)$ and a bandwidth $h$, the kernel density
+estimate at a point $y$ within a group of points :math:`x_i; i=1\cdots N`
+is:
+
+.. math:: \rho_K(y) = \sum_{i=1}^N K((y - x) / h)
@ogrisel
ogrisel Jul 6, 2013

The rendering of this math directive does not work on my box. The formulas in the sgd.rst file do work in the same build though. So I assume this should be:

.. math::
     \rho_K(y) = \sum_{i=1}^N K((y - x) / h) 

instead.

@ogrisel ogrisel commented on an outdated diff Jul 6, 2013
doc/modules/neighbors.rst
+
+* Epanechnikov kernel (``kernel = 'epanechnikov'``)
+
+ ..math:: K(x; h) \propto 1 - \frac{x^2}{h^2}
+
+* Exponential kernel (``kernel = 'exponential'``)
+
+ ..math:: K(x; h) \propto \exp(-x/h)
+
+* Linear kernel (``kernel = 'linear'``)
+
+ ..math:: K(x; h) \propto 1 - dist/h \mathrm{if} x < h
+
+* Cosine kernel (``kernel = 'cosine'``)
+
+ ..math:: K(x; h) \propto \cos(\frac{\pi x}{2h}) \mathrm{if} x < h
@ogrisel
ogrisel Jul 6, 2013

Same remark for this block.

@ogrisel
scikit-learn member

Apart from the math stuff, the rest of the sphinx rendering looks fine on my box (including the example plot includes).

@jakevdp
scikit-learn member

Thanks for the review, @ogrisel! I'll work on fixing those pieces.

@jakevdp
scikit-learn member

The Travis build was failing due to a test which required a newer SciPy version. I've added a check and a SkipTest that should result in all tests passing.

@ogrisel ogrisel commented on an outdated diff Jul 7, 2013
doc/modules/neighbors.rst
+ [ 0. , 1. ],
+ [ 0. , 1.41421356]])
+
+Because the query set matches the training set, the nearest neighbor of each
+point is the point itself, at a distance of zero.
+
+It is also possible to efficiently produce a sparse graph showing the
+connections between neighboring points:
+
+ >>> nbrs.kneighbors_graph(X).toarray()
+ array([[ 1., 1., 0., 0., 0., 0.],
+ [ 1., 1., 0., 0., 0., 0.],
+ [ 0., 1., 1., 0., 0., 0.],
+ [ 0., 0., 0., 1., 1., 0.],
+ [ 0., 0., 0., 1., 1., 0.],
+ [ 0., 0., 0., 0., 1., 1.]])
@ogrisel
ogrisel Jul 7, 2013

The indent level is wrong here.

@ogrisel ogrisel commented on an outdated diff Jul 7, 2013
doc/modules/neighbors.rst
+ >>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
+ >>> distances, indices = nbrs.kneighbors(X)
+ >>> indices
+ array([[0, 1],
+ [1, 0],
+ [2, 1],
+ [3, 4],
+ [4, 3],
+ [5, 4]])
+ >>> distances
+ array([[ 0. , 1. ],
+ [ 0. , 1. ],
+ [ 0. , 1.41421356],
+ [ 0. , 1. ],
+ [ 0. , 1. ],
+ [ 0. , 1.41421356]])
@ogrisel ogrisel commented on an outdated diff Jul 7, 2013
sklearn/neighbors/binary_tree.pxi
+ >>> np.random.seed(0)
+ >>> X = np.random.random((10,3)) # 10 points in 3 dimensions
+ >>> tree = {BinaryTree}(X, leaf_size=2) # doctest: +SKIP
+ >>> dist, ind = tree.query(X[0], k=3) # doctest: +SKIP
+ >>> print ind # indices of 3 closest neighbors
+ [0 3 1]
+ >>> print dist # distances to 3 closest neighbors
+ [ 0. 0.19662693 0.29473397]
+
+Pickle and Unpickle a tree. Note that the state of the tree is saved in the
+pickle operation: the tree needs not be rebuilt upon unpickling.
+
+ >>> import numpy as np
+ >>> import pickle
+ >>> np.random.seed(0)
+ >>> X = np.random.random((10,3)) # 10 points in 3 dimensions
@ogrisel ogrisel commented on an outdated diff Jul 7, 2013
sklearn/neighbors/binary_tree.pxi
+ >>> np.random.seed(0)
+ >>> X = np.random.random((10,3)) # 10 points in 3 dimensions
+ >>> tree = {BinaryTree}(X, leaf_size=2) # doctest: +SKIP
+ >>> s = pickle.dumps(tree) # doctest: +SKIP
+ >>> tree_copy = pickle.loads(s) # doctest: +SKIP
+ >>> dist, ind = tree_copy.query(X[0], k=3) # doctest: +SKIP
+ >>> print ind # indices of 3 closest neighbors
+ [0 3 1]
+ >>> print dist # distances to 3 closest neighbors
+ [ 0. 0.19662693 0.29473397]
+
+Query for neighbors within a given radius
+
+ >>> import numpy as np
+ >>> np.random.seed(0)
+ >>> X = np.random.random((10,3)) # 10 points in 3 dimensions
@ogrisel ogrisel commented on an outdated diff Jul 7, 2013
sklearn/neighbors/tests/test_neighbors.py
@@ -579,6 +577,43 @@ def test_neighbors_deprecation_arg():
assert_equal(len(w), 1)
+def test_neighbors_metrics(n_samples=20, n_features=3,
+ n_query_pts=2, n_neighbors=5):
+ """Test computing the neighbors for various metrics"""
+ # create a symmetric matrix
+ V = rng.rand(n_features, n_features)
+ VI = np.dot(V, V.T)
+
+ metrics = {'euclidean':{},
+ 'manhattan':{},
+ 'minkowski':dict(p=3),
+ 'chebyshev':{},
+ 'seuclidean':dict(V=rng.rand(n_features)),
+ 'wminkowski':dict(p=3, w=rng.rand(n_features)),
+ 'mahalanobis':dict(VI=VI)}
@ogrisel
ogrisel Jul 7, 2013

Please run pep8 on this file.

@ogrisel
scikit-learn member

I run the tests on the neighbors package and got the following test coverage report:

$ nosetests -s -v --with-coverage sklearn/neighbors/
sklearn.neighbors                               10      0   100%
sklearn.neighbors.base                         235     22    91%   44-51, 96, 105, 129-132, 138, 154, 186, 190, 194, 197, 216, 231, 323, 502
sklearn.neighbors.classification                62      0   100%
sklearn.neighbors.graph                         10      0   100%
sklearn.neighbors.kde                           62     20    68%   80, 82, 91-101, 166, 187-200
sklearn.neighbors.nearest_centroid              50      3    94%   90, 98, 156
sklearn.neighbors.regression                    32      0   100%
sklearn.neighbors.unsupervised                   7      0   100%

It seems that most of the uncovered code are exceptions raised by input value checks: they should really be covered by assert_raises statements. It should be quite easy to raise the coverage to 100% on the all the neighbors modules.

@ogrisel
scikit-learn member

Also KDE sampling is not tested either (apart from the example).

@jakevdp
scikit-learn member

PEP8 doesn't quite work on .pxi or .pyx files, but I'll do my best 😄

I just added a short example of KDE on geospatial data using the Haversine metric. I think the possibility of doing fast spacial queries on this sort of data is going to be a very well-used part of this new code.

I'll work next on improving test coverage.

@ogrisel
scikit-learn member

For the KDE model I am wondering if the eval method should not be renamed to loglikelihood or predict_loglikelihood to be more explicit. Also I am wondering if there are no other estimators in sklearn that should share the same API. Maybe it would be worth asking for opinions on the ML.

I think it should not be called predict_logproba as predict_logproba and predict_proba are usually reserved for estimates of the probabilities of a target variable y conditional to the input variables X whereas here it's an estimate of the input distribution X.

@ogrisel
scikit-learn member

I just added a short example of KDE on geospatial data using the Haversine metric. I think the possibility of doing fast spacial queries on this sort of data is going to be a very well-used part of this new code.

Very nice.

Also I think that plot_kde.py should be renamed to plot_digits_kde_sampling.py or something similar to make it more explicit.

@ogrisel
scikit-learn member

Also could you please include the plot of the species example in the narrative doc along with a short summary sentence to explain it there as well?

@jakevdp
scikit-learn member

Regarding the interface to KernelDensity -- I chose to follow the same conventions as are currently used in GMM so that the two estimators could be interchanged within the generative classification framework I'm working on (see https://gist.github.com/jakevdp/5891921). If we change this interface for KDE, we should probably change it for GMM as well.

The best thing, I think, would be to decide on a standard interface for density estimation, and use that for KDE, GMM, and other future routines. I'll write to the mailing list to get people's opinions about this.

@jakevdp
scikit-learn member

OK, I increased the test coverage a fair bit, and also found some corner-case errors in the process (hey, unit-testing works! 😄)

$ nosetests -s --with-coverage sklearn/neighbors/
...
sklearn.neighbors                               10      0   100%   
sklearn.neighbors.base                         227      8    96%   84, 93, 118, 142, 204, 219, 311, 490
sklearn.neighbors.classification                63      0   100%   
sklearn.neighbors.graph                         10      0   100%   
sklearn.neighbors.kde                           62      0   100%   
sklearn.neighbors.nearest_centroid              50      3    94%   90, 98, 156
sklearn.neighbors.regression                    33      0   100%   
sklearn.neighbors.unsupervised                   7      0   100%
...

The remaining un-covered lines within base.py are (for the most part) unreachable pieces of redundant validation code, but I think they should remain because they will ease future maintenance and debugging.

I also addressed the above requests regarding documentation, formatting, PEP8, etc. Again, if you could check that the rendered docs look OK, I'd be grateful!

I think all that remains is to decide on density estimator method names for KDE... no response yet on the ML thread.

@jakevdp
scikit-learn member

Changing parameters in Neighbors made one of the tutorial doc tests fail. I think the Travis build should pass now.

@ogrisel
scikit-learn member

This looks good to me. I probably won't have much time to do any more review on this week. It would be great if someone else could have a deeper look at the cython code (I barely glanced at it so far). Maybe @larsmans @pprett @glouppe @arjoly @jnothman?

About the method names, if nobody is interested in making it more explicit I think we can keep eval. At least it's consistent with the GMM model.

@jakevdp
scikit-learn member

Great - thanks for all the help @ogrisel!

@ogrisel
scikit-learn member

Your welcome, this PR is very promising.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 9, 2013
examples/neighbors/plot_digits_kde_sampling.py
+
+from sklearn.datasets import load_digits
+from sklearn.neighbors import KernelDensity
+from sklearn.decomposition import PCA
+from sklearn.grid_search import GridSearchCV
+
+# load the data
+digits = load_digits()
+data = digits.data
+
+# project the 64-dimensional data to a lower dimension
+pca = PCA(n_components=15, whiten=False)
+data = pca.fit_transform(digits.data)
+
+# use grid search cross-validation to optimize the bandwidth
+params = {'bandwidth':np.logspace(-1, 1, 20)}
@GaelVaroquaux
GaelVaroquaux Jul 9, 2013

PEP8 :) (missing whitespace after semicolon)

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 9, 2013
examples/neighbors/plot_species_kde.py
+
+ * `"Maximum entropy modeling of species geographic distributions"
+ <http://www.cs.princeton.edu/~schapire/papers/ecolmod.pdf>`_
+ S. J. Phillips, R. P. Anderson, R. E. Schapire - Ecological Modelling,
+ 190:231-259, 2006.
+"""
+# Author: Jake Vanderplas <jakevdp@cs.washington.edu>
+#
+# License: BSD 3 clause
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.datasets import fetch_species_distributions
+from sklearn.datasets.species_distributions import construct_grids
+from sklearn.neighbors import KNeighborsClassifier, KernelDensity
+from sklearn.metrics import classification_report
@GaelVaroquaux
GaelVaroquaux Jul 9, 2013

Unused imports (classificiation_report and KNeighborsClassifier)

@GaelVaroquaux
scikit-learn member

I think that a new subsection of the docs should be created: 'Density estimation', and the corresponding paragraph of the Neighbors documentation moved there (with of course cross-linking). The reason I mention that, is that I think that in the current layout of the documentation, it is easy to miss the fact that there is a KDE in scikit-learn.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 9, 2013
doc/modules/neighbors.rst
+Here is an example of using this process to
+create a new set of hand-written digits, using a Gaussian kernel learned
+on a PCA projection of the data:
+
+.. |digits_kde| image:: ../auto_examples/neighbors/images/plot_digits_kde_sampling_1.png
+ :target: ../auto_examples/neighbors/plot_digits_kde_sampling.html
+ :scale: 80
+
+.. centered:: |digits_kde|
+
+The "new" data consists of linear combinations of the input data, with weights
+probabilistically drawn given the KDE model.
+
+.. topic:: Examples:
+
+ * :ref:`example_neighbors_plot_kde.py`: an example of using Kernel Density
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 9, 2013
doc/modules/neighbors.rst
@@ -413,3 +484,100 @@ the model from 0.81 to 0.82.
* :ref:`example_neighbors_plot_nearest_centroid.py`: an example of
classification using nearest centroid with different shrink thresholds.
+
+Kernel Density Estimation
+=========================
@GaelVaroquaux
GaelVaroquaux Jul 9, 2013

Before the math and the fancy examples, could we have an very simple 2D example, with an image embedded in the docs, that gives the intuition behind the KDE. I like intuitions :)

@ogrisel
ogrisel Jul 9, 2013

Actually even a 1D example: plot the samples of a mixture of 2 well spaced yet overlapping normal (or Cauchy or other) distributions as a set of dirac masses on the x axis of a plot and the true distribution + estimated gaussian kernel density on the y axis.

@ogrisel
ogrisel Jul 9, 2013

Basically what I am asking for is an example reproducing the first plot on the right in the wikipedia article:

http://en.wikipedia.org/wiki/Density_estimation

@GaelVaroquaux
GaelVaroquaux Jul 9, 2013
@jakevdp
scikit-learn member

I think that a new subsection of the docs should be created: 'Density estimation', and the corresponding paragraph of the Neighbors documentation moved there (with of course cross-linking). The reason I mention that, is that I think that in the current layout of the documentation, it is easy to miss the fact that there is a KDE in scikit-learn.

Sounds good -- I'll work on that. Should it be a separate .rst file in the modules subdirectory?

Actually even a 1D example: plot the samples of a mixture of 2 well spaced yet overlapping normal (or Cauchy or other) distributions as a set of dirac masses on the x axis of a plot and the true distribution + estimated gaussian kernel density on the y axis.

I'll work on this as well. Hopefully I'll be able to get to it in the next couple days.

@arjoly arjoly commented on an outdated diff Jul 9, 2013
doc/modules/neighbors.rst
@@ -29,12 +29,14 @@ learning methods, since they simply "remember" all of its training data
Despite its simplicity, nearest neighbors has been successful in a
large number of classification and regression problems, including
-handwritten digits or satellite image scenes. It is often successful
-in classification situations where the decision boundary is very irregular.
+handwritten digits or satellite image scenes. Being a non-parametric method,
+It is often successful in classification situations where the decision
@arjoly arjoly commented on the diff Jul 9, 2013
doc/modules/neighbors.rst
+ [3, 4],
+ [4, 3],
+ [5, 4]])
+ >>> distances
+ array([[ 0. , 1. ],
+ [ 0. , 1. ],
+ [ 0. , 1.41421356],
+ [ 0. , 1. ],
+ [ 0. , 1. ],
+ [ 0. , 1.41421356]])
+
+Because the query set matches the training set, the nearest neighbor of each
+point is the point itself, at a distance of zero.
+
+It is also possible to efficiently produce a sparse graph showing the
+connections between neighboring points:
@arjoly
arjoly Jul 9, 2013

What is the use case of producing a neighbors graph?

@jakevdp
jakevdp Jul 9, 2013

One application I've used is in the Isomap algorithm: a shortest-path search over the graph of neighbors is used to approximate geodesic distances between points, and thereby learn the manifold geometry.

@ogrisel
ogrisel Jul 10, 2013

Would be worth adding a note to the documentation right here to explain that this is internally used by some manifold learning algorithms. For instance a cross refence link to: http://scikit-learn.org/stable/modules/manifold.html#isomap

@GaelVaroquaux
GaelVaroquaux Jul 11, 2013
@GaelVaroquaux
scikit-learn member

I am slowly reviewing this guy. One comment that comes to my mind is that in the Cython files, you should apply PEP8, and in particular, you should have 2 empty lines between function definitions.

@GaelVaroquaux
scikit-learn member

Also, could we have short, on-liner, docstrings for the cython functions. It is less a question of documenting what they do, but more to help people read the code.

@jakevdp
scikit-learn member

I think I responded to all the comments thus far -- I moved density estimation docs to a separate file, created a 1D KDE example, fixed typos & broken links, and added some documentation to the private functions within the various cython modules. Let me know if I overlooked anything!

@GaelVaroquaux
scikit-learn member
@GaelVaroquaux
scikit-learn member
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 11, 2013
examples/neighbors/plot_kde_1d.py
+"""
+# Author: Jake Vanderplas <jakevdp@cs.washington.edu>
+#
+import numpy as np
+import matplotlib.pyplot as plt
+from scipy.stats import norm
+from sklearn.neighbors import KernelDensity
+
+
+#----------------------------------------------------------------------
+# Plot the progression of histograms to kernels
+np.random.seed(1)
+N = 20
+X = np.concatenate((np.random.normal(0, 1, 0.3 * N),
+ np.random.normal(5, 1, 0.7 * N)))[:, np.newaxis]
+Xplot = np.linspace(-5, 10, 1000)[:, np.newaxis]
@GaelVaroquaux
GaelVaroquaux Jul 11, 2013

That's a complete nitpick, but I would prefer this variable to be named X_plot.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 11, 2013
examples/neighbors/plot_kde_1d.py
+
+for axi in ax.ravel():
+ axi.plot(X[:, 0], np.zeros(X.shape[0]) - 0.01, '+k')
+ axi.set_xlim(-4, 9)
+ axi.set_ylim(-0.02, 0.34)
+
+for axi in ax[:, 0]:
+ axi.set_ylabel('Normalized Density')
+
+for axi in ax[1, :]:
+ axi.set_xlabel('x')
+
+#----------------------------------------------------------------------
+# Plot all available kernels
+Xplot = np.linspace(-6, 6, 1000)[:, None]
+Xsrc = np.zeros((1, 1))
@GaelVaroquaux
GaelVaroquaux Jul 11, 2013

Same thing here: I would prefer if the variable were X_plot and X_src

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 11, 2013
sklearn/neighbors/ball_tree.pyx
+ radius = fmax(radius,
+ bt.rdist(centroid,
+ data + n_features * idx_array[i],
+ n_features))
+
+ bt.node_data[i_node].radius = bt.dm._rdist_to_dist(radius)
+ bt.node_data[i_node].idx_start = idx_start
+ bt.node_data[i_node].idx_end = idx_end
+ return 0
+
+cdef inline DTYPE_t min_dist(BinaryTree bt, ITYPE_t i_node,
+ DTYPE_t* pt) except -1:
+ cdef DTYPE_t dist_pt = bt.dist(pt, &bt.node_bounds[0, i_node, 0],
+ bt.data.shape[1])
+ return fmax(0, dist_pt - bt.node_data[i_node].radius)
+
@GaelVaroquaux
GaelVaroquaux Jul 11, 2013

You still don't like to put 2 empty lines between functions :)

@jakevdp
jakevdp Jul 11, 2013

Sorry, PEP8 doeesn't catch that due to the cython annotations 😄

@ogrisel
scikit-learn member

I wonder if it should not be put at the end of the unsupervised learning section, mainly to avoid breaking the section numbers. What do people think?

+1

Edit: my first reply was stupid, I had read your comment too quickly.

@ogrisel
scikit-learn member

Very nice new examples. +1 for merge once the latest comments by @GaelVaroquaux are addressed.

@jakevdp jakevdp referenced this pull request Jul 11, 2013
Closed

[WIP] Dbscan balltrees #2151

1 of 1 task complete
@jakevdp
scikit-learn member

Alright -- I did some PEP8 changes to the ball_tree.pyx and kd_tree.pyx, as well as adding some documentation and comments for anyone trying to understand the code. Also fixed the nit-picks in the examples and moved the density estimation section to the end of the unsupervised learning section. Let me know if you see any other necessary tweaks!

@jakevdp
scikit-learn member

Ah! Don't merge yet! I just found a bug in the tophat sampling function.

@jakevdp
scikit-learn member

OK - I think it's good to go now!

@ogrisel
scikit-learn member

Thanks @jakevdp. As this is a big PR I would appreciate it if we could get another reviewer: ping @larsmans @mblondel @jnothman @arjoly for instance.

+1 for merging on my side.

@ogrisel
scikit-learn member

Something that is missing from this PR but could be addressed in another one is a "cross_validated" internal mode or external helper tool or example to estimate the bandwidth parameter of the kernel of the KDE model:

  • sample random 1000 pairs from the dataset and compute the min and max pairwise distances
  • use those 2 distances as the bounds of a logspace range of bandwidth values to CV
  • use the likelihood of the held out test set vs to a KDE model fitted on the train fold as the score to CV

Keep the best bandwith and refit.

@larsmans
scikit-learn member

Do you really think it's wise to merge in any more big features before that next release? I'd rather get 0.14 out, then start adding features so bleeding-edge users have the time to play with this stuff.

@ogrisel
scikit-learn member

Do we have a plan to release 0.14 soon? The ongoing dicussion on the scorer API makes me think that we might not want to release in a hurry. @jakevdp's PR on the other hand as very much less of an API impact.

If you really don't want to introduce any new features for 0.14 we could decide to split this PR in 2 parts:

1- one for the new BallTree and KDTree implementations using the existing neighbors estimator API.
2- the new Kernel Density Estimation stuff.

I think 1 can me merged for 0.14 as it does not introduce any new feature (besides speed and support for additional metrics in the ball tree).

@GaelVaroquaux
scikit-learn member
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 13, 2013
sklearn/neighbors/ball_tree.pyx
- cdef ITYPE_t i_mid
- cdef ITYPE_t i
-
- if val >= queue[i_upper]:
- return
- elif val <= queue[i_lower]:
- i_mid = i_lower
+ # determine Node radius
+ radius = 0
+ for i in range(idx_start, idx_end):
+ radius = fmax(radius,
+ bt.rdist(centroid,
+ data + n_features * idx_array[i],
+ n_features))
+
+ bt.node_data[i_node].radius = bt.dm._rdist_to_dist(radius)
@GaelVaroquaux
GaelVaroquaux Jul 13, 2013

I hate to be nitpicking on such an excellent PR that has already been sitting in the pipeline for a little while, but I must admit that I am having a hard time following the code (because it is technical and I am dense, not because it is poorly written).

I might help of the names were a bit more explicit. 'bt', and 'bt.dm' don't tell me what they are. Would you mind renaming them?

@GaelVaroquaux
GaelVaroquaux Jul 13, 2013

'bt' -> 'binary_tree'? Or simply 'tree'?

@jakevdp
scikit-learn member

@GaelVaroquaux - I made some changes -- hopefully the more explicit names will make the code easier to read

@GaelVaroquaux GaelVaroquaux commented on the diff Jul 13, 2013
sklearn/neighbors/dist_metrics.pxd
+from typedefs cimport DTYPE_t, ITYPE_t, DITYPE_t
+from typedefs import DTYPE, ITYPE
+
+######################################################################
+# Inline distance functions
+#
+# We use these for the default (euclidean) case so that they can be
+# inlined. This leads to faster computation for the most common case
+cdef inline DTYPE_t euclidean_dist(DTYPE_t* x1, DTYPE_t* x2,
+ ITYPE_t size) except -1:
+ cdef DTYPE_t tmp, d=0
+ for j in range(size):
+ tmp = x1[j] - x2[j]
+ d += tmp * tmp
+ return sqrt(d)
+
@GaelVaroquaux
GaelVaroquaux Jul 13, 2013

You don't like leaving 2 lines between function definitions, right :). This file could probably use a bit of empty lines.

Because it's PEP8, my brain is really hardwire into thinking that if there are not two empty lines, this is the same function.

@GaelVaroquaux
scikit-learn member

OK, @jakevdp , I want this guy in: it has been lying around for too long, and is ready apart from the empty lines issue. I am finishing it off and merging.

@GaelVaroquaux
scikit-learn member

Merged! (I didn't rebase, because it was creating a conflict hell).

Bravo Jake.

@jakevdp
scikit-learn member

Awesome - thanks!
Any idea why this says there are unmerged commits? Is it just that PR merge commit from you yesterday?

@GaelVaroquaux
scikit-learn member
@jakevdp
scikit-learn member

Is Jenkins different than Travis? The Travis build was doing just fine.

@jakevdp
scikit-learn member

I don't understand some of these failures... in particular

ValueError: Buffer dtype mismatch, expected 'ITYPE_t' but got Python object in 'NodeData_t.idx_start'

NodeData_t.idx_start is cdef'd as type ITYPE_t, and I'm not sure how it could ever be interpreted as a Python object. I'm going to see if I can figure it out.

@jakevdp
scikit-learn member

I think that all the problems can be traced to this:

# use a dummy variable to determine the python data type
cdef NodeData_t dummy
cdef NodeData_t[:] dummy_view = <NodeData_t[:1]> &dummy
NodeData = np.asarray(dummy_view).dtype

Recall that NodeData_t is a C structure:

cdef struct NodeData_t:
    ITYPE_t idx_start
    ITYPE_t idx_end
    int is_leaf
    DTYPE_t radius

I think that on the Jenkins machine, the resulting NodeData dtype does not match the C type denoted by NodeData_t. I've been using this sort of hack for a long time, and never seen it fail.

Anybody have better ideas about how to take a cdef'd struct and extract an equivalent numpy dtype? Last year I sent this snippet to the cython list, asking if anyone knew a better way, and there wasn't a response.

@larsmans
scikit-learn member

Why do you need such a hack at all? Can't you just malloc a NodeData_t array?

@jakevdp
scikit-learn member

Why do you need such a hack at all? Can't you just malloc a NodeData_t array?

I'd prefer the memory management to be handled by numpy -- it seems cleaner that way. I don't use malloc anywhere in this code, and consider that a positive feature.

@ogrisel
scikit-learn member

Note that the failure is at line:

self.node_data = np.empty(0, dtype=NodeData, order='C')

This only fails on the python 2.6 / numpy 1.3. So it's probably related to the numpy version.

@ogrisel
scikit-learn member

I tried to install numpy 1.3 on python 2.7 but the numpy build fails. Maybe we should stop trying to support that old a version of numpy.

Less than 5% of scipy stack users who answered this survey earlier this year use numpy 1.3:

http://astrofrog.github.io/blog/2013/01/13/what-python-installations-are-scientists-using/

The most recent Ubuntu LTS (which is precise pangolin 12.04) has numpy version 1.6.1 by default.

@GaelVaroquaux what is the numpy version in your institute? I will try to build with numpy 1.5.

@larsmans
scikit-learn member

Doesn't Cython views have some kind of magic memory allocation that we can use? I understand the fear of malloc, but this doesn't look very clean either.

@jakevdp
scikit-learn member

It's more than just a fear of malloc -- using numpy arrays makes serialization extremely straightforward as well. I'm not sure about cython's memory allocation. Given that this fails on numpy 1.3, I wouldn't be surprised if the same error would occur when we convert the cython typed memoryview to a numpy array when we serialize the object using __getstate__ and __setstate__.

@jakevdp
scikit-learn member

Actually, it looks like numpy 1.4 and below does not support the buffer interface, so almost nothing in this PR would work under those numpy versions:

http://comments.gmane.org/gmane.comp.python.cython.devel/14937

I didn't realize that detail... so I guess our choice is to either scrap this PR in full, or to drop numpy 1.3-1.4 support (for reference, numpy 1.5 was released in August 2010). Another option would be to rewrite all of this using the deprecated cython numpy array syntax, but that would feel like going backward to me.

If we deem that supporting old numpy versions is important, it's not all lost: I'd release all of this functionality as its own mini-package for the time being, and we could revisit its inclusion once the numpy requirement is upgraded to 1.5.

@ogrisel
scikit-learn member

Maybe we should ask for users opinions on the mailing list:

  • which version of numpy do you use?
  • would you agree to raise the minimal version for numpy to 1.5?
@larsmans
scikit-learn member

FYI, the most recent version of CentOS still ships NumPy 1.4. That's probably true of Red Hat Enterprise Linux and its other clones as well.

@ogrisel
scikit-learn member

That's bad news.

@ogrisel
scikit-learn member

Another option would be to rewrite all of this using the deprecated cython numpy array syntax, but that would feel like going backward to me.

I agree, but on the other end this is for backward compat. Any difference in runtime performance?

@jakevdp
scikit-learn member

I've sent a message to the cython users list: hopefully that will generate some ideas.

@GaelVaroquaux
scikit-learn member
@ogrisel
scikit-learn member

But memory allocation here is not the problem or is it? It seems that the problem is the fact that the memory view feature of cython does not work with old versions of NumPy (1.3 and 1.4) but only works with the NumPy version that follow the buffer interface (from NumPy 1.5 and on).

So the real question is: do we bump up the requirement for sklearn users to have NumPy 1.5+ or do we rewrite this the ball / kd-trees to use the old, deprecated cython-numpy array syntax instead of using the memory views syntax?

@jakevdp
scikit-learn member

I think I've come up with a solution that will work without too much rewriting of the code. We can use a cython utility function that looks like this:

cimport numpy as np

def array_to_memview(np.ndarray[double, ndim=1, mode='c'] X):
    cdef double* Xdata = <double*> X.data
    cdef double[::1] Xmem = <double[:X.shape[0]]> Xdata
    return Xmem

All external-facing functions would have to be changed so as to explicitly take numpy arrays as arguments, and then perform these gymnastics to convert the input. As long as we maintain references to the original object for as long as the memoryview is needed, we shouldn't run into any memory expiration issues.

One complication: as there is no templating, I think we'd need to explicitly define a different conversion function like the one above for each combination of data type and number of dimensions. The result is going to be code that's pretty messy and confusing to read, but it should be backward compatible to numpy 1.3.

@ogrisel
scikit-learn member

I think that worth trying it. Just a cosmetics note: Xdata => X_data and so on.

@GaelVaroquaux
scikit-learn member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment