[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC) #3304

Closed
wants to merge 119 commits into
from

Projects

None yet
@maheshakya
Contributor

No description provided.

@daniel-vainsencher

Maheshakya, did you mention using arrays of integers as the output of do_hash? seems like the current version should produce strings, right?
Also, random projection is generally used to mean simply multiplying by a random matrix. When you add a reference to this kind of hash, they might have a more specific name for this LSH family.

Owner

Converting hashes to integers does not prove any increase of performance in speed or memory consumption(I have another version which does that). It only increases the complexity during string to integer conversion and the other way around. So I thought that using integer arrays would be less useful.
But I have used numpy arrays of strings in the random projections.

It wont be a problem when using other hashing algorithms since do_hash function is defined to perform hashing for that particular LSH family.

Actually the link in the reference directs to Random projections section in LSH in wikipedia. I need to change the description of the reference. :)

(np.array([-2,3,4,-5]) >0).astype('int') seems pretty simple to me. I do not expect the ints to be faster than booleans (and I do not expect this to be a bottleneck, so speed is not a particular important criterion here). On the other hand, using hash families that return more than two results might be useful in the future. Of course, if you want to restrict the code to binary hashes for now, that's fine, but needs documentation in the right place.

Where are there numpy arrays of strings, and what do you mean by that?

Owner

Only a single string(a single binary hash) is return from the do_hash function, which is for a single data point in a single tree. The tree(which is in the LSH forest implementation) stores these strings in a numpy array as dtype='|S2'. S2 stands for a string of length greater than 1.

@daniel-vainsencher

For elegance, please try to express your _bisect_right(a,x) in terms of searchsorted instead of reimplementing.

Owner

Done.

@jnothman jnothman and 1 other commented on an outdated diff Jun 22, 2014
sklearn/neighbors/lsh_forest.py
+ [3, 2, 1],
+ [4, 3, 2]]), array([[ 0. , 3.15525015, 9.54018168],
+ [ 0. , 3.15525015, 6.38493153],
+ [ 0. , 6.38493153, 9.54018168],
+ [ 0. , 12.92048135, 19.30541288],
+ [ 0. , 26.1457523 , 39.06623365]]))
+
+ """
+
+ def __init__(self, max_label_length=32, n_trees=10,
+ hashing_algorithm='random_projections',
+ c=50, n_neighbors=1, lower_bound=4, seed=1):
+ self.max_label_length = max_label_length
+ self.n_trees = n_trees
+ self.hashing_algorithm = hashing_algorithm
+ self.random_state = np.random.RandomState(seed)
@jnothman
jnothman Jun 22, 2014 Member

Please store all arguments to the constructor just as they are passed in. Please see other uses of check_random_state and where it is called.

@maheshakya
maheshakya Jun 22, 2014 Contributor

Done.

@daniel-vainsencher

Have you considered the performance implications of the new version of _bisect_right(a,x)? transforming "a" seems expensive... OTOH, maybe you can transform x a little bit. I like to think of hashes as binary fixed point representations (0.010110001...) of numbers in [0,1], instead of the more concrete list of {0,1}.

Owner

I found that using numpy searchsorted is very slower than the previous method. So I changed it back.

The reason I thought it should be kept as binary strings is, we cannot perform slicing operation on integers(or any numerical type). Slicing is required to extract the MSBs from the binary hash. if decimal like 0.001110... or an array of integers is used, it will make it more complex and more time consuming at his MSBs extraction.

I have added some explanation in the doc string.

Are you sure it is searchsorted that is slower?
What your version did is to extract the MSBs of the whole index (a); if it creates a copy, that would access much more memory than the the whole binary search. If that slice doesn't create a copy, it might still have inefficient access patterns.
I am saying that instead you can transform the query x so that searchsorted gives the correct reply without having to extract the MSBs from a at all (instead of searching to 0.0101 (ignoring further digits), just search to 0.01100000...) . Then the use of either of search sorted or bisect_* from the library code allows you to delete some code. And if you move to the integer representation, that will also be blazingly fast and allow for more compact indices.

Owner

No, searchsorted is not what makes it slow. It's the slicing operations performed for elements in a.

To transform x: required MSBs of x can be extracted and the remaining part can be replaced by 1s. This transformation will produce the same results as before when used as searchcsorted(x, side='right'). So there will be no slicing on a. What do you think about this?

I have implemented to work with integers(I think you have seed it already):
lsh_forest_without_hash_bit_strings.py

But it didn't show much improvement in speed.

Paragraphs 1-2: sounds good. My expectation of the result is: less code overall. A significant speed improvement for small c. Smaller indices allow more trees, for better precision. Of course, for large c, time is still dominated by the candidate review stage, that is another issue.

As to paragraphs 3-4: I'm looking at the code under that link, it seems to be incomplete and old...

  1. Uses the hash inside the loop in query (line 170), a big bottleneck you've already fixed elsewhere
  2. Uses bisect_left and a slow/strange bisect_right, instead of the well optimized np.search_sorted, so no speed improvement is expected. Besides, line 19 seems to treat x as a string/array, rather than an int. If this works at all, its a lot of luck.
  3. Lines 33,34 use inconsistent types, 34 looks like it searches for a string through an array of ints.

So for an informative comparison, you must combine the integer representation with all the work you'd done before on the current representation. I suggest the following approach:

  1. Start from your best tested implementation
  2. Change to the new approach (transform x instead of changing the binary sort) on the string representation
  3. Drop the custom binary search in favor of searchsorted
    (pause to test that the results haven't changed much yet, and certainly shouldn't get worse)
  4. Create a branch in which only the representation is changed (string->int).
  5. Benchmark.
Owner

@daniel-vainsencher , I used that transformation on x. So that only the query has to be transformed before the binary search happens. This indeed made a significant improvement in query speed.

I have also created another branch with entire integer representation of hashes:
lsh_forest and lshashing

For _bisect_right operation, I had to reconvert query integer to binary, turn last bits to '1' (bits after hash length) and reconvert it back to integer. This makes the code a little more complex and maybe a bit less readable as well( See the _find_matching_indices function in the new branch).

But these two versions are not so different in performance with respect to query speeds. Memory consumption in the latter version is slightly better than the other one.

Owner

Yes, It's compared to transforming the index.

There is no bug in that implementation. Let me explain with an example:
Suppose you have a sorted array of strings:

gona = np.array(['000001', '001010', '001101', '001110','001111', '100001', '101000', '101010'])

These strings have a fixed length = 6. Suppose there's another string query = '001100. Say the considered is hash length h = 4. Then we want to get the indices from gona where first h bits of those entries match with first h bits for query. This means we want the indices of gona where MSBs = '0011'. Say x = query[:h], which means x = '0011'

Now we'll see how numpy searchsorted works. This function has two behaviors for parameter side = 'left' and side = 'right'. See numpy.searchsorted

  1. We can use side = 'left' directly to find the leftmost index. In the string representation '0011' is less than '001100' but it is greater than '000111'. So we can use this behavior directly to find the leftmost index where a string starts with '0011'. `numpy.searchsorted(gona, x, side = 'left') will return the position where '0011' suits in the array which is the leftmost index where a string starts with '0011'. This operation will return 2.
  2. To find the right most index, we cannot use side = 'right' directly like before. If you have read the documentation of searchsorted, setting side = 'right' will return the last suitable index. So the returned values of 'left' and 'right' differ only when there are duplicate entries in the array and what we search in the array is those duplicate entries. So passing '001100' or '0011' directly into searchsorted with side = 'right' will not do what we actually want to do. What we want is to find the right most index where the first 4 bits are equal to '0011' . We can accomplish this by transforming query. If we set the last two bits to '1's, '001100' will become '001111'. Now it is guaranteed that '001111' is greater than or equal to(to '001111' it self) any string which starts with '0011'. Using searchsorted with side = 'right' will make sure that it will cover even '001111' in the array(a duplicate of transformed query). So numpy.searchsorted(gona, query[:h] + '11', side = 'right') will return 5 which is the rightmost index which a string starts with '0011'. (my custom implementation ofbisect_right` also did the same thing).

So `numpy.arange(2,5) will give [2, 3, 4], an array of indices where strings start with '0011'.

This will return the correct candidates we are looking for. So it is not a bug. The same concept applies to integer representation as well.

I'll send you the results of line_profiler on these two implementations. About the memory consumption: Yes, I need to include raw data (input_array) in this data structure as it is required to calculate the actual distances of candidates. So that is the reason for not having improvements in memory consumption because the space consumed by raw data is the dominating factor.

Owner

It seems my last comment has not been posted here. :(
Anyway I sent you an email.

@maheshakya maheshakya Corrected usage of random_state parameter.
Replaced numpy searchsorted  in _bisect_right with the
previous version.
08ab73a
@maheshakya
Contributor

@jnothman, I'm adding insert operation into this data structure. I suppose that could help in incremental learning.

maheshakya added some commits Jun 22, 2014
@maheshakya maheshakya Added insert operation into LSH Forest.
Insert operation allows to insert new data points into the fitted set of trees.
(Can be used in incremental learning? )

Changed parameter m to n_neighbors.

Changed parameter m to n_neighbors.
c94cb5d
@maheshakya maheshakya Changed parameter m to n_neighbors. 802ed5f
@coveralls

Coverage Status

Coverage decreased (-0.06%) when pulling 802ed5f on maheshakya:lsh_forest into aaefdbd on scikit-learn:master.

@jnothman
Member

Did I say something about incremental learning?

@maheshakya
Contributor

Yes.
Check issue #3175

@jnothman
Member

Ahh =) That was merely a ping.

On 23 June 2014 09:58, maheshakya notifications@github.com wrote:

Yes.
Check issue #3175 #3175


Reply to this email directly or view it on GitHub
#3304 (comment)
.

@arjoly arjoly commented on an outdated diff Jun 24, 2014
sklearn/feature_extraction/lshashing.py
+ def __init__(self, n_dim=None, hash_size=None, random_state=1):
+ if n_dim is None or hash_size is None:
+ raise ValueError("n_dim or hash_size cannot be None.")
+
+ self.n_dim = n_dim
+ self.random_state = random_state
+ self.hash_size = hash_size
+
+ def generate_hash_function(self):
+ """Generates a hash function"""
+
+ def do_hash(self, input_point=None, hash_function=None):
+ """Performs hashing on the input_point with hash_function"""
+
+
+class RandomProjections(BaseHash):
@arjoly
arjoly Jun 24, 2014 Member

You might want to have a look at the random projection module.

@arjoly
arjoly Jun 24, 2014 Member

A random sign projection transformer could have its place in that module.

@arjoly arjoly and 1 other commented on an outdated diff Jun 24, 2014
sklearn/feature_extraction/lshashing.py
@@ -0,0 +1,81 @@
+"""
+Locality Sensitive Hashing Algorithms
+-------------------------------------
+"""
+# Author: Maheshakya Wijewardena <maheshakya.10@cse.mrt.ac.lk>
+
+import numpy as np
+from abc import ABCMeta, abstractmethod
+from ..externals.six import with_metaclass
+from ..utils import check_random_state
+
+__all__ = ["RandomProjections"]
+
+
+class BaseHash(with_metaclass(ABCMeta)):
@arjoly
arjoly Jun 24, 2014 Member

This looks like a duplicate of BaseRandomProjection.

@maheshakya
maheshakya Jun 29, 2014 Contributor

I already saw this. But I think this implementation of random projections does not cohere with the context of Locality sensitive hashing algorithms. The main goal of this version is dimensionality reduction, but in LSH it is not the only case. See Locality-sensitive hashing. Other than that, there are many other LSH families as well(Which I need to implement later in the project).
To put it simply, what we want here is much simpler than BaseRandomProjection (For random projection as LSH algorithm). To get the dot product of the all vectors in the data set with a fixed random vector.

@arjoly arjoly and 1 other commented on an outdated diff Jun 24, 2014
sklearn/feature_extraction/lshashing.py
+ """
+ Generates hyperplanes of shape (hash_size, n_dim) from standard
+ normal distribution.
+ """
+ random_state = check_random_state(self.random_state)
+ return random_state.randn(self.hash_size, self.n_dim)
+
+ def do_hash(self, input_point=None, hash_function=None):
+ """
+ Does hashing on the data point with the provided hash_function.
+ """
+ if input_point is None or hash_function is None:
+ raise ValueError("input_point or hash_function cannot be None.")
+
+ projections = np.dot(hash_function, input_point)
+ return "".join(['1' if i > 0 else '0' for i in projections])
@arjoly
arjoly Jun 24, 2014 Member

(1 + np.sign(x)) / 2?

@maheshakya
maheshakya Jul 3, 2014 Contributor

Yes. It is useful to convert the array into binary values before doing a conversion to string. But I found np.array(x > 0, dtype=int) faster than this.

@robertlayton robertlayton and 1 other commented on an outdated diff Jul 3, 2014
sklearn/neighbors/lsh_forest.py
+ self._hash_generator = self._select_hashing_algorithm(
+ n_dim, self.max_label_length)
+
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+ for i in range(self.n_trees):
+ # This is g(p,x) for a particular tree.
+ hash_function = self._hash_generator.generate_hash_function()
+ original_index, bin_hashes = self._create_tree(hash_function)
+ self._original_indices.append(original_index)
+ self._trees.append(bin_hashes)
+ self.hash_functions_.append(hash_function)
+
+ self.hash_functions_ = np.array(self.hash_functions_)
@robertlayton
robertlayton Jul 3, 2014 Member

Try this first: http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html
As a second optimisation, consider how it might be possible to compute all the trees (and so on) in one numpy operation etc, to get rid of the previous loop. Dot products are your friend!

@maheshakya
maheshakya Jul 3, 2014 Contributor

Yes, If all the hash functions are computed in advance, I think it's possible to get rid of the loop. I'll give it a try.

maheshakya added some commits Jun 24, 2014
@maheshakya maheshakya Added a transformation to the binary query.
For _bisect_right() function, a transformed x is passed. The transformation
will replace the characters after h hash length with '1's.

Used random_projections module.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
3cc2733
@maheshakya maheshakya Used random_projections module.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
d8e521b
@coveralls

Coverage Status

Coverage decreased (-0.06%) when pulling d8e521b on maheshakya:lsh_forest into 82611e8 on scikit-learn:master.

maheshakya added some commits Jul 5, 2014
@maheshakya maheshakya A minor change in lshashing a9b49bb
@maheshakya maheshakya Gaussian Random Projection is used in the LSHForest class.
Removed lshashinng in feature extraction and add that funtionality in
the LSHForest class. If other hashing algorithms are to be implemented,
a separate lshashing class may be required.
eb4852d
@maheshakya maheshakya Remove Random projection from feature extraction _init 57d9412
@coveralls

Coverage Status

Coverage decreased (-0.05%) when pulling 57d9412 on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

maheshakya added some commits Jul 14, 2014
@maheshakya maheshakya Converted to integer representation. b19ee99
@maheshakya maheshakya Updated example a7a5788
@maheshakya maheshakya Added accuracy tests for c and n_trees variation.
6a0366a
@coveralls

Coverage Status

Coverage decreased (-0.17%) when pulling 6a0366a on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

@maheshakya maheshakya Updated cache size to type int.
dcbe656
@coveralls

Coverage Status

Coverage decreased (-0.17%) when pulling dcbe656 on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 9f3a575 on maheshakya:lsh_forest into 8dab222 on scikit-learn:master.

@robertlayton robertlayton and 4 others commented on an outdated diff Jul 28, 2014
sklearn/neighbors/lsh_forest.py
+-------------------------------------------------------------------------
+"""
+# Author: Maheshakya Wijewardena <maheshakya.10@cse.mrt.ac.lk>
+
+import numpy as np
+import itertools
+from ..base import BaseEstimator
+from ..utils.validation import safe_asarray
+from ..utils import check_random_state
+
+from ..random_projection import GaussianRandomProjection
+
+__all__ = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
@robertlayton
robertlayton Jul 28, 2014 Member

These functions would be good candidates to move to cython. The speed up will probably be quite drastic with little extra coding needed.

@maheshakya
maheshakya Jul 28, 2014 Contributor

Only numpy.searchsorted(twice) and numpy.arange(once) methods are called in the _find_matching_indices method. Therefore I don't think using Cython for this function will make a significant improvement. I tried to use Cython for the string hashes we had earlier, but it did not make any improvement.
But I can try Cython on _find_longest_prefix_match method.

@daniel-vainsencher
daniel-vainsencher Jul 29, 2014

I think the priorities now should be making the code easy to merge and maintain, and removing serious problems (like an insert that takes O(nm) instead of O(n+m)).

I would specifically avoid micro optimizations: it takes some time to find where to optimize, try it out at different parameter values etc... and I'd guess they are relatively easy to add after GSOC if we want. Right?

@larsmans
larsmans Aug 1, 2014 Member

Hear hear. Get it working, get it asymptotically optimal, then bum it.

@ogrisel
ogrisel Aug 13, 2014 Member

It would still be interesting to report the output of line_profiler run on the main functions / method involved in this estimator. I mean as a comment to this PR.

@robertlayton robertlayton and 3 others commented on an outdated diff Jul 28, 2014
sklearn/neighbors/lsh_forest.py
+ [1, 0, 2],
+ [2, 1, 3],
+ [3, 2, 4],
+ [4, 3, 5]]), array([[ 0. , 0.52344831, 1.08434102],
+ [ 0. , 0.52344831, 0.56089272],
+ [ 0. , 0.56089272, 0.60101568],
+ [ 0. , 0.60101568, 0.6440088 ],
+ [ 0. , 0.6440088 , 0.6900774 ]]))
+
+ """
+
+ def __init__(self, max_label_length=32, n_trees=10,
+ radius=1.0, c=50, n_neighbors=1,
+ lower_bound=4, radius_cutoff_ratio=.9,
+ random_state=None):
+ self.max_label_length = int(max_label_length/2*2)
@robertlayton
robertlayton Jul 28, 2014 Member

What is going on with this line? Isn't int() enough?

@maheshakya
maheshakya Jul 28, 2014 Contributor

In order to cache the hashes, hash length should be an even number as cache is divided into to parts(Storing the cached hash as one is very memory consuming).
Is there any other way to make sure the max_label_length used is even?

@daniel-vainsencher
daniel-vainsencher Jul 29, 2014

There is no way to understand your intent from this code. Extract the code to a meaningfully named function, or use a comment.

In terms of implementation itself, it is not as obviously correct as it could be. If max_label_length somehow starts as a float, your code will fail, where int(max_label_length/2)*2 would obviously succeed.

But I think I've mentioned that in my opinion, this datastructure has no right to exist unless max_label_length is exactly 32 or 64 (and just 32 is probably sufficient for the moment); there are too many low level inefficiencies these particular cases avoid. That said, if your parameters invite max_label_length=23, you better be testing that it actually works...

@GaelVaroquaux
GaelVaroquaux Aug 13, 2014 Member
  •    self.max_label_length = int(max_label_length/2*2)
    

In order to cache the hashes, hash length should be an even number as cache

But I don't believe that this is what the line above is doing. You might
away a mis-placed parenthesis.

Gaël

@robertlayton robertlayton and 1 other commented on an outdated diff Jul 28, 2014
sklearn/neighbors/lsh_forest.py
+ This creates a binary hash by getting the dot product of
+ input_point and hash_function then transforming the projection
+ into a binary string array based on the sign(positive/negative)
+ of the projection.
+
+ Parameters
+ ----------
+
+ input_array: array_like, shape (n_samples, n_features)
+ A matrix of dimensions (n_samples, n_features), which is being
+ hashed.
+ """
+ if input_array is None:
+ raise ValueError("input_array cannot be None.")
+
+ grp = self._generate_hash_function()
@robertlayton
robertlayton Jul 28, 2014 Member

If the function is only "do_hash", then shouldn't this line be in fit instead? i.e. I would expect a function called do_hash to be callable multiple times, during both fit and transform.

You can get around needing to generated grp here, but just setting it as a class attribute.

@maheshakya
maheshakya Jul 28, 2014 Contributor

Done. Each tree needs a separate hash function to the functionality of do_hash can be put into _create_tree function.

@robertlayton robertlayton and 2 others commented on an outdated diff Jul 28, 2014
sklearn/neighbors/lsh_forest.py
+
+ def _create_tree(self):
+ """
+ Builds a single tree (in this case creates a sorted array of
+ binary hashes).
+ """
+ hashes, hash_function = self._do_hash(self._input_array)
+ binary_hashes = []
+ for i in range(hashes.shape[0]):
+ xx = tuple(hashes[i])
+ binary_hashes.append(self.cache[xx[:self.cache_N]] * self.k
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
+
+ def _compute_distances(self, query, candidates):
@robertlayton
robertlayton Jul 28, 2014 Member

Linking back to what I said about euclidean distance computation, look at where this function is called and consider if you can't wrap those up into a matrix

@maheshakya
maheshakya Jul 28, 2014 Contributor

Euclidean distances are computed for an array of candidates(which is a 1D array). In the radius_neighbors method, this has to be computed for each time the max_depth is decreased. So it is calculated multiple times. But it cannot be wrapped into a single matrix because the size of candidate array differ for each max_depth and the radius_neighbors to total_candidates ratio should be checked for each max_depth value to ensure whether that ratio is within the given range.

@arjoly
arjoly Aug 5, 2014 Member

I would make this method a function.

@robertlayton robertlayton and 2 others commented on an outdated diff Jul 28, 2014
sklearn/neighbors/lsh_forest.py
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
+
+ def _compute_distances(self, query, candidates):
+ distances = _simple_euclidean_distance(
+ query, self._input_array[candidates])
+ return np.argsort(distances), np.sort(distances)
+
+ def _generate_masks(self):
+ """
+ Creates left and right masks for all hash lengths
+ """
+ self._left_mask, self._right_mask = [], []
+
+ for length in range(self.max_label_length+1):
@robertlayton
robertlayton Jul 28, 2014 Member

It would be well worth looking into replacing these with bitwise functions.
That said, it might be worth flagging that for future work.

@maheshakya
maheshakya Jul 28, 2014 Contributor

These masks are created to perform bitwise operations for hashed data points. So these masks cannot be created with bitwise operations. In general (for 32 bit long hash) there will be only 66 masks, so the time consumption is insignificant for this operation, isn't it?

@daniel-vainsencher
daniel-vainsencher Jul 29, 2014

The bitwise + numeric operations python provides certainly suffice (using things like a || (1 << pos) ) so you do not have to use strings. Good catch that this is not a performance bottleneck! that's an important instinct (Amdahl's law is the general version). But if Robert meant that the bitwise solutions are more elegant in a sense than going through strings, I tend to agree it might worth doing when you have spare time.

@maheshakya
maheshakya Jul 29, 2014 Contributor

As the cashed hashes are generated in the fit function before calling _generate_masks function. those were able to be used in the mask generation process to. So I used those instead of doing the string conversion. So that the little overhead we might have here is also eliminated.

@robertlayton robertlayton and 2 others commented on an outdated diff Jul 28, 2014
sklearn/neighbors/lsh_forest.py
+ for length in range(self.max_label_length+1):
+ left_mask = int("".join(['1' for i in range(length)])
+ + "".join(['0' for i in
+ range(self.max_label_length-length)]),
+ 2)
+ self._left_mask.append(left_mask)
+ right_mask = int("".join(['0' for i in range(length)])
+ + "".join(['1' for i in
+ range(self.max_label_length-length)]),
+ 2)
+ self._right_mask.append(right_mask)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
@robertlayton
robertlayton Jul 28, 2014 Member

Please docstring all functions, even private ones

@maheshakya
maheshakya Jul 28, 2014 Contributor

Done.

@arjoly
arjoly Aug 5, 2014 Member

Why does it need to be a method if it called only once?

@arjoly
arjoly Aug 5, 2014 Member

What is m? To follow scikit-learn convention, we should try to use self-explaining variable name.

@robertlayton robertlayton and 1 other commented on an outdated diff Jul 28, 2014
sklearn/neighbors/tests/test_lsh_forest.py
+import numpy as np
+
+from sklearn.utils.testing import assert_array_equal
+from sklearn.utils.testing import assert_equal
+from sklearn.utils.testing import assert_raises
+from sklearn.utils.testing import assert_array_less
+from sklearn.utils.testing import assert_greater
+
+from sklearn.metrics import euclidean_distances
+from sklearn.neighbors import LSHForest
+
+
+def test_neighbors_accuracy_with_c():
+ """Accuracy increases as `c` increases."""
+ c_values = np.array([10, 50, 250])
+ samples = 1000
@robertlayton
robertlayton Jul 28, 2014 Member

For unit testing, try to keep these values as small as possible while still testing your intent. Remember that these tests will be run lots of times, so short run times are preferable.

Longer tests could possibly by in the examples/ folder

@maheshakya
maheshakya Jul 28, 2014 Contributor

I reduced the sample size. I will add some examples.

@robertlayton robertlayton commented on an outdated diff Jul 28, 2014
sklearn/neighbors/tests/test_lsh_forest.py
+
+ intersection = np.intersect1d(ranks, neighbors).shape[0]
+ ratio = intersection/float(n_points)
+ accuracies[i] = accuracies[i] + ratio
+
+ accuracies[i] = accuracies[i]/float(n_iter)
+
+ # Sorted accuracies should be equal to original accuracies
+ assert_array_equal(accuracies, np.sort(accuracies),
+ err_msg="Accuracies are not non-decreasing.")
+
+
+def test_neighbors_accuracy_with_n_trees():
+ """Accuracy increases as `n_trees` increases."""
+ n_trees = np.array([1, 10, 100])
+ samples = 1000
@robertlayton
robertlayton Jul 28, 2014 Member

My comment about unit test times applies all through this file

@robertlayton robertlayton and 1 other commented on an outdated diff Jul 28, 2014
sklearn/neighbors/tests/test_lsh_forest.py
+ dim = 50
+ n_iter = 100
+ X = np.random.rand(samples, dim)
+
+ lshf = LSHForest()
+ # Test unfitted estimator
+ assert_raises(ValueError, lshf.radius_neighbors, X[0])
+
+ lshf.fit(X)
+
+ for i in range(n_iter):
+ point = X[np.random.randint(0, samples)]
+ mean_dist = np.mean(euclidean_distances(point, X))
+ neighbors = lshf.radius_neighbors(point, radius=mean_dist)
+ # At least one neighbor should be returned.
+ assert_greater(neighbors.shape[1], 0)
@robertlayton
robertlayton Jul 28, 2014 Member

It would be good to test for correctness here -- possibly generate some static queries and see if they match as expected.

@maheshakya
maheshakya Jul 28, 2014 Contributor

All radius neighbors should be less than the given radius so that test is added.

@robertlayton
robertlayton Aug 11, 2014 Member

That's better, but correctness would be "I precomputed this value to equal 3.14, and the test asserts I still get that figure". This is a good way to check that your `random_state' is set correctly.

maheshakya added some commits Jul 28, 2014
@maheshakya maheshakya Removed _do_hash function.
The functionality of _do_hash is done within the _create_tree function as
a hashing function is generated for each tree and it doesn't need a separate
function.
b514130
@maheshakya maheshakya Updated radius_neighbors test.
802170e
@maheshakya maheshakya Added _convert_to_hash function.
This will be used in the queries and insertions to convert the data point
into the integer represented by the binary hash.
4c485a9
@maheshakya maheshakya Updated _generate_masks to work with cached hashes. 6e8b6a8
@maheshakya maheshakya Changed the insert function to accpet a batch of data points(a 2D arr…
…ay).

Now trees and original indices are stored in lists. Tests for insert and fit
functions have been updated.
e183c21
@daniel-vainsencher

This is what list comprehensions are for.

Owner

Done. But both methods do the same thing(iterating using a for loop)

I proposed this change because the list comprehension version is easier to understand, more obviously correct, and shorter. What did you mean?

Owner

Nothing important actually. I just meant that using list comprehension doesn't have extra advantages in the sense of performance.

So first, its not all about speed (I know you know ;) )

Second, you're wrong:
In [50]: b = []
In [56]: %timeit for i in xrange(100000): b.append(i+1)
1 loops, best of 3: 9.72 ms per loop
In [57]: %timeit [i+1 for i in xrange(100000)]
100 loops, best of 3: 4.91 ms per loop

Of course the advantage is pretty tiny, and visible here only because + is so cheap... the point is that assuming is dangerous.

Owner

Okay. Point understood :)

@daniel-vainsencher

"Cost is proportional to new total size, so additions should be batched."

Owner

Added to the doc string.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling 06ce72a on maheshakya:lsh_forest into 1b2833a on scikit-learn:master.

@robertlayton
Member

OK, some general comments:

  • Your tests should include a working example in the Neighbours framework (i.e. create a classifier and test it can be called). This checks that the API is correct.
  • You did reduce your testing size, but can you reduce even more? i.e. you are only testing for accuracy, is there any damage to having samples=5 and dim=2?
  • LSHForest.fit(self, X=None) has X=None as a default value, but this fails (as it should). Don't let X=None, make it required.
  • Please create a usage example (see the examples/ folder) based on your previous blog posts. See this example, which shows something similar to what you might want to show (although you may want to show speed improvement too). Examples can take (much) longer than tests to run, so feel free to use a larger dataset.
  • Narrative documentation! see doc/modules

At this point, I'd be happy to go to the scikit-learn devs to get further comments.

@maheshakya
Contributor
  • Did you mean that create a classifier like KNeighborsClassifier? If so, it's going to be out of the scope of GSoC because it is another application of LSH-ANN. The API can be checked with clustering algorithms(which is in the scope of the project) too, right?
  • Reduction of the dimension less than 32 causes a warning from the GaussianRandomProjection as we are using a fixed hash length of the 32.

I will do the rest.

BTW, is it okay to add the application of LSH in clustering also in this PR? Then examples and documentation can be done for both as planned in the schedule.

@jnothman
Member
jnothman commented Aug 1, 2014

Did you mean that create a classifier like KNeighborsClassifier? If so, it's going to be out of the scope of GSoC because it is another application of LSH-ANN.

I assumed this was the primary use of LSH-ANN in the context of scikit-learn. LSH-ANN isn't in itself machine learning.

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ """
+ Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
@jnothman
jnothman Aug 1, 2014 Member

please place a space between 'optional' and '(' here and below

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
@jnothman
jnothman Aug 1, 2014 Member

threshold of what value? I think perhaps "threshold" should be avoided here.

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
@jnothman
jnothman Aug 1, 2014 Member

This is a strange documentation convention, where both "optional" and "default" are included. This is a case where I find "optional" confusing: does it mean that the lower bound itself is optional?

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
@jnothman
jnothman Aug 1, 2014 Member

"functiond" -> "function"
"with the query" -> "to :meth:k_neighbors"
Is it necessary to have this parameter both for the class and the method?

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
+ lowerest hash length to be searched when candidate selection is
+ performed for nearest neighbors.
+
+ radius : float, optinal(default = 1.0)
+ Range of parameter space to use by default for :meth`radius_neighbors`
@jnothman
jnothman Aug 1, 2014 Member

I don't think "Range of parameter space" is clear here, i.e. it doesn't help the user set this value.

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
+ lowerest hash length to be searched when candidate selection is
+ performed for nearest neighbors.
+
+ radius : float, optinal(default = 1.0)
+ Range of parameter space to use by default for :meth`radius_neighbors`
+ queries.
+
+ radius_cutoff_ratio: float, optional(defualt = 0.9)
+ Cut off ratio of radius neighbors to candidates at the radius
+ neighbor search
+
+ random_state: float, optional(default = 1)
@jnothman
jnothman Aug 1, 2014 Member

"float" is not correct. Nor is "default=1". Please copy the type and description from other estimators that accept a random_state.

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ """
+
+ def __init__(self, n_trees=10, radius=1.0, c=50, n_neighbors=1,
+ lower_bound=4, radius_cutoff_ratio=.9,
+ random_state=None):
+ self.n_trees = n_trees
+ self.radius = radius
+ self.random_state = random_state
+ self.c = c
+ self.n_neighbors = n_neighbors
+ self.lower_bound = lower_bound
+ self.radius_cutoff_ratio = radius_cutoff_ratio
+
+ def _generate_hash_function(self):
+ """
+ Fits a `GaussianRandomProjections` with `n_components=hash_size
@jnothman
jnothman Aug 1, 2014 Member

"Projections" -> "Projection"

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ self.random_state = random_state
+ self.c = c
+ self.n_neighbors = n_neighbors
+ self.lower_bound = lower_bound
+ self.radius_cutoff_ratio = radius_cutoff_ratio
+
+ def _generate_hash_function(self):
+ """
+ Fits a `GaussianRandomProjections` with `n_components=hash_size
+ and n_features=n_dim.
+ """
+ random_state = check_random_state(self.random_state)
+ grp = GaussianRandomProjection(n_components=self.max_label_length,
+ random_state=random_state.randint(0,
+ 10))
+ X = np.zeros((2, self._n_dim), dtype=float)
@jnothman
jnothman Aug 1, 2014 Member

Please include a comment for why this needs to be 2, not 1

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ random_state = check_random_state(self.random_state)
+ grp = GaussianRandomProjection(n_components=self.max_label_length,
+ random_state=random_state.randint(0,
+ 10))
+ X = np.zeros((2, self._n_dim), dtype=float)
+ grp.fit(X)
+ return grp
+
+ def _create_tree(self):
+ """
+ Builds a single tree (in this case creates a sorted array of
+ binary hashes).
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign(positive/negative)
@jnothman
jnothman Aug 1, 2014 Member

space before '('

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+
+ radius : float, optinal(default = 1.0)
+ Range of parameter space to use by default for :meth`radius_neighbors`
+ queries.
+
+ radius_cutoff_ratio: float, optional(defualt = 0.9)
+ Cut off ratio of radius neighbors to candidates at the radius
+ neighbor search
+
+ random_state: float, optional(default = 1)
+ A random value to initialize random number generator.
+
+ Attributes
+ ----------
+
+ `hash_functions`: list of arrays
@jnothman
jnothman Aug 1, 2014 Member

hash_functions -> hash_functions_
It's also unclear from this description how a hash function is an array.

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ right_mask = np.append(np.zeros(length, dtype=int),
+ np.ones(self.max_label_length-length,
+ dtype=int))
+ xx = tuple(right_mask)
+ binary_hash_right = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ self._right_mask.append(binary_hash_right)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
+ """
+ Performs the Synchronous ascending phase in the LSH Forest
+ paper.
+ Returns an array of candidates, their distance rancks and
@jnothman
jnothman Aug 1, 2014 Member

"rancks" -> "ranks"

@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+
+def _simple_euclidean_distance(query, candidates):
+ """
+ Private function to calculate Euclidean distances between each
+ point in candidates and query
+ """
+ distances = np.zeros(candidates.shape[0])
+ for i in range(candidates.shape[0]):
+ distances[i] = np.linalg.norm(candidates[i] - query)
+ return distances
+
+
+class LSHForest(BaseEstimator):
+
+ """
+ Performs approximate nearest neighbor search using LSH forest.
@jnothman
jnothman Aug 1, 2014 Member

All/most of your docstrings are incorrectly formatted: on the same line as the opening """ should be a brief summary.

@jnothman jnothman and 1 other commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ """
+ if X is None:
+ raise ValueError("X cannot be None")
+
+ self._input_array = check_array(X)
+ self._n_dim = self._input_array.shape[1]
+
+ self.max_label_length = 32
+ digits = ['0', '1']
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+
+ self.cache_N = int(self.max_label_length/2)
+ hashes = [x for x in itertools.product((0, 1),
@jnothman
jnothman Aug 1, 2014 Member

x could be better named, given it is unrelated to the input X

@larsmans
larsmans Aug 1, 2014 Member

Actually [x for x in y] is more properly written list(y).

@jnothman jnothman and 1 other commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ 10))
+ X = np.zeros((2, self._n_dim), dtype=float)
+ grp.fit(X)
+ return grp
+
+ def _create_tree(self):
+ """
+ Builds a single tree (in this case creates a sorted array of
+ binary hashes).
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign(positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
@jnothman
jnothman Aug 1, 2014 Member

I don't think we should require completely independent gaussian random projections for each tree. While the trees will not be entirely independent, surely many complementary trees (in the sense of increasing recall) can be built just by permuting the bits in one of your existing hash functions.

@daniel-vainsencher
daniel-vainsencher Aug 1, 2014

More generally, k indices of h bits each can be populated using m hash bits for m >= k*h, using h bits chosen uniformly without repetition for each, with the permutations encoded as a matrix multiply. The question is: at what point does the reuse begin to hurt results? to what extent does the reuse help performance? I don't know if there is a reasonable answer to these questions easy to implement inside the GSOC, do you? Are there papers quantifying the effects of sharing (even for LSH, ignoring the *Forest aspect)?

@jnothman
jnothman Aug 2, 2014 Member

The bit permutations would be best done not as matrix multiplies, but as bit operations in C. Even if performed as multiplies over binary matrices, these matrices don't have the full feature-space dimensionality, only a fixed width of 32, so we should expect it to be much faster than additional random projections.

I can't recall where I've seen this technique used, but I don't think it involves a quantification of the sharing effects. Some colleagues were playing around with this variant, but I'm not sure if they have useful numbers in this space. However, traditional (bucket-based, rather than forest-based) LSH, in my understanding, is often described in terms of selecting a fixed number of bits from a random projection to represent one hash. This should be equivalent to permutation, and given the number of possible permutations, should still provide the additional recall that creating multiple indexes intends.

@jnothman jnothman and 3 others commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ self._n_dim = self._input_array.shape[1]
+
+ self.max_label_length = 32
+ digits = ['0', '1']
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+
+ self.cache_N = int(self.max_label_length/2)
+ hashes = [x for x in itertools.product((0, 1),
+ repeat=self.cache_N)]
+
+ self.cache = {}
+ for item in hashes:
+ self.cache[tuple(item)] = int("".join([digits[y] for y in item]),
@jnothman
jnothman Aug 1, 2014 Member

Alternatively, could use np.packbits and get rid of the cache and tuple conversion altogether? I'm not sure which would be faster.

@daniel-vainsencher
daniel-vainsencher Aug 1, 2014

Ah, cool, I wasn't aware of this method. Thanks Joel!

@maheshakya
maheshakya Aug 1, 2014 Contributor

Actually when it comes to conversion of bit arrays to integers, np.packbits gets slower. uint8 type is returned bynp.packbits and for a 32 bit array, 4 uint8 numbers are returned. To get the final result, those 4 values have to be used as: test[0]*(2**24)+test[1]*(2**16)+test[2]*(2**8)+test[3] where test is the array returned from np.packbits for a 32 long bit array.
Average time taken for this operation is 7.43 µs where as the cache and tuple version takes only 550 ns.

@larsmans
larsmans Aug 1, 2014 Member

Is that including the cost of computing those constants every time? Also, note that this is a sum of products of corresponding elements, i.e. a dot product.

@maheshakya
maheshakya Aug 1, 2014 Contributor

Computing values to store in the cache happens only once. Moreover, even if we compare computing times of these two methods, int("".join([digits[y] for y in bit_array]) is still faster:1.79 µs. test = np.packbits(bit_array); test[0]*(2**24)+test[1]*(2**16)+test[2]*(2**8)+test[3] takes 10.9 µs in average.

@jnothman
jnothman Aug 2, 2014 Member

I wouldn't be surprised if your cache method is faster, but test[0] + (2**24) ... etc shouldn't be necessary. You can use a view over the same array to convert from uint8 to uint32:

>>> np.packbits(np.array([0,0,0,0,0,0,0,0,] * 1 + [0,0,0,0,0,0,0,1] * 3)).view(dtype='>u4')
array([65793], dtype=uint32)

65793 = 20 + 28 + 2**16

Comparing this to lookup in a dict:

In [2]: a = np.array([0,0,0,0,0,0,0,0,] * 1 + [0,0,0,0,0,0,0,1] * 3)
In [3]: %timeit np.packbits(a).view(dtype='>u4')
1000000 loops, best of 3: 1.98 µs per loop
In [4]: lookup = {tuple(a): 5}
In [5]: %timeit lookup[tuple(a)]
100000 loops, best of 3: 6.26 µs per loop
@jnothman
jnothman Aug 2, 2014 Member

And with the further benefit of vectorization:

In [10]: A = np.repeat(a[None, :], 10, axis=0)
In [12]: %timeit np.packbits(A).view(dtype='>u4')
100000 loops, best of 3: 3.75 µs per loop
In [13]: %timeit [lookup[tuple(row)] for row in A]
10000 loops, best of 3: 67.5 µs per loop
@jnothman jnothman commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ distances.
+ """
+ candidates = []
+ n_candidates = self.c * self.n_trees
+ while max_depth > self.lower_bound and (len(candidates) < n_candidates
+ or len(set(candidates)) < m):
+ for i in range(self.n_trees):
+ candidates.extend(
+ self._original_indices[i][_find_matching_indices(
+ self._trees[i],
+ bin_queries[i],
+ self._left_mask[max_depth],
+ self._right_mask[max_depth])].tolist())
+ max_depth = max_depth - 1
+ candidates = np.unique(candidates)
+ ranks, distances = self._compute_distances(query, candidates)
@jnothman
jnothman Aug 1, 2014 Member

Given that in K(A)NN classification we don't need the actual nearest data points so much as a vote over their classes, can we offer the user to save time by not computing the true distances?

@maheshakya
Contributor

Clustering is the machine learning application of LSH-ANN discussed in the project scope, so it needs to be get done first. But this can be applied in classification after the project.

@jnothman
Member
jnothman commented Aug 1, 2014

Fair enough

On 1 August 2014 13:20, Maheshakya Wijewardena notifications@github.com
wrote:

Clustering is the machine learning application of LSH-ANN discussed in
the project scope, so it needs to be get done first. But this can be
applied in classification after the project.


Reply to this email directly or view it on GitHub
#3304 (comment)
.

@larsmans larsmans and 3 others commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
+ """
+ Finds indices in sorted array of strings where their first
+ h elements match the items' first h elements
+ """
+ left_index = np.searchsorted(sorted_array,
+ item & left_mask)
+ right_index = np.searchsorted(sorted_array,
+ item | right_mask,
+ side='right')
+ return np.arange(left_index, right_index)
+
+
+def _find_longest_prefix_match(bit_string_array, query, hash_size,
+ left_masks, right_masks):
@larsmans
larsmans Aug 1, 2014 Member

Can't the bisect module in the stdlib do this?

@maheshakya
maheshakya Aug 5, 2014 Contributor

Yes, that can be used here. For small and medium array sizes of sorted_array, bisect_left and bisect_right are faster. But for very large arrays, numpy.searchsorted is faster.
Eg:
array size = 1000
average time for _find_matching_indices with bisect = 3.88 µs
average time for _find_matching_indices with searchsorted = 5.09 µs

array size = 10000
average time for _find_matching_indices with bisect = 7.69 µs
average time for _find_matching_indices with searchsorted = 8.56 µs

array size = 30000
average time for _find_matching_indices with bisect = 17.2 µs
average time for _find_matching_indices with searchsorted = 16.8 µs

So I think using bisect is more reasonable.

@daniel-vainsencher
daniel-vainsencher Aug 5, 2014

In this case all differences are small so its not worth even talking about, but keep in mind that for not-too-big databases, LSH has no benefit over exhaustive search. So you should always be thinking about large DBs.

@ogrisel
ogrisel Aug 13, 2014 Member

+1, when doing benchmarks I think we need to focus on n_samples > 100k.

@larsmans larsmans commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
+ lowerest hash length to be searched when candidate selection is
@larsmans
larsmans Aug 1, 2014 Member

lowerest -> lowest

@larsmans larsmans commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ """
+ bin_queries = []
+
+ # descend phase
+ max_depth = 0
+ for i in range(self.n_trees):
+ bin_query = self._convert_to_hash(query, i)
+ k = _find_longest_prefix_match(self._trees[i], bin_query,
+ self.max_label_length,
+ self._left_mask,
+ self._right_mask)
+ if k > max_depth:
+ max_depth = k
+ bin_queries.append(bin_query)
+
+ if not is_radius:
@larsmans
larsmans Aug 1, 2014 Member

Stylistic nitpick: when switching on a boolean, please put the positive case first, esp. when the negative case takes fewer lines of code.

@larsmans larsmans commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single query.
+
+ n_neighbors: int, opitonal (default = None)
+ Number of neighbors required. If not provided, this will
+ return the number specified at the initialization.
+
+ return_distance: boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted.")
+
+ if X is None:
+ raise ValueError("X cannot be None.")
@larsmans
larsmans Aug 1, 2014 Member

Why this specific check for None, is it an expected input? Also, the exception class should be TypeError.

@larsmans larsmans commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single query.
+
+ radius : float
+ Limiting distance of neighbors to return.
+ (default is the value passed to the constructor).
+
+ return_distance: boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted.")
+
+ if X is None:
+ raise ValueError("X cannot be None.")
@larsmans
larsmans Aug 1, 2014 Member

Same story.

@larsmans larsmans commented on an outdated diff Aug 1, 2014
sklearn/neighbors/lsh_forest.py
+
+ radius : float
+ Limiting distance of neighbors to return.
+ (default is the value passed to the constructor).
+
+ return_distance: boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted.")
+
+ if X is None:
+ raise ValueError("X cannot be None.")
+
+ if radius is not None:
+ self.radius = radius
@larsmans
larsmans Aug 1, 2014 Member

I don't like a query method modifying an object. I also don't see why the class should remember the radius of the last radius query. It looks like this should be the other way around, if radius is None: radius = self.radius.

@maheshakya maheshakya Updated docstrings and tests.
in fit, kneighbors and radius methods, None type is not checked because
it is not an expected input.
2955fc1
@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 2955fc1 on maheshakya:lsh_forest into 1b2833a on scikit-learn:master.

@maheshakya
Contributor

All tests were written based on euclidean distance. Modified those to cosine distance.

@maheshakya
Contributor

Is there any reason different BLAS version produce different results for dot product. Doc tests fail because Python 2.6 and 3.4 configurations are producing different results from 2.7 for example.

@jnothman
Member

It's not an issue of BLAS, really. You've chosen a particularly bad example: it's showing negative distances!! And what it means is 0 (but for floating point error): all your samples are identical after taking their norm (they are geometric sequences with a different base). Why did you choose logspace?

@maheshakya
Contributor

That was used for the example of euclidean distance. Forgot to change that.
Thank you pointing out.

@coveralls

Coverage Status

Changes Unknown when pulling 74d5ef7 on maheshakya:lsh_forest into * on scikit-learn:master*.

@jnothman

This won't pass consistently without a fixed random_state. It happens to pass frequently because your data is particularly contrived. For example, it fails with random_state=0 or random_state=1

@jnothman jnothman commented on the diff Oct 21, 2014
sklearn/neighbors/approximate.py
+ hashes = (grp.transform(self._fit_X) > 0).astype(int)
+ hash_function = grp.components_
+
+ binary_hashes = np.packbits(hashes).view(dtype='>u4')
+ original_indices = np.argsort(binary_hashes)
+
+ return original_indices, binary_hashes[original_indices], hash_function
+
+ def _compute_distances(self, query, candidates):
+ """Computes the cosine distance.
+
+ Distance is from the query to points in the candidates array.
+ Returns argsort of distances in the candidates
+ array and sorted distances.
+ """
+ distances = pairwise_distances(query, self._fit_X[candidates],
@jnothman
jnothman Oct 21, 2014 Member

However, your error in the example alerted me to an unhandled case: where candidates is empty here it is incorrectly typed (array([], dtype=float64)) and this line raises:

IndexError: arrays used as indices must be of integer (or boolean) type

Please fix the type where you generate candidates.

@jnothman
jnothman Oct 21, 2014 Member

Actually, returning insufficient candidates requires a backup approach (e.g. fill slots without candidates arbitrarily).

And add a test in which you fit it with a very small dataset and manually check that for the given random_state there are no candidates returned.

@maheshakya
maheshakya Oct 21, 2014 Contributor

Isn't it more useful if an error is raised with more information such as there are no candidates for this particular query, because filling up neighbors with any other method is not coherent with this approximate neighbor search technique?

@jnothman
jnothman Oct 21, 2014 Member

I don't think an error is a good idea in this case, but a warning may be suitable: in the example you presented (X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]; X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]; n_neighbors=2, setting random_state=33 will trigger this error, not because all queries cannot be answered, but because some cannot. In such cases we need to fail gracefully.

Could you also explain to me why in the forest paradigm (rather than discrete buckets for each hash) this is occurring? Surely there are hashes closest to the input's hash for any input... Is this actually a deeper issue in that as long as n_train_samples >= n_candidates there should be sufficient output from the forest?

@maheshakya
maheshakya Oct 21, 2014 Contributor

Actually forest paradigm does not cause such an issue in the original form. There is a parameter min_hash_length of which the default value is 4. This is used to limit the number of candidates selected because candidates with very small hash matches do not contribute to nearest neighbors much and it will also cause to overwhelm the set of selected candidates as there are many candidates with short hash length match for a query.
In the original algorithm, candidates are selected until max_depth reaches 0. But here we allow to limit that by setting min_hash_length. If min_hash_length is set to 0, candidates will be selected until matching hash length becomes 0. Hench any input will have a closest hash. So the above problem will be eliminated.

I think there should be a warning addressing this as well, right? To decrease the min_hash_length.

@jnothman
jnothman Oct 21, 2014 Member

Yes, I think a warning like that might be useful. But we still need to handle the case where the current approach returns no candidates. Either we return arbitrary indices, or we return distance inf/nan and index -1.

And perhaps we should rename min_hash_length to min_hash_match.

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

Sorry I haven't been following too closely. You guys have been catching
important issues, its cool! I did want to weigh in on one:

The min_hash_length parameter avoids a bad performance issue (where we
try to look at more than a quarter of the DB). In my opinion if
min_hash_length avoids looking at sufficient candidates, then we should
add some random candidates and continue as usual. Then candidates in
_compute_distances is never empty.

Remember that even if min_hash_length kicked in in one index, it is
quite likely that some other index is having better luck, and having
some probability of bad performance is not a problem when using this
type of data-structure, that's just life. Hence I think an error or
warning should be avoided. BTW, there are algorithmic approaches we want
to try that deal with this whole issue more elegantly, but I think we
should test carefully after this is merged.

On a related issue: for the user to ask for more candidates than the DB
indexes is a user error which should be raised.

Daniel

On 10/21/2014 09:20 PM, Maheshakya Wijewardena wrote:

In sklearn/neighbors/approximate.py:

  •    hashes = (grp.transform(self._fit_X) > 0).astype(int)
    
  •    hash_function = grp.components_
    
  •    binary_hashes = np.packbits(hashes).view(dtype='>u4')
    
  •    original_indices = np.argsort(binary_hashes)
    
  •    return original_indices, binary_hashes[original_indices], hash_function
    
  • def _compute_distances(self, query, candidates):
  •    """Computes the cosine distance.
    
  •    Distance is from the query to points in the candidates array.
    
  •    Returns argsort of distances in the candidates
    
  •    array and sorted distances.
    
  •    """
    
  •    distances = pairwise_distances(query, self._fit_X[candidates],
    

Actually forest paradigm does not cause such an issue in the original
form. There is a parameter |min_hash_length| of which the default value
is 4. This is used to limit the number of candidates selected because
candidates with very small hashes do not contribute to nearest neighbors
much and it will also cause to overwhelm the set of selected candidates
as there are many candidates with short hash length match for a query.
In the original algorithm, candidates are selected until |max_depth|
reaches 0. But here we allow to limit that by setting |min_hash_length|.
If |min_hash_length| is set to 0, candidates will be selected until
matching hash length becomes 0. Hench any input will have a closest
hash. So the above problem will be eliminated.

I think there should be a warning addressing this as well, right? To
decrease the |min_hash_length|.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19172313.

@maheshakya
maheshakya Oct 22, 2014 Contributor

Oops. I just pushed without seeing this comment.
However there is another issue with candidates. We can surely use random set of candidates if there 0 candidates were selected. But selected number of candidates may be less than the requested number of neighbors. Here I only used a user warning for such situation. How can we address this better?

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

Invariant: unless the number of indexed points is small than the number
of points requested (which should raise an error), the number of
candidates passed to distance computation should always be at least the
requested number, so that the best k of those can be returned.

Anything else complicates the users life.

So I would complete the candidate size to a reasonable size (say, 5k)
with random candidates.

Daniel

On 10/22/2014 10:52 AM, Maheshakya Wijewardena wrote:

In sklearn/neighbors/approximate.py:

  •    hashes = (grp.transform(self._fit_X) > 0).astype(int)
    
  •    hash_function = grp.components_
    
  •    binary_hashes = np.packbits(hashes).view(dtype='>u4')
    
  •    original_indices = np.argsort(binary_hashes)
    
  •    return original_indices, binary_hashes[original_indices], hash_function
    
  • def _compute_distances(self, query, candidates):
  •    """Computes the cosine distance.
    
  •    Distance is from the query to points in the candidates array.
    
  •    Returns argsort of distances in the candidates
    
  •    array and sorted distances.
    
  •    """
    
  •    distances = pairwise_distances(query, self._fit_X[candidates],
    

Oops. I just pushed without seeing this comment.
However there is another issue with candidates. We can surely use random
set of candidates if there 0 candidates were selected. But selected
number of candidates may be less than the requested number of neighbors.
Here I only used a user warning for such situation. How can we address
this better?


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19201049.

@jnothman
jnothman Oct 22, 2014 Member

Why do you prefer random candidates over dud infinite-distance candidates?

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

The contract with the user is that in the end we must return k "neighbors"
that actually exist in the DB, so the best k out of 5k random candidates is
a valid, if sucky, answer. How do dud infinite-distance candidates help us?
maybe I am not understanding exactly what you mean.

On Wed, Oct 22, 2014 at 12:04 PM, jnothman notifications@github.com wrote:

In sklearn/neighbors/approximate.py:

  •    hashes = (grp.transform(self._fit_X) > 0).astype(int)
    
  •    hash_function = grp.components_
    
  •    binary_hashes = np.packbits(hashes).view(dtype='>u4')
    
  •    original_indices = np.argsort(binary_hashes)
    
  •    return original_indices, binary_hashes[original_indices], hash_function
    
  • def _compute_distances(self, query, candidates):
  •    """Computes the cosine distance.
    
  •    Distance is from the query to points in the candidates array.
    
  •    Returns argsort of distances in the candidates
    
  •    array and sorted distances.
    
  •    """
    
  •    distances = pairwise_distances(query, self._fit_X[candidates],
    

Why do you prefer random candidates over dud infinite-distance candidates?


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19203950.

Daniel Vainsencher

@jnothman
jnothman Oct 22, 2014 Member

You're right about the API in terms of using this in other components and making that interaction reliable. One way of doing this, however, is to return dummy results from each tree when there aren't enough candidates, and only after the multiple trees are combined are arbitrary fillers chosen.

@jnothman
jnothman Oct 22, 2014 Member

Your approach should suffice though. Do we need the arbitrary candidates to be distinct?

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

The k returned neighbors have to be distinct, that's the reasonable
expectation of a user of the package.

I'm sorry, I can't put in the time right now to read the current code
and think about how to do it elegantly.

Daniel

On 10/22/2014 12:30 PM, jnothman wrote:

In sklearn/neighbors/approximate.py:

  •    hashes = (grp.transform(self._fit_X) > 0).astype(int)
    
  •    hash_function = grp.components_
    
  •    binary_hashes = np.packbits(hashes).view(dtype='>u4')
    
  •    original_indices = np.argsort(binary_hashes)
    
  •    return original_indices, binary_hashes[original_indices], hash_function
    
  • def _compute_distances(self, query, candidates):
  •    """Computes the cosine distance.
    
  •    Distance is from the query to points in the candidates array.
    
  •    Returns argsort of distances in the candidates
    
  •    array and sorted distances.
    
  •    """
    
  •    distances = pairwise_distances(query, self._fit_X[candidates],
    

Your approach should suffice though. Do we need the arbitrary candidates
to be distinct?


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19204847.

@maheshakya
maheshakya Oct 22, 2014 Contributor

Choosing fixed 5k (or any number) of candidates may not be practical sometimes because 5k can be greater than index size. What if only k distinct candidates are chosen randomly and returned? In fact choosing 5k candidates causes more distance computations which is redundant is such situation.

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

The hard requirement is that if the index is large enough, we should
always return exactly k distinct neighbors from the DB. The rest is
optional because (unless we have evidence otherwise), the data structure
actually works for non-toy datasets.

That said, just to discuss the ideas: Yes, taking min(index_size, 5 * k)
candidates is fine, of course: in the case where index_size is used,
that is falling back to brute force.

When the DB is large, if you choose k at random, their average distance
will be the average distance in the DB. If you choose at random 5k and
choose the top k out of them, your neighbors will be better than
average; that was why I proposed it. Of course, the LSHForest indices
should be much better than this anyway.

Daniel

On 10/22/2014 03:57 PM, Maheshakya Wijewardena wrote:

In sklearn/neighbors/approximate.py:

  •    hashes = (grp.transform(self._fit_X) > 0).astype(int)
    
  •    hash_function = grp.components_
    
  •    binary_hashes = np.packbits(hashes).view(dtype='>u4')
    
  •    original_indices = np.argsort(binary_hashes)
    
  •    return original_indices, binary_hashes[original_indices], hash_function
    
  • def _compute_distances(self, query, candidates):
  •    """Computes the cosine distance.
    
  •    Distance is from the query to points in the candidates array.
    
  •    Returns argsort of distances in the candidates
    
  •    array and sorted distances.
    
  •    """
    
  •    distances = pairwise_distances(query, self._fit_X[candidates],
    

Choosing fixed 5k (or any number) of candidates may not be practical
sometimes because 5k can be greater than index size. What if only k
distinct candidates are chosen randomly and returned? In fact choosing
5k candidates causes more distance computations which is redundant is
such situation.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19214634.

@jnothman
jnothman Oct 23, 2014 Member

So I think this change needs to be in kneighbors: it should not touch radius_neighbors. I think we need the return value to be deterministic, so I am not sure about this random selection idea. I'd even consider repeating candidates with a warning, or selecting indices 0, 1, 2, ... as needed.

@daniel-vainsencher
daniel-vainsencher Oct 23, 2014

The data structure construction is already probabilistic. Do you think
it is important for users that for a fixed index, answers be
deterministic? there are advantages in debugging; on the other hand, I
can imagine situations where "always return the first few in index" also
causes significant surprise. But either way seems fine to me.

I think it is important that candidates do not repeat (user: lets get
k neighbors and build an order k linear model on them. Oh, they are
linearly dependent? what, they repeat!?).

Daniel

On 10/23/2014 04:29 AM, jnothman wrote:

In sklearn/neighbors/approximate.py:

  •    hashes = (grp.transform(self._fit_X) > 0).astype(int)
    
  •    hash_function = grp.components_
    
  •    binary_hashes = np.packbits(hashes).view(dtype='>u4')
    
  •    original_indices = np.argsort(binary_hashes)
    
  •    return original_indices, binary_hashes[original_indices], hash_function
    
  • def _compute_distances(self, query, candidates):
  •    """Computes the cosine distance.
    
  •    Distance is from the query to points in the candidates array.
    
  •    Returns argsort of distances in the candidates
    
  •    array and sorted distances.
    
  •    """
    
  •    distances = pairwise_distances(query, self._fit_X[candidates],
    

So I think this change needs to be in |kneighbors|: it should not touch
|radius_neighbors|. I think we need the return value to be
deterministic, so I am not sure about this random selection idea. I'd
even consider repeating candidates with a warning, or selecting indices
0, 1, 2, ... as needed.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19256708.

@jnothman
jnothman Oct 23, 2014 Member

Fair enough re repetition. But I think this is a case we hope to see little of in practice, and will give a warning so that at least the attentive user (or someone debugging linear dependence, for instance) might realise there is something anomalous.

If there is randomness, I don't think random_state should be shared with fit (if it's an RNG instance, rather than a seed). Perhaps we could seed it on the query instance itself, but this sounds complicated!

@maheshakya
maheshakya Oct 23, 2014 Contributor

I think it is important to provide distinct neighbors than repeating the same as Daniel mentioned. Having the insufficient candidates filled with uniformly selected points from the index is less complicate than filling them randomly. In fact using random set of point does not cohere any more with the approximation technique we use than choosing them uniformly from indices. So I think it is better to use the less complicated way.

(I've changed _get_candidates to fill up candidates uniformly from remaining indices)

Moreover a situation where candidates should become insufficient is unlikely in most applications because small index sizes are not the general case for ANN.

@jnothman jnothman and 1 other commented on an outdated diff Oct 21, 2014
sklearn/neighbors/approximate.py
+ """Computes the cosine distance.
+
+ Distance is from the query to points in the candidates array.
+ Returns argsort of distances in the candidates
+ array and sorted distances.
+ """
+ distances = pairwise_distances(query, self._fit_X[candidates],
+ metric='cosine')[0]
+ distance_positions = np.argsort(distances)
+ return distance_positions, distances[distance_positions]
+
+ def _generate_masks(self):
+ """Creates left and right masks for all hash lengths."""
+ tri_size = self.max_label_length + 1
+ left_mask = np.tril(np.ones((tri_size, tri_size), dtype=int))[:, 1:]
+ right_mask = np.triu(np.ones((tri_size, tri_size), dtype=int))[:, :-1]
@jnothman
jnothman Oct 21, 2014 Member

I think right_mask is just left_mask[::-1, ::-1]

@jnothman
jnothman Oct 21, 2014 Member

And you're currently producing arrays with shape (33, 32). Is that correct?

@maheshakya
maheshakya Oct 21, 2014 Contributor

Yes, that can be used. Thanks

@jnothman jnothman commented on the diff Oct 21, 2014
sklearn/neighbors/approximate.py
+
+from ..random_projection import GaussianRandomProjection
+
+__all__ = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
+ """Finds indices in sorted array of integers.
+
+ Most significant h bits in the binary representations of the
+ integers are matched with the items' most significant h bits.
+ """
+ left_index = np.searchsorted(sorted_array, item & left_mask)
+ right_index = np.searchsorted(sorted_array, item | right_mask,
+ side='right')
+ return np.arange(left_index, right_index)
@jnothman
jnothman Oct 21, 2014 Member

This function is used either to give the number of matched indices (in _find_longest_prefix_match), or to index into the original_indices array. Surely a slice is more appropriate than an arange, if not just returning a tuple.

@maheshakya
maheshakya Oct 21, 2014 Contributor

Here an array starting from left_index(inclusive) and ending with right_index should be returned. These are the indices of sorted_array which items first hash bits match.

@jnothman
jnothman Oct 21, 2014 Member

Or you can return slice(left_index, right_index) and this can also be used to slice an array, but is more efficient (it does not create a copy of the underlying data, and does not create an array to store the indices).

@jnothman
jnothman Oct 21, 2014 Member

I think I would prefer for clarity that you just return a tuple rather than a slice, though, and explicitly use ordered_indices[i][start:stop].

@jnothman jnothman commented on the diff Oct 21, 2014
sklearn/neighbors/approximate.py
+__all__ = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
+ """Finds indices in sorted array of integers.
+
+ Most significant h bits in the binary representations of the
+ integers are matched with the items' most significant h bits.
+ """
+ left_index = np.searchsorted(sorted_array, item & left_mask)
+ right_index = np.searchsorted(sorted_array, item | right_mask,
+ side='right')
+ return np.arange(left_index, right_index)
+
+
+def _find_longest_prefix_match(bit_string_array, query, hash_size,
@jnothman
jnothman Oct 21, 2014 Member

I've been trying to understand this function fully. I'm afraid there's a bit of byte order chaos going on here. bit_string_array is coming in as dtype='>u4'. query is dtype=uint32 (native byte order). This is already a problem.

To add to this, left_masks and right_masks are '>u4' but it appears in numpy 1.8.1 that when you select a single element from such an array (e.g. left_masks[mid]) it returns an element with native byte order i.e. uint32!! This sounds like it might be a bug in numpy, but I'm not certain.

@jnothman
jnothman Oct 21, 2014 Member

I'm also fairly sure this function can be written with only two searches in bit_string_array in total.

@maheshakya
maheshakya Oct 21, 2014 Contributor

query is also dtype='>u4'. _convert_to_hash method returns the hashed query in that data type.

I think this method with binary search is faster than having sequential searches on bit_string_array. If n is the index size, approximately it'll take 32*n time in worst case which is O(n). With this method, there is only log(32) searches for hash length and each of them take 2*log(n) which is O(logn).

@jnothman
jnothman Oct 21, 2014 Member

No, query is not dtype='>u4' (check it in this function!), probably due to the same quirky numpy behaviour.

And I mean there's a solution which is α2*log(len(bit_string_array)) + β32, without sequential searches on bit_string_array. The idea is that either np.searchsorted(bit_string_array, query, side='left') or np.searchsorted(bit_string_array, query, side='right') (with some edge cases) will share the maximum size prefix with query. It's just a matter of then determining the prefix size, which should be easy enough.

@jnothman
jnothman Oct 21, 2014 Member

The implicit nativization of byte order is really messy and hard to debug. I think we need to find a solution that avoids the explicit byte order in >u4. (I know, in my ignorance, I suggested >u4 in the first place.) For example, rather than viewing the packed bits as >u4, we could view it as an array of |S4, or a 2d array of u1s when necessary. (Note that this approach is extensible to a larger hash length, as long as it is divisible by 8.) Annoyingly, bitwise operations like &, |, ~ are not implemented for |S4, but I can quickly write a longest_matching_prefix(char *a, char *b) in Cython if you would like.

@jnothman
jnothman Oct 22, 2014 Member

By the way, this is noted in an issue at numpy/numpy#1866 which says "It seems (cf. arrayscalars.h) that the scalar objects do not even have byte order on the C level, they're just assumed to be native.". So you have to make sure everything remains a 1d array for the '>u4' dtype to work, or we can change to using byte arrays and make maintenance a bit easier.

@jnothman
jnothman Oct 22, 2014 Member

For the sake of minimising the work necessary to merge this, I have changed my mind; we can leave a rewrite of this code till later, as currently there are no operations that should be broken by this quirky behaviour, although I still think it's a bad idea to leave it in the code in the long term.

Where I found it problematic was in trying to rewrite the function to use only two lookups in bit_string_array. In doing so you have to have a function that just calculates the shared prefix length of two ints, which is of course byte-order dependent. I ran into trouble.

So I think what we need are:

  • tests of this private method that confirm it does what it is meant to, whether the arrays passed in are of dtype '>u4' or of '<u4' (not sure about the semantics for '<u4' so maybe just support '>u4').
  • a comment pointing to the numpy issue

(If you'd rather go down the string representation path, which again benefits from allowing a variable hash size and discarding the masks arrays, this is still an option, and I'm happy to submit a patch for the code, but tests are still needed.)

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

Please do not go (back!) to the string representation route.

We need the test suite you refer to anyway; given that, maintaining an int
based representation is reasonable despite the ambiguities and platform
differences. The space (and less so, time) efficiencies of using int-based
representations are significant (~4x for the index, which matters in the
not-too-high-dimension regime). A variable hash length is not practically a
benefit: 32bits is quite a lot for in-memory applications, and still very
compact. A byte array approach might be reasonable as it is compact, the
speed cost is less a concern.

On Wed, Oct 22, 2014 at 12:11 PM, jnothman notifications@github.com wrote:

In sklearn/neighbors/approximate.py:

+all = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):

  • """Finds indices in sorted array of integers.
  • Most significant h bits in the binary representations of the
  • integers are matched with the items' most significant h bits.
  • """
  • left_index = np.searchsorted(sorted_array, item & left_mask)
  • right_index = np.searchsorted(sorted_array, item | right_mask,
  •                              side='right')
    
  • return np.arange(left_index, right_index)

+def _find_longest_prefix_match(bit_string_array, query, hash_size,

For the sake of minimising the work necessary to merge this, I have
changed my mind; we can leave a rewrite of this code till later, as
currently there are no operations that should be broken by this quirky
behaviour, although I still think it's a bad idea to leave it in the code
in the long term.

Where I found it problematic was in trying to rewrite the function to use
only two lookups in bit_string_array. In doing so you have to have a
function that just calculates the shared prefix length of two ints, which
is of course byte-order dependent. I ran into trouble.

So I think what we need are:

  • tests of this private method that confirm it does what it is meant
    to, whether the arrays passed in are of dtype '>u4' or of '<u4'.
  • a comment pointing to the numpy issue

(If you'd rather go down the string representation path, which again
benefits from allowing a variable hash size and discarding the masks
arrays, this is still an option, and I'm happy to submit a patch for the
code, but tests are still needed.)


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19204157.

Daniel Vainsencher

@jnothman
jnothman Oct 22, 2014 Member

I don't understand your issue with the space consumption of a (fixed-length) string array, i.e. the exact same array we currently have with .view(dtype='|S4').

@daniel-vainsencher
daniel-vainsencher Oct 22, 2014

If you keep the space efficiency, then I just misunderstood your
proposal, please ignore. Do consider also the time efficiency of the
binary search, it also matters (though much less).

Ok, signing off for now.

On 10/22/2014 12:28 PM, jnothman wrote:

In sklearn/neighbors/approximate.py:

+all = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):

  • """Finds indices in sorted array of integers.
  • Most significant h bits in the binary representations of the
  • integers are matched with the items' most significant h bits.
  • """
  • left_index = np.searchsorted(sorted_array, item & left_mask)
  • right_index = np.searchsorted(sorted_array, item | right_mask,
  •                              side='right')
    
  • return np.arange(left_index, right_index)

+def _find_longest_prefix_match(bit_string_array, query, hash_size,

I don't understand your issue with the space consumption of a
(fixed-length) string array, i.e. the exact same array we currently have
with |.view(dtype='|S4')|.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3304/files#r19204757.

@jnothman
jnothman Oct 22, 2014 Member

Thanks Daniel. Binary search wasn't ever in question!

@jnothman
jnothman Oct 27, 2014 Member

A correction: we should use numpy.ndarray.tostring() rather than view('|S4') which will interpret the string as null delimited.

@maheshakya
maheshakya Oct 28, 2014 Contributor

For an array (like we have after random projections, a = np.array([1, 0, 1]).astype(int)), it is represented as raw data bytes. So for each element there will be 8 values. So for the above array, what we get is:

'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00'

How can this be used when we want is something like '101'?
(It can be used in string comparison, but there will be alot of redundant comparisons then)

Besides, in my earliest commits I have used the string comparisons for this (see 8c396e7 ) and found it was much slower than the current method.

@maheshakya
maheshakya Oct 28, 2014 Contributor

Cannot we proceed with a comment about the issue about the behavior of numpy and to fix it later, because using string representation may cause in slow performance.

@jnothman jnothman commented on an outdated diff Oct 22, 2014
sklearn/neighbors/approximate.py
+ """Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_estimators : int (default = 10)
+ Number of trees in the LSH Forest.
+
+ n_candidates : int (default = 10)
+ Value to restrict candidates selected from a single esitimator(tree)
@jnothman
jnothman Oct 22, 2014 Member

"Value to restrict candidates selected" is not very clear. How about "Minimum number of candidates evaluated per estimator, assuming enough items meet the min_hash_match constraint."

@jnothman
jnothman Oct 22, 2014 Member

Also, note that this is unused in radius nearest neighbors.

@jnothman jnothman commented on the diff Oct 22, 2014
sklearn/neighbors/approximate.py
+
+ def _get_candidates(self, query, max_depth, bin_queries, n_neighbors):
+ """Performs the Synchronous ascending phase.
+
+ Returns an array of candidates, their distance ranks and
+ distances.
+ """
+ candidates = []
+ max_candidates = self.n_candidates * self.n_estimators
+ while max_depth > self.min_hash_match and (len(candidates)
+ < max_candidates or
+ len(set(candidates))
+ < n_neighbors):
+
+ for i in range(self.n_estimators):
+ candidates.extend(
@jnothman
jnothman Oct 22, 2014 Member

Firstly, I think max_candidates should be min_candidates: we keep getting more until it is met. Or perhaps the code is incorrect (and not sufficiently unit tested).

Secondly, if the loop is traversed more than once, it appears to me that candidates will certainly contain duplicates. Is it really the intention that len(candidates) < max_candidates applies to a candidates array that contains duplicates? Why don't we just use candidates = set() and candidates.update instead of candidates.extend? This avoids the overhead of set(candidates) in the n_neighbors criterion anyway (and the np.unique below).

@maheshakya
maheshakya Oct 22, 2014 Contributor

Yes I guess you are correct about max_candidates. It should be min.

I think there is a reason to use the conditions as len(candidates) < max_candidates or len(set(candidates)) < n_neighbors in the LSH Forest paper. Loop will terminate if only both of these conditions fail or max_depth reaches min_hash_match. If only len(distinct_candidates) < n_neighbors is used results will solely depend on random projections without referring actual similarity measure because total candidates will be equal to requested number of neighbors. If len(distinct_candidates) < max_candidates is used, there will be unnecessarily excessive number of distinct candidates to compute actual distance and it doesn't have any reference to number of neighbors.
The synchronous ascending algorithm is designed in that way as indicated in the paper.

@jnothman
jnothman Oct 22, 2014 Member

I should probably refer to the paper, but assuming this is right, I'd still rather gather these things in a set, while storing the total number of candidates retrieved so far in a separate variable.

@jnothman jnothman commented on an outdated diff Oct 22, 2014
sklearn/neighbors/approximate.py
+ Parameters
+ ----------
+
+ n_estimators : int (default = 10)
+ Number of trees in the LSH Forest.
+
+ n_candidates : int (default = 10)
+ Value to restrict candidates selected from a single esitimator(tree)
+ for nearest neighbors. Number of total candidates is often greater
+ than n_candidates*n_estimators(unless restricted by min_hash_match)
+
+ n_neighbors : int (default = 5)
+ Number of neighbors to be returned from query function when
+ it is not provided to :meth:`k_neighbors`
+
+ min_hash_match : int (default = 4)
@jnothman
jnothman Oct 22, 2014 Member

This is an important parameter, and applies universally unlike n_candidates, n_neighbors, radius and radius_cutoff_ratio each of which is only used in one of the two query methods. Please make this the second parameter to emphasise it.

The importance of this parameter, and its unfamiliarity to users of whitebread LSH also makes me wonder whether it should be the subject of an example (although I appreciate that it may interact with other parameters in tricky ways).

@jnothman jnothman commented on an outdated diff Oct 23, 2014
sklearn/neighbors/approximate.py
+ """
+ Returns the n_number of approximated nearest neighbors
+
+ Parameters
+ ----------
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single query.
+
+ n_neighbors : int, opitonal (default = None)
+ Number of neighbors required. If not provided, this will
+ return the number specified at the initialization.
+
+ return_distance : boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
@jnothman
jnothman Oct 23, 2014 Member

This method's return value is undocumented.

@jnothman jnothman commented on an outdated diff Oct 23, 2014
sklearn/neighbors/approximate.py
+ """
+ Returns the approximated nearest neighbors within the radius
+
+ Parameters
+ ----------
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single query.
+
+ radius : float
+ Limiting distance of neighbors to return.
+ (default is the value passed to the constructor).
+
+ return_distance : boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
@jnothman
jnothman Oct 23, 2014 Member

This method's return value is undocumented.

@jnothman
jnothman Oct 23, 2014 Member

I also note that the RadiusNeighborsMixin's radius_neighbors is incorrectly documented and should probably be fixed: inds is an array of arrays (and IMO should be a list of arrays) and that structure should be described more precisely.

@jnothman jnothman commented on an outdated diff Oct 23, 2014
sklearn/neighbors/approximate.py
+ binary hashed value.
+ """
+ projections = (np.dot(self.hash_functions_[tree_n],
+ y) > 0).astype(int)
+
+ return np.packbits(projections).view(dtype='>u4')[0]
+
+ def fit(self, X):
+ """Fit the LSH forest on the data.
+
+ Parameters
+ ----------
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single data point.
+ """
@jnothman
jnothman Oct 23, 2014 Member

This method's return value is undocumented, albeit trivial.

@jnothman jnothman commented on the diff Oct 23, 2014
sklearn/neighbors/approximate.py
+ n_neighbors = self.n_neighbors
+
+ X = check_array(X)
+
+ neighbors, distances = [], []
+ for i in range(X.shape[0]):
+ neighs, dists = self._query(X[i], n_neighbors)
+ neighbors.append(neighs)
+ distances.append(dists)
+
+ if return_distance:
+ return np.array(distances), np.array(neighbors)
+ else:
+ return np.array(neighbors)
+
+ def radius_neighbors(self, X, radius=None, return_distance=True):
@jnothman
jnothman Oct 23, 2014 Member

Is there a reason we are not using the KNeighborsMixin and RadiusNeighborsMixin to provide the _graph methods?

@jnothman jnothman commented on the diff Oct 23, 2014
sklearn/neighbors/approximate.py
+ if radius is None:
+ radius = self.radius
+
+ X = check_array(X)
+
+ neighbors, distances = [], []
+ for i in range(X.shape[0]):
+ neighs, dists = self._query(X[i], radius=radius,
+ is_radius=True)
+ neighbors.append(neighs)
+ distances.append(dists)
+
+ if return_distance:
+ return np.array(distances), np.array(neighbors)
+ else:
+ return np.array(neighbors)
@jnothman
jnothman Oct 23, 2014 Member

calling np.array on this return value means that if the data happens to produce an equal number of neighbors for each query item, it will return a 2d array, while otherwise it will return an array of arrays. Either we should return a list of arrays, or for consistency with RadiusNeighborsMixin, we should return an array of arrays, but in both radius_neighbors implementations we need to ensure (i.e. fix and ideally test) that we are not returning 2d arrays. The fix can use something like:

def _array_of_arrays(list_of_arrays):
    out = np.empty(len(list_of_arrays), dtype=object)
    out[:] = list_of_arrays
    return out

which will keep things safe.

@coveralls

Coverage Status

Changes Unknown when pulling b877b98 on maheshakya:lsh_forest into * on scikit-learn:master*.

@jnothman

But now you need at least a smoke test to make sure those graph methods run without error.

@jnothman

Perhaps this should be added to the RadiusNeighborsMixin as it really should be used there too...

@maheshakya
Contributor

Is there anything that needs to be done in this PR before merging?

@ogrisel ogrisel commented on an outdated diff Oct 29, 2014
benchmarks/bench_plot_approximate_neighbors.py
+ nbrs = NearestNeighbors(algorithm='brute', metric='cosine').fit(X)
+
+ average_time_approx = 0
+ average_time_exact = 0
+ accuracy = 0
+
+ for query in queries:
+ t0 = time()
+ approx_neighbors = lshf.kneighbors(query, n_neighbors=n_neighbors,
+ return_distance=False)
+ T = time() - t0
+ average_time_approx += T
+
+ t0 = time()
+ exact_neighbors = nbrs.kneighbors(query, n_neighbors=n_neighbors,
+ return_distance=False)
@ogrisel
ogrisel Oct 29, 2014 Member

This call is expensive and repeated many times for different values of lshf_params. Could you please refactor this benchmark to only perform the exact queries once for each query and index size and reuse the results to compute the accuracies of the LSHForest models with different lshf_params values?

@ogrisel
Member
ogrisel commented Oct 29, 2014

Please consider merging maheshakya/scikit-learn#8 into this branch. This example should further be improved by collecting standard deviations of times and accuracies so as to add error bars on the plots as there is quite a lot of variations, esp for the accuracies and this is not reflected in the current plots.

@ogrisel

The concrete exact radius search methods should be updated to leverage this private method as well, otherwise I don't see the point in moving it into the base mixin class.

@ogrisel
Member
ogrisel commented Oct 29, 2014

Is there anything that needs to be done in this PR before merging?

Yes, as @GaelVaroquaux said earlier it is very important to integrate this method into existing estimators that could directly benefit from it.

In my opinion:

  • KNeighborsClassifier
  • RadiusNeighborsClassifier
  • KNeighborsClassifier
  • RadiusNeighborsClassifier

should all support the algorithm="forest_lsh" parameter as well as adding the n_estimators, n_candidates and random_state hyperparameter with reasonable default values to those classes as well.

The fit method should properly raise informative error message when unsuported combinations of hyperparameters are passed (e.g. a non-cosine metric with algorithm="forest_lsh" for KNeighborsClassifier).

@jnothman
Member

Is there anything that needs to be done in this PR before merging?

Yes, as @GaelVaroquaux said earlier it is very important to integrate this method into existing estimators that could directly benefit from it.

IMO, mostly because of the magnitude and time scale of this PR, that is best left to another contribution.

But I have still not given the code a full review, @maheshakya; I haven't looked at the testing. Each time I've stopped after I've discovered a critical number of issues.

@daniel-vainsencher

On Oct 29, 2014 10:56 PM, "jnothman" notifications@github.com wrote:

Is there anything that needs to be done in this PR before merging?

Yes, as @GaelVaroquaux said earlier it is very important to integrate
this method into existing estimators that could directly benefit from it.

IMO, mostly because of the magnitude and time scale of this PR, that is
best left to another contribution.

I agree. In the current mode, we have paused some algorithmic work until
the merge, which is a pity.

But I have still not given the code a full review, @maheshakya; I haven't
looked at the testing. Each time I've stopped after I've discovered a
critical number of issues.

I think it might be useful to do a complete review and distinguish between
must-fix and nice to haves. Then we do the first and some of the latter,
and merge. While I certainly appreciate the feedback and resulting
improvements the current process just feels endless.


Reply to this email directly or view it on GitHub.

@jnothman
Member

I think it might be useful to do a complete review and distinguish between
must-fix and nice to haves. Then we do the first and some of the latter,
and merge. While I certainly appreciate the feedback and resulting
improvements the current process just feels endless.

I am well aware of your concerns, but I've found enough issues with the correctness of the code (and it is generally a complex enough algorithm) to require a fairly high standard of unit testing.

@daniel-vainsencher

I do not see where our last two comments disagree... I fully agree that the reviews so far have been very valuable, leading to improvements in correctness and clarity. I am not calling to stop reviews and just merge. I love calls for specific unit tests, I would have done more of those myself as a mentor but am frankly less experienced myself in this area.

I would like to see two changes in the way we are going about the reviews:

  • An explicit distinction between A. improvements to quality (yes!), B. expansions of scope (left to a future PR) and C. exploratory/subjective changes (probably left for investigation for a future PR).
  • An attempt to be predictably finite: here is a complete review of all the code with issues labelled as types A, B and C. Deal with A and we should merge (of course if we find a critical bug afterwards we won't ignore it).

I realize this PR is a non-trivial amount of code to review, and providing a full review in advance would have been risky earlier in the process: what if we just get discouraged at all the issues and quit?! but I suggest maybe progress so far defuses that risk, and it is time to go bulk.

Does this plan sound reasonable? if that's just not the way scikit-learn rolls, I understand.

@jnothman
Member

It would be great to go bulk, but it's not a matter of scikit-learn
rolling, but just finding sufficient contiguous available time to make the
review.

On 30 October 2014 13:28, Daniel Vainsencher notifications@github.com
wrote:

I do not see where our last two comments disagree... I fully agree that
the reviews so far have been very valuable, leading to improvements in
correctness and clarity. I am not calling to stop reviews and just
merge. I love calls for specific unit tests, I would have done more of
those myself as a mentor but am frankly less experienced myself in this
area.

I would like to see two changes in the way we are going about the reviews:

  • An explicit distinction between A. improvements to quality (yes!),
    B. expansions of scope (left to a future PR) and C. exploratory/subjective
    changes (probably left for investigation for a future PR).
  • An attempt to be predictably finite: here is a complete review of
    all the code with issues labelled as types A, B and C. Deal with A and we
    should merge (of course if we find a critical bug afterwards we won't
    ignore it).

I realize this PR is a non-trivial amount of code to review, and providing
a full review in advance would have been risky earlier in the process: what
if we just get discouraged at all the issues and quit?! but I suggest maybe
progress so far defuses that risk, and it is time to go bulk.

Does this plan sound reasonable? if that's just not the way scikit-learn
rolls, I understand.


Reply to this email directly or view it on GitHub
#3304 (comment)
.

@daniel-vainsencher

Then we'll take what we can get of course :)

On 10/30/2014 04:32 AM, jnothman wrote:

It would be great to go bulk, but it's not a matter of scikit-learn
rolling, but just finding sufficient contiguous available time to make the
review.

On 30 October 2014 13:28, Daniel Vainsencher notifications@github.com
wrote:

I do not see where our last two comments disagree... I fully agree that
the reviews so far have been very valuable, leading to improvements in
correctness and clarity. I am not calling to stop reviews and just
merge. I love calls for specific unit tests, I would have done more of
those myself as a mentor but am frankly less experienced myself in this
area.

I would like to see two changes in the way we are going about the
reviews:

  • An explicit distinction between A. improvements to quality (yes!),
    B. expansions of scope (left to a future PR) and C.
    exploratory/subjective
    changes (probably left for investigation for a future PR).
  • An attempt to be predictably finite: here is a complete review of
    all the code with issues labelled as types A, B and C. Deal with A
    and we
    should merge (of course if we find a critical bug afterwards we won't
    ignore it).

I realize this PR is a non-trivial amount of code to review, and
providing
a full review in advance would have been risky earlier in the
process: what
if we just get discouraged at all the issues and quit?! but I suggest
maybe
progress so far defuses that risk, and it is time to go bulk.

Does this plan sound reasonable? if that's just not the way scikit-learn
rolls, I understand.


Reply to this email directly or view it on GitHub

#3304 (comment)

.


Reply to this email directly or view it on GitHub
#3304 (comment).

@ogrisel
Member
ogrisel commented Oct 30, 2014

It would be great to go bulk, but it's not a matter of scikit-learn rolling, but just finding sufficient contiguous available time to make the review.

Same here: I usually spend max 2 hours in row when reviewing a PR. For a PR like this it is not sufficient to review it in all at once, hence the iterative feedback. I acknowledge that this can be frustrating but I don't have any solution.

I started by reviewing the doc and examples because I think it's good to see a new feature from the user's perspective first. Documentation and examples are one of the primary reason of the success of scikit-learn: they should be treated as first class citizens.

There are still problems with the doc and examples: re-read them with a full (cd doc && make html):

  • the presentation could still be improved in the narrative doc. Now it's hard to follow for people who don't already know what LSH is. The ordering of the narrative doc should be:

    1- introduce what ANN is and what problem it tackles (this is already done)

    2- tell that the canonical method of ANN is called "Locality Sensitive Hashing" (LSH) but that its vanilla implementation has a hyper-parameter that is hard to tune in practice and that therefore scikit-learn implements a variant called "LSH Forest" that has more reasonable hyperparameters. Both methods use internally random hyperplanes to index the samples into a buckets and that actually cosine similarities are only computed for samples that collide with the query hence achieving sublinear scaling. Reference the mathematical formulation section for more details.

    3- quickly introduce the two main hyperparameters of "LSH Forest" n_candidates and n_estimators and present the plot that highlight the impact of n_candidates and n_estimators on accuracy and query time and give rule of thumb to set them: fix n_estimators to be large enough (e.g. between 10 and 50) and then vary n_candidates to trade off accuracy for query time.

    4- present the scalability profile (query time vs index size) of the LSH Forest model for fixed hyperparameter values: highlight the break even point wrt. brute force scan and the fact that the accuracy should not be too sensitive w.r.t. the dataset size.

    5- give mathematical presentation of the vanilla LSH algorithm, highlight that the choice of k (the number of binary digits for the individual hash) is strongly dependent on the dataset size and structure and is therefore hard to tune in practice) and then present the LSHForest variant (the prefix tree and the 2 stage tree traversal to do the queries) and

  • the hyperparameters examples need averaging over more iteration right now the curves are non monotonic which is confusing. If it's too costly to run many iterations to get monotic or smooth curves at least add errors bars with the std dev for each point to highlight that this is highly stochastic. Error bars are good any should be added to all those plots any way.

Then I reviewed the benchmarks to check that the scalability profiles actually looks like it's supposed to be (as the point of ANN is to be able to do sub-linear time queries): there are also problems with the benchmarks/bench_plot_approximate_neighbors.py script: the last refactoring that was meant to remove redundant computation to make the benchmark execution run faster actually completely broke it. Just run it and you will see what I mean. Once this is fixed we should check that the speedup is increasing significantly overtime: the scalability of brute force is linear time. LSH should be log-ish or polynomial with a power < 1 if I understand correctly. So when growing the dataset size linearly, the speedup should improve more than linearly. I am not sure this was the case with the previous iteration of the benchmark. It would be worth checking once the benchmark script is fixed.

Finally, I still think than leveraging LSH for KNN classification / regression should be done in this PR to check that we are not making any fundamental mistake in the design of this tool. However I also agree that integration into KNN classification / regression models should be done once the doc, examples, benchmarks, tests and review of the existing implementation are completed.

I read the paper yesterday and plan to review the implementation next (although I probably won't have time this week).

@ogrisel
Member
ogrisel commented Nov 3, 2014

The ordering of the samples of LSHForest.kneighbors does not match the ordering of the returned values of NearestNeighbors.kneighbors. The closest should be first IMO.

@maheshakya
Contributor

@ogrisel I've modified the documentation. Can you have a look?

@ogrisel
Member
ogrisel commented Nov 5, 2014

@ogrisel I've modified the documentation. Can you have a look?

Thanks, I find its organization much nicer to new users. I will add some minor comments inline.

@ogrisel ogrisel commented on an outdated diff Nov 5, 2014
doc/modules/neighbors.rst
+
+Approximate Nearest Neighbors
+=============================
+
+There are many efficient exact nearest neighbor search algorithms for low
+dimensions :math:`d` (approximately 50 on modern computers). However these
+algorithms perform poorly with respect to space and query time, with an
+exponential rate in :math:`d`. For last :math:`d`, these algorithsm have only
+an insignificant improvement compared with the linear time algorithm
+:ref:`brute_force`. That implies that these algorithms are not any better than
+comparing query point to each point from the database in a high dimension.
+This is a well-known phenomenon called “The Curse of Dimensionality”.
+
+There are certain applications where we do not need the exact nearest neighbors
+but having a “good guess” would suffice when the data is in high dimension.
+When answers do not have to be exact, the :class:`LSHForest` implements an
@ogrisel
ogrisel Nov 5, 2014 Member

"the :class:LSHForest implements" => "the :class:LSHForest class implements"

@ogrisel ogrisel commented on the diff Nov 5, 2014
doc/modules/neighbors.rst
+ :math:`p` in the bucket with label
+ :math:`g(p) = (h_1(p), h_2(p), … h_k(p))`. Observe that if
+ each :math:`h_i` outputs one “digit”, each bucket has a k-digit label.
+
+2. Independently perform step 1 :math:`l` times to construct :math:`l`
+ separate estimators, with hash functions :math:`g_1, g_2, … g_l`.
+
+The reason to concatenate hash functions in the step 1 is to decrease the
+probability of the collision of distant points as much as possible. The
+probability drops from :math:`p_2` to :math:`p_2^k` which is negligibly
+small for large :math:`k`. The choice of :math:`k` is strongly dependent on
+the data set size and structure and is therefore hard to tune in practice.
+There is a side effect of having a large :math:`k`; it has the potential of
+decreasing the chance of nearby points getting collided. To address this
+issue, multiple estimators are constructed in step 2.
+
@ogrisel
ogrisel Nov 5, 2014 Member

Before the beginning of the paragraph that explains how LSH Forest works, I would add a couple of sentences that introduces the motivation for LSH Forest variant, for instance:

"""
The requirement to tune :math:k for a given dataset makes classical LSH cumbersome to use in practice. The LSH Forest variant was designed to alleviate this requirement by automatically adjusting the number of digits used to hash the samples.
"""

@ogrisel ogrisel commented on an outdated diff Nov 5, 2014
...plot_approximate_nearest_neighbors_hyperparameters.py
+colors = ['c', 'm', 'y']
+p1 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colors[0])
+p2 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colors[1])
+p3 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colors[2])
+
+labels = ['n_estimators = ' + str(n_estimators_for_candidate_value[0]),
+ 'n_estimators = ' + str(n_estimators_for_candidate_value[1]),
+ 'n_estimators = ' + str(n_estimators_for_candidate_value[2])]
+
+plt.figure()
+plt.legend((p1, p2, p3), (labels[0], labels[1], labels[2]),
+ loc='upper left', fontsize='small')
+
+for i in range(len(n_estimators_for_candidate_value)):
+ plt.scatter(n_candidates_values, accuracies_c[i, :], c=colors[i])
+ plt.plot(n_candidates_values, accuracies_c[i, :], c=colors[i])
@ogrisel
ogrisel Nov 5, 2014 Member

I think this can be greatly simplified with:

for i, n_estimators in enumerate(n_estimators_for_candidate_value):
    label = 'n_estimators = %d ' % n_estimators
    plt.plot(n_candidates_values, accuracies_c[i, :], 'o-', c=colors[i], label=label)

plt.legend(loc='upper left', font='small')
@ogrisel ogrisel commented on an outdated diff Nov 5, 2014
...plot_approximate_nearest_neighbors_hyperparameters.py
+n_candidates_values = np.linspace(10, 500, 5).astype(np.int)
+n_estimators_for_candidate_value = [1, 5, 10]
+accuracies_c = np.zeros((len(n_estimators_for_candidate_value),
+ n_candidates_values.shape[0]), dtype=float)
+
+nbrs = NearestNeighbors(n_neighbors=1, algorithm='brute',
+ metric='cosine').fit(X_index)
+neighbors_exact = nbrs.kneighbors(X_query, return_distance=False)
+
+# Calculate average accuracy for each value of `n_candidates`
+for j, value in enumerate(n_estimat