Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC) #3304

Closed
wants to merge 119 commits into from
@maheshakya

No description provided.

@daniel-vainsencher

Maheshakya, did you mention using arrays of integers as the output of do_hash? seems like the current version should produce strings, right?
Also, random projection is generally used to mean simply multiplying by a random matrix. When you add a reference to this kind of hash, they might have a more specific name for this LSH family.

Converting hashes to integers does not prove any increase of performance in speed or memory consumption(I have another version which does that). It only increases the complexity during string to integer conversion and the other way around. So I thought that using integer arrays would be less useful.
But I have used numpy arrays of strings in the random projections.

It wont be a problem when using other hashing algorithms since do_hash function is defined to perform hashing for that particular LSH family.

Actually the link in the reference directs to Random projections section in LSH in wikipedia. I need to change the description of the reference. :)

(np.array([-2,3,4,-5]) >0).astype('int') seems pretty simple to me. I do not expect the ints to be faster than booleans (and I do not expect this to be a bottleneck, so speed is not a particular important criterion here). On the other hand, using hash families that return more than two results might be useful in the future. Of course, if you want to restrict the code to binary hashes for now, that's fine, but needs documentation in the right place.

Where are there numpy arrays of strings, and what do you mean by that?

Only a single string(a single binary hash) is return from the do_hash function, which is for a single data point in a single tree. The tree(which is in the LSH forest implementation) stores these strings in a numpy array as dtype='|S2'. S2 stands for a string of length greater than 1.

@daniel-vainsencher

For elegance, please try to express your _bisect_right(a,x) in terms of searchsorted instead of reimplementing.

sklearn/neighbors/lsh_forest.py
((140 lines not shown))
+ [3, 2, 1],
+ [4, 3, 2]]), array([[ 0. , 3.15525015, 9.54018168],
+ [ 0. , 3.15525015, 6.38493153],
+ [ 0. , 6.38493153, 9.54018168],
+ [ 0. , 12.92048135, 19.30541288],
+ [ 0. , 26.1457523 , 39.06623365]]))
+
+ """
+
+ def __init__(self, max_label_length=32, n_trees=10,
+ hashing_algorithm='random_projections',
+ c=50, n_neighbors=1, lower_bound=4, seed=1):
+ self.max_label_length = max_label_length
+ self.n_trees = n_trees
+ self.hashing_algorithm = hashing_algorithm
+ self.random_state = np.random.RandomState(seed)
@jnothman Owner

Please store all arguments to the constructor just as they are passed in. Please see other uses of check_random_state and where it is called.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@daniel-vainsencher

Have you considered the performance implications of the new version of _bisect_right(a,x)? transforming "a" seems expensive... OTOH, maybe you can transform x a little bit. I like to think of hashes as binary fixed point representations (0.010110001...) of numbers in [0,1], instead of the more concrete list of {0,1}.

I found that using numpy searchsorted is very slower than the previous method. So I changed it back.

The reason I thought it should be kept as binary strings is, we cannot perform slicing operation on integers(or any numerical type). Slicing is required to extract the MSBs from the binary hash. if decimal like 0.001110... or an array of integers is used, it will make it more complex and more time consuming at his MSBs extraction.

I have added some explanation in the doc string.

Are you sure it is searchsorted that is slower?
What your version did is to extract the MSBs of the whole index (a); if it creates a copy, that would access much more memory than the the whole binary search. If that slice doesn't create a copy, it might still have inefficient access patterns.
I am saying that instead you can transform the query x so that searchsorted gives the correct reply without having to extract the MSBs from a at all (instead of searching to 0.0101 (ignoring further digits), just search to 0.01100000...) . Then the use of either of search sorted or bisect_* from the library code allows you to delete some code. And if you move to the integer representation, that will also be blazingly fast and allow for more compact indices.

No, searchsorted is not what makes it slow. It's the slicing operations performed for elements in a.

To transform x: required MSBs of x can be extracted and the remaining part can be replaced by 1s. This transformation will produce the same results as before when used as searchcsorted(x, side='right'). So there will be no slicing on a. What do you think about this?

I have implemented to work with integers(I think you have seed it already):
lsh_forest_without_hash_bit_strings.py

But it didn't show much improvement in speed.

Paragraphs 1-2: sounds good. My expectation of the result is: less code overall. A significant speed improvement for small c. Smaller indices allow more trees, for better precision. Of course, for large c, time is still dominated by the candidate review stage, that is another issue.

As to paragraphs 3-4: I'm looking at the code under that link, it seems to be incomplete and old...
1. Uses the hash inside the loop in query (line 170), a big bottleneck you've already fixed elsewhere
2. Uses bisect_left and a slow/strange bisect_right, instead of the well optimized np.search_sorted, so no speed improvement is expected. Besides, line 19 seems to treat x as a string/array, rather than an int. If this works at all, its a lot of luck.
3. Lines 33,34 use inconsistent types, 34 looks like it searches for a string through an array of ints.

So for an informative comparison, you must combine the integer representation with all the work you'd done before on the current representation. I suggest the following approach:

  1. Start from your best tested implementation
  2. Change to the new approach (transform x instead of changing the binary sort) on the string representation
  3. Drop the custom binary search in favor of searchsorted (pause to test that the results haven't changed much yet, and certainly shouldn't get worse)
  4. Create a branch in which only the representation is changed (string->int).
  5. Benchmark.

@daniel-vainsencher , I used that transformation on x. So that only the query has to be transformed before the binary search happens. This indeed made a significant improvement in query speed.

I have also created another branch with entire integer representation of hashes:
lsh_forest and lshashing

For _bisect_right operation, I had to reconvert query integer to binary, turn last bits to '1' (bits after hash length) and reconvert it back to integer. This makes the code a little more complex and maybe a bit less readable as well( See the _find_matching_indices function in the new branch).

But these two versions are not so different in performance with respect to query speeds. Memory consumption in the latter version is slightly better than the other one.

Yes, It's compared to transforming the index.

There is no bug in that implementation. Let me explain with an example:
Suppose you have a sorted array of strings:

gona = np.array(['000001', '001010', '001101', '001110','001111', '100001', '101000', '101010'])

These strings have a fixed length = 6. Suppose there's another string query = '001100. Say the considered is hash length h = 4. Then we want to get the indices from gona where first h bits of those entries match with first h bits for query. This means we want the indices of gona where MSBs = '0011'. Say x = query[:h], which means x = '0011'

Now we'll see how numpy searchsorted works. This function has two behaviors for parameter side = 'left' and side = 'right'. See numpy.searchsorted

  1. We can use side = 'left' directly to find the leftmost index. In the string representation '0011' is less than '001100' but it is greater than '000111'. So we can use this behavior directly to find the leftmost index where a string starts with '0011'. `numpy.searchsorted(gona, x, side = 'left') will return the position where '0011' suits in the array which is the leftmost index where a string starts with '0011'. This operation will return 2.

  2. To find the right most index, we cannot use side = 'right' directly like before. If you have read the documentation of searchsorted, setting side = 'right' will return the last suitable index. So the returned values of 'left' and 'right' differ only when there are duplicate entries in the array and what we search in the array is those duplicate entries. So passing '001100' or '0011' directly into searchsorted with side = 'right' will not do what we actually want to do. What we want is to find the right most index where the first 4 bits are equal to '0011' . We can accomplish this by transforming query. If we set the last two bits to '1's, '001100' will become '001111'. Now it is guaranteed that '001111' is greater than or equal to(to '001111' it self) any string which starts with '0011'. Using searchsorted with side = 'right' will make sure that it will cover even '001111' in the array(a duplicate of transformed query). So numpy.searchsorted(gona, query[:h] + '11', side = 'right') will return 5 which is the rightmost index which a string starts with '0011'. (my custom implementation ofbisect_right` also did the same thing).

So `numpy.arange(2,5) will give [2, 3, 4], an array of indices where strings start with '0011'.

This will return the correct candidates we are looking for. So it is not a bug. The same concept applies to integer representation as well.

I'll send you the results of line_profiler on these two implementations. About the memory consumption: Yes, I need to include raw data (input_array) in this data structure as it is required to calculate the actual distances of candidates. So that is the reason for not having improvements in memory consumption because the space consumed by raw data is the dominating factor.

It seems my last comment has not been posted here. :(
Anyway I sent you an email.

@maheshakya maheshakya Corrected usage of random_state parameter.
Replaced numpy searchsorted  in _bisect_right with the
previous version.
08ab73a
@maheshakya

@jnothman, I'm adding insert operation into this data structure. I suppose that could help in incremental learning.

maheshakya added some commits
@maheshakya maheshakya Added insert operation into LSH Forest.
Insert operation allows to insert new data points into the fitted set of trees.
(Can be used in incremental learning? )

Changed parameter m to n_neighbors.

Changed parameter m to n_neighbors.
c94cb5d
@maheshakya maheshakya Changed parameter m to n_neighbors. 802ed5f
@coveralls

Coverage Status

Coverage decreased (-0.06%) when pulling 802ed5f on maheshakya:lsh_forest into aaefdbd on scikit-learn:master.

@jnothman
Owner

Did I say something about incremental learning?

@maheshakya

Yes.
Check issue #3175

@jnothman
Owner
sklearn/feature_extraction/lshashing.py
((35 lines not shown))
+ def __init__(self, n_dim=None, hash_size=None, random_state=1):
+ if n_dim is None or hash_size is None:
+ raise ValueError("n_dim or hash_size cannot be None.")
+
+ self.n_dim = n_dim
+ self.random_state = random_state
+ self.hash_size = hash_size
+
+ def generate_hash_function(self):
+ """Generates a hash function"""
+
+ def do_hash(self, input_point=None, hash_function=None):
+ """Performs hashing on the input_point with hash_function"""
+
+
+class RandomProjections(BaseHash):
@arjoly Owner
arjoly added a note

You might want to have a look at the random projection module.

@arjoly Owner
arjoly added a note

A random sign projection transformer could have its place in that module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_extraction/lshashing.py
@@ -0,0 +1,81 @@
+"""
+Locality Sensitive Hashing Algorithms
+-------------------------------------
+"""
+# Author: Maheshakya Wijewardena <maheshakya.10@cse.mrt.ac.lk>
+
+import numpy as np
+from abc import ABCMeta, abstractmethod
+from ..externals.six import with_metaclass
+from ..utils import check_random_state
+
+__all__ = ["RandomProjections"]
+
+
+class BaseHash(with_metaclass(ABCMeta)):
@arjoly Owner
arjoly added a note

This looks like a duplicate of BaseRandomProjection.

I already saw this. But I think this implementation of random projections does not cohere with the context of Locality sensitive hashing algorithms. The main goal of this version is dimensionality reduction, but in LSH it is not the only case. See Locality-sensitive hashing. Other than that, there are many other LSH families as well(Which I need to implement later in the project).
To put it simply, what we want here is much simpler than BaseRandomProjection (For random projection as LSH algorithm). To get the dot product of the all vectors in the data set with a fixed random vector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/feature_extraction/lshashing.py
((66 lines not shown))
+ """
+ Generates hyperplanes of shape (hash_size, n_dim) from standard
+ normal distribution.
+ """
+ random_state = check_random_state(self.random_state)
+ return random_state.randn(self.hash_size, self.n_dim)
+
+ def do_hash(self, input_point=None, hash_function=None):
+ """
+ Does hashing on the data point with the provided hash_function.
+ """
+ if input_point is None or hash_function is None:
+ raise ValueError("input_point or hash_function cannot be None.")
+
+ projections = np.dot(hash_function, input_point)
+ return "".join(['1' if i > 0 else '0' for i in projections])
@arjoly Owner
arjoly added a note

(1 + np.sign(x)) / 2?

Yes. It is useful to convert the array into binary values before doing a conversion to string. But I found np.array(x > 0, dtype=int) faster than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton robertlayton self-assigned this
@robertlayton robertlayton removed their assignment
@robertlayton robertlayton self-assigned this
sklearn/neighbors/lsh_forest.py
((218 lines not shown))
+ self._hash_generator = self._select_hashing_algorithm(
+ n_dim, self.max_label_length)
+
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+ for i in range(self.n_trees):
+ # This is g(p,x) for a particular tree.
+ hash_function = self._hash_generator.generate_hash_function()
+ original_index, bin_hashes = self._create_tree(hash_function)
+ self._original_indices.append(original_index)
+ self._trees.append(bin_hashes)
+ self.hash_functions_.append(hash_function)
+
+ self.hash_functions_ = np.array(self.hash_functions_)
@robertlayton Owner

Try this first: http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html
As a second optimisation, consider how it might be possible to compute all the trees (and so on) in one numpy operation etc, to get rid of the previous loop. Dot products are your friend!

Yes, If all the hash functions are computed in advance, I think it's possible to get rid of the loop. I'll give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
maheshakya added some commits
@maheshakya maheshakya Added a transformation to the binary query.
For _bisect_right() function, a transformed x is passed. The transformation
will replace the characters after h hash length with '1's.

Used random_projections module.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
3cc2733
@maheshakya maheshakya Used random_projections module.
GuassianRandomProjections in random_projections module is used to perform the
hashing for Random projections LSH method.
d8e521b
@coveralls

Coverage Status

Coverage decreased (-0.06%) when pulling d8e521b on maheshakya:lsh_forest into 82611e8 on scikit-learn:master.

maheshakya added some commits
@maheshakya maheshakya A minor change in lshashing a9b49bb
@maheshakya maheshakya Gaussian Random Projection is used in the LSHForest class.
Removed lshashinng in feature extraction and add that funtionality in
the LSHForest class. If other hashing algorithms are to be implemented,
a separate lshashing class may be required.
eb4852d
@maheshakya maheshakya Remove Random projection from feature extraction _init 57d9412
@coveralls

Coverage Status

Coverage decreased (-0.05%) when pulling 57d9412 on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

maheshakya added some commits
@maheshakya maheshakya Converted to integer representation. b19ee99
@maheshakya maheshakya Updated example a7a5788
@maheshakya maheshakya Added accuracy tests for c and n_trees variation.
6a0366a
@coveralls

Coverage Status

Coverage decreased (-0.17%) when pulling 6a0366a on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

@maheshakya maheshakya Updated cache size to type int.
dcbe656
@coveralls

Coverage Status

Coverage decreased (-0.17%) when pulling dcbe656 on maheshakya:lsh_forest into b65e4c8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 9f3a575 on maheshakya:lsh_forest into 8dab222 on scikit-learn:master.

sklearn/neighbors/lsh_forest.py
((3 lines not shown))
+-------------------------------------------------------------------------
+"""
+# Author: Maheshakya Wijewardena <maheshakya.10@cse.mrt.ac.lk>
+
+import numpy as np
+import itertools
+from ..base import BaseEstimator
+from ..utils.validation import safe_asarray
+from ..utils import check_random_state
+
+from ..random_projection import GaussianRandomProjection
+
+__all__ = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
@robertlayton Owner

These functions would be good candidates to move to cython. The speed up will probably be quite drastic with little extra coding needed.

Only numpy.searchsorted(twice) and numpy.arange(once) methods are called in the _find_matching_indices method. Therefore I don't think using Cython for this function will make a significant improvement. I tried to use Cython for the string hashes we had earlier, but it did not make any improvement.
But I can try Cython on _find_longest_prefix_match method.

I think the priorities now should be making the code easy to merge and maintain, and removing serious problems (like an insert that takes O(nm) instead of O(n+m)).

I would specifically avoid micro optimizations: it takes some time to find where to optimize, try it out at different parameter values etc... and I'd guess they are relatively easy to add after GSOC if we want. Right?

@larsmans Owner
larsmans added a note

Hear hear. Get it working, get it asymptotically optimal, then bum it.

@ogrisel Owner
ogrisel added a note

+1

@ogrisel Owner
ogrisel added a note

It would still be interesting to report the output of line_profiler run on the main functions / method involved in this estimator. I mean as a comment to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((143 lines not shown))
+ [1, 0, 2],
+ [2, 1, 3],
+ [3, 2, 4],
+ [4, 3, 5]]), array([[ 0. , 0.52344831, 1.08434102],
+ [ 0. , 0.52344831, 0.56089272],
+ [ 0. , 0.56089272, 0.60101568],
+ [ 0. , 0.60101568, 0.6440088 ],
+ [ 0. , 0.6440088 , 0.6900774 ]]))
+
+ """
+
+ def __init__(self, max_label_length=32, n_trees=10,
+ radius=1.0, c=50, n_neighbors=1,
+ lower_bound=4, radius_cutoff_ratio=.9,
+ random_state=None):
+ self.max_label_length = int(max_label_length/2*2)
@robertlayton Owner

What is going on with this line? Isn't int() enough?

In order to cache the hashes, hash length should be an even number as cache is divided into to parts(Storing the cached hash as one is very memory consuming).
Is there any other way to make sure the max_label_length used is even?

There is no way to understand your intent from this code. Extract the code to a meaningfully named function, or use a comment.

In terms of implementation itself, it is not as obviously correct as it could be. If max_label_length somehow starts as a float, your code will fail, where int(max_label_length/2)*2 would obviously succeed.

But I think I've mentioned that in my opinion, this datastructure has no right to exist unless max_label_length is exactly 32 or 64 (and just 32 is probably sufficient for the moment); there are too many low level inefficiencies these particular cases avoid. That said, if your parameters invite max_label_length=23, you better be testing that it actually works...

@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((183 lines not shown))
+ This creates a binary hash by getting the dot product of
+ input_point and hash_function then transforming the projection
+ into a binary string array based on the sign(positive/negative)
+ of the projection.
+
+ Parameters
+ ----------
+
+ input_array: array_like, shape (n_samples, n_features)
+ A matrix of dimensions (n_samples, n_features), which is being
+ hashed.
+ """
+ if input_array is None:
+ raise ValueError("input_array cannot be None.")
+
+ grp = self._generate_hash_function()
@robertlayton Owner

If the function is only "do_hash", then shouldn't this line be in fit instead? i.e. I would expect a function called do_hash to be callable multiple times, during both fit and transform.

You can get around needing to generated grp here, but just setting it as a class attribute.

Done. Each tree needs a separate hash function to the functionality of do_hash can be put into _create_tree function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((202 lines not shown))
+
+ def _create_tree(self):
+ """
+ Builds a single tree (in this case creates a sorted array of
+ binary hashes).
+ """
+ hashes, hash_function = self._do_hash(self._input_array)
+ binary_hashes = []
+ for i in range(hashes.shape[0]):
+ xx = tuple(hashes[i])
+ binary_hashes.append(self.cache[xx[:self.cache_N]] * self.k
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
+
+ def _compute_distances(self, query, candidates):
@robertlayton Owner

Linking back to what I said about euclidean distance computation, look at where this function is called and consider if you can't wrap those up into a matrix

Euclidean distances are computed for an array of candidates(which is a 1D array). In the radius_neighbors method, this has to be computed for each time the max_depth is decreased. So it is calculated multiple times. But it cannot be wrapped into a single matrix because the size of candidate array differ for each max_depth and the radius_neighbors to total_candidates ratio should be checked for each max_depth value to ensure whether that ratio is within the given range.

@arjoly Owner
arjoly added a note

I would make this method a function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((213 lines not shown))
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
+
+ def _compute_distances(self, query, candidates):
+ distances = _simple_euclidean_distance(
+ query, self._input_array[candidates])
+ return np.argsort(distances), np.sort(distances)
+
+ def _generate_masks(self):
+ """
+ Creates left and right masks for all hash lengths
+ """
+ self._left_mask, self._right_mask = [], []
+
+ for length in range(self.max_label_length+1):
@robertlayton Owner

It would be well worth looking into replacing these with bitwise functions.
That said, it might be worth flagging that for future work.

These masks are created to perform bitwise operations for hashed data points. So these masks cannot be created with bitwise operations. In general (for 32 bit long hash) there will be only 66 masks, so the time consumption is insignificant for this operation, isn't it?

The bitwise + numeric operations python provides certainly suffice (using things like a || (1 << pos) ) so you do not have to use strings. Good catch that this is not a performance bottleneck! that's an important instinct (Amdahl's law is the general version). But if Robert meant that the bitwise solutions are more elegant in a sense than going through strings, I tend to agree it might worth doing when you have spare time.

As the cashed hashes are generated in the fit function before calling _generate_masks function. those were able to be used in the mask generation process to. So I used those instead of doing the string conversion. So that the little overhead we might have here is also eliminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((228 lines not shown))
+ for length in range(self.max_label_length+1):
+ left_mask = int("".join(['1' for i in range(length)])
+ + "".join(['0' for i in
+ range(self.max_label_length-length)]),
+ 2)
+ self._left_mask.append(left_mask)
+ right_mask = int("".join(['0' for i in range(length)])
+ + "".join(['1' for i in
+ range(self.max_label_length-length)]),
+ 2)
+ self._right_mask.append(right_mask)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
@robertlayton Owner

Please docstring all functions, even private ones

Done.

@arjoly Owner
arjoly added a note

Why does it need to be a method if it called only once?

@arjoly Owner
arjoly added a note

What is m? To follow scikit-learn convention, we should try to use self-explaining variable name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/tests/test_lsh_forest.py
((8 lines not shown))
+import numpy as np
+
+from sklearn.utils.testing import assert_array_equal
+from sklearn.utils.testing import assert_equal
+from sklearn.utils.testing import assert_raises
+from sklearn.utils.testing import assert_array_less
+from sklearn.utils.testing import assert_greater
+
+from sklearn.metrics import euclidean_distances
+from sklearn.neighbors import LSHForest
+
+
+def test_neighbors_accuracy_with_c():
+ """Accuracy increases as `c` increases."""
+ c_values = np.array([10, 50, 250])
+ samples = 1000
@robertlayton Owner

For unit testing, try to keep these values as small as possible while still testing your intent. Remember that these tests will be run lots of times, so short run times are preferable.

Longer tests could possibly by in the examples/ folder

I reduced the sample size. I will add some examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/tests/test_lsh_forest.py
((39 lines not shown))
+
+ intersection = np.intersect1d(ranks, neighbors).shape[0]
+ ratio = intersection/float(n_points)
+ accuracies[i] = accuracies[i] + ratio
+
+ accuracies[i] = accuracies[i]/float(n_iter)
+
+ # Sorted accuracies should be equal to original accuracies
+ assert_array_equal(accuracies, np.sort(accuracies),
+ err_msg="Accuracies are not non-decreasing.")
+
+
+def test_neighbors_accuracy_with_n_trees():
+ """Accuracy increases as `n_trees` increases."""
+ n_trees = np.array([1, 10, 100])
+ samples = 1000
@robertlayton Owner

My comment about unit test times applies all through this file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/tests/test_lsh_forest.py
((121 lines not shown))
+ dim = 50
+ n_iter = 100
+ X = np.random.rand(samples, dim)
+
+ lshf = LSHForest()
+ # Test unfitted estimator
+ assert_raises(ValueError, lshf.radius_neighbors, X[0])
+
+ lshf.fit(X)
+
+ for i in range(n_iter):
+ point = X[np.random.randint(0, samples)]
+ mean_dist = np.mean(euclidean_distances(point, X))
+ neighbors = lshf.radius_neighbors(point, radius=mean_dist)
+ # At least one neighbor should be returned.
+ assert_greater(neighbors.shape[1], 0)
@robertlayton Owner

It would be good to test for correctness here -- possibly generate some static queries and see if they match as expected.

All radius neighbors should be less than the given radius so that test is added.

@robertlayton Owner

That's better, but correctness would be "I precomputed this value to equal 3.14, and the test asserts I still get that figure". This is a good way to check that your `random_state' is set correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
maheshakya added some commits
@maheshakya maheshakya Removed _do_hash function.
The functionality of _do_hash is done within the _create_tree function as
a hashing function is generated for each tree and it doesn't need a separate
function.
b514130
@maheshakya maheshakya Updated radius_neighbors test.
802170e
@maheshakya maheshakya Added _convert_to_hash function.
This will be used in the queries and insertions to convert the data point
into the integer represented by the binary hash.
4c485a9
@maheshakya maheshakya Updated _generate_masks to work with cached hashes. 6e8b6a8
@maheshakya maheshakya Changed the insert function to accpet a batch of data points(a 2D arr…
…ay).

Now trees and original indices are stored in lists. Tests for insert and fit
functions have been updated.
e183c21
@daniel-vainsencher

This is what list comprehensions are for.

Done. But both methods do the same thing(iterating using a for loop)

I proposed this change because the list comprehension version is easier to understand, more obviously correct, and shorter. What did you mean?

Nothing important actually. I just meant that using list comprehension doesn't have extra advantages in the sense of performance.

So first, its not all about speed (I know you know ;) )

Second, you're wrong:
In [50]: b = []
In [56]: %timeit for i in xrange(100000): b.append(i+1)
1 loops, best of 3: 9.72 ms per loop
In [57]: %timeit [i+1 for i in xrange(100000)]
100 loops, best of 3: 4.91 ms per loop

Of course the advantage is pretty tiny, and visible here only because + is so cheap... the point is that assuming is dangerous.

Okay. Point understood :)

@daniel-vainsencher

"Cost is proportional to new total size, so additions should be batched."

Added to the doc string.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling 06ce72a on maheshakya:lsh_forest into 1b2833a on scikit-learn:master.

@robertlayton

OK, some general comments:

  • Your tests should include a working example in the Neighbours framework (i.e. create a classifier and test it can be called). This checks that the API is correct.
  • You did reduce your testing size, but can you reduce even more? i.e. you are only testing for accuracy, is there any damage to having samples=5 and dim=2?
  • LSHForest.fit(self, X=None) has X=None as a default value, but this fails (as it should). Don't let X=None, make it required.
  • Please create a usage example (see the examples/ folder) based on your previous blog posts. See this example, which shows something similar to what you might want to show (although you may want to show speed improvement too). Examples can take (much) longer than tests to run, so feel free to use a larger dataset.
  • Narrative documentation! see doc/modules

At this point, I'd be happy to go to the scikit-learn devs to get further comments.

@maheshakya
  • Did you mean that create a classifier like KNeighborsClassifier? If so, it's going to be out of the scope of GSoC because it is another application of LSH-ANN. The API can be checked with clustering algorithms(which is in the scope of the project) too, right?
  • Reduction of the dimension less than 32 causes a warning from the GaussianRandomProjection as we are using a fixed hash length of the 32.

I will do the rest.

BTW, is it okay to add the application of LSH in clustering also in this PR? Then examples and documentation can be done for both as planned in the schedule.

@jnothman
Owner

Did you mean that create a classifier like KNeighborsClassifier? If so, it's going to be out of the scope of GSoC because it is another application of LSH-ANN.

I assumed this was the primary use of LSH-ANN in the context of scikit-learn. LSH-ANN isn't in itself machine learning.

sklearn/neighbors/lsh_forest.py
((72 lines not shown))
+ """
+ Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
@jnothman Owner
jnothman added a note

please place a space between 'optional' and '(' here and below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((73 lines not shown))
+ Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
@jnothman Owner
jnothman added a note

threshold of what value? I think perhaps "threshold" should be avoided here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((81 lines not shown))
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
@jnothman Owner
jnothman added a note

This is a strange documentation convention, where both "optional" and "default" are included. This is a case where I find "optional" confusing: does it mean that the lower bound itself is optional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((78 lines not shown))
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
@jnothman Owner
jnothman added a note

"functiond" -> "function"
"with the query" -> "to :meth:k_neighbors"
Is it necessary to have this parameter both for the class and the method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((86 lines not shown))
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
+ lowerest hash length to be searched when candidate selection is
+ performed for nearest neighbors.
+
+ radius : float, optinal(default = 1.0)
+ Range of parameter space to use by default for :meth`radius_neighbors`
@jnothman Owner
jnothman added a note

I don't think "Range of parameter space" is clear here, i.e. it doesn't help the user set this value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((93 lines not shown))
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
+ lowerest hash length to be searched when candidate selection is
+ performed for nearest neighbors.
+
+ radius : float, optinal(default = 1.0)
+ Range of parameter space to use by default for :meth`radius_neighbors`
+ queries.
+
+ radius_cutoff_ratio: float, optional(defualt = 0.9)
+ Cut off ratio of radius neighbors to candidates at the radius
+ neighbor search
+
+ random_state: float, optional(default = 1)
@jnothman Owner
jnothman added a note

"float" is not correct. Nor is "default=1". Please copy the type and description from other estimators that accept a random_state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((150 lines not shown))
+ """
+
+ def __init__(self, n_trees=10, radius=1.0, c=50, n_neighbors=1,
+ lower_bound=4, radius_cutoff_ratio=.9,
+ random_state=None):
+ self.n_trees = n_trees
+ self.radius = radius
+ self.random_state = random_state
+ self.c = c
+ self.n_neighbors = n_neighbors
+ self.lower_bound = lower_bound
+ self.radius_cutoff_ratio = radius_cutoff_ratio
+
+ def _generate_hash_function(self):
+ """
+ Fits a `GaussianRandomProjections` with `n_components=hash_size
@jnothman Owner
jnothman added a note

"Projections" -> "Projection"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((157 lines not shown))
+ self.random_state = random_state
+ self.c = c
+ self.n_neighbors = n_neighbors
+ self.lower_bound = lower_bound
+ self.radius_cutoff_ratio = radius_cutoff_ratio
+
+ def _generate_hash_function(self):
+ """
+ Fits a `GaussianRandomProjections` with `n_components=hash_size
+ and n_features=n_dim.
+ """
+ random_state = check_random_state(self.random_state)
+ grp = GaussianRandomProjection(n_components=self.max_label_length,
+ random_state=random_state.randint(0,
+ 10))
+ X = np.zeros((2, self._n_dim), dtype=float)
@jnothman Owner
jnothman added a note

Please include a comment for why this needs to be 2, not 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((168 lines not shown))
+ random_state = check_random_state(self.random_state)
+ grp = GaussianRandomProjection(n_components=self.max_label_length,
+ random_state=random_state.randint(0,
+ 10))
+ X = np.zeros((2, self._n_dim), dtype=float)
+ grp.fit(X)
+ return grp
+
+ def _create_tree(self):
+ """
+ Builds a single tree (in this case creates a sorted array of
+ binary hashes).
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign(positive/negative)
@jnothman Owner
jnothman added a note

space before '('

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((99 lines not shown))
+
+ radius : float, optinal(default = 1.0)
+ Range of parameter space to use by default for :meth`radius_neighbors`
+ queries.
+
+ radius_cutoff_ratio: float, optional(defualt = 0.9)
+ Cut off ratio of radius neighbors to candidates at the radius
+ neighbor search
+
+ random_state: float, optional(default = 1)
+ A random value to initialize random number generator.
+
+ Attributes
+ ----------
+
+ `hash_functions`: list of arrays
@jnothman Owner
jnothman added a note

hash_functions -> hash_functions_
It's also unclear from this description how a hash function is an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((224 lines not shown))
+ right_mask = np.append(np.zeros(length, dtype=int),
+ np.ones(self.max_label_length-length,
+ dtype=int))
+ xx = tuple(right_mask)
+ binary_hash_right = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ self._right_mask.append(binary_hash_right)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
+ """
+ Performs the Synchronous ascending phase in the LSH Forest
+ paper.
+ Returns an array of candidates, their distance rancks and
@jnothman Owner
jnothman added a note

"rancks" -> "ranks"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((58 lines not shown))
+
+def _simple_euclidean_distance(query, candidates):
+ """
+ Private function to calculate Euclidean distances between each
+ point in candidates and query
+ """
+ distances = np.zeros(candidates.shape[0])
+ for i in range(candidates.shape[0]):
+ distances[i] = np.linalg.norm(candidates[i] - query)
+ return distances
+
+
+class LSHForest(BaseEstimator):
+
+ """
+ Performs approximate nearest neighbor search using LSH forest.
@jnothman Owner
jnothman added a note

All/most of your docstrings are incorrectly formatted: on the same line as the opening """ should be a brief summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((313 lines not shown))
+ """
+ if X is None:
+ raise ValueError("X cannot be None")
+
+ self._input_array = check_array(X)
+ self._n_dim = self._input_array.shape[1]
+
+ self.max_label_length = 32
+ digits = ['0', '1']
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+
+ self.cache_N = int(self.max_label_length/2)
+ hashes = [x for x in itertools.product((0, 1),
@jnothman Owner
jnothman added a note

x could be better named, given it is unrelated to the input X

@larsmans Owner
larsmans added a note

Actually [x for x in y] is more properly written list(y).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((171 lines not shown))
+ 10))
+ X = np.zeros((2, self._n_dim), dtype=float)
+ grp.fit(X)
+ return grp
+
+ def _create_tree(self):
+ """
+ Builds a single tree (in this case creates a sorted array of
+ binary hashes).
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign(positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
@jnothman Owner
jnothman added a note

I don't think we should require completely independent gaussian random projections for each tree. While the trees will not be entirely independent, surely many complementary trees (in the sense of increasing recall) can be built just by permuting the bits in one of your existing hash functions.

More generally, k indices of h bits each can be populated using m hash bits for m >= k*h, using h bits chosen uniformly without repetition for each, with the permutations encoded as a matrix multiply. The question is: at what point does the reuse begin to hurt results? to what extent does the reuse help performance? I don't know if there is a reasonable answer to these questions easy to implement inside the GSOC, do you? Are there papers quantifying the effects of sharing (even for LSH, ignoring the *Forest aspect)?

@jnothman Owner
jnothman added a note

The bit permutations would be best done not as matrix multiplies, but as bit operations in C. Even if performed as multiplies over binary matrices, these matrices don't have the full feature-space dimensionality, only a fixed width of 32, so we should expect it to be much faster than additional random projections.

I can't recall where I've seen this technique used, but I don't think it involves a quantification of the sharing effects. Some colleagues were playing around with this variant, but I'm not sure if they have useful numbers in this space. However, traditional (bucket-based, rather than forest-based) LSH, in my understanding, is often described in terms of selecting a fixed number of bits from a random projection to represent one hash. This should be equivalent to permutation, and given the number of possible permutations, should still provide the additional recall that creating multiple indexes intends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((318 lines not shown))
+ self._n_dim = self._input_array.shape[1]
+
+ self.max_label_length = 32
+ digits = ['0', '1']
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+
+ self.cache_N = int(self.max_label_length/2)
+ hashes = [x for x in itertools.product((0, 1),
+ repeat=self.cache_N)]
+
+ self.cache = {}
+ for item in hashes:
+ self.cache[tuple(item)] = int("".join([digits[y] for y in item]),
@jnothman Owner
jnothman added a note

Alternatively, could use np.packbits and get rid of the cache and tuple conversion altogether? I'm not sure which would be faster.

Ah, cool, I wasn't aware of this method. Thanks Joel!

Actually when it comes to conversion of bit arrays to integers, np.packbits gets slower. uint8 type is returned bynp.packbits and for a 32 bit array, 4 uint8 numbers are returned. To get the final result, those 4 values have to be used as: test[0]*(2**24)+test[1]*(2**16)+test[2]*(2**8)+test[3] where test is the array returned from np.packbits for a 32 long bit array.
Average time taken for this operation is 7.43 µs where as the cache and tuple version takes only 550 ns.

@larsmans Owner
larsmans added a note

Is that including the cost of computing those constants every time? Also, note that this is a sum of products of corresponding elements, i.e. a dot product.

Computing values to store in the cache happens only once. Moreover, even if we compare computing times of these two methods, int("".join([digits[y] for y in bit_array]) is still faster:1.79 µs. test = np.packbits(bit_array); test[0]*(2**24)+test[1]*(2**16)+test[2]*(2**8)+test[3] takes 10.9 µs in average.

@jnothman Owner
jnothman added a note

I wouldn't be surprised if your cache method is faster, but test[0] + (2**24) ... etc shouldn't be necessary. You can use a view over the same array to convert from uint8 to uint32:

>>> np.packbits(np.array([0,0,0,0,0,0,0,0,] * 1 + [0,0,0,0,0,0,0,1] * 3)).view(dtype='>u4')
array([65793], dtype=uint32)

65793 = 20 + 28 + 2**16

Comparing this to lookup in a dict:

In [2]: a = np.array([0,0,0,0,0,0,0,0,] * 1 + [0,0,0,0,0,0,0,1] * 3)
In [3]: %timeit np.packbits(a).view(dtype='>u4')
1000000 loops, best of 3: 1.98 µs per loop
In [4]: lookup = {tuple(a): 5}
In [5]: %timeit lookup[tuple(a)]
100000 loops, best of 3: 6.26 µs per loop
@jnothman Owner
jnothman added a note

And with the further benefit of vectorization:

In [10]: A = np.repeat(a[None, :], 10, axis=0)
In [12]: %timeit np.packbits(A).view(dtype='>u4')
100000 loops, best of 3: 3.75 µs per loop
In [13]: %timeit [lookup[tuple(row)] for row in A]
10000 loops, best of 3: 67.5 µs per loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((240 lines not shown))
+ distances.
+ """
+ candidates = []
+ n_candidates = self.c * self.n_trees
+ while max_depth > self.lower_bound and (len(candidates) < n_candidates
+ or len(set(candidates)) < m):
+ for i in range(self.n_trees):
+ candidates.extend(
+ self._original_indices[i][_find_matching_indices(
+ self._trees[i],
+ bin_queries[i],
+ self._left_mask[max_depth],
+ self._right_mask[max_depth])].tolist())
+ max_depth = max_depth - 1
+ candidates = np.unique(candidates)
+ ranks, distances = self._compute_distances(query, candidates)
@jnothman Owner
jnothman added a note

Given that in K(A)NN classification we don't need the actual nearest data points so much as a vote over their classes, can we offer the user to save time by not computing the true distances?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@maheshakya

Clustering is the machine learning application of LSH-ANN discussed in the project scope, so it needs to be get done first. But this can be applied in classification after the project.

@jnothman
Owner
sklearn/neighbors/lsh_forest.py
((17 lines not shown))
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
+ """
+ Finds indices in sorted array of strings where their first
+ h elements match the items' first h elements
+ """
+ left_index = np.searchsorted(sorted_array,
+ item & left_mask)
+ right_index = np.searchsorted(sorted_array,
+ item | right_mask,
+ side='right')
+ return np.arange(left_index, right_index)
+
+
+def _find_longest_prefix_match(bit_string_array, query, hash_size,
+ left_masks, right_masks):
@larsmans Owner
larsmans added a note

Can't the bisect module in the stdlib do this?

Yes, that can be used here. For small and medium array sizes of sorted_array, bisect_left and bisect_right are faster. But for very large arrays, numpy.searchsorted is faster.
Eg:
array size = 1000
average time for _find_matching_indices with bisect = 3.88 µs
average time for _find_matching_indices with searchsorted = 5.09 µs

array size = 10000
average time for _find_matching_indices with bisect = 7.69 µs
average time for _find_matching_indices with searchsorted = 8.56 µs

array size = 30000
average time for _find_matching_indices with bisect = 17.2 µs
average time for _find_matching_indices with searchsorted = 16.8 µs

So I think using bisect is more reasonable.

In this case all differences are small so its not worth even talking about, but keep in mind that for not-too-big databases, LSH has no benefit over exhaustive search. So you should always be thinking about large DBs.

@ogrisel Owner
ogrisel added a note

+1, when doing benchmarks I think we need to focus on n_samples > 100k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((82 lines not shown))
+ ----------
+
+ n_trees: int, optional (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int, optional(default = 10)
+ Threshold value to select candidates for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int, optional(default = 1)
+ Number of neighbors to be returned from query funcitond when
+ it is not provided with the query.
+
+ lower_bound: int, optional(defualt = 4)
+ lowerest hash length to be searched when candidate selection is
@larsmans Owner
larsmans added a note

lowerest -> lowest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((356 lines not shown))
+ """
+ bin_queries = []
+
+ # descend phase
+ max_depth = 0
+ for i in range(self.n_trees):
+ bin_query = self._convert_to_hash(query, i)
+ k = _find_longest_prefix_match(self._trees[i], bin_query,
+ self.max_label_length,
+ self._left_mask,
+ self._right_mask)
+ if k > max_depth:
+ max_depth = k
+ bin_queries.append(bin_query)
+
+ if not is_radius:
@larsmans Owner
larsmans added a note

Stylistic nitpick: when switching on a boolean, please put the positive case first, esp. when the negative case takes fewer lines of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((388 lines not shown))
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single query.
+
+ n_neighbors: int, opitonal (default = None)
+ Number of neighbors required. If not provided, this will
+ return the number specified at the initialization.
+
+ return_distance: boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted.")
+
+ if X is None:
+ raise ValueError("X cannot be None.")
@larsmans Owner
larsmans added a note

Why this specific check for None, is it an expected input? Also, the exception class should be TypeError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((435 lines not shown))
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single query.
+
+ radius : float
+ Limiting distance of neighbors to return.
+ (default is the value passed to the constructor).
+
+ return_distance: boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted.")
+
+ if X is None:
+ raise ValueError("X cannot be None.")
@larsmans Owner
larsmans added a note

Same story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((438 lines not shown))
+
+ radius : float
+ Limiting distance of neighbors to return.
+ (default is the value passed to the constructor).
+
+ return_distance: boolean, optional (default = False)
+ Returns the distances of neighbors if set to True.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted.")
+
+ if X is None:
+ raise ValueError("X cannot be None.")
+
+ if radius is not None:
+ self.radius = radius
@larsmans Owner
larsmans added a note

I don't like a query method modifying an object. I also don't see why the class should remember the radius of the last radius query. It looks like this should be the other way around, if radius is None: radius = self.radius.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@maheshakya maheshakya Updated docstrings and tests.
in fit, kneighbors and radius methods, None type is not checked because
it is not an expected input.
2955fc1
@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 2955fc1 on maheshakya:lsh_forest into 1b2833a on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling f78dcf5 on maheshakya:lsh_forest into 1b2833a on scikit-learn:master.

sklearn/neighbors/lsh_forest.py
((43 lines not shown))
+ while lo < hi:
+ mid = (lo+hi)//2
+
+ k = _find_matching_indices(bit_string_array, query,
+ left_masks[mid],
+ right_masks[mid]).shape[0]
+ if k > 0:
+ lo = mid + 1
+ res = mid
+ else:
+ hi = mid
+
+ return res
+
+
+def _simple_euclidean_distance(query, candidates):
@arjoly Owner
arjoly added a note

I think that there is a function for this in sklearn.metrics.pairwise.

I seem to recall that was slower than the code below in this case; if so, it seems that fact needs to be documented in a comment.

@arjoly Owner
arjoly added a note

Is it by a large magnitude? Is it possible to speed up the original function?
Probably this function would be better in that module.

@robertlayton Owner

The implementation in pairwise.py is optimised for matrix by matrix computation, whereas this version takes a vector as the first input (see @maheshakya's comment further on [1]). This function itself can be improved, but I missed that in my most recent pass. @maheshakya -- this function can be optimised by getting rid of the for loop

@arjoly Owner
arjoly added a note

Thanks @robertlayton for the precision!

@ogrisel Owner
ogrisel added a note

@robertlayton's remark should be included in the docstring.

@robertlayton , I found that, with loop, it is faster when the n_features is large. I can obtain the same results as the loop with np.linalg.norm(candidates - query, axis=1). This is super fast when the dimension is low. But with high dimensions, this gets slower than the loop. Here are some profiling results:

n_samples = 10000 (fixed)

n_features = 100 :
without loop: 57.1 ms
with loop: 3.86 ms

n_features = 500 :
without loop: 21 ms
with loop: 70.6 ms

n_features = 1000 :
without loop: 40.5 ms
with loop: 82.3 ms

n_features = 2000 :
without loop: 79.6 ms
with loop: 104 ms

n_features = 5000 :
without loop: 200 ms
with loop: 177 ms

n_features = 10000 :
without loop: 377 ms
with loop: 302 ms

n_features = 20000 :
without loop: 752 ms
with loop: 551 ms

@jnothman Owner

I think number of candidates( n_samples in this example) does not matter much to this speed bottleneck. n_features is what affects the speed of this operation greatly. When I set n_samples=100 and run the test, I get following results for large n_features.

n_features = 10000 :
without loop: 3.37 ms
with loop: 2.78 ms

n_features = 20000 :
without loop: 8.22 ms
with loop: 5.22 ms

LSHF does reduces n_samples but, it cannot help with the high dimension. Eventhougth it limits the number of candidates, there is a possibility of getting around 10000 candidates if the size of database is very large(around 1000k or so).

This change starts to happen at n_features is around 3500. I guess it's better if I use that as a heuristic to switch between loop and the vectorized method. @jnothman , any idea?

@jnothman Owner

That's better than both previous methods, even for large dimensions such as 50000. Thank you for the idea. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((12 lines not shown))
+from ..utils import check_random_state
+
+from ..random_projection import GaussianRandomProjection
+
+__all__ = ["LSHForest"]
+
+
+def _find_matching_indices(sorted_array, item, left_mask, right_mask):
+ """Finds indices in sorted array of integers.
+
+ Most significant h bits in the binary representations of the
+ integers are matched with the items' most significant h bits.
+ """
+ left_index = bisect_left(sorted_array, item & left_mask)
+ right_index = bisect_right(sorted_array, item | right_mask)
+ return np.arange(left_index, right_index)
@arjoly Owner
arjoly added a note

This function signature doesn't look right for _find_longest_prefix_match where it expect a scalar (and probably only one call to bisect_left.

I am not sure I understood this comment. Do you mean that it returns an array, where _find_longest_prefix_match uses just the size of this array?
One call to bisect_left gives the location of the first match; since we care only about whether there are 0 or more than zero matches, I agree that could suffice (good point). This would involve a little code duplication, and I'm not sure this optimization is to a bottleneck, but worth checking.

@arjoly Owner
arjoly added a note

Sorry, I have missed the .shape[0]. I still find it's strange to allocate an array if it's not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((149 lines not shown))
+ [ 0. , 0.6440088 , 0.6900774 ]]))
+
+ """
+
+ def __init__(self, n_trees=10, radius=1.0, c=50, n_neighbors=1,
+ lower_bound=4, radius_cutoff_ratio=.9,
+ random_state=None):
+ self.n_trees = n_trees
+ self.radius = radius
+ self.random_state = random_state
+ self.c = c
+ self.n_neighbors = n_neighbors
+ self.lower_bound = lower_bound
+ self.radius_cutoff_ratio = radius_cutoff_ratio
+
+ def _generate_hash_function(self):
@arjoly Owner
arjoly added a note

I would inline this methods. It's only used once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((173 lines not shown))
+ X = np.zeros((1, self._n_dim), dtype=float)
+ grp.fit(X)
+ return grp
+
+ def _create_tree(self):
+ """Builds a single tree.
+
+ Here, it creates a sorted array of binary hashes.
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign (positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
+ hashes = np.array(grp.transform(self._input_array) > 0, dtype=int)
@arjoly Owner
arjoly added a note

Why not calling fit on self._input_array then transform?

@arjoly Owner
arjoly added a note

I would have a clean RandomSignProjection transformer. If you don't like the warning for n_components higher than n_features. This could be disabled by adding a new parameter to the base random projection class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((299 lines not shown))
+ binary_hash = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ return binary_hash
+
+ def fit(self, X):
+ """Fit the LSH forest on the data.
+
+ Parameters
+ ----------
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single data point.
+ """
+
+ self._input_array = check_array(X)
+ self._n_dim = self._input_array.shape[1]
@arjoly Owner
arjoly added a note

This information is already in _input_array.

@arjoly Owner
arjoly added a note

You seems to mean n_features. n_dim would be self._input_array.ndim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((298 lines not shown))
+ xx = tuple(projections)
+ binary_hash = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ return binary_hash
+
+ def fit(self, X):
+ """Fit the LSH forest on the data.
+
+ Parameters
+ ----------
+ X : array_like, shape (n_samples, n_features)
+ List of n_features-dimensional data points. Each row
+ corresponds to a single data point.
+ """
+
+ self._input_array = check_array(X)
@arjoly Owner
arjoly added a note

To be consistent with the rest of the project _input_array => X_train? Why does it need to be private?

@arjoly Owner
arjoly added a note

If it is a class similar to NearestNeighbors, it should be called _fit_X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((179 lines not shown))
+
+ Here, it creates a sorted array of binary hashes.
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign (positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
+ hashes = np.array(grp.transform(self._input_array) > 0, dtype=int)
+ hash_function = grp.components_
+
+ binary_hashes = []
+ for i in range(hashes.shape[0]):
+ xx = tuple(hashes[i])
+ binary_hashes.append(self.cache[xx[:self.cache_N]] * self.k
@arjoly Owner
arjoly added a note

The cache attribute is not documented.

@arjoly Owner
arjoly added a note

the attribute k is not documented and should be k_ if it's not part of the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((178 lines not shown))
+ """Builds a single tree.
+
+ Here, it creates a sorted array of binary hashes.
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign (positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
+ hashes = np.array(grp.transform(self._input_array) > 0, dtype=int)
+ hash_function = grp.components_
+
+ binary_hashes = []
+ for i in range(hashes.shape[0]):
+ xx = tuple(hashes[i])
@arjoly Owner
arjoly added a note

Why getting back to a tuple?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((71 lines not shown))
+
+ """Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int (default = 10)
@arjoly Owner
arjoly added a note

Do you want to mean n_candidates? I wouldn't use c as it is used in linear model to control regularisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((68 lines not shown))
+
+
+class LSHForest(BaseEstimator):
+
+ """Performs approximate nearest neighbor search using LSH forest.
+
+ LSH Forest: Locality Sensitive Hashing forest [1] is an alternative
+ method for vanilla approximate nearest neighbor search methods.
+ LSH forest data structure has been implemented using sorted
+ arrays and binary search. 32 bit fixed length hashes are used in
+ this implementation.
+
+ Parameters
+ ----------
+
+ n_trees: int (default = 10)
@arjoly Owner
arjoly added a note

n_trees => n_estimators? This would be consistent with the ensemble module and the rest of the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((80 lines not shown))
+ Parameters
+ ----------
+
+ n_trees: int (default = 10)
+ Number of trees in the LSH Forest.
+
+ c: int (default = 10)
+ Value to restrict candidates selection for nearest neighbors.
+ Number of candidates is often greater than c*n_trees(unless
+ restricted by lower_bound)
+
+ n_neighbors: int (default = 1)
+ Number of neighbors to be returned from query function when
+ it is not provided to :meth:`k_neighbors`
+
+ lower_bound: int (defualt = 4)
@arjoly Owner
arjoly added a note

Do you want to mean min_hash_length?

@arjoly Owner
arjoly added a note

defualt => default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((94 lines not shown))
+
+ lower_bound: int (defualt = 4)
+ lowest hash length to be searched when candidate selection is
+ performed for nearest neighbors.
+
+ radius : float, optinal (default = 1.0)
+ Radius from the data point to its neighbors. This is the parameter
+ space to use by default for :meth`radius_neighbors` queries.
+
+ radius_cutoff_ratio: float, optional (defualt = 0.9)
+ Cut off ratio of radius neighbors to candidates at the radius
+ neighbor search
+
+ random_state: numpy.RandomState, optional
+ The generator used to initialize random projections.
+ Defaults to numpy.random.
@arjoly Owner
arjoly added a note
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((182 lines not shown))
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign (positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
+ hashes = np.array(grp.transform(self._input_array) > 0, dtype=int)
+ hash_function = grp.components_
+
+ binary_hashes = []
+ for i in range(hashes.shape[0]):
+ xx = tuple(hashes[i])
+ binary_hashes.append(self.cache[xx[:self.cache_N]] * self.k
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
@arjoly Owner
arjoly added a note

You perform two sort while you could make it only one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((193 lines not shown))
+ xx = tuple(hashes[i])
+ binary_hashes.append(self.cache[xx[:self.cache_N]] * self.k
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
+
+ def _compute_distances(self, query, candidates):
+ """Computes the Euclidean distance.
+
+ Distance is from the queryto points in the candidates array.
+ Returns argsort of distances in the candidates
+ array and sorted distances.
+ """
+ distances = _simple_euclidean_distance(
+ query, self._input_array[candidates])
+ return np.argsort(distances), np.sort(distances)
@arjoly Owner
arjoly added a note

You sort things twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((217 lines not shown))
+ dtype=int))
+ xx = tuple(left_mask)
+ binary_hash_left = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ self._left_mask.append(binary_hash_left)
+
+ right_mask = np.append(np.zeros(length, dtype=int),
+ np.ones(self.max_label_length-length,
+ dtype=int))
+ xx = tuple(right_mask)
+ binary_hash_right = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ self._right_mask.append(binary_hash_right)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
@arjoly Owner
arjoly added a note

Those attributes need to be documented somewhere (even if private).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((195 lines not shown))
+ + self.cache[xx[self.cache_N:]])
+
+ return np.argsort(binary_hashes), np.sort(binary_hashes), hash_function
+
+ def _compute_distances(self, query, candidates):
+ """Computes the Euclidean distance.
+
+ Distance is from the queryto points in the candidates array.
+ Returns argsort of distances in the candidates
+ array and sorted distances.
+ """
+ distances = _simple_euclidean_distance(
+ query, self._input_array[candidates])
+ return np.argsort(distances), np.sort(distances)
+
+ def _generate_masks(self):
@arjoly Owner
arjoly added a note

This method is called only once. I would inline it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((204 lines not shown))
+ array and sorted distances.
+ """
+ distances = _simple_euclidean_distance(
+ query, self._input_array[candidates])
+ return np.argsort(distances), np.sort(distances)
+
+ def _generate_masks(self):
+ """Creates left and right masks for all hash lengths."""
+ self._left_mask, self._right_mask = [], []
+
+ for length in range(self.max_label_length+1):
+ left_mask = np.append(np.ones(length, dtype=int),
+ np.zeros(self.max_label_length-length,
+ dtype=int))
+ xx = tuple(left_mask)
+ binary_hash_left = (self.cache[xx[:self.cache_N]] * self.k +
@arjoly Owner
arjoly added a note

self.cache_N is public and not documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((203 lines not shown))
+ Returns argsort of distances in the candidates
+ array and sorted distances.
+ """
+ distances = _simple_euclidean_distance(
+ query, self._input_array[candidates])
+ return np.argsort(distances), np.sort(distances)
+
+ def _generate_masks(self):
+ """Creates left and right masks for all hash lengths."""
+ self._left_mask, self._right_mask = [], []
+
+ for length in range(self.max_label_length+1):
+ left_mask = np.append(np.ones(length, dtype=int),
+ np.zeros(self.max_label_length-length,
+ dtype=int))
+ xx = tuple(left_mask)
@arjoly Owner
arjoly added a note

Why using tuple instead of array?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((232 lines not shown))
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
+ """Performs the Synchronous ascending phase.
+
+ Returns an array of candidates, their distance ranks and
+ distances.
+ """
+ candidates = []
+ n_candidates = self.c * self.n_trees
+ while max_depth > self.lower_bound and (len(candidates) < n_candidates
+ or len(set(candidates)) < m):
+ for i in range(self.n_trees):
+ candidates.extend(
+ self._original_indices[i][_find_matching_indices(
+ self._trees[i],
@arjoly Owner
arjoly added a note

I would make _trees public.

Why? I cannot imagine anyone using them from outside, considering their contents are given sense only by the particular random hashes being used for each tree.

@arjoly Owner
arjoly added a note

Ok, thanks for the information.

@arjoly Owner
arjoly added a note

By making things public, you are forced to have a good documentation. And in my opinion, this ease maintenance in the long run. However, I am not a expert. :-)

@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((225 lines not shown))
+ dtype=int))
+ xx = tuple(right_mask)
+ binary_hash_right = (self.cache[xx[:self.cache_N]] * self.k +
+ self.cache[xx[self.cache_N:]])
+ self._right_mask.append(binary_hash_right)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
+ """Performs the Synchronous ascending phase.
+
+ Returns an array of candidates, their distance ranks and
+ distances.
+ """
+ candidates = []
@arjoly Owner
arjoly added a note

This could be directly initialized as you seems to know the number of candidates.

Initialized to what? we don't know the candidates yet. Do you mean initialize a large empty list for performance? growing a long list is an amortized constant factor, right? I'm sorry, I don't have enough background in scikit learn to understand the motivations for some of your suggestions, more detail would be welcome. This is GSOC code; I am guessing the author Maheshakya would probably benefit from the detail too.

@arjoly Owner
arjoly added a note

I am far from an expert in LSH. But it seems/seemed that you know the number of candidates. However, as you said it might generate a very big empty array for nothing. Probably best to keep the list for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((228 lines not shown))
+ self.cache[xx[self.cache_N:]])
+ self._right_mask.append(binary_hash_right)
+
+ self._left_mask = np.array(self._left_mask)
+ self._right_mask = np.array(self._right_mask)
+
+ def _get_candidates(self, query, max_depth, bin_queries, m):
+ """Performs the Synchronous ascending phase.
+
+ Returns an array of candidates, their distance ranks and
+ distances.
+ """
+ candidates = []
+ n_candidates = self.c * self.n_trees
+ while max_depth > self.lower_bound and (len(candidates) < n_candidates
+ or len(set(candidates)) < m):
@arjoly Owner
arjoly added a note

You have

len(candidates) < n_candidates or len(set(candidates)) < m
len(candidates) < max(n_candidates, m)

Thus you can compute max(n_candidates, m)only once.

@arjoly Owner
arjoly added a note

By max_depth, do you mean hash_length?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((242 lines not shown))
+ while max_depth > self.lower_bound and (len(candidates) < n_candidates
+ or len(set(candidates)) < m):
+ for i in range(self.n_trees):
+ candidates.extend(
+ self._original_indices[i][_find_matching_indices(
+ self._trees[i],
+ bin_queries[i],
+ self._left_mask[max_depth],
+ self._right_mask[max_depth])].tolist())
+ max_depth = max_depth - 1
+ candidates = np.unique(candidates)
+ ranks, distances = self._compute_distances(query, candidates)
+
+ return candidates, ranks, distances
+
+ def _get_radius_neighbors(self, query, max_depth, bin_queries, radius):
@arjoly Owner
arjoly added a note

Is there anything that can be re-used from the radius neighbor class?

The implementation of this(LSH approximate neighbors) is completely different from the NearestNeighbors. So I think re-using is not possible.

@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((252 lines not shown))
+ candidates = np.unique(candidates)
+ ranks, distances = self._compute_distances(query, candidates)
+
+ return candidates, ranks, distances
+
+ def _get_radius_neighbors(self, query, max_depth, bin_queries, radius):
+ """Finds radius neighbors from the candidates obtained.
+
+ Their distances from query are smaller than radius.
+ Returns radius neighbors and distances.
+ """
+ ratio_within_radius = 1
+ threshold = 1 - self.radius_cutoff_ratio
+ total_candidates = np.array([], dtype=int)
+ total_neighbors = np.array([], dtype=int)
+ total_distances = np.array([], dtype=float)
@arjoly Owner
arjoly added a note

You seem to want to use list and not array in your code.

Later there are several operations like numpy.setdiff1d. So arrays are needed, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((282 lines not shown))
+ total_neighbors = np.append(total_neighbors,
+ candidates[ranks[:m]])
+ total_distances = np.append(total_distances, distances[:m])
+ ratio_within_radius = (total_neighbors.shape[0] /
+ float(total_candidates.shape[0]))
+ max_depth = max_depth - 1
+ return total_neighbors, total_distances
+
+ def _convert_to_hash(self, item, tree_n):
+ """Converts item(a date point) into an integer.
+
+ Value of the integer is the value represented by the
+ binary hashed value.
+ """
+ projections = np.array(np.dot(self.hash_functions_[tree_n],
+ item) > 0, dtype=int)
@arjoly Owner
arjoly added a note

You have a RandomSignProjection.transform here.

Do you mean to create a transform function for getting dot product and conversion to binary? Where should this function be included?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((313 lines not shown))
+ self._input_array = check_array(X)
+ self._n_dim = self._input_array.shape[1]
+
+ self.max_label_length = 32
+ digits = ['0', '1']
+ # Creates a g(p,x) for each tree
+ self.hash_functions_ = []
+ self._trees = []
+ self._original_indices = []
+
+ self.cache_N = int(self.max_label_length/2)
+ hashes = list(itertools.product((0, 1), repeat=self.cache_N))
+
+ self.cache = {}
+ for item in hashes:
+ self.cache[tuple(item)] = int("".join([digits[y] for y in item]),
@arjoly Owner
arjoly added a note

Apparently, you want to use tuple to have hashable object. Why not computing a hash of the numpy array instead?

@arjoly Owner
arjoly added a note

Just to let you know, I think that we have efficient functions (murmur_hash?) to compute the hash of an array. If not, it's also possible to use joblib.hash

@jnothman Owner
jnothman added a note

Rather than looking up a tuple in a dict, this should be using packbits as I have shown. My comment was also on this line, but for some reason github has made it invisible, despite the code not having changed :(

Sorry, I forgot to push my last commit. Actually the way you suggested is faster when using numpy arrays. I tested it with lists. The operations: conversion from list to tuple and dictionary lookup is fast but when using numpy arrays, these operations are slow. But the signed projections are in numpy arrays so your method outperforms the caching method. Thank you @jnothman :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((363 lines not shown))
+ max_depth = k
+ bin_queries.append(bin_query)
+
+ if is_radius:
+ return self._get_radius_neighbors(query, max_depth,
+ bin_queries, radius)
+
+ else:
+ candidates, ranks, distances = self._get_candidates(query,
+ max_depth,
+ bin_queries,
+ m)
+
+ return candidates[ranks[:m]], distances[:m]
+
+ def kneighbors(self, X, n_neighbors=None, return_distance=False):
@arjoly Owner
arjoly added a note

Can't we re-use code from the neighbors module?

I don't think so. This is completely different.

@arjoly Owner
arjoly added a note

Why is it completely different? You have the same interface and the code doesn't seem to be lsh specific.

@ogrisel Owner
ogrisel added a note

The default value for return_distance in sklearn.neighbors.NearestNeighbors.kneighbors is True. We should do the same here for consistency.

@ogrisel Owner
ogrisel added a note

Also the ordering of the returned values should be consistent: first the distances then the indices (although I agree this is not very intuitive but I think being consistent is more important).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((407 lines not shown))
+ return np.array([neighbors]), np.array([distances])
+ else:
+ return np.array([neighbors])
+ else:
+ neighbors, distances = [], []
+ for i in range(X.shape[0]):
+ neighs, dists = self._query(X[i], n_neighbors)
+ neighbors.append(neighs)
+ distances.append(dists)
+
+ if return_distance:
+ return np.array(neighbors), np.array(distances)
+ else:
+ return np.array(neighbors)
+
+ def radius_neighbors(self, X, radius=None, return_distance=False):
@arjoly Owner
arjoly added a note

Can't we re-use code from the neighbors module?

Same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((453 lines not shown))
+ else:
+ return np.array([neighbors])
+ else:
+ neighbors, distances = [], []
+ for i in range(X.shape[0]):
+ neighs, dists = self._query(X[i], radius=radius,
+ is_radius=True)
+ neighbors.append(neighs)
+ distances.append(dists)
+
+ if return_distance:
+ return np.array(neighbors), np.array(distances)
+ else:
+ return np.array(neighbors)
+
+ def insert(self, X):
@arjoly Owner
arjoly added a note

It seems to be a partial_fit.

@arjoly Owner
arjoly added a note

Why don't we need the y?

Given we are trying to solve an Approximate Nearest Neighbor problem (where the criteria for how many points to return are decided at query time)... What is y exactly?

@arjoly Owner
arjoly added a note

In supervised learning problem, you have the input matrix denoted by X in scikit-learn and the output that you are trying to predict denoted by y in scikit-learn. I would have thought that if you add new samples/points , you might also want to add together the target.

Is this method consistent with the sklearn.neighbors api?

@arjoly Owner
arjoly added a note

Hm apparently, this follows the same purpose as NearestNeighbors.

Could it be an algorithm of NearestNeighbors?

@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((467 lines not shown))
+
+ def insert(self, X):
+ """
+ Inserts new data into the LSH Forest. Cost is proportional
+ to new total size, so additions should be batched.
+
+ Parameters
+ ----------
+ X: array_like, shape (n_samples, n_features)
+ New data point to be inserted into the LSH Forest.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted before"
+ " inserting.")
+
+ X = check_array(X)
@arjoly Owner
arjoly added a note

Does it work with sparse data?

Sparse matrix support is not given yet. The random projection too have to be fixed then. So 'SparseRandomProjection` will have to be used when the data is sparse.

@arjoly Owner
arjoly added a note

I have fixed the regression with sparse matrix. You can use either sparse random projection or Gaussian random projection with sparse matrices.

@ogrisel Owner
ogrisel added a note

SparseRandomProjection is not about sparse input data but about using a sparse projection matrix. The transform method of both estimators should work with sparse input data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((471 lines not shown))
+ to new total size, so additions should be batched.
+
+ Parameters
+ ----------
+ X: array_like, shape (n_samples, n_features)
+ New data point to be inserted into the LSH Forest.
+ """
+ if not hasattr(self, 'hash_functions_'):
+ raise ValueError("estimator should be fitted before"
+ " inserting.")
+
+ X = check_array(X)
+
+ if X.shape[1] != self._input_array.shape[1]:
+ raise ValueError("Number of features in X and"
+ " fitted array does not match.")
@arjoly Owner
arjoly added a note

I would use check_consistency_length from sklearn.utils?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((275 lines not shown))
+ bin_queries[i],
+ self._left_mask[max_depth],
+ self._right_mask[max_depth])].tolist())
+ candidates = np.setdiff1d(candidates, total_candidates)
+ total_candidates = np.append(total_candidates, candidates)
+ ranks, distances = self._compute_distances(query, candidates)
+ m = np.searchsorted(distances, radius, side='right')
+ total_neighbors = np.append(total_neighbors,
+ candidates[ranks[:m]])
+ total_distances = np.append(total_distances, distances[:m])
+ ratio_within_radius = (total_neighbors.shape[0] /
+ float(total_candidates.shape[0]))
+ max_depth = max_depth - 1
+ return total_neighbors, total_distances
+
+ def _convert_to_hash(self, item, tree_n):
@arjoly Owner
arjoly added a note

item => y?
tree_n => i would pass directly the appropriate hash function.

@arjoly Owner
arjoly added a note

I have the feeling that this would be better as a function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton
Owner

Thanks for your review @arjoly, very thorough and many good points.

@arjoly
Owner

Happy that it helps. Sorry for the dumb questions / remarks and thanks to your answers. I am learning new things while reading the implementation.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 8dbb712 on maheshakya:lsh_forest into 1b2833a on scikit-learn:master.

sklearn/neighbors/lsh_forest.py
((270 lines not shown))
+ self._left_mask[max_depth],
+ self._right_mask[max_depth])].tolist())
+ candidates = np.setdiff1d(candidates, total_candidates)
+ total_candidates = np.append(total_candidates, candidates)
+ ranks, distances = self._compute_distances(query, candidates)
+ m = np.searchsorted(distances, radius, side='right')
+ total_neighbors = np.append(total_neighbors,
+ candidates[ranks[:m]])
+ total_distances = np.append(total_distances, distances[:m])
+ ratio_within_radius = (total_neighbors.shape[0] /
+ float(total_candidates.shape[0]))
+ max_depth = max_depth - 1
+ return total_neighbors, total_distances
+
+ def _convert_to_hash(self, item, tree_n):
+ """Converts item(a date point) into an integer.
@jnothman Owner
jnothman added a note

"date point" -> "data point"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((278 lines not shown))
+ total_distances = np.append(total_distances, distances[:m])
+ ratio_within_radius = (total_neighbors.shape[0] /
+ float(total_candidates.shape[0]))
+ max_depth = max_depth - 1
+ return total_neighbors, total_distances
+
+ def _convert_to_hash(self, item, tree_n):
+ """Converts item(a date point) into an integer.
+
+ Value of the integer is the value represented by the
+ binary hashed value.
+ """
+ projections = np.array(np.dot(self.hash_functions_[tree_n],
+ item) > 0, dtype=int)
+
+ return np.packbits(projections).view(dtype='>u4')[0]
@jnothman Owner
jnothman added a note

Firstly, np.packbits(projections).view(dtype='>u4') should be pulled out as a helper function.

Secondly, repeated calls to this with [0] is going to be much slower than a vectorized operation (i.e. this packbits/view operation can return an array of 32bit integers, given a [n_hashes, 32] binary input). Instead of passing in tree_n and looping over trees in the calling function, just get hashes for all trees at once here.

Thirdly, you say:

conversion from list to tuple and dictionary lookup is fast but when using numpy arrays, these operations are slow.

Note that list(my_array) is 10x slower than my_array.tolist(). But I still expect the above to be faster or as fast once you've properly vectorized the loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/lsh_forest.py
((178 lines not shown))
+ """Builds a single tree.
+
+ Here, it creates a sorted array of binary hashes.
+ Hashing is done on an array of data points.
+ This creates a binary hashes by getting the dot product of
+ input points and hash_function then transforming the projection
+ into a binary string array based on the sign (positive/negative)
+ of the projection.
+ """
+ grp = self._generate_hash_function()
+ hashes = np.array(grp.transform(self._input_array) > 0, dtype=int)
+ hash_function = grp.components_
+
+ binary_hashes = []