Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it bug in searching ? #45

Closed
yu-dk opened this issue Mar 31, 2015 · 4 comments
Closed

is it bug in searching ? #45

yu-dk opened this issue Mar 31, 2015 · 4 comments

Comments

@yu-dk
Copy link

yu-dk commented Mar 31, 2015

in order to implement it for my own use, I read the source code.
I guess there might be a logic bug -- unless I understand the idea in a wrong way..
I will try to explain it step by step.

during the _make_tree() step, a memory optimisation is used for nodes with n_descendants <= K, which means a sub-tree with only leave nodes can have a size as large as K.

_get_all_nns(), vector nns is used to store all potential candidates, and its size is limited as k * number_of_trees. where k is number of nearest neighbours to report.

at _get_all_nns(), the search for a tree stops when it meets a internal node with only leave nodes, and all of these leave nodes are added to vector nns.

the problem is that vector nns quickly reaches it size limitation after searching for a few trees, and the majority of the forest is not visited.

In my test example, I build 500 trees for 2000 nodes, vectors has dimension of 400, and I want to find top 5 NN. _get_all_nns() only visits 10 trees and stops, leaving the remaining 490 trees untouched.
Maybe that explains why annoy works faster than many other programs...

I am not totally convinced that it is a heuristic instead of a bug in the program...

Any comments are helpful.

@erikbern
Copy link
Collaborator

I looked at the code but I'm not sure if it's a bug... the heuristic n * roots.size() is very arbitrary though.

In your case I'm not surprised most of the trees aren't visited because you have so many trees per node. Annoy keeps a priority queue of the most promising nodes (initialized to the roots)

If n is much smaller than K I see why it doesn't visit all the trees though.

Either way the solution is probably to change the n * roots.size() to something that makes more sense. maybe max(_K, n) instead of n?

@erikbern
Copy link
Collaborator

Also you really shouldn't need 500 trees for 2000 nodes. It should be enough to have 5-10 trees

@yu-dk
Copy link
Author

yu-dk commented Mar 31, 2015

Thanks erikbern! I agree with your quick suggestions by now ^_^
ps: 500 trees is just an example, i should calculate the probability before choosing the parameters.

@erikbern
Copy link
Collaborator

See #80 – can now specify at query time now many nodes to inspect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants