Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move features into tree for performance #3

Closed
vadixidav opened this issue Apr 20, 2019 · 4 comments
Closed

Move features into tree for performance #3

vadixidav opened this issue Apr 20, 2019 · 4 comments

Comments

@vadixidav
Copy link
Member

The current performance is unacceptable. After profiling the code, it is clear that there are issues with cache locality. Here is the closure taking up 71.85% of all the time:

let lookup_distance = |index| (lookup(index) ^ feature).count_ones();

This performs a single trivial operation, so it looks innocuous, but it seems that the way hwt allows the user to store features outside of the tree is actually working against it, making it much slower than linear search (except maybe on huge data sets > 1B).

The API should be changed to give back features and not indices. The user can use their features to look up their data afterwards, which should have negligible performance impact since they only need to do a few lookups (k lookups for k-NN).

This has two negative impacts:

  • This could make the tree slightly larger.
  • This will make it harder for the user to store the data however they want.

Seeing as performance is the most important thing here, that takes precedence over these issues.

@abonander
Copy link

What about identical features that refer to distinct items?

@vadixidav
Copy link
Member Author

@abonander My bad for getting back a bit late, I was visiting the family for Easter. Currently, I plan to just have the user disambiguate that. It isn't clear what one might do if two features overlap. For instance, you might just throw one of them out if you don't expect overlap (like descriptors over a bunch of completely unique features). In other cases the user might map one feature to multiple entries, like in photogrammetry applications where you might want geometric verification filtering via RANSAC to match a repeating pattern. There would be a problem if two features overlapped and you were using the lowes ratio from the two features coming out of hwt, in which case you might miss the duplicate, but it could technically be checked when you get the feature back if it has duplicates (although at that point maybe you would just remove it from the tree).

Before it was possible to do this in the tree by using the u32 identifiers, but it wont be possible anymore since that really wasn't good for cache locality. Alternatively, I could add a helper wrapper over the tree that maps features back into a vector of values associated with them. For the time being I am just trying to do everything I can to boost the performance though, and I have learned that what I was doing did not appease the cache on my Intel processor. It seems obvious in retrospect, but I didn't realize that cache locality would have destroyed the performance like that. I am still trying to make more improvements right now to improve branch prediction and cache locality performance, after which I will make another release.

@vadixidav
Copy link
Member Author

Actually, the way it is being coded now will allow for the lowes ratio to be computed still because the feature can show up twice in the tree, so it can be returned back twice.

@vadixidav
Copy link
Member Author

This has been finished (along with other optimizations). Benchmarks are on the README.md now.

Unfortunately, after adding more tests I found that the tree broke down at further depths and so the fixed tree's performance only beats linear search once you reach over 1 million features. Keep in mind that the original paper was working with a linear search that took ~2.5 seconds, while I am finding that linear search on modern CPUs is a tiny ~80 milliseconds for the same dataset size. My profiling doesn't leave much room for optimization, but SIMD (using packed_simd) is definitely an option moving forwards (which would definitely speed up the relevant critical sections).

This issue has been addressed, but the performance is still lacking, so I intend to create a tracking issue for improving performance as a follow-up to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants