Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dot product as a distance metric. #303

Merged
merged 8 commits into from Aug 17, 2018
Merged

Conversation

psobot
Copy link
Member

@psobot psobot commented Jun 27, 2018

This PR:

  • adds a DotProduct metric, using the negative of the dot product between two vectors as a pseudo-distance metric.
  • makes this metric available under the string name "dot" in the standard Annoy interface
  • adds tests for this new metric, similar to the existing tests for the Euclidean metric.

You might ask - why add dot when we already have angular, which is basically just dot but divided by magnitude? Well, for some recommender systems (including some matrix factorization collaborative filters), the dot product between the user and item pair is a prediction of affinity between that user and item. (Here's a great example from @jfkirk.)

@erikbern
Copy link
Collaborator

This looks nice! One major issue is that the dot product is not a distance function – I'm concerned the splits will be really bad for instance. Let me follow up with some more thoughts though.

@psobot
Copy link
Member Author

psobot commented Jun 27, 2018

Thanks @erikbern! Good point - I didn't change the way that we do splits, and just assumed that the existing logic would be good enough. The tests for precision still pass, but with real-world data we might end up with pretty poor accuracy. I'll try to get some realistic data to test on and add an understandable test for splits.

@psobot
Copy link
Member Author

psobot commented Jul 24, 2018

So, if not made obvious from the radio silence on my end - this has been notoriously difficult to optimize Annoy for. As you expected, the splits are really bad and the overall efficiency is very low, to the point where callers need to crank up search_k to get any reasonable results.

I'm not sure where to go from here, unfortunately. I've tried overriding DotProduct::create_split to skew the splits toward nodes with higher magnitude, but this doesn't seem to have significantly affected the lookup performance on larger random datasets. Is there a way you know of to change the tree construction to ensure it's more likely that we scan through higher-magnitude nodes first during lookup?

@erikbern
Copy link
Collaborator

@psobot take a look at #44 – it contains some older notes that may be useful.

Maybe also https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/ although it looks like it refers to the same xbox paper

@psobot
Copy link
Member Author

psobot commented Jul 27, 2018

Thanks for that link @erikbern! After reading through that Xbox paper and trying a bunch of tests on random and real-world data, I've updated this PR with a solution that's a bit more involved, but is now much more efficient than just straight dot product.

The new method adds:

  • an optional number of internal_dimensions added to the index by Annoy when using the dot distance measure (it adds exactly one internal dimension, but this is configurable in case other metrics need to use this in the future)
  • a preprocess step, called from build, that populates this extra internal dimension before any queries are made
  • logic to all of the get_nns_* methods to pad out input vectors with zeros in the internal dimensions, as per the paper

My tests show that on random gaussian-distributed data, this method seems to actually outperform regular cosine distance for accuracy when fetching 10 neighbours (using kendall-tau rank correlation as a metric):

Average kendall-tau accuracy:
at search_k=10: 	cosine 		7.01%
at search_k=10: 	Xbox trick 	12.28%
at search_k=10: 	Annoy Dot 	12.28%

at search_k=50: 	cosine 		9.28%
at search_k=50: 	Xbox trick 	18.72%
at search_k=50: 	Annoy Dot 	18.72%

at search_k=100: 	cosine 		13.63%
at search_k=100: 	Xbox trick 	24.68%
at search_k=100: 	Annoy Dot 	24.68%

at search_k=200: 	cosine 		23.74%
at search_k=200: 	Xbox trick 	35.03%
at search_k=200: 	Annoy Dot 	35.03%

at search_k=500: 	cosine 		54.44%
at search_k=500: 	Xbox trick 	60.78%
at search_k=500: 	Annoy Dot 	60.78%

at search_k=1000: 	cosine 		85.14%
at search_k=1000: 	Xbox trick 	81.28%
at search_k=1000: 	Annoy Dot 	81.28%

@erikbern
Copy link
Collaborator

Very cool!

I don't think you need the internal_dimensions thing actually. If you look at the different distance metrics, note that they redefine the Node struct. For instance the Euclidean distance metric uses that to store an extra element that denotes the offset of the plane for each split. So you could just add an extra element to the Node definition for DotProduct and use that.

Separately I suspect the preprocessing can be avoided and that you can move that logic into the create_split but that's more speculative and I have to think a bit more about it.

@psobot
Copy link
Member Author

psobot commented Jul 27, 2018

I don't think you need the internal_dimensions thing actually. If you look at the different distance metrics, note that they redefine the Node struct. For instance the Euclidean distance metric uses that to store an extra element that denotes the offset of the plane for each split. So you could just add an extra element to the Node definition for DotProduct and use that.

That's one option, although it saves a lot of custom code to just tack on an extra dimension and defer to Angular for create_split and other functions. (Adding an extra element to the Node definition would require us to override those functions to consider the extra element whenever we do math on an entire vector.) If there's a way to do that without adding too much extra complexity though, I'd be interested.

Separately I suspect the preprocessing can be avoided and that you can move that logic into the create_split but that's more speculative and I have to think a bit more about it.

I thought about doing so, but the preprocessing requires a first pass over all of the nodes to compute a global max_norm for the universe. create_split gets called on successively smaller lists of nodes, meaning we'd have to pass along this global max_norm every time we call create_split (and two_means) and if we ever see the same node twice in create_split, we'd end up re-computing the additional dimension multiple times.

@erikbern
Copy link
Collaborator

That's one option, although it saves a lot of custom code to just tack on an extra dimension and defer to Angular for create_split and other functions. (Adding an extra element to the Node definition would require us to override those functions to consider the extra element whenever we do math on an entire vector.) If there's a way to do that without adding too much extra complexity though, I'd be interested.

I see that point but I think that's possible to do by just overriding D::margin and D::distance or something similar right? The internal_dimensions is a bit hacky to me – very unlikely to generalize to any other distance metric. I'd rather keep it contained in the DotProduct class rather than making it a high level concept

@erikbern
Copy link
Collaborator

I thought about doing so, but the preprocessing requires a first pass over all of the nodes to compute a global max_norm for the universe. create_split gets called on successively smaller lists of nodes, meaning we'd have to pass along this global max_norm every time we call create_split (and two_means) and if we ever see the same node twice in create_split, we'd end up re-computing the additional dimension multiple times.

Let me think some more about that. I think it's possible to make the max norm local to each split rather than global but not sure yet.

src/annoylib.h Outdated
memcpy(v_node->v, v, sizeof(T) * (_f - D::internal_dimensions()));
memset(&v_node->v[_f - D::internal_dimensions()], 0, sizeof(T) * D::internal_dimensions());

Node* v_node = (Node *)calloc(_s, 1); // TODO: avoid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably shouldn't mix calloc and malloc :)

@erikbern
Copy link
Collaborator

nice, starting to look good

src/annoylib.h Outdated
@@ -833,14 +991,15 @@ template<typename S, typename T, typename Distance, typename Random>
}

void _get_all_nns(const T* v, size_t n, size_t search_k, vector<S>* result, vector<T>* distances) {
Node* v_node = (Node *)malloc(_s); // TODO: avoid
memcpy(v_node->v, v, sizeof(T)*_f);
Node* v_node = (Node *)calloc(_s, 1); // TODO: avoid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use malloc here to make it consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

The reason for changing this to calloc was to zero-out the entire Node struct first, but you're right - it's better to be consistent and explicit. I've added a zero_value method to Base to allow us to set sane defaults for any metrics that require them - in this case, after malloc'ing a new node, we need to set the dot_factor of the node to zero.

src/annoylib.h Outdated
if (search_k == (size_t)-1)
search_k = n * _roots.size(); // slightly arbitrary default value
if (search_k == (size_t)-1) {
search_k = D::default_search_k(n, _roots.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would different distance functions have different implementations? I would rather make it consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Removed. This was to make the test_precision tests easier, as dot is a bit less precise than other metrics.

src/annoylib.h Outdated
@@ -859,7 +1018,7 @@ template<typename S, typename T, typename Distance, typename Random>
const S* dst = nd->children;
nns.insert(nns.end(), dst, &dst[nd->n_descendants]);
} else {
T margin = D::margin(nd, v, _f);
T margin = D::margin(nd, v_node->v, _f);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it matters but why did you change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - this was a holdover from the previous iteration where we used extra_dimensions. This is no longer needed. 👍

if (PyObject_Size(v) == -1) {
char buf[256];
snprintf(buf, 256, "Expected an iterable, got an object of type \"%s\"", v->ob_type->tp_name);
PyErr_SetString(PyExc_ValueError, buf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

if (PyObject_Size(v) != f) {
PyErr_SetString(PyExc_IndexError, "Vector has wrong length");
char buf[128];
snprintf(buf, 128, "Vector has wrong length (expected %d, got %ld)", f, PyObject_Size(v));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nice

def similarity(a, b):
# Could replace this with kendall-tau if we're comfortable
# bringing in scipy as a test dependency.
return float(len(set(a) & set(b))) / float(len(set(a) | set(b)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like jaccard similarity to me? anyway why is that needed? in all the other tests we use recall

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched this to use recall like the other tests.

src/annoylib.h Outdated
template<typename S, typename T>
static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) {
return -dot(x->v, y->v, f);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be the euclidean distance between the vectors including the extra element?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure, tbh. Both perform fairly well, but if we return dot as the actual distance measure, then callers will get back distances that are actually dot products, which might be what callers expect.

I tried switching this to plain cosine distance, and results are pretty similar regardless of which distance metric we use:

"dot" with cosine as `distance` (incl extra element): 
Recall at 10 is 70.21%
Recall at 100 is 96.50%
Recall at 1000 is 100.00%

"dot" with dot as `distance`: 
Recall at 10 is 69.55%
Recall at 100 is 96.46%
Recall at 1000 is 100.00%

(The reason changing this one method doesn't break everything is that DotProduct::margin still takes the dot_factor into account.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i'm very confused now, let me take a deeper look later today. i think we use the terminology "distance" for multiple separate things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies for randomly popping up here and writing an essay)

@erikbern, I don't understand the rational behind using the euclidean distance here?


To try to clarify this from an API standpoint:

  • The distance function is exposed to the users as follows:
t = AnnoyIndex(2, 'dot')
t.add_item(0, [0, 1])
t.add_item(1, [1, 1])
d = t.get_distance(0, 1)

where get_distance is defined as follows:

T get_distance(S i, S j) {
    return D::normalized_distance(D::distance(_get(i), _get(j), _f));
  }
  • The doc does not specify anything special about the distance: "Returns the distance between items i and j."

  • Therefore, I believe the user would expect that t.get_distance(v_1, v_2) would use the distance specified in AnnoyIndex(f, {DIST}).
    So in this case it would be dot(vectors[v_1], vectors[v_2]).


One issue is that dot(x, y) is more a similarity than a distance. I assume -dot(...) is more convenient than C - dot(...) since the dot is not normalized.

I guess in this case we could return either -dot(original_vec_i, original_vec_j) or 1 - cos(augmented_vec_i, augmented_vec_j).
(where original_vec_i is of dimension N and augmented_vec_i is of dimension N+1).

Given the API example above, returning the dot value seems to make more sense to me than returning the cos value.


=> @erikbern, @psobot: WDYT?

Copy link
Collaborator

@erikbern erikbern Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yonromai the reason i originally brought up euclidean distance was that i thought the distance method was used internally somewhere for the splits.

if it's only for external consumption then i don't have any strong feelings, i guess positive dot product is fine. just need to document it since "distance" is a bit of a misnomer

src/annoylib.h Outdated

template<typename T>
static inline T normalized_distance(T distance) {
return distance;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't entirely make sense to me that this returns the negative dot product. maybe should just return the positive dot product? either way it's not a "distance" in the correct meaning of the word

https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - if we leave this as returning the dot product distance directly, we can flip the sign to return the positive dot product.

src/annoylib.h Outdated
@@ -586,7 +733,8 @@ template<typename S, typename T, typename Distance, typename Random>
n->children[1] = 0;
n->n_descendants = 1;

for (int z = 0; z < _f; z++)
memset(n->v, 0, _f * sizeof(T));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this memset is needed
i guess the for loop could just be replaced by a memcpy though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this was another holdover from using extra_dimensions. I've changed this to a straight memcpy.

src/annoylib.h Outdated
@@ -663,7 +815,7 @@ template<typename S, typename T, typename Distance, typename Random>
// we have mmapped data
close(_fd);
off_t size = _n_nodes * _s;
munmap(_nodes, size);
munmap(_nodes, (size_t) size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this?

if anything should probably change off_t size to size_t size but i don't remember the difference between off_t and size_t

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might just be some win specific thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to try and eliminate a bunch of clang warnings, but I can undo this.

src/annoylib.h:846:12: warning: implicit conversion loses integer precision: 'off_t' (aka 'long long') to 'size_t' (aka 'unsigned long') [-Wshorten-64-to-32]

@erikbern
Copy link
Collaborator

This is approved except for my comments

@psobot
Copy link
Member Author

psobot commented Aug 1, 2018

Thanks for the review, @erikbern! And my apologies - I pushed this code at the end of the day on Friday without being completely sure it was ready for review. I'll address each comment individually.

@erikbern
Copy link
Collaborator

erikbern commented Aug 7, 2018

sorry forgot about this one. will go through shortly – i promise!

@psobot
Copy link
Member Author

psobot commented Aug 14, 2018

Hey @erikbern - friendly ping on this. We've got some internal customers interested in using this new feature, so I'm planning to merge this by Friday unless you have any final objections. (Also happy to hand-deliver some cookies to your office if that'd incentivize you to push this PR over the line. 🙂)

@erikbern
Copy link
Collaborator

haha i forget that this repo is under spotify and you can merge whatever you want :)

i'll try to find some time tonight!

}
};

struct Angular : Base {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is implementation inheritance, which i'm not a super big fan of. but not blocking

src/annoylib.h Outdated

template<typename S, typename T>
static inline T margin(const Node<S, T>* n, const T* y, int f) {
return dot(n->v, y, f);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this include the dot_factor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - yes, it should! This is fixed.

src/annoylib.h Outdated
return dot(n->v, y, f);
}

template<typename S, typename T, typename Random>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a reimplementation of the superclass method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but I'm not sure we can omit it, as Angular::Node and DotProduct::Node are incompatible types so calling the superclass method directly won't work.

@erikbern
Copy link
Collaborator

This looks good. I'm curious what would happen if you update the margin method as per my suggestion – would be interesting to see if it has an impact on recall!

Side note but Annoy could really use a redesign from scratch at some point...

@erikbern
Copy link
Collaborator

Also this is just tangentially related but what's the purpose of this PR? I did a fair amount of benchmarking and even though dot product logically makes sense for music recommendation, cosine always performed vastly better. It was always a bit of a mystery to me

@psobot
Copy link
Member Author

psobot commented Aug 17, 2018

I'm curious what would happen if you update the margin method as per my suggestion – would be interesting to see if it has an impact on recall!

Interestingly enough, recall went down by about 2% (at 10 items) in my tests after including dot_factor in the margin calculation. I'm not sure if that means we should leave it out entirely, but if we're trying to match the original Xbox paper, we should include dot_factor for correctness.

@erikbern
Copy link
Collaborator

@psobot so should i merge this?

@erikbern
Copy link
Collaborator

Interestingly enough, recall went down by about 2% (at 10 items) in my tests after including dot_factor in the margin calculation. I'm not sure if that means we should leave it out entirely, but if we're trying to match the original Xbox paper, we should include dot_factor for correctness.

that's kind of random btw, hopefully within the level of noise

@psobot
Copy link
Member Author

psobot commented Aug 17, 2018

@erikbern Merge away, I'm ready if you are. 👍 Thanks for all the review!

@erikbern erikbern merged commit 8fc84a8 into spotify:master Aug 17, 2018
@erikbern
Copy link
Collaborator

💥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants