Add dot product as a distance metric. #303

psobot · 2018-06-27T15:36:17Z

This PR:

adds a DotProduct metric, using the negative of the dot product between two vectors as a pseudo-distance metric.
makes this metric available under the string name "dot" in the standard Annoy interface
adds tests for this new metric, similar to the existing tests for the Euclidean metric.

You might ask - why add dot when we already have angular, which is basically just dot but divided by magnitude? Well, for some recommender systems (including some matrix factorization collaborative filters), the dot product between the user and item pair is a prediction of affinity between that user and item. (Here's a great example from @jfkirk.)

erikbern · 2018-06-27T15:50:08Z

This looks nice! One major issue is that the dot product is not a distance function – I'm concerned the splits will be really bad for instance. Let me follow up with some more thoughts though.

psobot · 2018-06-27T16:01:51Z

Thanks @erikbern! Good point - I didn't change the way that we do splits, and just assumed that the existing logic would be good enough. The tests for precision still pass, but with real-world data we might end up with pretty poor accuracy. I'll try to get some realistic data to test on and add an understandable test for splits.

psobot · 2018-07-24T14:36:42Z

So, if not made obvious from the radio silence on my end - this has been notoriously difficult to optimize Annoy for. As you expected, the splits are really bad and the overall efficiency is very low, to the point where callers need to crank up search_k to get any reasonable results.

I'm not sure where to go from here, unfortunately. I've tried overriding DotProduct::create_split to skew the splits toward nodes with higher magnitude, but this doesn't seem to have significantly affected the lookup performance on larger random datasets. Is there a way you know of to change the tree construction to ensure it's more likely that we scan through higher-magnitude nodes first during lookup?

erikbern · 2018-07-24T17:59:02Z

@psobot take a look at #44 – it contains some older notes that may be useful.

Maybe also https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/ although it looks like it refers to the same xbox paper

psobot · 2018-07-27T13:41:54Z

Thanks for that link @erikbern! After reading through that Xbox paper and trying a bunch of tests on random and real-world data, I've updated this PR with a solution that's a bit more involved, but is now much more efficient than just straight dot product.

The new method adds:

an optional number of internal_dimensions added to the index by Annoy when using the dot distance measure (it adds exactly one internal dimension, but this is configurable in case other metrics need to use this in the future)
a preprocess step, called from build, that populates this extra internal dimension before any queries are made
logic to all of the get_nns_* methods to pad out input vectors with zeros in the internal dimensions, as per the paper

My tests show that on random gaussian-distributed data, this method seems to actually outperform regular cosine distance for accuracy when fetching 10 neighbours (using kendall-tau rank correlation as a metric):

Average kendall-tau accuracy:
at search_k=10: 	cosine 		7.01%
at search_k=10: 	Xbox trick 	12.28%
at search_k=10: 	Annoy Dot 	12.28%

at search_k=50: 	cosine 		9.28%
at search_k=50: 	Xbox trick 	18.72%
at search_k=50: 	Annoy Dot 	18.72%

at search_k=100: 	cosine 		13.63%
at search_k=100: 	Xbox trick 	24.68%
at search_k=100: 	Annoy Dot 	24.68%

at search_k=200: 	cosine 		23.74%
at search_k=200: 	Xbox trick 	35.03%
at search_k=200: 	Annoy Dot 	35.03%

at search_k=500: 	cosine 		54.44%
at search_k=500: 	Xbox trick 	60.78%
at search_k=500: 	Annoy Dot 	60.78%

at search_k=1000: 	cosine 		85.14%
at search_k=1000: 	Xbox trick 	81.28%
at search_k=1000: 	Annoy Dot 	81.28%

erikbern · 2018-07-27T14:23:50Z

Very cool!

I don't think you need the internal_dimensions thing actually. If you look at the different distance metrics, note that they redefine the Node struct. For instance the Euclidean distance metric uses that to store an extra element that denotes the offset of the plane for each split. So you could just add an extra element to the Node definition for DotProduct and use that.

Separately I suspect the preprocessing can be avoided and that you can move that logic into the create_split but that's more speculative and I have to think a bit more about it.

psobot · 2018-07-27T17:07:29Z

I don't think you need the internal_dimensions thing actually. If you look at the different distance metrics, note that they redefine the Node struct. For instance the Euclidean distance metric uses that to store an extra element that denotes the offset of the plane for each split. So you could just add an extra element to the Node definition for DotProduct and use that.

That's one option, although it saves a lot of custom code to just tack on an extra dimension and defer to Angular for create_split and other functions. (Adding an extra element to the Node definition would require us to override those functions to consider the extra element whenever we do math on an entire vector.) If there's a way to do that without adding too much extra complexity though, I'd be interested.

Separately I suspect the preprocessing can be avoided and that you can move that logic into the create_split but that's more speculative and I have to think a bit more about it.

I thought about doing so, but the preprocessing requires a first pass over all of the nodes to compute a global max_norm for the universe. create_split gets called on successively smaller lists of nodes, meaning we'd have to pass along this global max_norm every time we call create_split (and two_means) and if we ever see the same node twice in create_split, we'd end up re-computing the additional dimension multiple times.

erikbern · 2018-07-27T17:21:48Z

That's one option, although it saves a lot of custom code to just tack on an extra dimension and defer to Angular for create_split and other functions. (Adding an extra element to the Node definition would require us to override those functions to consider the extra element whenever we do math on an entire vector.) If there's a way to do that without adding too much extra complexity though, I'd be interested.

I see that point but I think that's possible to do by just overriding D::margin and D::distance or something similar right? The internal_dimensions is a bit hacky to me – very unlikely to generalize to any other distance metric. I'd rather keep it contained in the DotProduct class rather than making it a high level concept

erikbern · 2018-07-27T17:22:39Z

I thought about doing so, but the preprocessing requires a first pass over all of the nodes to compute a global max_norm for the universe. create_split gets called on successively smaller lists of nodes, meaning we'd have to pass along this global max_norm every time we call create_split (and two_means) and if we ever see the same node twice in create_split, we'd end up re-computing the additional dimension multiple times.

Let me think some more about that. I think it's possible to make the max norm local to each split rather than global but not sure yet.

…a dimension.

erikbern · 2018-07-27T21:54:46Z

src/annoylib.h

-    memcpy(v_node->v, v, sizeof(T) * (_f - D::internal_dimensions()));
-    memset(&v_node->v[_f - D::internal_dimensions()], 0, sizeof(T) * D::internal_dimensions());
-
+    Node* v_node = (Node *)calloc(_s, 1); // TODO: avoid


probably shouldn't mix calloc and malloc :)

erikbern · 2018-07-27T21:55:58Z

nice, starting to look good

erikbern · 2018-07-28T17:35:52Z

src/annoylib.h

@@ -833,14 +991,15 @@ template<typename S, typename T, typename Distance, typename Random>
  }

  void _get_all_nns(const T* v, size_t n, size_t search_k, vector<S>* result, vector<T>* distances) {
-    Node* v_node = (Node *)malloc(_s); // TODO: avoid
-    memcpy(v_node->v, v, sizeof(T)*_f);
+    Node* v_node = (Node *)calloc(_s, 1); // TODO: avoid


please use malloc here to make it consistent

Fixed!

The reason for changing this to calloc was to zero-out the entire Node struct first, but you're right - it's better to be consistent and explicit. I've added a zero_value method to Base to allow us to set sane defaults for any metrics that require them - in this case, after malloc'ing a new node, we need to set the dot_factor of the node to zero.

erikbern · 2018-07-28T17:36:15Z

src/annoylib.h

-    if (search_k == (size_t)-1)
-      search_k = n * _roots.size(); // slightly arbitrary default value
+    if (search_k == (size_t)-1) {
+      search_k = D::default_search_k(n, _roots.size());


why would different distance functions have different implementations? I would rather make it consistent

👍 Removed. This was to make the test_precision tests easier, as dot is a bit less precise than other metrics.

erikbern · 2018-07-28T17:37:01Z

src/annoylib.h

@@ -859,7 +1018,7 @@ template<typename S, typename T, typename Distance, typename Random>
        const S* dst = nd->children;
        nns.insert(nns.end(), dst, &dst[nd->n_descendants]);
      } else {
-        T margin = D::margin(nd, v, _f);
+        T margin = D::margin(nd, v_node->v, _f);


not that it matters but why did you change this?

Ah - this was a holdover from the previous iteration where we used extra_dimensions. This is no longer needed. 👍

erikbern · 2018-07-28T17:37:59Z

src/annoymodule.cc

+  if (PyObject_Size(v) == -1) {
+    char buf[256];
+    snprintf(buf, 256, "Expected an iterable, got an object of type \"%s\"", v->ob_type->tp_name);
+    PyErr_SetString(PyExc_ValueError, buf);


erikbern · 2018-07-28T17:38:06Z

src/annoymodule.cc

  if (PyObject_Size(v) != f) {
-    PyErr_SetString(PyExc_IndexError, "Vector has wrong length");
+    char buf[128];
+    snprintf(buf, 128, "Vector has wrong length (expected %d, got %ld)", f, PyObject_Size(v));


erikbern · 2018-07-28T17:39:54Z

test/dot_index_test.py

+def similarity(a, b):
+    # Could replace this with kendall-tau if we're comfortable
+    # bringing in scipy as a test dependency.
+    return float(len(set(a) & set(b))) / float(len(set(a) | set(b)))


this looks like jaccard similarity to me? anyway why is that needed? in all the other tests we use recall

Switched this to use recall like the other tests.

erikbern · 2018-07-28T17:42:59Z

src/annoylib.h

+  template<typename S, typename T>
+  static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) {
+    return -dot(x->v, y->v, f);
+  }


shouldn't this just be the euclidean distance between the vectors including the extra element?

I'm not entirely sure, tbh. Both perform fairly well, but if we return dot as the actual distance measure, then callers will get back distances that are actually dot products, which might be what callers expect.

I tried switching this to plain cosine distance, and results are pretty similar regardless of which distance metric we use:

"dot" with cosine as `distance` (incl extra element): Recall at 10 is 70.21% Recall at 100 is 96.50% Recall at 1000 is 100.00% "dot" with dot as `distance`: Recall at 10 is 69.55% Recall at 100 is 96.46% Recall at 1000 is 100.00%

(The reason changing this one method doesn't break everything is that DotProduct::margin still takes the dot_factor into account.)

ok i'm very confused now, let me take a deeper look later today. i think we use the terminology "distance" for multiple separate things.

(apologies for randomly popping up here and writing an essay)

@erikbern, I don't understand the rational behind using the euclidean distance here?

To try to clarify this from an API standpoint:

The distance function is exposed to the users as follows:

t = AnnoyIndex(2, 'dot') t.add_item(0, [0, 1]) t.add_item(1, [1, 1]) d = t.get_distance(0, 1)

where get_distance is defined as follows:

T get_distance(S i, S j) { return D::normalized_distance(D::distance(_get(i), _get(j), _f)); }

The doc does not specify anything special about the distance: "Returns the distance between items i and j."

Therefore, I believe the user would expect that t.get_distance(v_1, v_2) would use the distance specified in AnnoyIndex(f, {DIST}).
So in this case it would be dot(vectors[v_1], vectors[v_2]).

One issue is that dot(x, y) is more a similarity than a distance. I assume -dot(...) is more convenient than C - dot(...) since the dot is not normalized.

I guess in this case we could return either -dot(original_vec_i, original_vec_j) or 1 - cos(augmented_vec_i, augmented_vec_j).
(where original_vec_i is of dimension N and augmented_vec_i is of dimension N+1).

Given the API example above, returning the dot value seems to make more sense to me than returning the cos value.

=> @erikbern, @psobot: WDYT?

@yonromai the reason i originally brought up euclidean distance was that i thought the distance method was used internally somewhere for the splits.

if it's only for external consumption then i don't have any strong feelings, i guess positive dot product is fine. just need to document it since "distance" is a bit of a misnomer

erikbern · 2018-07-28T17:44:17Z

src/annoylib.h

+
+  template<typename T>
+  static inline T normalized_distance(T distance) {
+    return distance;


doesn't entirely make sense to me that this returns the negative dot product. maybe should just return the positive dot product? either way it's not a "distance" in the correct meaning of the word

https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition

Good point - if we leave this as returning the dot product distance directly, we can flip the sign to return the positive dot product.

erikbern · 2018-07-28T17:45:14Z

src/annoylib.h

@@ -586,7 +733,8 @@ template<typename S, typename T, typename Distance, typename Random>
    n->children[1] = 0;
    n->n_descendants = 1;

-    for (int z = 0; z < _f; z++)
+    memset(n->v, 0, _f * sizeof(T));


don't think this memset is needed
i guess the for loop could just be replaced by a memcpy though

Whoops, this was another holdover from using extra_dimensions. I've changed this to a straight memcpy.

erikbern · 2018-07-28T17:46:10Z

src/annoylib.h

@@ -663,7 +815,7 @@ template<typename S, typename T, typename Distance, typename Random>
      // we have mmapped data
      close(_fd);
      off_t size = _n_nodes * _s;
-      munmap(_nodes, size);
+      munmap(_nodes, (size_t) size);


why did you change this?

if anything should probably change off_t size to size_t size but i don't remember the difference between off_t and size_t

might just be some win specific thing

Changed this to try and eliminate a bunch of clang warnings, but I can undo this.

src/annoylib.h:846:12: warning: implicit conversion loses integer precision: 'off_t' (aka 'long long') to 'size_t' (aka 'unsigned long') [-Wshorten-64-to-32]

erikbern · 2018-07-28T17:47:31Z

This is approved except for my comments

psobot · 2018-08-01T14:31:13Z

Thanks for the review, @erikbern! And my apologies - I pushed this code at the end of the day on Friday without being completely sure it was ready for review. I'll address each comment individually.

erikbern · 2018-08-07T20:00:55Z

sorry forgot about this one. will go through shortly – i promise!

psobot · 2018-08-14T17:32:23Z

Hey @erikbern - friendly ping on this. We've got some internal customers interested in using this new feature, so I'm planning to merge this by Friday unless you have any final objections. (Also happy to hand-deliver some cookies to your office if that'd incentivize you to push this PR over the line. 🙂)

erikbern · 2018-08-14T18:30:26Z

haha i forget that this repo is under spotify and you can merge whatever you want :)

i'll try to find some time tonight!

erikbern · 2018-08-15T03:09:37Z

src/annoylib.h

+  }
+};
+
+struct Angular : Base {


this is implementation inheritance, which i'm not a super big fan of. but not blocking

erikbern · 2018-08-15T03:13:21Z

src/annoylib.h

+
+  template<typename S, typename T>
+  static inline T margin(const Node<S, T>* n, const T* y, int f) {
+    return dot(n->v, y, f);


shouldn't this include the dot_factor?

Good catch - yes, it should! This is fixed.

erikbern · 2018-08-15T03:13:57Z

src/annoylib.h

+    return dot(n->v, y, f);
+  }
+
+  template<typename S, typename T, typename Random>


isn't this a reimplementation of the superclass method?

It is, but I'm not sure we can omit it, as Angular::Node and DotProduct::Node are incompatible types so calling the superclass method directly won't work.

erikbern · 2018-08-15T03:15:23Z

This looks good. I'm curious what would happen if you update the margin method as per my suggestion – would be interesting to see if it has an impact on recall!

Side note but Annoy could really use a redesign from scratch at some point...

erikbern · 2018-08-15T03:16:34Z

Also this is just tangentially related but what's the purpose of this PR? I did a fair amount of benchmarking and even though dot product logically makes sense for music recommendation, cosine always performed vastly better. It was always a bit of a mystery to me

psobot · 2018-08-17T15:12:21Z

I'm curious what would happen if you update the margin method as per my suggestion – would be interesting to see if it has an impact on recall!

Interestingly enough, recall went down by about 2% (at 10 items) in my tests after including dot_factor in the margin calculation. I'm not sure if that means we should leave it out entirely, but if we're trying to match the original Xbox paper, we should include dot_factor for correctness.

erikbern · 2018-08-17T16:30:19Z

@psobot so should i merge this?

erikbern · 2018-08-17T16:30:51Z

Interestingly enough, recall went down by about 2% (at 10 items) in my tests after including dot_factor in the margin calculation. I'm not sure if that means we should leave it out entirely, but if we're trying to match the original Xbox paper, we should include dot_factor for correctness.

that's kind of random btw, hopefully within the level of noise

psobot · 2018-08-17T17:33:13Z

@erikbern Merge away, I'm ready if you are. 👍 Thanks for all the review!

erikbern · 2018-08-17T17:34:06Z

💥

psobot added the enhancement label Jun 27, 2018

psobot requested review from erikbern and yonromai June 27, 2018 15:36

erikbern mentioned this pull request Jul 24, 2018

Maximum inner product for Annoy benfred/bens-blog-code#10

Closed

Added dot product as a pseudo-distance measure.

d25ede3

psobot force-pushed the dot-product branch from a33122e to d25ede3 Compare July 27, 2018 13:19

Added more clarifying documentation.

895ba53

Seed random generator to avoid introducing flaky tests.

6295d67

Added default_search_k that can be varied on a per-metric basis.

8952b16

Refactored to use separate dot_factor parameter instead of hacky extr…

d3ae2b5

…a dimension.

erikbern reviewed Jul 27, 2018

View reviewed changes

erikbern reviewed Jul 28, 2018

View reviewed changes

psobot added 2 commits August 1, 2018 12:38

Cleaned up initialization code.

5f9b8b8

Switched back to a for loop instead of a memcpy.

19e0f98

erikbern reviewed Aug 15, 2018

View reviewed changes

Include dot_factor when computing margin.

d33f24f

erikbern merged commit 8fc84a8 into spotify:master Aug 17, 2018

Add dot product as a distance metric. #303

Add dot product as a distance metric. #303

Conversation

psobot commented Jun 27, 2018

erikbern commented Jun 27, 2018

psobot commented Jun 27, 2018

psobot commented Jul 24, 2018

erikbern commented Jul 24, 2018

psobot commented Jul 27, 2018

erikbern commented Jul 27, 2018

psobot commented Jul 27, 2018

erikbern commented Jul 27, 2018

erikbern commented Jul 27, 2018

Choose a reason for hiding this comment

erikbern commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikbern Aug 7, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikbern commented Jul 28, 2018

psobot commented Aug 1, 2018

erikbern commented Aug 7, 2018

psobot commented Aug 14, 2018

erikbern commented Aug 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikbern commented Aug 15, 2018

erikbern commented Aug 15, 2018

psobot commented Aug 17, 2018

erikbern commented Aug 17, 2018

erikbern commented Aug 17, 2018

psobot commented Aug 17, 2018

erikbern commented Aug 17, 2018

erikbern Aug 7, 2018 •

edited