New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dot product as a distance metric. #303
Conversation
This looks nice! One major issue is that the dot product is not a distance function – I'm concerned the splits will be really bad for instance. Let me follow up with some more thoughts though. |
Thanks @erikbern! Good point - I didn't change the way that we do splits, and just assumed that the existing logic would be good enough. The tests for |
So, if not made obvious from the radio silence on my end - this has been notoriously difficult to optimize Annoy for. As you expected, the splits are really bad and the overall efficiency is very low, to the point where callers need to crank up I'm not sure where to go from here, unfortunately. I've tried overriding |
@psobot take a look at #44 – it contains some older notes that may be useful. Maybe also https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/ although it looks like it refers to the same xbox paper |
Thanks for that link @erikbern! After reading through that Xbox paper and trying a bunch of tests on random and real-world data, I've updated this PR with a solution that's a bit more involved, but is now much more efficient than just straight dot product. The new method adds:
My tests show that on random gaussian-distributed data, this method seems to actually outperform regular cosine distance for accuracy when fetching 10 neighbours (using kendall-tau rank correlation as a metric):
|
Very cool! I don't think you need the Separately I suspect the preprocessing can be avoided and that you can move that logic into the |
That's one option, although it saves a lot of custom code to just tack on an extra dimension and defer to
I thought about doing so, but the preprocessing requires a first pass over all of the nodes to compute a global |
I see that point but I think that's possible to do by just overriding |
Let me think some more about that. I think it's possible to make the max norm local to each split rather than global but not sure yet. |
src/annoylib.h
Outdated
memcpy(v_node->v, v, sizeof(T) * (_f - D::internal_dimensions())); | ||
memset(&v_node->v[_f - D::internal_dimensions()], 0, sizeof(T) * D::internal_dimensions()); | ||
|
||
Node* v_node = (Node *)calloc(_s, 1); // TODO: avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably shouldn't mix calloc and malloc :)
nice, starting to look good |
src/annoylib.h
Outdated
@@ -833,14 +991,15 @@ template<typename S, typename T, typename Distance, typename Random> | |||
} | |||
|
|||
void _get_all_nns(const T* v, size_t n, size_t search_k, vector<S>* result, vector<T>* distances) { | |||
Node* v_node = (Node *)malloc(_s); // TODO: avoid | |||
memcpy(v_node->v, v, sizeof(T)*_f); | |||
Node* v_node = (Node *)calloc(_s, 1); // TODO: avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use malloc here to make it consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
The reason for changing this to calloc
was to zero-out the entire Node struct first, but you're right - it's better to be consistent and explicit. I've added a zero_value
method to Base
to allow us to set sane defaults for any metrics that require them - in this case, after malloc'ing a new node, we need to set the dot_factor
of the node to zero.
src/annoylib.h
Outdated
if (search_k == (size_t)-1) | ||
search_k = n * _roots.size(); // slightly arbitrary default value | ||
if (search_k == (size_t)-1) { | ||
search_k = D::default_search_k(n, _roots.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would different distance functions have different implementations? I would rather make it consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Removed. This was to make the test_precision
tests easier, as dot
is a bit less precise than other metrics.
src/annoylib.h
Outdated
@@ -859,7 +1018,7 @@ template<typename S, typename T, typename Distance, typename Random> | |||
const S* dst = nd->children; | |||
nns.insert(nns.end(), dst, &dst[nd->n_descendants]); | |||
} else { | |||
T margin = D::margin(nd, v, _f); | |||
T margin = D::margin(nd, v_node->v, _f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that it matters but why did you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - this was a holdover from the previous iteration where we used extra_dimensions
. This is no longer needed. 👍
if (PyObject_Size(v) == -1) { | ||
char buf[256]; | ||
snprintf(buf, 256, "Expected an iterable, got an object of type \"%s\"", v->ob_type->tp_name); | ||
PyErr_SetString(PyExc_ValueError, buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
if (PyObject_Size(v) != f) { | ||
PyErr_SetString(PyExc_IndexError, "Vector has wrong length"); | ||
char buf[128]; | ||
snprintf(buf, 128, "Vector has wrong length (expected %d, got %ld)", f, PyObject_Size(v)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also nice
test/dot_index_test.py
Outdated
def similarity(a, b): | ||
# Could replace this with kendall-tau if we're comfortable | ||
# bringing in scipy as a test dependency. | ||
return float(len(set(a) & set(b))) / float(len(set(a) | set(b))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like jaccard similarity to me? anyway why is that needed? in all the other tests we use recall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched this to use recall like the other tests.
src/annoylib.h
Outdated
template<typename S, typename T> | ||
static inline T distance(const Node<S, T>* x, const Node<S, T>* y, int f) { | ||
return -dot(x->v, y->v, f); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this just be the euclidean distance between the vectors including the extra element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure, tbh. Both perform fairly well, but if we return dot
as the actual distance measure, then callers will get back distances that are actually dot products, which might be what callers expect.
I tried switching this to plain cosine distance, and results are pretty similar regardless of which distance metric we use:
"dot" with cosine as `distance` (incl extra element):
Recall at 10 is 70.21%
Recall at 100 is 96.50%
Recall at 1000 is 100.00%
"dot" with dot as `distance`:
Recall at 10 is 69.55%
Recall at 100 is 96.46%
Recall at 1000 is 100.00%
(The reason changing this one method doesn't break everything is that DotProduct::margin
still takes the dot_factor
into account.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i'm very confused now, let me take a deeper look later today. i think we use the terminology "distance" for multiple separate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(apologies for randomly popping up here and writing an essay)
@erikbern, I don't understand the rational behind using the euclidean distance here?
To try to clarify this from an API standpoint:
- The distance function is exposed to the users as follows:
t = AnnoyIndex(2, 'dot')
t.add_item(0, [0, 1])
t.add_item(1, [1, 1])
d = t.get_distance(0, 1)
where get_distance
is defined as follows:
T get_distance(S i, S j) {
return D::normalized_distance(D::distance(_get(i), _get(j), _f));
}
-
The doc does not specify anything special about the distance: "Returns the distance between items
i
andj
." -
Therefore, I believe the user would expect that
t.get_distance(v_1, v_2)
would use the distance specified inAnnoyIndex(f, {DIST})
.
So in this case it would bedot(vectors[v_1], vectors[v_2])
.
One issue is that dot(x, y)
is more a similarity than a distance. I assume -dot(...)
is more convenient than C - dot(...)
since the dot is not normalized.
I guess in this case we could return either -dot(original_vec_i, original_vec_j)
or 1 - cos(augmented_vec_i, augmented_vec_j)
.
(where original_vec_i
is of dimension N
and augmented_vec_i
is of dimension N+1
).
Given the API example above, returning the dot
value seems to make more sense to me than returning the cos
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yonromai the reason i originally brought up euclidean distance was that i thought the distance method was used internally somewhere for the splits.
if it's only for external consumption then i don't have any strong feelings, i guess positive dot product is fine. just need to document it since "distance" is a bit of a misnomer
src/annoylib.h
Outdated
|
||
template<typename T> | ||
static inline T normalized_distance(T distance) { | ||
return distance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't entirely make sense to me that this returns the negative dot product. maybe should just return the positive dot product? either way it's not a "distance" in the correct meaning of the word
https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - if we leave this as returning the dot product distance directly, we can flip the sign to return the positive dot product.
src/annoylib.h
Outdated
@@ -586,7 +733,8 @@ template<typename S, typename T, typename Distance, typename Random> | |||
n->children[1] = 0; | |||
n->n_descendants = 1; | |||
|
|||
for (int z = 0; z < _f; z++) | |||
memset(n->v, 0, _f * sizeof(T)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think this memset is needed
i guess the for loop could just be replaced by a memcpy though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this was another holdover from using extra_dimensions
. I've changed this to a straight memcpy
.
src/annoylib.h
Outdated
@@ -663,7 +815,7 @@ template<typename S, typename T, typename Distance, typename Random> | |||
// we have mmapped data | |||
close(_fd); | |||
off_t size = _n_nodes * _s; | |||
munmap(_nodes, size); | |||
munmap(_nodes, (size_t) size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this?
if anything should probably change off_t size
to size_t size
but i don't remember the difference between off_t
and size_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might just be some win specific thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to try and eliminate a bunch of clang
warnings, but I can undo this.
src/annoylib.h:846:12: warning: implicit conversion loses integer precision: 'off_t' (aka 'long long') to 'size_t' (aka 'unsigned long') [-Wshorten-64-to-32]
This is approved except for my comments |
Thanks for the review, @erikbern! And my apologies - I pushed this code at the end of the day on Friday without being completely sure it was ready for review. I'll address each comment individually. |
sorry forgot about this one. will go through shortly – i promise! |
Hey @erikbern - friendly ping on this. We've got some internal customers interested in using this new feature, so I'm planning to merge this by Friday unless you have any final objections. (Also happy to hand-deliver some cookies to your office if that'd incentivize you to push this PR over the line. 🙂) |
haha i forget that this repo is under spotify and you can merge whatever you want :) i'll try to find some time tonight! |
} | ||
}; | ||
|
||
struct Angular : Base { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is implementation inheritance, which i'm not a super big fan of. but not blocking
src/annoylib.h
Outdated
|
||
template<typename S, typename T> | ||
static inline T margin(const Node<S, T>* n, const T* y, int f) { | ||
return dot(n->v, y, f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this include the dot_factor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - yes, it should! This is fixed.
src/annoylib.h
Outdated
return dot(n->v, y, f); | ||
} | ||
|
||
template<typename S, typename T, typename Random> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this a reimplementation of the superclass method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but I'm not sure we can omit it, as Angular::Node
and DotProduct::Node
are incompatible types so calling the superclass method directly won't work.
This looks good. I'm curious what would happen if you update the Side note but Annoy could really use a redesign from scratch at some point... |
Also this is just tangentially related but what's the purpose of this PR? I did a fair amount of benchmarking and even though dot product logically makes sense for music recommendation, cosine always performed vastly better. It was always a bit of a mystery to me |
Interestingly enough, recall went down by about 2% (at 10 items) in my tests after including |
@psobot so should i merge this? |
that's kind of random btw, hopefully within the level of noise |
@erikbern Merge away, I'm ready if you are. 👍 Thanks for all the review! |
💥 |
This PR:
DotProduct
metric, using the negative of the dot product between two vectors as a pseudo-distance metric."dot"
in the standard Annoy interfaceEuclidean
metric.You might ask - why add
dot
when we already haveangular
, which is basically justdot
but divided by magnitude? Well, for some recommender systems (including some matrix factorization collaborative filters), the dot product between the user and item pair is a prediction of affinity between that user and item. (Here's a great example from @jfkirk.)