A couple of questions #1

MLnick · 2015-08-27T12:26:07Z

how scalable is the implementation? what size of input data have you tested against?
is there a way (or potentially a way) to compute the NN based on a filter (e.g. only compute from a subset of input vectors filtered by some other node attribute, or ID etc).

For 2, the use case would be say to compute user- and item- vectors in a CF model, and compute the NN vector set for item-item similarity, as well as user-item scores.

tdebatty · 2015-08-27T19:38:37Z

Hello,

The NNDescent algorithm, which is currently the most robust, was tested with a dataset of 4 millions items (spam titles).

For filtering, the cleanest way is by first filtering your RDD:
http://spark.apache.org/docs/latest/programming-guide.html#transformations
http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaRDD.html

In any way, don't hesitate to keep me informed...

Regards,

MLnick · 2015-08-28T07:45:39Z

Ok thanks - from a quick look through the code it appears that this could potentially scale out with item size if the number of buckets is made large enough. Will be interesting to confirm.

What about LSHSuperBit for DoubleArray compared to NNDescent?

For #2, what I meant was rather this:

Say I have 20 million user vectors, and 5 million item vectors. I'd like to compute the NN for item-item cosine similarity. This is easy, I just pass in RDD[Node] where each Node is an item vector.

Now say I would like to compute the user-item "similarity" (in reality, I'd take the dot product, but cosine sim could suffice and I guess one could implement a simple dot product "similarity" too). I could pass in an RDD[Node] containing all the vectors (25 million say). But if I don't care about the user-user similarities, it is wasteful to compute them.

So what I'd want is to apply some step so that, for each user vector, I compute the NN from the set of item vectors.

Hope this is more clear.

tdebatty · 2015-08-31T11:53:01Z

Hello,

The drawback of LSHSuperBit (and other partitioning algorithms) is that if
your data is skewed (and it is often so), a lot of items (or most items)
might fall in a few buckets. This would largely decrease the efficiency of
the algorithm... So you might give LSHSuperBit a try, but NNDescent is
definitely more robust!

For your second question, this use case is currently not possible, but I
can certainly add a filter hook.
The interface would probably look like
/*

return true if the similarity between these two nodes should be
computed...
*/
public boolean prefilter(Node n1, Node n2);

What do you think about this?

Regards,

On Fri, Aug 28, 2015 at 9:45 AM, MLnick notifications@github.com wrote:

Ok thanks - from a quick look through the code it appears that this could
potentially scale out with item size if the number of buckets is made large
enough. Will be interesting to confirm.

What about LSHSuperBit for DoubleArray compared to NNDescent?

For #2, what I meant was rather this:

Say I have 20 million user vectors, and 5 million item vectors. I'd like
to compute the NN for item-item cosine similarity. This is easy, I just
pass in RDD[Node] where each Node is an item vector.

Now say I would like to compute the user-item "similarity" (in reality,
I'd take the dot product, but cosine sim could suffice and I guess one
could implement a simple dot product "similarity" too). I could pass in
an RDD[Node] containing all the vectors (25 million say). But if I don't
care about the user-user similarities, it is wasteful to compute them.

So what I'd want is to apply some step so that, for each user vector, I
compute the NN from the set of item vectors.

Hope this is more clear.

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

MLnick · 2015-09-07T07:52:13Z

Ok, that makes sense

The interface for prefilter looks good - though it looks like Node can only use the id String to do the filter? In which case I'd need to hack the id a bit to be something like "54_user" and "71_item" etc.

Or, can I define a value for the Node that is a (type, vector) pair, and define the similarity interface such that it operates on the (type, vector) but computes the similarity using only the vector part of the tuple?

tdebatty · 2015-09-09T13:15:00Z

Hi,

Just realized that you can achieve the same effect without this new
interface.

You need to:

use a custom class as Node.value
in your similarity implementation, test the Node.value. If you don't need
to compute the similarity, directly return 0 (or even -1)

I posted an example on GitHub:
https://github.com/tdebatty/spark-knn-graphs/blob/master/src/main/java/info/debatty/spark/knngraphs/example/NNDescentCustomValue.java

Don't hesitate to let me know how it works for you...

Regards,

On Mon, Sep 7, 2015 at 9:52 AM, MLnick notifications@github.com wrote:

Ok, that makes sense

The interface for prefilter looks good - though it looks like Node can
only use the id String to do the filter? In which case I'd need to hack the
id a bit to be something like "54_user" and "71_item" etc.

Or, can I define a value for the Node that is a (type, vector) pair, and
define the similarity interface such that it operates on the (type,
vector) but computes the similarity using only the vector part of the
tuple?

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

tdebatty closed this as completed Nov 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A couple of questions #1

A couple of questions #1

MLnick commented Aug 27, 2015

tdebatty commented Aug 27, 2015

MLnick commented Aug 28, 2015

tdebatty commented Aug 31, 2015

MLnick commented Sep 7, 2015

tdebatty commented Sep 9, 2015

A couple of questions #1

A couple of questions #1

Comments

MLnick commented Aug 27, 2015

tdebatty commented Aug 27, 2015

MLnick commented Aug 28, 2015

tdebatty commented Aug 31, 2015

MLnick commented Sep 7, 2015

tdebatty commented Sep 9, 2015