Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A couple of questions #1

Closed
MLnick opened this issue Aug 27, 2015 · 5 comments
Closed

A couple of questions #1

MLnick opened this issue Aug 27, 2015 · 5 comments

Comments

@MLnick
Copy link

MLnick commented Aug 27, 2015

  1. how scalable is the implementation? what size of input data have you tested against?
  2. is there a way (or potentially a way) to compute the NN based on a filter (e.g. only compute from a subset of input vectors filtered by some other node attribute, or ID etc).

For 2, the use case would be say to compute user- and item- vectors in a CF model, and compute the NN vector set for item-item similarity, as well as user-item scores.

@tdebatty
Copy link
Owner

Hello,

The NNDescent algorithm, which is currently the most robust, was tested with a dataset of 4 millions items (spam titles).

For filtering, the cleanest way is by first filtering your RDD:
http://spark.apache.org/docs/latest/programming-guide.html#transformations
http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaRDD.html

In any way, don't hesitate to keep me informed...

Regards,

@MLnick
Copy link
Author

MLnick commented Aug 28, 2015

Ok thanks - from a quick look through the code it appears that this could potentially scale out with item size if the number of buckets is made large enough. Will be interesting to confirm.

What about LSHSuperBit for DoubleArray compared to NNDescent?

For #2, what I meant was rather this:

Say I have 20 million user vectors, and 5 million item vectors. I'd like to compute the NN for item-item cosine similarity. This is easy, I just pass in RDD[Node] where each Node is an item vector.

Now say I would like to compute the user-item "similarity" (in reality, I'd take the dot product, but cosine sim could suffice and I guess one could implement a simple dot product "similarity" too). I could pass in an RDD[Node] containing all the vectors (25 million say). But if I don't care about the user-user similarities, it is wasteful to compute them.

So what I'd want is to apply some step so that, for each user vector, I compute the NN from the set of item vectors.

Hope this is more clear.

@tdebatty
Copy link
Owner

Hello,

The drawback of LSHSuperBit (and other partitioning algorithms) is that if
your data is skewed (and it is often so), a lot of items (or most items)
might fall in a few buckets. This would largely decrease the efficiency of
the algorithm... So you might give LSHSuperBit a try, but NNDescent is
definitely more robust!

For your second question, this use case is currently not possible, but I
can certainly add a filter hook.
The interface would probably look like
/*

  • return true if the similarity between these two nodes should be
    computed...
    */
    public boolean prefilter(Node n1, Node n2);

What do you think about this?

Regards,

On Fri, Aug 28, 2015 at 9:45 AM, MLnick notifications@github.com wrote:

Ok thanks - from a quick look through the code it appears that this could
potentially scale out with item size if the number of buckets is made large
enough. Will be interesting to confirm.

What about LSHSuperBit for DoubleArray compared to NNDescent?

For #2, what I meant was rather this:

Say I have 20 million user vectors, and 5 million item vectors. I'd like
to compute the NN for item-item cosine similarity. This is easy, I just
pass in RDD[Node] where each Node is an item vector.

Now say I would like to compute the user-item "similarity" (in reality,
I'd take the dot product, but cosine sim could suffice and I guess one
could implement a simple dot product "similarity" too). I could pass in
an RDD[Node] containing all the vectors (25 million say). But if I don't
care about the user-user similarities, it is wasteful to compute them.

So what I'd want is to apply some step so that, for each user vector, I
compute the NN from the set of item vectors.

Hope this is more clear.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@MLnick
Copy link
Author

MLnick commented Sep 7, 2015

Ok, that makes sense

The interface for prefilter looks good - though it looks like Node can only use the id String to do the filter? In which case I'd need to hack the id a bit to be something like "54_user" and "71_item" etc.

Or, can I define a value for the Node that is a (type, vector) pair, and define the similarity interface such that it operates on the (type, vector) but computes the similarity using only the vector part of the tuple?

@tdebatty
Copy link
Owner

tdebatty commented Sep 9, 2015

Hi,

Just realized that you can achieve the same effect without this new
interface.

You need to:

  • use a custom class as Node.value
  • in your similarity implementation, test the Node.value. If you don't need
    to compute the similarity, directly return 0 (or even -1)

I posted an example on GitHub:
https://github.com/tdebatty/spark-knn-graphs/blob/master/src/main/java/info/debatty/spark/knngraphs/example/NNDescentCustomValue.java

Don't hesitate to let me know how it works for you...

Regards,

On Mon, Sep 7, 2015 at 9:52 AM, MLnick notifications@github.com wrote:

Ok, that makes sense

The interface for prefilter looks good - though it looks like Node can
only use the id String to do the filter? In which case I'd need to hack the
id a bit to be something like "54_user" and "71_item" etc.

Or, can I define a value for the Node that is a (type, vector) pair, and
define the similarity interface such that it operates on the (type,
vector) but computes the similarity using only the vector part of the
tuple?


Reply to this email directly or view it on GitHub
#1 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants