Skip to content
Forest Fang edited this page Sep 27, 2015 · 1 revision

k-NN algorithm is unique in the sense that it can be used for a variety use cases: classification, regression, clustering and etc. The core of k-NN is to compute distance/similarity between vectors which is well understood. The crux of the implementation is to support all generic use cases as mentioned above which are more than just maintaining the vectors used for computation but as well as other auxiliary information such as label.

For most local implementation, k-NN only need to take an array of vectors. Since indices of each vector can be tracked, they can be mapped back to original data easily to obtain any auxiliary data as needed. However in distributed implementation, vectors are required to be shuffle across partition during tree construction. Further a search vector may need to traverse multiple sub-trees across partition and aggregate by key in the end. Therefore neighbors of a searched point can span across partitions and mapping neighbors back to original RDD can be difficult.

Challenge

It is therefore logical to carry the original data around as we construct the tree and return the original data directly as results of k-NN search. This means we need to introduce a type parameter to the KNN class.

Furthermore, recognizing that we will need to compute distance often, it is best to store internal vector as VectorWithNorm such that Euclidean distance can be efficiently computed.

The question is to design a runtime/space efficient type interface for KNN and make sure it is intuitive and easy to use for end-users.

Example Usage

  1. Generic nearest neighbor search with vector only

    e.g. most similar image: data are stored as vector and vector only but we still need to produce VectorWithNorm

  2. Generic nearest neighbor search with auxiliary data

    e.g. most similar image with random projection: data are stored as vector but a separate (projected) vector is used to compute distance. Furthermore additional auxiliary data such as label, captions can be included.

  3. Classification/Regression modeling

    This is arguably a more narrower use case as above. For example, using a LabeledPoint to store the label which can be used for classification and regression.

Solutions

RDD[T], T => Vector

The easy solution might be to ask for an RDD[T] and a T => Vector function. Internally each leaf will maintain (T, VectorWithNorm) tuple.

While this is straightforward, often times the function just ends up being an quick accessor function because vectors are likely already pre-computed when data are fed to k-NN.

RDD[(T, Vector)]

This potentially is a cleaner implementation to save the extra boilerplate code of T => Vector which often ends up being _._1. However I'm a bit confused about the implication of data serialization. If Vector refers to some Vector member in T, does it cost extra to serialize?

RDD[T <: hasVector)]

Having a single generic type is desirable as it makes the code cleaner with small sacrifice on the user friendliness as developers need to get to know this trait. Also the implication on serialization cost and memory footprint is still unclear to me.