-
Notifications
You must be signed in to change notification settings - Fork 113
Data Types for KNN
k-NN algorithm is unique in the sense that it can be used for a variety use cases: classification, regression, clustering and etc. The core of k-NN is to compute distance/similarity between vectors which is well understood. The crux of the implementation is to support all generic use cases as mentioned above which are more than just maintaining the vectors used for computation but as well as other auxiliary information such as label.
For most local implementation, k-NN only need to take an array of vectors. Since indices of each vector can be tracked, they can be mapped back to original data easily to obtain any auxiliary data as needed. However in distributed implementation, vectors are required to be shuffle across partition during tree construction. Further a search vector may need to traverse multiple sub-trees across partition and aggregate by key in the end. Therefore neighbors of a searched point can span across partitions and mapping neighbors back to original RDD can be difficult.
It is therefore logical to carry the original data around as we construct the tree and return the original data directly as results of k-NN search. This means we need to introduce a type parameter to the KNN class.
Furthermore, recognizing that we will need to compute distance often, it is best to store internal vector as VectorWithNorm
such that Euclidean distance can be efficiently computed.
The question is to design a runtime/space efficient type interface for KNN and make sure it is intuitive and easy to use for end-users.
-
Generic nearest neighbor search with vector only
e.g. most similar image: data are stored as vector and vector only but we still need to produce
VectorWithNorm
-
Generic nearest neighbor search with auxiliary data
e.g. most similar image with random projection: data are stored as vector but a separate (projected) vector is used to compute distance. Furthermore additional auxiliary data such as label, captions can be included.
-
Classification/Regression modeling
This is arguably a more narrower use case as above. For example, using a
LabeledPoint
to store the label which can be used for classification and regression.
The easy solution might be to ask for an RDD[T]
and a T => Vector
function. Internally each leaf will maintain (T, VectorWithNorm)
tuple.
While this is straightforward, often times the function just ends up being an quick accessor function because vectors are likely already pre-computed when data are fed to k-NN.
This potentially is a cleaner implementation to save the extra boilerplate code of T => Vector
which often ends up being _._1
. However I'm a bit confused about the implication of data serialization. If Vector
refers to some Vector
member in T
, does it cost extra to serialize?
Having a single generic type is desirable as it makes the code cleaner with small sacrifice on the user friendliness as developers need to get to know this trait
. Also the implication on serialization cost and memory footprint is still unclear to me.