ML2019 ASSIGNMENT 1

## Introduction

This paper proposes two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For the data set of size $R^d$, the algorithms require preprocessing cost polynomial in $n$ and $d$, while achieving a sublinear query time. The paper also gives applications to information retrieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms.

## Content

The __nearest neighbor(NN)__ problem is following: Given a set $P$ of $n$ points in a metric space defined over a set $X$ with distance function $D$, preprocess $P$ to efficiently answer queries for finding the point in $P$ closest to a query point $q \in X$.
For low-dimensional cases (that is, the number of features in a sample is not large), linear search is a simple and efficient algorithm for NN problem. But for a large number of high-dimensional data sets, if you use linear search, it will consume a lot of time cost, this problem is called "curse of dimensionality".
For this high-dimensional case, the known high-dimensional NNS algorithm can encounter one of two situations: (a) the preprocessing time is low, but the query time is linear through the number of points $n$ and dimension $d$, or (b) the query time is sublinear in $n$ and polynomial in $d$, but with severely exponential prprocessing cost $n^d$.
Researchers have taken many approaches to solve the problem of NNS in high dimensions. For example, many data structures suitable for NN (_k-d_ trees, *R*-trees, etc.). However, although some methods performed well in 2-3 dimensions, in the high-dimensional space, they all performed poorly in the worst case and typical cases. For example, the first proposed algorithm for dealing with NN problem, which is with query time $O(2^dlog n)$ and preprocessing cost $O(n^{2^{d+1}})$. Researchers tried to reduce the consumption of preprocessing or query based on this algorithm. Unfortunately, none of these methods can avoid the exponential level of time spent on preprocessing and query at the same time.
In order to propose a practical algorithm for NN search in high dimensional space, authors thought they have to relax their algorithmic guarantee. Roughly speaking, they will allow the query to return "close enough" instead of the closest point. This problem is called __approximate nearest neighbor(ANN)__. For the ANN algorithm, the situation is slightly better. The cost of pre-processing or query is gradually reduced. The researchers proposed an algorithm for retrieving all points within distance $r$ of the query $q$ in Hamming space. But these algorithms are exponential for the time spent on any distance. After that, a scheme was proposed, which is for binary data chosen uniformly at random. This scheme can retrieves all points for distance $r$ of $q$ with query time $O(dn^{r/d})$ and preprocessing time $O(dn^{1+r/d})$.
This paper finally came up with three propositions, namely:

__Proposition 1__ _For $\epsilon$ >0,there is an algorithm for $\epsilon-NNS$ in $\Re^d$ under the $l_p$ norm for $p \in [1,2]$ which uses $Õ(n^{1+1/(1+\epsilon)}+dn)$ preprocessing and require $Õ(dn^{1/(1+\epsilon)})$ query time._ 

__Proposition 2__ _For $0<\epsilon<1$, there is an algorithm for $\epsilon-NNS$ in $\Re^d$ under any $l_p$ norm which uses $Õ(n)\times O(1/\epsilon)^d$ preprocessing and requires $Õ(d)$ querytime._

__Proposition 3__ _For any $\epsilon>0$, there is an algorithm for $\epsilon-NNS$ in $\Re^d$ under the $l_p$ norm for $p \in [1,2]$ which uses $p\in [1,2]$ which uses $(nd)^{O(1)}$ preprocessing and requires $Õ(d)$ query time._

These results are obtained through a key idea, namely reducing e-NNS to problems of point location in equal balls(PLEB). PLEB is generally given n balls with radius r, for each query q, returns YES if it is in ball, else return NO. It has been observed that PLEB can be reduced to NNS, with the same dependency on preprocessing and query.It was proved that it is also available to reduce from NNS to PLEB, with a small overhead in preprocessing and query costs. This reduction needs a special data structure called ring-cover tree, which is based on rings and covers. The way to build this tree is recursion. In this data structure, for any point in the $P$ set, we can either find a ring-separator or a cover. Therefore, we can divide $P$ into multiple subsets. This decomposition allows us to quickly lock into one of the subsets when searching for $P$ sets, thereby increasing efficiency.

There is also a reduction from $\epsilon-NN$ to $\epsilon-PLEB$, $r-PLEB$ is like a decision problem for $\epsilon-NN.
A simple implementation steps can be divided into the following steps:

1) Set $R$ be the proportion between largest distance and smallest distance of 2 points.  
2) Define $l=\{(1+\epsilon)^0,(1+\epsilon)^1,...,R\}$.  
3) Generate $l-PLEB$ instances(balls) for each $l$.  
4) For given query $q$, using binary search to find the smallest $l$ which there exists an $i$ such that $q\in B^l_i$ and return $p_i$.

Two techniques are used for solving the $\epsilon-PLEB$ problem. The first one is __Bucketing Method__. The implementation steps are as follows:  
1) Apply a grid of spacing $s=\epsilon\sqrt{d}$ so that every ball is covered by cubes.  
2) Use hash table to store elements  
   -Key: cube ; Value: a ball it covers  
3) For query $q$, compute the cell which contains $q$ and check if it is stored in the table.

The second technology is __locality-Sensitive Hashing__. The key idea is that similar items are more likely to collide. This technique increases the chance of mapping two data points to the same value as their distance $f(x,y)$ decreases. Through such a mapping, we can find adjacent data points in low-dimensional data space, avoiding spending too much time in high-dimensional data space.

## Innovation

The NNS algorithm is very important for a variety of applications, such as data compression, pattern recognition, information retrieval, and more. These applications require a similarity search. But as the number of features (ie, dimensions) of the object increases (from tens to thousands), the computation time is usually increased in polynomial or exponential manner. Some dimensionality reduction technologies such as latent semantic indexing(LSI) can reduce the dimensions of thousands to hundreds.

In order to better solve the problem of NNS algorithm in high dimension, Indyk and Motwani(authors) propose an approximate method: Find a point $p$ approximately to $q$ instead of closest to $q$, which is the __approximate nearest neighbor problem__.

Indyk and Motawani found that s-NNS problem can be reduced to a new problem: __Point location in equal balls (PLEB)__. This reducition is achieved by a data structure called a ring-cover tree.

They gave two algorithms to solve the PLEB problem, __Bucketing Method__ and __Locality-Sensitive Hashing__ respectively.

The authors' idea of converting the NNS problem to ANN is very creative and the research method reducing one problem to another problem is practical and worth learning.


## Technical quality

In general, the technical quality of the paper is very high. Through the use of mathematical equations, proofs and pseudocode, the origin of the NN dimension problem and the solution to the ANN problem are explained in detail. Not only can the reader understand the principles of the ANN, but also allow researchers to continue research on this basis.

The paper also has some shortcomings. For example, the lack of actual experimental display, such as process, results, etc., the author only directly gives the results of the experiment to prove the reliability of some algorithms or techniques. Unable to obtain specific information about the experiment, which will have a bad impact on the reliability of the paper. Furthermore, for $0 < \epsilon < 1$, because of exponent depends on $1 / \epsilon$, the result of the bucketing method algorithm is a purely theoretical result.

## Application and X-Factor

In this paper, the authors mentioned the application of information retrieval, pattern recognition, dynamic closest-pairs and fast clustering algorithms.

The author concludes from an important inference in Locality-Sensitive Hashing that for any set of points $p,q\in H^d$, their approximate measurement distance can be defined as dot product $p\cdot q$.The dot product is a common measurement standard for information retrieval. This can be applied to compare the similarities between two documents. When a document is represented by a vector, a dot product can be used to calculate the distance between documents. This is a very useful application that can be used to build repetitive rate query systems and more.

Dot product can also be used to solve the problem of the largest common point set. The largest common point set has many applications for image retrieval and pattern recognition. Image retrieval can be divided into two categories according to the way of describing image content. One is text-based image retrieval (TBIR, Text Based Image Retrieval), and the other is content-based image retrieval (CBIR, Content Based Image Retrieval). Today, as image data grows rapidly, text-based image retrieval has been eliminated. Obviously, getting the largest set of common points is a great help for content-based image retrieval.

Approximate closet-pair queries can be easily accomplished by copying the PLEB algorithm. The scheme in paper can solve the dynamic closet-pair problem in sublinear time.This is a very powerful feature, because dynamic additions and deletions are a basic requirement in practical applications. 

The LSH algorithm and its variants have been successfully applied to computational problems in a variety of fields, including network clustering, computational biology, computer vision, and computational drug design and computational linguistics. In the new literature by Indyk and Andoni (2008), they developed the LSH family for the Euclidean distances, which achieved a near-optimal separation between the collision probabilities of close and far points.

ANN also has many excellent libraries.

ANNOY(Approximate Nearest Neighbor Oh Yeah) is an open source library to search for points in space that are close to a given query point. It is considered to be one of the best ANN libraries. In the recent version of ANNOY, it constructs multiple hierarchical 2-means trees. In each iteration, two centers are formed by clustering the subset of samples. These two centers define a partition hyperplane that is equidistant from the center. The data points are then divided into two subtrees by a hyperplane, and the algorithm recursively indexes on each subtree. The search process is performed by the travel tree nodes of multiple random projection trees. ANNOY has been used in the recommendation engine of spotify.com.

FLANN(Fast Library for Approximate Nearest Neighbors) is an automatic nearest neighbor algorithm configuration method. It can select the most appropriate algorithm from the randomized kd-tree, hierarchical k-means tree and linear scan methods for the target data set, and can also specify the accuracy.

In addition, ANN's libraries also includes Kgrapth, Nearpy, LsHash, etc.

In Indyk, Motawani, Har-Peled's paper in 2010, they replaced PLEB with the closer term 'approximate near neighbor'. In addition, several algorithms in this paper are simpler and more versatile. The reduction for near to the nearest is much simpler and more efficient than before. It works for general metric spaces, and can be performed using near-linear number of approximate near neighbor queries. New applications are also shown in the new paper, such as The approximate minimum spanning tree.

## Presentation

The structure of the paper is clear and generally easy to read. I think that for questions such as PLEB, The Bucketing Method, or data structures like Ring-Cover Trees, it would be easier for readers to understand if they can add some images when interpreting.

## References 

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.

E. Bernhardsson. Annoy at github
https://github.com/spotify/annoy, 2015.