https://github.com/qinyu0831/UTS_ML2019_ID13023662/blob/master/A1.ipynb

## Review Report on "Approximate nearest neighbors: towards removing the curse of dimensionality"

## Introduction

The __nearest neighbor(NN)__ problem is following: Given a set $P$ of $n$ points in a metric space defined over a set $X$ with distance function $D$, preprocess $P$ to efficiently answer queries for finding the point in $P$ closest to a query point $q \in X$.  
In 1999, after decades of hard work, solutions for NN problem in a low dimension has been satisfactory.The authors want to give a better solution for nearest neighbor problem under high-dimensional conditions. In this paper, they gave two algorithms to solve this problem.

## Content

In order to propose a practical algorithm for NN search in high dimensional space, authors thought they have to relax their algorithmic guarantee. Roughly speaking, they will allow the query to return "close enough" instead of the closest point. This problem is called __approximate nearest neighbor(ANN)__. For the ANN algorithm, the situation is slightly better. The cost of pre-processing or query is gradually reduced. The researchers proposed an algorithm for retrieving all points within distance $r$ of the query $q$ in Hamming space. But these algorithms are exponential for the time spent on any distance. After that, a scheme was proposed, which is for binary data chosen uniformly at random. This scheme can retrieves all points for distance $r$ of $q$ with query time $O(dn^{r/d})$ and preprocessing time $O(dn^{1+r/d})$.This paper finally came up with three propositions.

These results are obtained through a key idea, namely reducing e-NNS to problems of point location in equal balls(PLEB). PLEB is generally given n balls with radius r, for each query q, returns YES if it is in ball, else return NO. The authors reduced NNS to PLEB. This reduction needs a special data structure called ring-cover tree, which is based on rings and covers. The way to build this tree is recursion. In this data structure, we can divide $P$ into multiple subsets. This decomposition allows us to quickly lock into one of the subsets when searching for $P$ sets, thereby increasing efficiency.

There is also a reduction from $\epsilon-NN$ to $\epsilon-PLEB$, $r-PLEB$ is like a decision problem for $\epsilon-NN.
A simple implementation steps can be divided into the following steps:

1) Set $R$ be the proportion between largest distance and smallest distance of 2 points.  
2) Define $l=\{(1+\epsilon)^0,(1+\epsilon)^1,...,R\}$.  
3) Generate $l-PLEB$ instances(balls) for each $l$.  
4) For given query $q$, using binary search to find the smallest $l$ which there exists an $i$ such that $q\in B^l_i$ and return $p_i$.

Two techniques are proposed in the paper for solving the $\epsilon-PLEB$ problem. The first method is __Bucketing Method__. The implementation steps are as follows:  
1) Apply a grid of spacing $s=\epsilon\sqrt{d}$ so that every ball is covered by cubes.  
2) Use hash table to store elements  
   -Key: cube ; Value: a ball it covers  
3) For query $q$, compute the grid cell contains $q$ and check if it is stored in the hash table.

The second technology is __locality-Sensitive Hashing__.The core idea of this method is 'similar items are more likely to collide'. This technique increases the chance of mapping two data points to the same value as their distance $f(x,y)$ decreases. Through such a mapping, we can find adjacent data points in low dimensions while avoiding spending too much time in high dimensions.

## Innovation

For the low-dimensional cases (that is, the number of features in a sample is not large), linear search is a simple and efficient algorithm for NN problem. But for a large number of high-dimensional data sets, if you use linear search, it will consume a lot of time cost, this problem is called "curse of dimensionality".

For the high-dimensional case, the known high-dimensional NNS algorithm can encounter one of two situations: (a) the preprocessing time is low, but the query time is linear through the number of points $n$ and dimension $d$, or (b) the query time is sublinear in $n$ and polynomial in $d$.However,the preprocessing cost is $n^d$.

Researchers have taken many approaches to solve the problem of NNS in high dimensions. For example, many data structures suitable for NN (_k-d_ trees, *R*-trees, etc.). However, although some methods performed well in low dimensions,in the high-dimensional space, they all performed poorly in some cases. Researchers tried to reduce the consumption of preprocessing or query based on this algorithm. Unfortunately, none of these methods can avoid the exponential level of time spent on preprocessing and query at the same time.

The NNS algorithm is very important for a variety of applications, such as data compression, pattern recognition, information retrieval, and more. These applications require a similarity search. But as the number of features (ie, dimensions) of the object increases (from tens to thousands), the computation time is usually increased in polynomial or exponential manner. Some dimensionality reduction technologies such as latent semantic indexing(LSI) can reduce the dimensions of thousands to hundreds.

In order to better solve the problem of NNS algorithm in high dimension, Indyk and Motwani(authors) propose an approximate method: Find a point $p$ approximately to $q$ instead of closest to $q$, which is the __approximate nearest neighbor problem__.

Indyk and Motawani found that s-NNS problem can be reduced to a new problem: __Point location in equal balls (PLEB)__ , which is based on a data structure named a ring-cover tree.

They gave two algorithms to solve the PLEB problem, __Bucketing Method__ and __Locality-Sensitive Hashing__ respectively.

The authors' idea of converting the NNS problem to ANN is very creative and the research method reducing one problem to another problem is practical and worth learning.


## Technical quality

In general, the technical quality of the paper is very high. Through the use of mathematical equations, proofs and pseudocode, the origin of the NN dimension problem and the solution to the ANN problem are explained in detail. Not only can the reader understand the principles of the ANN, but also allow researchers to continue research on this basis. There are many applications listed in next section.

The paper also has some shortcomings. For example, the lack of actual experimental display, such as process, results, etc., the author only directly gives the results of the experiment to prove the reliability of some algorithms or techniques. Unable to obtain specific information about the experiment, which will have a bad impact on the reliability of the paper. Furthermore, for $0 < \epsilon < 1$, because of exponent depends on $1 / \epsilon$, the result of the bucketing method algorithm is a purely theoretical result.

## Application and X-Factor

In this paper, the authors mentioned the applications of ANN search,such as information retrieval, pattern recognition, etc.

The author concludes from an important inference in Locality-Sensitive Hashing that for data points $p,q\in H^d$, their approximate measurement distance can be seen as a dot product $p\cdot q$.Dot product is a kind of measurement standard for information retrieval. This can be applied to compare the similarities between two documents. When a document is represented by a vector, a dot product can be used to calculate the distance between documents. This is a very useful application that can be used to build repetitive rate query systems and more.

Dot product can also be used to solve the problem of the largest common point set. The largest common point set has many applications for image retrieval and pattern recognition. Image retrieval can be divided into two categories(text-based or content-based) according to the way of describing image content. Today, as image data grows rapidly, text-based image retrieval has been eliminated. Obviously, getting the largest set of common points is a great help for content-based image retrieval.

Approximate closet-pair queries can be easily accomplished by copying the PLEB algorithm. The scheme in paper can solve the dynamic closet-pair problem in sublinear time.This is a very powerful feature, because dynamic additions and deletions are a basic requirement in practical applications. 

The LSH algorithm and its variants are widely used today. They are successfully used to solve computational problems. In the new literature by Indyk and Andoni (2008), they developed the LSH family for the Euclidean distances.

ANN also has many excellent libraries.

ANNOY(Approximate Nearest Neighbor Oh Yeah) is an open source library to search for points in space that are close to a given query point. It is considered to be one of the best ANN libraries. In the recent version of ANNOY, it constructs multiple hierarchical 2-means trees. In each iteration, two centers are formed by clustering the subset of samples. These two centers define a partition hyperplane that is equidistant from the center. The data points are then divided into two subtrees by a hyperplane, and the algorithm recursively indexes on each subtree. The search process is performed by the travel tree nodes of multiple random projection trees. ANNOY has been used in the recommendation engine of spotify.com.

FLANN(Fast Library for Approximate Nearest Neighbors) is an automatic nearest neighbor algorithm configuration method. It can select the most appropriate algorithm from the randomized kd-tree, hierarchical k-means tree and linear scan methods for the target data set, and can also specify the accuracy.

In addition, ANN's libraries also includes Kgrapth, Nearpy, LsHash, etc.

In Indyk, Motawani, Har-Peled's paper in 2010, they replaced PLEB with the closer term 'approximate near neighbor'. In addition, many algorithms in this paper are improved to be simpler and general. The reduction for near to the nearest is much simpler and more efficient than before. It applies general metric spaces, and only needs near-linear number of ANN queries. New applications like 'The approximate minimum spanning tree' are also shown in the new paper.

## Presentation

The structure of the paper is clear and generally easy to read. I think that for questions such as PLEB, The Bucketing Method, or data structures like Ring-Cover Trees, it would be easier for readers to understand if they can add some images when interpreting.In addition, some special symbols or formulas may confuse the reader.It would be better if some mathematical expressions are accompanied by a clear explanation.

## References 

Indyk, P., & Motwani, R. 1998, May. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (pp. 604-613). ACM.

Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., & Lin, X. 2019. Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering.

Har-Peled, S., Indyk, P., & Motwani, R. 2012. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1), 321-350.

Muja, M., & Lowe, D. G. 2009. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340), 2.

Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6), 891-923.

Arya, S., & Mount, D. M. 1993, January. Approximate Nearest Neighbor Queries in Fixed Dimensions. In SODA (Vol. 93, pp. 271-280).

Andoni, A., & Indyk, P. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1), 117.