- Introduction
- Algorithms included
- Compile and test the configuration
- Run
- Summary of run-time complexities
In this repository we include some of the most popular algorithms to solve the 'all m-th nearest neighbours problem' (all-m-nn in short). Given a set of n points in a d dimensional space, the problem requires to find the m-th nearest neighbours for each point. We include algorithms that work both in euclidean and non-euclidean spaces. This last case is particularly suitable for the entropy estimation using the kNN method. The notation we use is the following:
- d number of dimensions
- n number of input points
- m size of the neighborhood
Reference:
Borelli R, Dovier A, Fogolari F. Data Structures and Algorithms for k-th Nearest Neighbours Conformational Entropy Estimation. Biophysica. 2022; 2(4):340-352. https://doi.org/10.3390/biophysica2040031
We include different algorithms to solve the all-m-nn problem with different properties and runtime complexities (see Summary of run-time complexities):
-
trivial/
contains trivial solutions to the problem
-
trivial/1-mnn
performs a sorting process and works just for d=1 -
trivial/d-mnn
performs$O(n^2)$ comparison and works for generic values of d -
trivial/periodic-d-mnn
performs$O(n^2)$ comparison and works in spaces with periodic boundary conditions.
-
-
kdtree/
contains solutions which construct a K-D tree
-
kdtree/d-mnn
is the basic algorithm which works in euclidean spaces. -
kdtree/periodic-d-mnn
works in spaces with periodic boundary conditions.
-
-
vptree/
contains solutions which construct a VP tree
-
vptree/d-mnn
works in euclidean spaces. -
vptree/periodic-d-mnn
works in spaces with periodic boundary conditions.
-
- gcc
- make
To build all the executables run:
./make
If you are using bash you can test that everything works by simply running:
./runTest
The syntax to run every executables is:
<executable> <yourInputFile> [-out <outputMode>] [-in <inputMode>] [-entropy] [-sort]
<outputMode>
is one of the following:
verbose
: the program prints the neighbours in a human-readable format.flat
: the program prints the neighbours in a non-human-readable format. It's useful just to compare 2 implementations and to save output space.silent
: the program doesn't print the neighbours.time
: the program prints just the time taken to compute the m-neighbours for each point.
<inputMode>
is one of the following:
verbose
: the program prints the points read from the input file.silent
: the program doesn't print the points read from the input file.
With the option -sort
the neighbours are sorted by increasing distances for each point.
With the option -entropy
the program prints the entropy in the specified space with periodic conditions. If -entropy
is specified in a executable that doesn't support this kind of spaces, an exception will be raised.
Note that by specifying -out time
, it is just considered the time taken to compute the neighborhood for each point and it is not considered the time to eventually sort the neighbours and calculate the entropy.
By default the both the <outputMode>
and the <inputMode>
are set to verbose. The neighbours are sorted and the entropy is not calculated.
Example 1:
vptree/d-mnn test/basic-tests/input1.txt -in silent -out verbose -sort
Runs the VP tree algorithm with the input file test/basic-tests/input1.txt. It doesn't print the read points, but it prints the neighbours sorted by distance.
Example 2:
trivial/periodic-d-mnn test/basic-tests/input2.txt -in silent -out silent -entropy
Runs the Naive algorithm (the one which works in spaces with periodic boundaries conditions) with the input file test/basic-tests/input2.txt. It doesn't print neither the read points and their neighbours. It just prints the calculated entropy.
<yourInputFile>
should have the following basic syntax
n=<nValue> d=<dValue> m=<mValue>
[L <l1> <l2> ... <ld>]
[U <u1> <u2> ... <ud>]
<p1_1> <p1_2> ... <p1_d>
<p2_1> <p2_2> ... <p2_d>
...
<pn_1> <pn_2> ... <pn_d>
- The first line is mandatory and defines parameters n,d, and m.
- The second and the third line are optional and should be used to specify the periodic boundary conditions if you use a periodic-d-mnn program. If you use a program for basic euclidean spaces, those lines are ignored if present. li is the lower bound for the i-th coordinate and ui is the upper bound for the i-th coordinate.
- The following lines are used to specify the points. pi_j is the j-th coordinate of the i-th point.
Some examples could be found in the test/basic-tests
directory.
The run-time complexities are summarized in the table below.
Algorithm | Expected Time | Worst Case | Constraints |
---|---|---|---|
trivial/1-mnn | d=1 | ||
trivial/d-mnn | |||
trivial/periodic-d-mnn | |||
vptree/d-mnn | |||
vptree/periodic-d-mnn | |||
kdtree/d-mnn | |||
kdtree/periodic-d-mnn |
Note that with options -sort
or -entropy
, an additional time of