Skip to content
This repository has been archived by the owner on Nov 17, 2021. It is now read-only.

Latest commit

 

History

History
296 lines (228 loc) · 13 KB

README.md

File metadata and controls

296 lines (228 loc) · 13 KB

NGT

Neighborhood Graph and Tree for Indexing High-dimensional Data

Home / Build / Command / License / Publications / About Us

Command

Name

ngt - proximity search for high dimensional data

Synopsis

  $ ngt command [option] index [data]

Note:

When the environment variable POSIXLY_CORERECT is set on some platforms such as Cygwin, you should specifiy options before the command as follows.

  $ ngt [option] command index [data]

Description

ngt provides high-speed nearest neighbor searches against a large volume of data (several million to several 10 million items of data) in high dimensional vector data space (several ten to several thousand dimensions).

command is one of:

CREATE

Constructs the specified index with the specified data.

  $ ngt create -d no_of_dimensions [-p no_of_threads] [-b no_of_batch_processes] 
      [-i index_type] [-g graph_type] [-t edge_reduction_threshold] 
      [-e search_range_coefficient] [-E no_of_edges] [-S no_of_edges_at_search_time] 
      [-o object_type] [-D distance_function] [-n no_of_registration_data] 
      index registration_data

index
Specifies the name of the index to be generated. After data registration, an index is generated consisting of multiple files each with an extension attached to this index name.

registration_data
Specifies the vector data to be registered. These data shall consist of one object (data item) per line and each dimensional element shall be delimited by a space or tab.

-d no_of_dimensions
Specifies the number of dimensions of registration data. Specification is unnecessary if each row of the registration data file consists only of dimensional elements. However, if attribute information or other types of data follow the dimensional elements, such subsequent data will be ignored based on the number of dimensions specified here.

-p no_of_threads (default = 24, recommend value = number of cores)
Specifies the number of threads to be used for parallel processing at generation time.

-b no_of_batch_processes (default = recommend value = 200)
Specifies the number of objects for batch processing, which is normally performed for a fixed number of objects. It is generally not necessary to specify this option.

-i index_type (g|t)
Selects between creation of graph only or graph and tree.

  • g: Generates only a graph index. At search time, the search start point in the graph is randomly determined.
  • t: Generates a tree index in addition to a graph index. At search time, the search start point in the graph is determined using the tree. (default/recommended)

-g graph_type
Specifies the type of graph index.

  • a: Generates ANNG. Enables high-speed registration. (default/recommended)
  • k: Generates KNNG. Registration is very slow. (for experimental use—not recommended)
  • b: Generates BKNNG. Registration is very slow, but search has highest level of performance. (for experimental use—not recommended)

-t edge_reduction_threshold (default = recommend value = 0)
Specifies the increase in number of edges that acts as a criterion for executing excess-edge reduction processing. Although excess-edge reduction processing can only be used when selecting ANNG, it involves heavy processing and is essential unnecessary unless there is a need to reduce the amount of consumed memory as much as possible. Specifying 0 here prevents the execution of excess-edge reduction processing.

-e search_range_coefficient (default = recommend value = 0.1)
When specifying ANNG or BKNNG, neighboring nodes connected to an item of registration data (node) by edges are obtained by searching and combined by edges. This option specifies the magnification coefficient of the search range at search time.

-E no_of_edges (default = recommend value = 10)
Specifies the number of initial edges of each node at graph generation time. Once an index has been generated, the number of edges will be equal to or greater than this specified number in the case of ANNG or BKNNG and equal to this specified number in the case of KNNG.

-S no_of_edges_at_search_time (default = recommend value = 40)
Specifies the number of edges at search time accompanying or following index generation. This value is used when not specifying the number of edges by the search command. It is specified to conduct searches by a number of edges less than the actual number of edges of each node in the graph. Since a large number of edges may be generated in the case of ANNG or BKNNG, limiting the number of edges can help improve search performance. Specifying 0 here indicates that the number of edges is not to be limited (that all actual edges are to be used). If 0 is specified, index generation is relatively slow, but top performance can be obtained at search time.

-o object_type
Specifies the data object type.

  • c: 1 byte integer
  • f: 4 byte floating point (default)

-D distance_function
Specifies the distance function as follows.

  • 1: L1 distance
  • 2: L2 distance (default)
  • a: angle
  • h: Hamming distance

-n no_of_registration_data
Specifies the number of data items to be registered. If not specified, all data in the specified file will be registered.

APPEND

Adds the specified data to the specified index.

  $ ngt append [-p no_of_threads] [-d no_of_dimensions] [-n no_of_registration_data]
      index registration_data

index
Specifies name of existing index (the portion excluding index file extension).

registration_data
Specifies the vector data to be registered. These data shall consist of one object (data item) per line and each dimensional element shall be delimited by a space or tab.

-p no_of_threads (default = recommend value = 24)
Specifies the number of threads to be used for parallel processing at generation time.

-d no_of_dimensions
Specifies the number of dimensions of registration data. Specification is unnecessary if each row of the registration data file consists only of dimensional elements. However, if attribute information or other types of data follow the dimensional elements, such subsequent data will be ignored based on the number of dimensions specified here.

-n no_of_registration_data
Specifies the number of data items to be registered. If not specified, all data in the specified file will be registered.

SEARCH

Searches the index using the specified query data.

  $ ngt search [-i index_type] [-e search_range_coefficient] [-n no_of_searches] 
      [-E max_no_of_edges] [-r search_radius] index query_data

index
Specifies name of existing index (the portion excluding index file extension).

query_data
Specifies the name of the file containing query data. This file shall consist of one item of query data per line and each dimensional element of that data item shall be delimited by a space or tab the same as registration data. A sequential search shall be performed when providing multiple queries.

-i index_type (g|t|s)
Selects between creation of graph only or graph and tree.

  • g: Uses only a graph index. May be specified even if a tree exists. The search start point in the graph is randomly determined.
  • t: Uses a tree index in addition to a graph index. An error occurs if no tree exists. The search start point in the graph is determined using the tree. Enables high-speed searches compared with graph only. (default/recommended)
  • s: Performs a linear search using no index.

-e search_range_coefficient (default = recommend value = 0.1)
Specifies the magnification coefficient of the search range. A larger value means greater accuracy but slower searching, while a smaller value means a drop in accuracy but faster searching. While it is desirable to adjust this value within the range of 0 - 0.3, a negative value may also be specified.

-n no_of_searches (default: 20)
Specifies the number of search results.

-E max_no_of_edges (default = value specified with the create command or 40; recommend value = 40)
Specifies the maximum number of edges to be used in the search. This option is specified when conducting a search with fewer edges than the number of edges of each node on the graph. Since a large number of edges can be generated in the case of ANNG or BKNNG, limiting the number of edges in this way tends to improve search performance. Specifying zero here indicates no limiting of number of edges (use all actual edges).

-r search_radius (default = infinite circle)
Specifies the search range in terms of the radius of a circle.

REMOVE

Removes the specified object from the index.

  $ ngt remove [-d object_id_specification_method] index object_id

index
Specifies name of existing index (the portion excluding index file extension).

object_id
Specifies the ID of the object to be removed.

-d object_id_specification_method (f|d) (default = f)
Specifies the method for specifying the ID of the object to be removed. Specifying f indicates that the following object-ID specification is to be treated as a file name. That file shall consist of one entry per line, each indicting the ID of an object to be removed. Specifying d indicates that the following object-ID specification is to be treated simply as an object-ID referring to the object to be removed.

Examples of using ngt command

Create

Construct an index for 128-dimensional, 1-byte-integer data:

  $ cd (NGT_TOP_DIR)
  $ ngt create -d 128 -o c index ./data/sift-dataset-5k.tsv
  Data loading time=0.160748 (sec) 160.748 (msec)
  # of objects=5000
  Index creation time=0.379659 (sec) 379.659 (msec)

Search

Perform a neighborhood search by three queries specified in a file:

  $ cd (NGT_TOP_DIR)
  $ ngt search -n 20 index ./data/sift-query-3.tsv
  Query No.1
  Rank	ID		Distance
  1		3031	239.332
  2		4079	240.002
  3		3164	244.504
  4		3718	246.763
  5		157		251.094
  6		2422	251.185
  7		1313	251.34
  8		379		252.446
  9		3521	260.158
  10	2594	261.132
  11	4627	262.381
  12	2159	263.471
  13	3519	264.909
  14	1764	265.136
  15	4400	266.156
  16	2717	266.914
  17	3168	269.637
  18	4236	270.673
  19	4700	272.725
  20	679		272.973
  Query Time= 0.000472 (sec), 0.472 (msec)
  Query No.2
  Rank	ID		Distance
  1		2726	291.983
  2		924		296.987
  3		3638	298.585
  4		858		300.376
  5		1453	306.805
  6		174		307.789
  7		2992	308.485
  8		2980	308.93
  9		1525	309.49
  10	244		309.816
  11	910		310.446
  12	3310	310.585
  13	2433	311.482
  14	1633	311.735
  15	3761	312.44
  16	407		313.252
  17	4546	313.876
  18	697		315.108
  19	34		315.563
  20	2189	316.193
  Query Time= 0.000478 (sec), 0.478 (msec)
  Query No.3
  Rank	ID		Distance
  1		762		194.286
  2		1046	212.695
  3		4906	215.244
  4		2905	216.539
  5		4142	219.479
  6		1879	219.616
  7		4398	223.352
  8		3842	223.468
  9		233		224.127
  10	2794	224.366
  11	2476	224.804
  12	1848	225.803
  13	3364	226.561
  14	4098	226.74
  15	3023	228.884
  16	4113	229.325
  17	1036	232.852
  18	1740	233.144
  19	2302	233.818
  20	2440	233.91
  Query Time= 0.00018 (sec), 0.18 (msec)
  Average Query Time= 0.000376667 (sec), 0.376667 (msec), (0.00113/3)

How to construct the indexes for our publications

	$ ngt create -i g -g a -S 0 -e 0.1 -E no_of_edges -d dimensionality_of_data -o data_type -D distatnce_type panng-index vector-data.dat
	$ ngt prune -e no_of_forcedly_pruned_edges -s no_of_selectively_pruned_edges panng-index

e.g.

	$ ngt create -i g -g a -S 0 -e 0.1 -E 10 -d 128 -o c -D 2 panng-index vector-data.dat
	$ ngt prune -e 60 -s 30 panng-index
	$ ngt create -i t -g a -S 0 -e 0.1 -E no_of_edges(k) -d dimensionality_of_data -o data_type -D distance_type anngt-index vector-data.dat

e.g.

	$ ngt create -i t -g a -S 0 -e 0.1 -E 16 -d 128 -o c -D 2 anngt-index vector-data.dat
	$ ngt create -i g -g a -S 0 -e 0.1 -E no_of_edges(k) -d dimensionality_of_data -o data_type -D distance_type anng-index vector-data.dat

e.g.

	$ ngt create -i g -g a -S 0 -e 0.1 -E 16 -d dimensionality_of_data -o data_type -D distance_type anng-index vector-data.dat
KNNG
	$ ngt create -i g -g k -S 0 -E no_of_edges(k) -d dimensionality_of_data -o data_type -D distance_type knng-index vector-data.dat

e.g.

	$ ngt create -i g -g k -S 0 -E 20 -d 128 -o c -D 2 knng-index vector-data.dat