# Random data - synthetic demo

This is a very simple demonstration of how to use `SimilaritySearch.jl`. The API correspond to version `0.8`

In [1]:
using Pkg
Pkg.activate(".")
Pkg.add([
    PackageSpec(name="SimilaritySearch", version="0.10")
])

using SimilaritySearch

[32m[1m  Activating[22m[39m project at `~/Research/SimilaritySearchDemos/synthetic`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Research/SimilaritySearchDemos/synthetic/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Research/SimilaritySearchDemos/synthetic/Manifest.toml`


# A random dataset
Let us define a dataset of 8-dimensions and $10^5$ elements. Each object is a column. The matrix needs to be wrapped as a database since `SimilaritySearch` is distance agnostic and objects can be any representation. The matrix is not copied.


In [2]:
n = 100_000
M = randn(Float32, 8, n)
db = MatrixDatabase(M)

MatrixDatabase{Matrix{Float32}}(Float32[-0.7630596 0.2501615 … 0.44209227 -0.9475498; -0.42873305 1.2776356 … -0.43395448 1.1141217; … ; 1.5619737 0.09357595 … -0.4541717 0.5648302; -0.25632438 1.4236702 … -0.31651688 -0.2934293])

The database object mimics a vector of elements

In [3]:
length(db), eltype(db), typeof(db[1])

(100000, AbstractVector{Float32}, SubArray{Float32, 1, Matrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})

The `SubArray` that results of `typeof(db[1])` means that each object is a column's `view`, and therefore there is no extra memory allocations.

# Index construction

An index is defined as follows

In [4]:
dist = SqL2Distance()
G = SearchGraph(; db, dist, verbose=false)

SearchGraph{SqL2Distance, MatrixDatabase{Matrix{Float32}}, SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}, BeamSearch}
  dist: SqL2Distance SqL2Distance()
  db: MatrixDatabase{Matrix{Float32}}
  adj: SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}
  hints: Array{Int32}((0,)) Int32[]
  search_algo: BeamSearch
  len: Base.RefValue{Int64}
  verbose: Bool false


The `SearchGraph` index has an incremental construction that contains a list of callbacks that are called at exponential steps. By default it uses `OptimizeParameters(kind=ParetoRecall())` such that our index try to optimize jointly search speed and recall. It is also possible to optimize for a minimum recall with `MinRecall(0.9)` for a construction that will try to reach 0.9 of recall (using the same dataset as gold standard).

The index is defined, it needs to be constructed as follows (please note that the construction can output a lot of information):

In [5]:
# index!(G, callbacks=SearchGraphCallbacks(hyperparameters=OptimizeParameters(kind=MinRecall(0.9))))
index!(G)

SearchGraph{SqL2Distance, MatrixDatabase{Matrix{Float32}}, SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}, BeamSearch}
  dist: SqL2Distance SqL2Distance()
  db: MatrixDatabase{Matrix{Float32}}
  adj: SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}
  hints: Array{Int32}((112,)) Int32[227, 236, 247, 259, 325, 354, 374, 401, 544, 573  …  6643, 6661, 6696, 6697, 6752, 6769, 6774, 6898, 6903, 6956]
  search_algo: BeamSearch
  len: Base.RefValue{Int64}
  verbose: Bool false


# Searching

Searching can be performed with methods `search` and `searchbatch`. Both are pretty similar, the first one solves a single query and the second method solves a batch of queries. 

In [6]:
I, D = searchbatch(G, MatrixDatabase(rand(8, 3)), 10)
size(I), size(D)

((10, 3), (10, 3))

It returns two matrices of size $10 \times 3$ (10nn of the three given queries). Please note that our dataset is composed of Vector of Float32 elements and we are asking for Float64 vector queries. This is allowed due to the automatic specialization of Julia, but it may impact on the performance (due to SIMD ops.)

A similar way to search is using an array of queries

In [7]:
I, D = searchbatch(G, MatrixDatabase(rand(Float32, 8, 3)), 10)
size(I), size(D)

((10, 3), (10, 3))

Note: Querying directly for rand(8, 3) will perform unexpected results. Note: the cannonical way to perform queries `searchbatch` is the first one (wrapping the queryset with a MatrixDatabase) and the second form should be used only for fast scripting since it always.

## Single queries
The function `search` solves single queries, specified and stored with a `KnnResult` struct.

In [8]:
res = search(G, rand(Float32, 8), KnnResult(10)).res

KnnResult(IdWeight[IdWeight(0x00002c04, 0.34546858f0), IdWeight(0x0000b7cb, 0.3788787f0), IdWeight(0x0000dbe1, 0.4203803f0), IdWeight(0x00001691, 0.47034025f0), IdWeight(0x00015493, 0.50981396f0), IdWeight(0x00005fc9, 0.5403733f0), IdWeight(0x00016b0a, 0.58243275f0), IdWeight(0x000050c6, 0.5837985f0), IdWeight(0x00002bbc, 0.5849759f0), IdWeight(0x0001500b, 0.62225896f0)], 10)

The function `search` returns the struct passed as argument (`KnnResult(10)`) and the number of distance evaluations performed to solve it.

The `res` object has several related functions, but internally, it contains identifiers and distances. The identifiers are indexes in the database to access the retrieved nearest neighbors; and its respective distances to the query. `KnnResult` objects can be iterated at accessed by position.

In [9]:
display("text/markdown", """

- Nearest neighbor pair: `$(first(res))`
- argmin: $(argmin(res)), minimum: $(minimum(res))
- argmax: $(argmax(res)), maximum: $(maximum(res))
- 1nn: $(first(res)), 2nn: $(res[2])), last: $(last(res))
- knns: $(IdView(res))
- dists: $(DistView(res))
    

The `KnnResult` is a priority queue that stores at most `k` pairs.
You can modify it using `push!`, `pop!` and `popfirst!`

""")

display((:popfirst! => popfirst!(res), :res => res, length => length(res)))
display((:pop! => pop!(res), :res => res, :length => length(res)))
push_item!(res, 1, 0.0)
push_item!(res, 2, 1e6)
display(:after_push! => res)
display("text/markdown", "### You can also iterate the result set and access to the indexed dataset")
for (i, p) in enumerate(res)
    println(i => (p.weight, G[p.id]))
end





- Nearest neighbor pair: `IdWeight(0x00002c04, 0.34546858f0)`
- argmin: 11268, minimum: 0.34546858
- argmax: 86027, maximum: 0.62225896
- 1nn: IdWeight(0x00002c04, 0.34546858f0), 2nn: IdWeight(0x0000b7cb, 0.3788787f0)), last: IdWeight(0x0001500b, 0.62225896f0)
- knns: IdView(KnnResult(IdWeight[IdWeight(0x00002c04, 0.34546858f0), IdWeight(0x0000b7cb, 0.3788787f0), IdWeight(0x0000dbe1, 0.4203803f0), IdWeight(0x00001691, 0.47034025f0), IdWeight(0x00015493, 0.50981396f0), IdWeight(0x00005fc9, 0.5403733f0), IdWeight(0x00016b0a, 0.58243275f0), IdWeight(0x000050c6, 0.5837985f0), IdWeight(0x00002bbc, 0.5849759f0), IdWeight(0x0001500b, 0.62225896f0)], 10))
- dists: DistView(KnnResult(IdWeight[IdWeight(0x00002c04, 0.34546858f0), IdWeight(0x0000b7cb, 0.3788787f0), IdWeight(0x0000dbe1, 0.4203803f0), IdWeight(0x00001691, 0.47034025f0), IdWeight(0x00015493, 0.50981396f0), IdWeight(0x00005fc9, 0.5403733f0), IdWeight(0x00016b0a, 0.58243275f0), IdWeight(0x000050c6, 0.5837985f0), IdWeight(0x00002bbc, 0.5849759f0), IdWeight(0x0001500b, 0.62225896f0)], 10))
    

The `KnnResult` is a priority queue that stores at most `k` pairs.
You can modify it using `push!`, `pop!` and `popfirst!`



(:popfirst! => IdWeight(0x00002c04, 0.34546858f0), :res => KnnResult(IdWeight[IdWeight(0x0000b7cb, 0.3788787f0), IdWeight(0x0000dbe1, 0.4203803f0), IdWeight(0x00001691, 0.47034025f0), IdWeight(0x00015493, 0.50981396f0), IdWeight(0x00005fc9, 0.5403733f0), IdWeight(0x00016b0a, 0.58243275f0), IdWeight(0x000050c6, 0.5837985f0), IdWeight(0x00002bbc, 0.5849759f0), IdWeight(0x0001500b, 0.62225896f0)], 10), length => 9)

(:pop! => IdWeight(0x0001500b, 0.62225896f0), :res => KnnResult(IdWeight[IdWeight(0x0000b7cb, 0.3788787f0), IdWeight(0x0000dbe1, 0.4203803f0), IdWeight(0x00001691, 0.47034025f0), IdWeight(0x00015493, 0.50981396f0), IdWeight(0x00005fc9, 0.5403733f0), IdWeight(0x00016b0a, 0.58243275f0), IdWeight(0x000050c6, 0.5837985f0), IdWeight(0x00002bbc, 0.5849759f0)], 10), :length => 8)

:after_push! => KnnResult(IdWeight[IdWeight(0x00000001, 0.0f0), IdWeight(0x0000b7cb, 0.3788787f0), IdWeight(0x0000dbe1, 0.4203803f0), IdWeight(0x00001691, 0.47034025f0), IdWeight(0x00015493, 0.50981396f0), IdWeight(0x00005fc9, 0.5403733f0), IdWeight(0x00016b0a, 0.58243275f0), IdWeight(0x000050c6, 0.5837985f0), IdWeight(0x00002bbc, 0.5849759f0), IdWeight(0x00000002, 1.0f6)], 10)

### You can also iterate the result set and access to the indexed dataset

1 => (0.0f0, Float32[-0.7630596, -0.42873305, -0.16345312, -0.30858696, -0.59181386, 1.7422782, 1.5619737, -0.25632438])
2 => (0.3788787f0, Float32[0.62754375, 0.9978854, 0.6051406, 0.5104975, 0.48214805, 0.6172233, 0.8566184, -0.0988432])
3 => (0.4203803f0, Float32[0.40087703, 0.7620401, 0.5171964, -0.012335303, 0.5355121, 0.4249623, 1.122061, 0.40317476])
4 => (0.47034025f0, Float32[0.639298, 0.8764266, 1.16921, 0.23474234, 0.62843925, 0.40918884, 0.4388563, 0.03813715])
5 => (0.50981396f0, Float32[0.36410972, 0.91254157, 0.6256755, -0.0242322, 0.19040246, 0.4812591, 0.617917, 0.6650715])
6 => (0.5403733f0, Float32[0.117136024, 0.9634689, 1.1168529, 0.6943728, 0.7118387, 0.64946336, 0.47406694, 0.3585967])
7 => (0.58243275f0, Float32[0.48408374, 0.8917867, 0.664812, 0.871713, 0.5335689, 0.55054533, 0.78279185, -0.21962304])
8 => (0.5837985f0, Float32[0.11780816, 0.39372015, 0.687422, 0.4780256, 0.68495995, -0.0010389228, 0.88268745, 0.013796319])
9 => (0.5849759f0, Float32[0.54255056