# Random data - synthetic demo

This is a very simple demonstration of how to use `SimilaritySearch.jl`. The API correspond to version `0.8`

In [1]:
using Pkg
Pkg.activate(".")
Pkg.add([
    PackageSpec(name="SimilaritySearch", version="0.8")
])

using SimilaritySearch

[32m[1m  Activating[22m[39m project at `~/Research/SimilaritySearchDemos/synthetic`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Research/SimilaritySearchDemos/synthetic/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Research/SimilaritySearchDemos/synthetic/Manifest.toml`


# A random dataset
Let us define a dataset of 8-dimensions and $10^5$ elements. Each object is a column. The matrix needs to be wrapped as a database since `SimilaritySearch` is distance agnostic and objects can be any representation. The matrix is not copied.


In [2]:
n = 100_000
M = randn(Float32, 8, n)
db = MatrixDatabase(M)

MatrixDatabase{Matrix{Float32}}(Float32[0.5034658 -0.93668336 … -0.7618183 0.6744511; 0.47730342 -2.610985 … -0.49312723 0.13696608; … ; -0.8878111 2.229185 … 2.9433165 -0.97900796; 0.44475862 -0.29567906 … -0.20951769 -0.47941688])

The database object mimics a vector of elements

In [3]:
length(db), eltype(db), typeof(db[1])

(100000, AbstractVector{Float32}, SubArray{Float32, 1, Matrix{Float32}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})

The `SubArray` that results of `typeof(db[1])` means that each object is a column's `view`, and therefore there is no extra memory allocations.

# Index construction

An index is defined as follows

In [4]:
G = SearchGraph(db=(db), dist=SqL2Distance())
push!(G.callbacks, OptimizeParameters(kind=ParetoRecall(), params=SimilaritySearch.SearchParams(verbose=false)))

3-element Vector{Callback}:
 DisjointHints
  logbase: Float32 1.1f0

 NeighborhoodSize()
 OptimizeParameters
  kind: ParetoRecall ParetoRecall()
  initialpopulation: Int64 16
  params: SearchModels.SearchParams
  ksearch: Int32 10
  numqueries: Int32 64
  minrecall: Float64 0.9
  space: BeamSearchSpace


The `SearchGraph` index has an incremental construction and `G.callbacks` contains a list of callbacks that are called at exponential steps. We add the `OptimizeParameters` of kind `ParetoRecall` such that our index try to optimize jointly search speed and recall. It is also possible to optimize for a minimum recall with `MinRecall`.

The index is defined, it needs to be constructed as follows (please note that the construction can output a lot of information):

In [5]:
index!(G)
IJulia.clear_output()

0

# Searching

Searching can be performed with methods `search` and `searchbatch`. Both are pretty similar, the first one solves a single query and the second method solves a batch of queries. 

In [6]:
I, D = searchbatch(G, MatrixDatabase(rand(8, 3)), 10)
size(I), size(D)

((10, 3), (10, 3))

It returns two matrices of size $10 \times 3$ (10nn of the three given queries). Please note that our dataset is composed of Vector of Float32 elements and we are asking for Float64 vector queries. This is allowed due to the automatic specialization of Julia, but it may impact on the performance (due to SIMD ops.)

A similar way to search is using an array of queries

In [7]:
I, D = searchbatch(G, [rand(Float32, 8) for i in 1:3], 10)
size(I), size(D)

((10, 3), (10, 3))

Note: Querying directly for rand(8, 3) will perform unexpected results. Note: the cannonical way to perform queries `searchbatch` is the first one (wrapping the queryset with a MatrixDatabase) and the second form should be used only for fast scripting since it always.

## Single queries
The function `search` solves single queries, specified and stored with a `KnnResult` struct.

In [8]:
res, cost = search(G, rand(Float32, 8), KnnResult(10))

(res = KnnResult(Int32[67346, 21525, 26550, 19936, 58579, 49516, 40222, 90325, 7281, 82431], Float32[0.35673603, 0.4084082, 0.4260828, 0.44249764, 0.4783573, 0.53685194, 0.5409415, 0.5529058, 0.5750766, 0.5774101], 10), cost = 345)

The function `search` returns the struct passed as argument (`KnnResult(10)`) and the number of distance evaluations performed to solve it.

The `res` object has several related functions, but internally, it contains identifiers and distances. The identifiers are indexes in the database to access the retrieved nearest neighbors; and its respective distances to the query.

In [9]:
for (i, (id, dist)) in enumerate(res)
    println(i => (dist, G[id]))
end

1 => (0.35673603f0, Float32[0.4994649, 0.2459959, 0.41559488, -0.22472794, 0.5081421, 0.75663584, 0.78263, 0.52516836])
2 => (0.4084082f0, Float32[0.23588413, 0.36198026, 0.35849693, 0.4293546, 0.4586625, 0.6400273, 0.5287682, 0.77999455])
3 => (0.4260828f0, Float32[0.25135532, 0.5490575, -0.38024586, -0.012148158, 0.40443987, 0.71936345, 0.81165045, 0.86893636])
4 => (0.44249764f0, Float32[0.49519494, 0.37087166, -0.21152371, 0.11435383, -0.258612, 0.5929354, 0.74892527, 0.6288024])
5 => (0.4783573f0, Float32[0.27711454, 0.07953253, 0.1760169, -0.40125924, 0.40620005, 0.925945, 0.8988471, 0.93427444])
6 => (0.53685194f0, Float32[0.6002004, 0.8654269, -0.07265843, -0.1685336, -0.13893478, 0.6286316, 0.660411, 1.113605])
7 => (0.5409415f0, Float32[0.24515025, 0.522316, 0.38613623, -0.25196764, 0.13193493, 1.0425844, 1.2510918, 0.9864671])
8 => (0.5529058f0, Float32[0.45296872, 0.27671918, 0.27731606, -0.56722015, 0.14737016, 0.6006524, 0.65252244, 1.0746515])
9 => (0.5750766f0, Float32[