/
approximate.Rmd
118 lines (96 loc) · 3.99 KB
/
approximate.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Approximate search"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Approximate search}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
use_cached_data = TRUE
can_run = require(kdtools) &&
require(ggplot2) &&
require(tidyr) &&
require(microbenchmark) &&
kdtools::has_cxx17()
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = can_run,
echo = FALSE
)
```
```{r eval=!can_run}
if (has_cxx17()) {
message("Required packages missing -- code will not be evaluated")
} else {
message("kdtools needs C++17 for full functionality, code will not be evaluated")
}
```
```{r}
nsamp <- 1e5
reps <- 33
```
```{r}
cache_file = "../../inst/extdata/approx_benchmark_data"
if (!file.exists(cache_file))
cache_file = system.file("extdata/approx_benchmark_data", package = "kdtools")
if (use_cached_data && file.exists(cache_file)) {
load(cache_file)
} else {
ndim <- round(seq(2, 50, length = 10))
a <- c(0, 1, 5, 10, 50, 100)
experiment <- tidyr::expand_grid(ndim, a)
f <- function(ndim, a) {
summary(microbenchmark({
kd_nearest_neighbors(x, runif(ndim), 7, a = a)
}, times = reps, setup = {
x <- kd_sort(matrix(runif(ndim * nsamp), nr = nsamp, nc = ndim))
}), unit = "eps")$median
}
experiment$eps <- vapply(1:nrow(experiment),
function(i) do.call("f", experiment[i, ]),
numeric(1))
g <- function(ndim, a) {
res <- numeric(reps)
for (i in 1:reps) {
x <- kd_sort(matrix(runif(ndim * nsamp), nr = nsamp, nc = ndim))
key <- runif(ndim)
y1 <- kd_nn_indices(x, key, 1, distances = TRUE)
y2 <- kd_nn_indices(x, key, 1, a = a, distances = TRUE)
res[i] <- (y2$distance - y1$distance) / y1$distance
}
return(median(res))
}
experiment$delta <- vapply(1:nrow(experiment),
function(i) do.call("g", experiment[i, ]),
numeric(1))
save(experiment, file = cache_file)
}
```
The `kdtools` package implements approximate nearest neighbors. This is done in the standard way by shrinking the search radius by a factor of $1 / (1 + \alpha)$. Once the greedy descent phase is completed and candidate nearest neighbors are queued, the algorithm interrogates adjacent half-planes while backtracking if the search key is sufficiently close to the half-plane boundary. The factor $\alpha$ reduces the definition of sufficiently close. Given a sufficiently large $\alpha$, the algorithm is entirely greedy and ignores any points not in the nearest half-plane.
Approximate search is accomplished by setting the `a` argument to a value grater than zero.
```{r echo=TRUE}
key <- runif(13)
x <- kd_sort(matrix(runif(13 * 1e5), nr = 1e5, nc = 13))
cbind(kd_nn_indices(x, key, 3, distances = TRUE),
kd_nn_indices(x, key, 3, distances = TRUE, a = 10))
```
The results below show number of searches completed per second with different number of dimensions (length of the search key) and different values of $\alpha$. Number of evaluations is the median of `r toString(reps)` trials searching among `r toString(nsamp)` vectors containing uniform random deviates.
```{r fig.height=4, fig.width=6}
ggplot(experiment) +
geom_point(aes(x = ndim, y = eps,
color = as.factor(a))) +
ylab("Evaluations per second") +
xlab("Number of dimensions") +
scale_color_discrete(name = "alpha")
```
The following shows the relative error $(d_2 - d_1) / d_1$ where $d_2$ is the approximate nearest neighbor and $d_1$ is the exact nearest neighbor distance. Notice these results suggest that greedy search is optimal for 2-dimensions, at least for uniform random vectors. The results will depend on the details of the input data.
```{r fig.height=4, fig.width=6}
ggplot(experiment) +
geom_point(aes(x = ndim, y = delta,
color = as.factor(a))) +
ylab("Median proportional error") +
xlab("Number of dimensions") +
scale_color_discrete(name = "alpha")
```