Skip to content

Commit 7046ff4

Browse files
committed
Merge branch 'optimizations'
2 parents bb087dc + 339dbaf commit 7046ff4

25 files changed

+1788
-670
lines changed

README.md

+50-6
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ rustup install nightly
1717
## Building all the binaries
1818

1919
```
20-
cargo install --force --path danny
21-
cargo install --force --path danny-utilities
20+
cargo install --force --path danny --locked
21+
cargo install --force --path danny-utilities --locked
2222
```
2323

2424
## Prepare the datasets
@@ -33,10 +33,10 @@ DANNY_MINIONS=hostnames,separated,by,comma,that,execute,experiments # Or localho
3333
The following command with download and preprocess **all** datasets, if they are not already in your machine! Takes a **long** time.
3434

3535
```
36-
./run.py --list
36+
./datasets/prepare.py --list
3737
```
3838

39-
If you are interested in just one dataset, edit the `run.py` and edit the `DATASETS` dictionary, removing the ones you don't need.
39+
If you are interested in just one dataset, edit the `prepare.py` and edit the `DATASETS` dictionary, removing the ones you don't need.
4040
Also, find and comment out the following loop:
4141

4242
```python
@@ -51,7 +51,7 @@ You can use the `sampledata` binary that was installed alongside the other utili
5151
An example usage is the following, for taking 5000 points from dataset `livejournal`:
5252

5353
```
54-
sampledata --measure jaccard --size 5000 $DANNY_DATA_DIR/Livejournal.bin $DANNY_DATA_DIR/Livejournal-5000.bin
54+
sampledata --size 5000 $DANNY_DATA_DIR/Livejournal.bin $DANNY_DATA_DIR/Livejournal-5000.bin
5555
```
5656

5757
If you are sampling from a dataset which uses the cosine distance, use `--measure cosine`.
@@ -63,11 +63,55 @@ You can define several environment variables to control the behavior of `danny`,
6363
Example invocation of the one round, fixed parameter LSH algorithm:
6464

6565
```
66-
danny -m jaccard --algorithm lsh -k 8 --rounds one --range 0.5 $DANNY_DATA_DIR/Livejournal-5000.bin $DANNY_DATA_DIR/Livejournal-5000.bin
66+
danny --algorithm local-lsh -k 8 --range 0.5 $DANNY_DATA_DIR/Livejournal-5000.bin $DANNY_DATA_DIR/Livejournal-5000.bin
6767
```
6868

6969
For a list of all available options and algorithms, please consult `danny --help`.
7070

71+
## Running on a cluster
72+
73+
Deploying a running on a cluster requires each machine of the cluster to have a copy of the `danny` binary available in the `$PATH`.
74+
The simplest way to accomplish this is to run
75+
76+
```
77+
cargo install --force --path danny --locked
78+
```
79+
80+
on each machine of the cluster. This will place the `danny` executable in the `~/.cargo/bin` directory of each machine,
81+
which should be added to `$PATH`.
82+
83+
To run the code, you invoke `danny` on one of the machines and provide a list of all the hosts to use in a file: the executable
84+
will take care of spawning worker processes on all listed machines using `ssh`. Therefore it is best to have
85+
passwordless `ssh` configured in your cluster.
86+
87+
The file listing hosts should contain `host:port` pairs, like the following (any port number will do):
88+
89+
```
90+
host1:2001
91+
host2:2001
92+
host3:2001
93+
host4:2001
94+
host5:2001
95+
```
96+
97+
Let the above file be `~/hosts.txt`. Then you can invoke `danny` as follows:
98+
99+
```
100+
danny --hosts ~/hosts.txt --threads 8 --threshold 0.7 --algorithm local-lsh --recall 0.8 --k 4 $PATH_TO_DATA
101+
```
102+
103+
which will run `danny` using 8 threads on each of the 5 listed hosts,
104+
using the `local-lsh` algorithm with `k=4` and required recall 0.8, at similarity threshold 0.7.
105+
There are four available algorithms:
106+
107+
- `local-lsh`
108+
- `one-level-lsh`
109+
- `two-level-lsh`, which takes an additional parameter `--k2` for the number of hash functions to use locally
110+
- `cartesian`
111+
112+
Additionally, you can specify the number of sketch bits to use using the `--sketch-bits` argument, which
113+
takes values in `0`, `64`, `128`, `256`, `512`.
114+
71115
## Hacking
72116

73117
If you are changing the code, you can run the modified versions without reinstalling

analysis/Makefile

+5-2
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,18 @@ CWD=`pwd`
33
RSCRIPT=docker run -it --rm -v ${CWD}:/work -t ${DOCKER_CONTAINER} Rscript
44

55
.PHONY: all
6-
all: imgs/dep_k.png imgs/sketches.png imgs/load.png imgs/full.png imgs/profile.png imgs/profile_glove_detail.png tex/best.tex
6+
all: imgs/dep_k.png imgs/sketches.png tex/best.tex tex/info.tex
77

88
.PHONY: build-docker
99
build-docker: Dockerfile
1010
docker build -t ${DOCKER_CONTAINER} .
1111

12-
imgs/dep_k.png: plot_k_dep.R tables.R danny-results.sqlite
12+
imgs/counters.png: plot_k_dep.R tables.R danny-results.sqlite
1313
${RSCRIPT} plot_k_dep.R
1414

15+
imgs/plot_subproblem_size.png: plot_subproblem_size.R tables.R danny-results.sqlite
16+
${RSCRIPT} plot_subproblem_size.R
17+
1518
imgs/sketches.png: plot_sketches.R tables.R danny-results.sqlite
1619
${RSCRIPT} plot_sketches.R
1720

analysis/latex.R

+79-70
Original file line numberDiff line numberDiff line change
@@ -11,24 +11,24 @@ latex_table_best <- function(data) {
1111
ungroup() %>%
1212
group_by(dataset, threshold) %>%
1313
mutate(
14-
dataset = str_remove(dataset, "-sample-*"),
15-
k = if_else(algorithm == "two-round-lsh",
16-
str_c(k, " [k2=", k2, "]"),
17-
as.character(k)
18-
),
14+
# dataset = str_remove(dataset, "-sample-*"),
15+
# k = if_else(algorithm == "TwoLevelLSH",
16+
# str_c(k, " [k2=", k2, "]"),
17+
# as.character(k)
18+
# ),
1919
total_time_num = drop_units(total_time),
2020
total_time = total_time %>%
21-
set_units("s") %>%
21+
set_units("min") %>%
2222
drop_units() %>%
23-
scales::number(accuracy = 0.01),
23+
scales::number(accuracy = 0.1),
2424
recall = scales::number(recall, accuracy = 0.01),
2525
total_time = cell_spec(total_time,
26-
# background = spec_color(total_time_num, direction = -1),
27-
# color = "white",
28-
# underline = total_time == min(total_time),
2926
underline = id %in% best_runs,
3027
format = "latex"
31-
)
28+
),
29+
# more compact names
30+
# algorithm = str_remove(algorithm, "LSH"),
31+
# dataset = if_else(dataset == "Livejournal", "LJ", as.character(dataset))
3232
) %>%
3333
ungroup() %>%
3434
select(dataset, threshold, algorithm, total_time, recall, k, sketch_bits) %>%
@@ -40,14 +40,14 @@ latex_table_best <- function(data) {
4040
) %>%
4141
kbl(
4242
format = "latex",
43-
align = "ll rrll rrll",
43+
align = c("l", "l", "r", "r", "l", "l", "r", "r", "l", "l"),
4444
escape = F,
4545
booktabs = T,
4646
linesep = c("", "", "", "\\addlinespace"),
4747
col.names = c(
4848
"dataset", "algorithm",
49-
"total time (s)", "recall", "k", "b",
50-
"total time (s)", "recall", "k", "b"
49+
"time", "recall", "k", "b",
50+
"time", "recall", "k", "b"
5151
)
5252
) %>%
5353
add_header_above(c(" " = 2, "0.5" = 4, "0.7" = 4))
@@ -63,6 +63,8 @@ latex_table_info <- function(data) {
6363
filter(threshold %in% c(0.5, 0.7)) %>%
6464
select(dataset, threshold, n, dim, output_size) %>%
6565
mutate(
66+
# Count in the output size the self pairs, which are not reported by the implementations
67+
output_size = output_size + n,
6668
sample = as.integer(str_match(dataset, "sample-(\\d+)")[, 2]),
6769
sample = if_else(is.na(sample), "Full dataset", str_c("Sample of ", sample)),
6870
dataset = case_when(
@@ -71,7 +73,6 @@ latex_table_info <- function(data) {
7173
str_detect(dataset, "[Gg]love") ~ "Glove",
7274
str_detect(dataset, "Orkut") ~ "Orkut"
7375
),
74-
# selectivity = scales::percent(selectivity, accuracy = 0.00001),
7576
avg_neighbors = scales::number(output_size / n, accuracy = 0.01, big.mark = "\\\\,")
7677
) %>%
7778
select(-output_size) %>%
@@ -94,56 +95,56 @@ latex_table_info <- function(data) {
9495
linesep = ""
9596
) %>%
9697
# kable_styling() %>%
97-
add_header_above(c(" " = 1, " " = 1, " " = 1, "average neighbors" = 2)) %>%
98-
pack_rows("Full dataset", 1, 4) %>%
99-
pack_rows("Sample of 200000 vectors", 5, 8)
98+
add_header_above(c(" " = 1, " " = 1, " " = 1, "average neighbors" = 2))
10099
}
101100

102101
table_data_info() %>%
103102
latex_table_info() %>%
104103
write_file("tex/info.tex")
105104

106105

107-
latex_normalized_profile <- function(data) {
108-
data %>%
109-
select(-ends_with("input"), -sketch, -verify, -deduplicate) %>%
110-
pivot_longer(ends_with("ppf"), names_to = "component", values_to = "ppf") %>%
111-
mutate(
112-
component = str_remove(component, "_ppf"),
113-
component = if_else(component == "dedup", "deduplicate", component),
114-
component = factor(component,
115-
levels = c("sketch", "verify", "deduplicate"),
116-
ordered = T
117-
),
118-
ppf = scales::number(ppf, big.mark = "\\\\,", scale=0.001, accuracy = 1),
119-
algorithm = factor(algorithm, ordered = T, levels = c(
120-
"LocalLSH",
121-
"OneLevelLSH",
122-
"TwoLevelLSH"
123-
))
124-
) %>%
125-
ungroup() %>%
126-
select(-id, -threshold) %>%
127-
pivot_wider(names_from=c(algorithm, component), values_from=ppf) %>%
128-
select(dataset,
129-
LocalLSH_sketch, LocalLSH_verify, LocalLSH_deduplicate,
130-
OneLevelLSH_sketch, OneLevelLSH_verify, OneLevelLSH_deduplicate,
131-
TwoLevelLSH_sketch, TwoLevelLSH_verify, TwoLevelLSH_deduplicate
132-
) %>%
133-
kbl(format = "latex", booktabs = T, escape = F,
134-
col.names = c(
135-
"dataset",
136-
"sketching", "verify", "dedup.",
137-
"sketching", "verify", "dedup.",
138-
"sketching", "verify", "dedup."
139-
)
140-
) %>%
141-
add_header_above(c(" " = 1, "\\\\local" = 3, "\\\\onelevel" = 3, "\\\\twolevel" = 3), escape = F)
142-
}
143-
144-
table_normalized_profile() %>%
145-
latex_normalized_profile() %>%
146-
write_file("tex/profiling.tex")
106+
# latex_normalized_profile <- function(data) {
107+
# data %>%
108+
# select(-ends_with("input"), -sketch, -verify, -deduplicate) %>%
109+
# pivot_longer(ends_with("ppf"), names_to = "component", values_to = "ppf") %>%
110+
# mutate(
111+
# component = str_remove(component, "_ppf"),
112+
# component = if_else(component == "dedup", "deduplicate", component),
113+
# component = factor(component,
114+
# levels = c("sketch", "verify", "deduplicate"),
115+
# ordered = T
116+
# ),
117+
# ppf = scales::number(ppf, big.mark = "\\\\,", scale = 0.001, accuracy = 1),
118+
# algorithm = factor(algorithm, ordered = T, levels = c(
119+
# "LocalLSH",
120+
# "OneLevelLSH",
121+
# "TwoLevelLSH"
122+
# ))
123+
# ) %>%
124+
# ungroup() %>%
125+
# select(-id, -threshold) %>%
126+
# pivot_wider(names_from = c(algorithm, component), values_from = ppf) %>%
127+
# select(
128+
# dataset,
129+
# LocalLSH_sketch, LocalLSH_verify, LocalLSH_deduplicate,
130+
# OneLevelLSH_sketch, OneLevelLSH_verify, OneLevelLSH_deduplicate,
131+
# TwoLevelLSH_sketch, TwoLevelLSH_verify, TwoLevelLSH_deduplicate
132+
# ) %>%
133+
# kbl(
134+
# format = "latex", booktabs = T, escape = F,
135+
# col.names = c(
136+
# "dataset",
137+
# "sketching", "verify", "dedup.",
138+
# "sketching", "verify", "dedup.",
139+
# "sketching", "verify", "dedup."
140+
# )
141+
# ) %>%
142+
# add_header_above(c(" " = 1, "\\\\local" = 3, "\\\\onelevel" = 3, "\\\\twolevel" = 3), escape = F)
143+
# }
144+
145+
# table_normalized_profile() %>%
146+
# latex_normalized_profile() %>%
147+
# write_file("tex/profiling.tex")
147148

148149
latex_bench <- function() {
149150
tbldata <- table_bench() %>%
@@ -160,13 +161,14 @@ latex_bench <- function() {
160161
max_verify = max(verify) %>% scales::number(accuracy = 0.1)
161162
) %>%
162163
ungroup()
163-
164+
164165

165166
tbldata %>%
166167
select(dataset, classification, mean_sketch, median_sketch, max_sketch, mean_dedup, median_dedup, max_dedup, mean_verify, median_verify, max_verify) %>%
167-
kbl(format = "latex", escape = F, booktabs = TRUE,
168+
kbl(
169+
format = "latex", escape = F, booktabs = TRUE,
168170
col.names = c(
169-
"data type", "pair type",
171+
"data type", "pair type",
170172
"mean", "median", "max",
171173
"mean", "median", "max",
172174
"mean", "median", "max"
@@ -175,17 +177,20 @@ latex_bench <- function() {
175177
add_header_above(c(" " = 1, " " = 1, "sketch" = 3, "deduplication" = 3, "similarity" = 3))
176178
}
177179

178-
latex_bench() %>% write_file("tex/bench.tex")
180+
# latex_bench() %>% write_file("tex/bench.tex")
179181

180182

181183
latex_motivation <- function(data) {
182184
data %>%
183185
filter(
186+
!dry_run,
184187
algorithm == "OneLevelLSH",
185-
sketch_bits == 0,
188+
sketch_bits == 0,
186189
required_recall == 0.8,
187-
threshold == 0.5
190+
threshold == 0.7
188191
) %>%
192+
drop_na(Load, total_time) %>%
193+
mutate(total_time = set_units(total_time, "min")) %>%
189194
select(dataset, k, total_time, Load) %>%
190195
arrange(k) %>%
191196
group_by(dataset) %>%
@@ -194,18 +199,22 @@ latex_motivation <- function(data) {
194199
total_time == min(total_time) ~ "practical",
195200
Load == min(Load) ~ "theoretical"
196201
),
197-
total_time = set_units(total_time, "s") %>% drop_units() %>% scales::number(accuracy=1)
202+
total_time = drop_units(total_time) %>% scales::number(accuracy = 0.1),
203+
Load = scales::number(Load, big.mark = "\\\\,")
198204
) %>%
199-
drop_na() %>%
200-
arrange(dataset, Load) %>%
201205
select(dataset, kind, total_time, Load) %>%
202-
kbl(format = "latex", booktabs = T, escape = F,
206+
drop_na(kind) %>%
207+
arrange(dataset, Load) %>%
208+
kbl(
209+
format = "latex", booktabs = T, escape = F,
203210
linesep = "",
204211
col.names = c(
205-
"", "", "time (s)", "load"
212+
"", "", "time (min)", "load"
206213
)
207214
) %>%
208215
collapse_rows(columns = 1, latex_hline = "major")
209216
}
210217

211-
table_search_best() %>% latex_motivation() %>% write_file("tex/motivation.tex")
218+
best <- table_search_best()
219+
latex_motivation(best) %>%
220+
write_file("tex/motivation.tex")

0 commit comments

Comments
 (0)