Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistical independence of pseudo-random numbers #1139

Closed
wlandau opened this issue Sep 5, 2023 · 65 comments
Closed

Statistical independence of pseudo-random numbers #1139

wlandau opened this issue Sep 5, 2023 · 65 comments
Assignees

Comments

@wlandau
Copy link
Collaborator

wlandau commented Sep 5, 2023

targets has two major challenges with pseudo-random number generation:

  1. Statistical reproducibility. Different runs of the same pipeline must give the exact same results.
  2. Statistical independence. Pseudo-random numbers from different targets must be independent.

For years, targets has controlled (1) in a simplistic way: set a deterministic seed for each target, where each target seed depends on the target name and an overarching global seed:

targets/R/utils_digest.R

Lines 35 to 42 in 116f1ce

produce_seed <- function(scalar) {
seed <- tar_options$get_seed()
if_any(
anyNA(seed),
NA_integer_,
digest::digest2int(as.character(scalar), seed = seed)
)
}

But this approach may violate (2). Different targets have different pseudo-random number generator states, and there is nothing to prevent the PRNG sequence of one target from overlapping with that of a different target.

This problem is not unique to targets, it is a general issue with parallel computing.

C.f. wlandau/crew#113.

@wlandau wlandau self-assigned this Sep 5, 2023
@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

The comment in wlandau/crew#113 (comment) describes a solid approach to guarantee statistical independence. However, it's tough for reproducibility in targets. I want to assign a deterministic unique stream to each target, and I don't want a given target's choice of stream to interfere with the streams of other targets. What's more, dynamic branches are defined while the pipeline is running, so I cannot know all the target names in advance (or even how many there will be by the end).

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

(FYI @shikokuchuo)

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

Each target already gets a "seed", which is just an integer from digest::digest2int(). These digest2int() values could be very large and/or very negative, but they are already deterministic and unique. I wonder if I can somehow map them deterministically to independent L'Ecuyer streams.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to parallel::nextRNGStream()?

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

This is really tricky because L'Ecuyer RNG states are recursive, so a change in the seed of one target could invalidate a bunch of seemingly unrelated targets (causing them to rerun). So unless I can pin down independent RNG states deterministically, there is no way to do this in a way that preserves targets' existing reproducibility guarantees. On the other hand, simulations with tarchetypes::tar_map_rep() really need statistically independent random numbers for each batch of simulations.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

Maybe L'Ecueyer's original paper has advice about how to manually generate seeds: https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159. Then maybe I could use targets' existing digest2int() number somehow instead of the default recursive behavior of nextRNGStream(). It would be manual, but it may be the only way.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

Just from a glance, I wouldn't trust myself to implement any part of https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159, or to manually advance streams an arbitrary number of steps.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

I think targets needs to commit to L'Ecuyer and give up some reproducibility in the process. Reproducibility does not mean anything if the results are incorrect to begin with.

What this will look like:

  • When you run tar_make(use_crew = TRUE), tar_make_clustermq(), or tar_make_future(), you will be able to trust the statistical properties of the RNGs.
  • However, targets will no longer be able to guarantee that different runs of the pipeline will replicate the RNG state for each target exactly. In other words, you may get slightly different results because the pseudo-random numbers will be slightly different.
  • The metadata will have the full L'Ecuyer 7-integer .Random.seed vector instead of just a single integer for set.seed(). This means you will still be able to recover the RNG state from the metadata in an interactive R session, whether with tar_workspace() or manually.
  • I will update tarchetypes accordingly (Rep-specific seeds in tar_rep(), tar_map_rep(), etc. tarchetypes#111) so that the tar_seed column of the output has the 7-integer .Random.seed vector.

Plan for the main implementation:

  1. Commit to L'Ecuyer as the RNG algorithm.
  2. At the start of the pipeline, initialize the L'Ecuyer state in the tar_runtime object. Here in utils_callr.R seems like the best place for that.
  3. Do not use produce_seed(). Instead, the default seed in target$command$seed should be NA.
  4. Here in process_target(), call parallel::nextRNGStream() to get the next L'Ecuyer RNG state. Then assign it as the target's seed in target$command$seed. Also update the state in the tar_runtme object. If tar_option_get("seed") is NA, then do nothing.
  5. In build_run_expr(), if seed is not NA, set the RNG algorithm to L'Ecuyer and set the seed to .GlobalEnv[[".Random.seed"]]. No need to restore this seed on exit because (2) should take care of that. But do throw a warning if the RNGkind is not already L'Ecuyer.
  6. Store the seed in target$command$seed in the metadata.
  7. In tar_workspace(), load and set the new seed properly. For compatibility, check for a 7-integer seed vs a 1-integer seed.
  8. In tarchetypes, discard the existing seed setting behavior. Just replicate (5) and set the seed vector to the tar_seed column of various outputs, e.g. in tar_map_rep().

Other tasks:

  • Document the changes.
  • Deprecate the seed argument of tar_cue().
  • Make sure the saved metadata from old pipelines still loads.

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Sep 5, 2023

Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to parallel::nextRNGStream()?

Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.

Then when it is actually run, you initialise the seed RNGkind("L'Ecuyer-CMRG"); set.seed(<user or default value>), size of the hash table and pre-generate a list of all the required streams (random seeds) using parallel::nextRNGStream() in a loop. Then for each target you simply retrieve the random seed to use by looking up the hash value.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.

If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to #1139 (comment).

@shikokuchuo
Copy link
Contributor

Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence.

If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to #1139 (comment).

I'm suggesting that the hash map is stored as metadata. So assuming the target names are unique then it is a case of looking up the value, and if the key doesn't exist then creating one. That way each target name will always be associated with the same stream (as the table is immutable, but appendable).

wlandau-lilly added a commit that referenced this issue Sep 5, 2023
@wlandau
Copy link
Collaborator Author

wlandau commented Sep 5, 2023

I see. I do already plan to store the state of the stream for each target in the metadata, but I would rather not make guarantees about being able to recover it. Users can clear some or all of the metadata at any time.

@shikokuchuo
Copy link
Contributor

Right. I think that's equivalent, but just storing the integer value may not only be more efficient but more reproducible, as if additional targets are added then they can be easily appended and the correct streams generated. Whereas if only the streams are stored, you would also need to separately mark the last-used stream in order to generate the next one etc. But just a suggestion based on your earlier observation - I don't profess to know all the intricacies involved.

@shikokuchuo
Copy link
Contributor

FYI sections 6 and 7 of the parallel package vignette (on RNG and load-balancing) is a very clear and concise summary:

vignette("parallel", package = "parallel")

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 6, 2023

Thanks for the reference, I find it much clearer than the L'Ecuyer papers I skimmed yesterday.

From Section 6 ("Random-number generation"):

When an R process is started up it takes the random-number seed from the object .Random.seed in a saved workspace or constructs one from the clock time and process ID when random-number generation is first used (see the help on RNG). Thus worker processes might get the same seed because a workspace containing .Random.seed was restored or the random number generator has been used before forking: otherwise these get a non-reproducible seed (but with very high probability a different seed for each worker).

The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process. This is generally plenty safe enough, but there have been worries that the random-number streams in the workers might somehow get into step.

From the above, it seems like targets is not in a crisis like I initially feared. In fact, it actually has advantages over what most people seem to do. The set.seed() seed of each target is the digest::digest2int() of the target name. That means seeds are guaranteed to be:

  1. Reproducible.
  2. Unique.
  3. Different for each task, not each worker.

(3) in particular cuts RNG sequences short, which makes overlap less likely.

On top of all this, yesterday I had a really tough time trying to use parallel::nextRNGStream() in targets (https://github.com/ropensci/targets/tree/1139). There are so many subtle things it breaks, chief of all reproducibility: the same code with the same set of target names no longer produces the same results.

In the short term, I think it would be much better if I write a whole chapter in https://books.ropensci.org/targets about pseudo-random number generation. The default behavior is worth explaining in depth. In addition, I can offer a workaround: set tar_option_set(seed = NA) to prevent targets from setting seeds altogether, then run targets::tar_make(use_crew = TRUE) with crew to get widely-spaced L'Ecuyer streams automatically.

And I wonder if I there is some way I can make it easy for targets users to test their pipeline seed configuration. I am thinking:

  1. Take the target seeds from the metadata.
  2. For each seed, run the RNG sequence with something really simple.
  3. Look for overlap.

For (2), I am wondering how best to simulate forward from the seed. (Maybe stats::runif()? Maybe something even more basic?) And I am wondering how far forward to go. 100K steps? 10M steps? An empirical approach could give assurance to individual users for specific pipelines, and it could help understand how big a problem overlapping streams actually is in real life. For all I know, this problem may be extremely rare in practical workflows.

@shikokuchuo
Copy link
Contributor

Agree, shouldn't be a 'crisis' situation for targets. The streams implementation in R was based on L'Ecuyer 2002 - so the 'state of the art' is still 20 years old!

Highlighting this topic in the book is a sensible approach - and it would help to disseminate the ideas - it is completely by luck that I found the (very helpful) vignette!

If you do end up developing testing, I just refer you to the end of the examples on ?RNG:

sum(duplicated(runif(1e6))) # around 110 for default generator

as a good starting point.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 12, 2023

I wrote https://books.ropensci.org/targets/random.html, was a long time coming anyway. Unless there is a non-sequential non-recursive way to generate widely-spaced seeds for base R, this is all I can do about the issue of statistical independence in targets.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 22, 2023

DavisVaughan/furrr#265 (comment) spells out the biggest reason why recursively-defined RNG streams are not feasible in targets.

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Sep 22, 2023

Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap.

Just thinking through a bit more, it does get messy potentially, but you could assign each target a monotonically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough nextRNGstream() calls). If this metadata gets deleted, then you wouldn't be able to guarantee reproducibility.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 22, 2023

Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap.

targets is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to stick with the RNG state it chooses initially. With recursive L'Ecuyer streams, the RNG state would have depend on some kind of ordering. No matter ordering it is initially, it changes when the pipeline changes, and the effects on the RNG streams would cause tasks to rerun which should not need to rerun. Make sense?

Just thinking through a bit more, it does get messy potentially, but you could assign each target an atomically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough nextRNGstream() calls). If this metadata gets deleted, then you wouldn't be able to guarantee reproducibility.

As you mentioned here and in #1139 (comment), this would require on the metadata for reproducibility, and that metadata could be deleted. In fact, it is common for users to start over from scratch with targets::tar_destroy() and then expect reproducible results starting only with the code. So unfortunately that type of workaround is not feasible.

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Sep 22, 2023

In fact, it is common for users to start over from scratch with targets::tar_destroy() and then expect reproducible results starting only with the code.

Thanks for the explanation @wlandau it does all make sense. What I was missing was the bit above. If targets is meant to be stateless then I would agree that RNG streams don't work.

What I was thinking of was effectively preserving the state after the first and each run. So you could literally repeat the experiment (with the recorded state) and therefore match the previous result. But precludes obviously starting from scratch as per the above.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 23, 2023

@HenrikBengtsson, I think we can continue part of our discussion from DavisVaughan/furrr#265 here. We were talking about why recursively defined L'Ecuyer streams are not feasible for targets.

Regarding DavisVaughan/furrr#265 (comment): if only it were that simple.

A target is very different from an ordinary task because we need to know whether it is up to date. If a target's RNG stream changes from one run to the next, it is not up to date, which means the target needs to rerun. DavisVaughan/furrr#265 (comment) attempts to explain how that can happen in a non-sequential DAG.

Maybe your example from DavisVaughan/furrr#265 (comment) would be easier because it uses independent tasks. Say we start with 3 tasks. Here are the seeds:

RNGkind("L'Ecuyer-CMRG")
tasks <- c("task_a", "task_b", "task_c")
first_seed <- .Random.seed
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
  seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
str(seeds)
#> List of 3
#>  $ task_a: int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#>  $ task_b: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#>  $ task_c: int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439

Now suppose we insert a new task between task_a and task_b. The RNG stream for task_a stays the same, but task_b and task_c now have different streams. The new task gets the stream task_b had last time, and task_c gets the stream task_b had last time.

tasks <- c("task_a", "task_new", "task_b", "task_c")
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
  seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
#> str(seeds)
#> List of 4
#>  $ task_a  : int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#>  $ task_new: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#>  $ task_b  : int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439
#>  $ task_c  : int [1:7] 10407 -1556413327 -444875998 37935421 -1864914985 -341121962 -1229046830

In ordinary parallel computing, this is not a problem. You simply run the computation and note the seed that each task ran with. But in targets, the change in RNG seeds of task_b and tasks_c is highly disruptive. These tasks should have nothing to do with any of the other tasks, and yet bizarrely, they need to rerun because task_new was inserted. This is what I meant when I talked about inefficiency in DavisVaughan/furrr#265 (comment). It's not because of calling nextRNGStream(), that's actually quite fast. The inefficiency comes from rerunning task_c and task_d.

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 23, 2023

In targets, we should have a DAG with independent tasks. That should mean any change to task_a or task_new, or the insertion of task_new, should not invalidate task_b or task_c. This is what the DAG we want:

graph LR
  task_a
  task_new
  task_b
  task_c

But if we were to use recursive L'Ecuyer streams, that would induce dependency relationships we don't want:

graph LR
  task_a --> task_new
  task_new --> task_b
  task_b --> task_c

@wlandau
Copy link
Collaborator Author

wlandau commented Sep 23, 2023

In terms of how targets is used, here is how we would set up a pipeline with independent tasks (c.f. https://books.ropensci.org/targets/walkthrough.html). First, we create a _targets.R file, which is like a Makefile for data analysis tasks in R.

# _targets.R file:
library(targets)
list(
  tar_target(task_a, runif(1)),
  tar_target(task_b, runif(1)),
  tar_target(task_c, runif(1))
)

We inspect the graph and correctly see that the tasks are independent (there are no edges).

# R console:
library(targets)
tar_mermaid()
graph LR
  task_a
  task_b
  task_c

When we run the pipeline, the tasks run in topological order (with optional parallelism using integration with crew):

# R console
tar_make()
#> ▶ start target task_a
#> ● built target task_a [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]

targets stores the result of the task and the RNG seeds it used.

# R console
tar_read(task_c)
#> [1] 0.3329109
tar_meta(task_c, seed)
#> # A tibble: 1 × 2
#>   name         seed
#>   <chr>       <int>
#> 1 task_c -439103021#> 

Next time we run the pipeline, all our tasks are up to date. That's good because nothing changed since the last run.

# R console:
tar_make()
#> ✔ skip target task_a
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ✔ skip pipeline [0.593 seconds]

To insert a new task, we modify _targets.R:

# _targets.R file
library(targets)
list(
  tar_target(task_a, runif(1)),
  tar_target(task_a_new, runif(1)), 
  tar_target(task_b, runif(1)),
  tar_target(task_c, runif(1))
)

The new task task_a_new should be totally independent of the other tasks because it does not use the data from those tasks. This is the graph we see:

# R console
tar_mermaid()
graph LR
  task_a
  task_a_new
  task_b
  task_c

When we run the pipeline, only task_a_new runs because the others are up to date. That's good.

# R console:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ▶ end pipeline [0.458 seconds]

However: if targets were to assign recursive L'Ecuyer streams in topological order, the insertion of task_a_new would have changed the RNG streams of task_b and task_c. We would have had to rerun task_b and task_c. This not only causes a lot more computation for the user, it also obscures the true dependency relationships among the targets in the pipeline, which decreases the user's high-level understanding of the pipeline as a whole.

# What *would* have happened with recursive L'Ecuyer streams for each target:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]

@wlandau wlandau reopened this Oct 12, 2023
@wlandau
Copy link
Collaborator Author

wlandau commented Oct 12, 2023

Hmmm.... this might also slow down targets a lot.

library(digest)
library(microbenchmark)
name <- "a"
microbenchmark(
  int = digest2int(name),
  md5 = digest(name, algo = "md5"),
  sha256 = digest(name, algo = "sha256"),
  sha512 = digest(name, algo = "sha512"),
  mt = {
    set.seed(utf8ToInt(name))
    sample.int(n = .Machine$integer.max, size = 1L)
  }
)
#> Unit: nanoseconds
#>    expr   min      lq      mean  median      uq      max neval
#>     int   507   730.5   1105.23   811.5  1069.0    17599   100
#>     md5 17237 18389.0 133541.58 21435.0 24110.5 10555007   100
#>  sha256 19013 20269.5  24911.84 22746.0 24862.5   152663   100
#>  sha512 15633 17178.0  27703.10 19355.5 21550.5   406789   100
#>      mt 11186 13200.5  30920.48 13951.5 16907.0   475125   100

Created on 2023-10-12 with reprex v2.0.2

I know the nanonext hashes are a bit faster, but some users can't install NNG because their version of Ubuntu is too low, so I hesitate to make it a hard dependency.

@wlandau
Copy link
Collaborator Author

wlandau commented Oct 12, 2023

We don't need serialization, and that makes things 2x faster.

library(digest)
library(microbenchmark)

name <- "a"
microbenchmark(
  int = digest2int(name),
  md5 = digest(name, algo = "md5", serialize = FALSE),
  sha256 = digest(name, algo = "sha256", serialize = FALSE),
  sha512 = digest(name, algo = "sha512", serialize = FALSE),
  mt = {
    set.seed(utf8ToInt(name))
    sample.int(n = .Machine$integer.max, size = 1L)
  }
)
#> Unit: nanoseconds
#>    expr   min      lq     mean  median      uq    max neval
#>     int   493   635.5   894.52   693.0   817.0  19359   100
#>     md5 10345 10700.0 13325.78 10904.5 11450.5 234411   100
#>  sha256 12306 12677.0 13282.54 13033.5 13580.5  29117   100
#>  sha512  9043  9541.5 10088.13  9954.0 10444.5  13210   100
#>      mt 11100 12095.5 12707.84 12509.0 12898.5  29362   100

name <- "longlonglonglonglonglonglonglonglonglonglonglonglonglonglonglong"
microbenchmark(
  int = digest2int(name),
  md5 = digest(name, algo = "md5", serialize = FALSE),
  sha256 = digest(name, algo = "sha256", serialize = FALSE),
  sha512 = digest(name, algo = "sha512", serialize = FALSE),
  mt = {
    set.seed(utf8ToInt(name))
    sample.int(n = .Machine$integer.max, size = 1L)
  }
)
#> Unit: nanoseconds
#>    expr   min      lq     mean  median      uq   max neval
#>     int   558   681.0   777.75   747.5   807.5  3081   100
#>     md5 10466 10792.5 11365.36 11285.0 11840.0 13044   100
#>  sha256 12660 13078.5 13749.33 13440.0 14168.5 20253   100
#>  sha512  9021  9592.5 10440.70  9834.5 10391.0 48685   100
#>      mt 11662 12613.5 13318.57 13152.0 13564.0 33029   100

Created on 2023-10-12 with reprex v2.0.2

@wlandau wlandau mentioned this issue Oct 12, 2023
2 tasks
@wlandau
Copy link
Collaborator Author

wlandau commented Oct 12, 2023

From profiling #1168, the performance hit is should be less than 1% in most pipelines, and it is not likely to increase above 3.5% even with cue = tar_cue(file = FALSE).

@wlandau
Copy link
Collaborator Author

wlandau commented Oct 12, 2023

On a Mac it's a bit more overhead: 2% with tar_cue(file = TRUE), 5.8% with tar_cue(file = FALSE). But still not terrible and certainly not a bottleneck. The pipeline I was profiled was the one below (lots of quick small targets), and I ran proffer::pprof(tar_outdated(callr_function = NULL)) on an up-to-date pipeline.

library(targets)
library(tarchetypes)
tar_option_set(cue = tar_cue(file = FALSE))
list(
  tar_target(x, 1:5000),
  tar_target(ordinary_name, x, pattern = map(x))
)

@shikokuchuo
Copy link
Contributor

I know the nanonext hashes are a bit faster, but some users can't install NNG because their version of Ubuntu is too low, so I hesitate to make it a hard dependency.

How so? Ubuntu focal has NNG 1.5.2 which supports nanonext out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.

You do realise that nanonext::sha1() is same order of magnitude as your original digest2int(), so would be no discernible difference.

But it's very good you've implemented a 'correct' solution in any case.

Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.

@wlandau
Copy link
Collaborator Author

wlandau commented Oct 13, 2023

How so? Ubuntu focal has NNG 1.5.2 which supports nanonext out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.

A few months ago, @MilesMcBain reported having trouble with Ubuntu 18.04: #1069. Even with the impetus to upgrade, I wonder how many users still run these older OS's.

You do realise that nanonext::sha1() is same order of magnitude as your original digest2int(), so would be no discernible difference.

Is SHA1 is strong enough for this?

Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.

I thought about it, but it feels like modular arithmetic, which is essentially cutting off the most significant bits in the hash value. I don't know what cryptographic properties we lose if we just take part of the hash, so I thought it would be more rigorous to supply the entire hash to digest2int() instead. All this assumes set.seed() only takes 32-bit integers. I will investigate that part.

@wlandau
Copy link
Collaborator Author

wlandau commented Oct 13, 2023

Hmm... turns out seed in set.seed() needs to be a 32-bit integer. I will start an issue at https://github.com/HenrikBengtsson/Wishlist-for-R.

@wlandau
Copy link
Collaborator Author

wlandau commented Oct 13, 2023

Posted HenrikBengtsson/Wishlist-for-R#152

@shikokuchuo
Copy link
Contributor

How so? Ubuntu focal has NNG 1.5.2 which supports nanonext out of the box. Older versions of Ubuntu are already EOL, but in any case will have no trouble automatically building the libs from the bundled source.

A few months ago, @MilesMcBain reported having trouble with Ubuntu 18.04: #1069. Even with the impetus to upgrade, I wonder how many users still run these older OS's.

I clarified on the original thread, and seems like a non-issue.

Is SHA1 is strong enough for this?

It's slightly weakened for certain kinds of attacks - but an irrelevance here? Of course sha256/512 is also available if you'd rather not have to consider this.

@shikokuchuo
Copy link
Contributor

Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0.

I thought about it, but it feels like modular arithmetic, which is essentially cutting off the most significant bits in the hash value. I don't know what cryptographic properties we lose if we just take part of the hash, so I thought it would be more rigorous to supply the entire hash to digest2int() instead. All this assumes set.seed() only takes 32-bit integers. I will investigate that part.

Yes, I think you're right it's not a solution. But I'm also not sure what effect the re-hashing would have. Have you considered an XOF, something like SHAKE256 or KangarooTwelve?

@wlandau
Copy link
Collaborator Author

wlandau commented Oct 17, 2023

That's interesting. I don't know of a convenient trusted implementation for an XOF in R. I trust digest and have depended on it for years, but it has very traditional functionality.

@shikokuchuo
Copy link
Contributor

I put this together in a spare afternoon: https://shikokuchuo.net/secretbase/

This takes a quarter of the runtime of the current double-hashing implementation.

Moreover, SHAKE256 is more correct and fits precisely the use case here.

@HenrikBengtsson
Copy link

... spare afternoon

Where can I buy those? :p

@shikokuchuo
Copy link
Contributor

@wlandau {secretbase} is now established on CRAN. This provides the solution that targets is seeking in reproducible random seeds that are safe for parallel processes.

Namely, ones that can provide truly ‘random’ random seeds to R’s RNG that satisfy the requirement per Chapter 4 of doi:10.1016/j.matcom.2016.05.00 - the section “A single RNG with a ‘random’ seed for each stream.”

Appreciate this is some time on from the discussion above, but at that time I did not have an answer to your ask (above) for a trusted and convenient XOF implementation in R.

The SHA3 algorithms only landed in Mbed TLS at a later date. As I’ve wrapped their (validated) algorithm with an interface that allows hashing directly to integer - I think this satisfies both trustworthy and convenient. Performance gains I’ve mentioned before – this should reverse the slowdowns that you observed in your previous benchmarking. Also it’s a pretty conservative 9kb package, with no dependencies at the R or C level and compiles under C99.

I’d be happy to provide a PR to get this started if you’d like.

It would be up to you of course whether you want to invalidate existing seeds or not, but I think it makes sense for new analyses to benefit from these at least.

@wlandau
Copy link
Collaborator Author

wlandau commented Jan 23, 2024

Wow, this is fantastic! Thanks so much for secretbase! Since XOFs are trusted ways to vary the length of an output hash, I think you closed all the inferential gaps needed to justify this approach to seeds. I think R core would be interested, c.f. HenrikBengtsson/Wishlist-for-R#152. I think I might also use secretbase in crew so we can keep Mersenne twister there.

To clarify: if I set bits = 32, which SHA3 will be used? Is it e.g. possible to use a 512-bit SHA with a 32-bit output hash?

The speed gain is incredible:

microbenchmark::microbenchmark(
  secretbase = secretbase::sha3("target_name", bits = 32L, convert = NA),
  double_hash = {
    digest::digest2int(
      x = digest::digest(
        object = "target_name",
        algo = "sha512",
        serialize = FALSE,
        file = FALSE,
        seed = 0L
      )
    )
  }
)
#> Unit: nanoseconds
#>         expr  min   lq    mean median   uq   max neval cld
#>   secretbase  656  697  815.08    738  779  6396   100  a 
#>  double_hash 4879 5043 5368.13   5125 5289 22468   100   b

For hashes of output data,targets does not need cryptographic properties, and the existing XXHASH(32/64) algorithm is inherently faster than SHA3. So in that particular area, I think I will stick with the existing digest::digest(algo = "xxhash64") implementation.

@wlandau
Copy link
Collaborator Author

wlandau commented Jan 23, 2024

Implemented in targets in a4f5694.

@wlandau
Copy link
Collaborator Author

wlandau commented Jan 23, 2024

I wonder if this would be relevant for mirai. Maybe instead of a target name, the seed of the task could be the secretbase::sha3() of the task order and the nanonext::mclock(). This may let us use Mersenne twister again.

@HenrikBengtsson
Copy link

I've followed this discussion at distance, and I should say, I'm not up-to-date with all comments here-in, so my comment/question might already have been discussed.

I see two types of RNG reproducibility when working with DAGs. There's the reproducibilty of (i) the "final", fixed DAG, and (2) the DAG in flux.

If you consider the final DAG that exists when a pipeline has stabilized, then you can always identify a unique, linear path for walking the DAG, which means you could use parallel RNGs such as L'Ecuyer.

For the in-flux DAG, you need an ad-hoc solution, for instance, the one you've already proposed here, i.e. generate a good-enough RNG seed based on a fixed property of the node.

My proposal would be to provide the option to work with both. The first version is important to ensure statistical sounds RNGs. With the second approach, there will always be the concern that the RNGs are not statistically sound.

Not sure what the default should be, but I can imagine that a user develops their pipeline using "in-flux-DAG" RNGs, but at the end, wipes the cache and regenerate everything with "fixed-DAG" RNGs.

@wlandau
Copy link
Collaborator Author

wlandau commented Jan 23, 2024

Unfortunately, a "fixed" DAG is not a feasible concept for targets. Fundamentally, a pipeline is always poised to be in flux. Subtle unexpected changes may e.g. change the number of dynamic branches, which would change the number and order of output streams if those streams were recursively generated.

I can imagine that a user develops their pipeline using "in-flux-DAG" RNGs, but at the end, wipes the cache and regenerate everything with "fixed-DAG" RNGs.

Switching to a "fixed" DAG would generate a different set of seeds and thus a different set of results, which would invalidate the current conclusion in the moment it is being finalized. And in practice, I think users would switch between fixed and flux a lot during failed or premature attempts to finalize a project, which would cause a lot of painful rework.

With the second approach, there will always be the concern that the RNGs are not statistically sound.

From https://www.sciencedirect.com/science/article/pii/S0378475416300829?via%3Dihub:

There are situations where using one or more central monitor(s) to create the streams is not acceptable, e.g., because new streams have to be created dynamically and randomly within threads as a simulation goes on, without resorting to external information, and with a guarantee that the results are exactly reproducible [14], [53]. In this setting, we want each stream to be able to create children streams by itself at any time, using a deterministic mechanism. The possibility of collisions or overlap between streams is hard to rule out in this case, but the probability that it occurs can be made extremely small, as we shall see.

targets may be a niche case, but I agree with the authors when they say GPU-powered work really needs an approach that goes beyond the limitations of recursively-generated RNG streams. And if we believe their arguments, the extremely small probability of overlap justifies the secretbase approach.

@shikokuchuo
Copy link
Contributor

To clarify: if I set bits = 32, which SHA3 will be used? Is it e.g. possible to use a 512-bit SHA with a 32-bit output hash?

It will be the SHAKE256 XOF which is used. The SHA-3 standard doi:10.6028/NIST.FIPS.202 defines the 4 SHA-3 hashes 224, 256, 384 and 512, and two XOFs - SHAKE128 and SHAKE256. They are all standardised instances of the same underlying Keccak algorithm. SHAKE256 will provide up to 256 bits of security (although limited by our output length).

For hashes of output data,targets does not need cryptographic properties, and the existing XXHASH(32/64) algorithm is inherently faster than SHA3. So in that particular area, I think I will stick with the existing digest::digest(algo = "xxhash64") implementation.

Yes, SHA-3 is inherently slower even vs SHA-2 and the intention is not to replace the usage there. Having said that, I think with SHA1, I can get to a timing that's within 1.5x that of xxhash64. The place it will come in useful is in big-data / memory-constrained situations as my solution in secretbase is non-allocating for R objects which need to be serialized, whereas other approaches just serialize the object first (thereby doubling the memory occupied) and then hash it, . That means there are cases where secretbase will allow hashing that would otherwise not be possible.

@wlandau
Copy link
Collaborator Author

wlandau commented Jan 24, 2024

It will be the SHAKE256 XOF which is used. The SHA-3 standard doi:10.6028/NIST.FIPS.202 defines the 4 SHA-3 hashes 224, 256, 384 and 512, and two XOFs - SHAKE128 and SHAKE256. They are all standardised instances of the same underlying Keccak algorithm. SHAKE256 will provide up to 256 bits of security (although limited by our output length).

So it's not as if a SHA3 is computed, then the SHAKE256 XOF is applied post-hoc? The whole thing from beginning to end is just the SHAKE256 XOF and it shares the underlying algorithm of SHA3?

Having said that, I think with SHA1, I can get to a timing that's within 1.5x that of xxhash64. The place it will come in useful is in big-data / memory-constrained situations as my solution in secretbase is non-allocating for R objects which need to be serialized, whereas other approaches just serialize the object first (thereby doubling the memory occupied) and then hash it, . That means there are cases where secretbase will allow hashing that would otherwise not be possible.

Halving the memory footprint sounds really compelling! I will have to think about it more: a 50% reduction in memory usage at the price of a 50% increase in runtime...

@shikokuchuo
Copy link
Contributor

It will be the SHAKE256 XOF which is used. The SHA-3 standard doi:10.6028/NIST.FIPS.202 defines the 4 SHA-3 hashes 224, 256, 384 and 512, and two XOFs - SHAKE128 and SHAKE256. They are all standardised instances of the same underlying Keccak algorithm. SHAKE256 will provide up to 256 bits of security (although limited by our output length).

So it's not as if a SHA3 is computed, then the SHAKE256 XOF is applied post-hoc? The whole thing from beginning to end is just the SHAKE256 XOF and it shares the underlying algorithm of SHA3?

Yes, that's correct. SHAKE256 is one of the SHA-3 options, it's not applied on top.

Having said that, I think with SHA1, I can get to a timing that's within 1.5x that of xxhash64. The place it will come in useful is in big-data / memory-constrained situations as my solution in secretbase is non-allocating for R objects which need to be serialized, whereas other approaches just serialize the object first (thereby doubling the memory occupied) and then hash it, . That means there are cases where secretbase will allow hashing that would otherwise not be possible.

Halving the memory footprint sounds really compelling! I will have to think about it more: a 50% reduction in memory usage at the price of a 50% increase in runtime...

No free lunch in this instance! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants