Roll up summaries of nonscalar variables #152

jsocolar · 2021-05-21T20:47:57Z

See #43

This is not ready for primetime, but I want the authors to be able to commit here if they want, so here it is.

Now has:

A dimensions column in the rolled-up output
Some minimal documentation
a dedicated parse_variable_indices function that does the work of parsing the dimensions/indices

Still needs:

Proper error handling
Tests
Review of parse_variable_indices.R and, if it looks good, rewriting as_draws_rvars.draws_matrix() to rely on parse_variable_indices (see also Coding matrix/vector variables in tidy form #61)
Proper handling when summaries are NA for some elements (e.g. Cholesky-factors will have rhat NA for half their elements)

codecov-commenter · 2021-05-21T21:18:51Z

Codecov Report

Merging #152 (e639cd1) into master (6cf8d24) will decrease coverage by 3.31%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
- Coverage   93.89%   90.58%   -3.32%     
==========================================
  Files          41       43       +2     
  Lines        2654     2751      +97     
==========================================
  Hits         2492     2492              
- Misses        162      259      +97

Impacted Files	Coverage Δ
R/parse_variable_indices.R	`0.00% <0.00%> (ø)`
R/rollup_summary.R	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cf8d24...e639cd1. Read the comment docs.

…ollup

jsocolar · 2021-05-24T20:56:07Z

Having a think about how to handle NA summaries, and I'm not sure what the best approach is. Some variables will have NAs that can be safely ignored. For example, rhat for Cholesky-factors and the like. But in other cases, the NAs can be crucial diagnostic tools. For example, if there's a problem in the model definition and a parameter that is supposed to vary is fixed, or if numerical issues cause the sampler to freeze for a single element of a vector. Since there's no good way to tell whether an NA is expected or problematic, we need to decide what to do. Some options:

Always display NA for the maximum or minimum value if any elements are NA. This is the current behavior.
Provide an argument to ignore NAs.
Ignore NAs by default, but if the summary contains any NAs for any of the summary functions, append an additional column giving the number of elements that are NA for at least one of the summaries.
Something else...

mjskay · 2021-05-25T00:14:35Z

R/parse_variable_indices.R

+NULL
+
+parse_variable_indices <- function(x){
+  vars_indices <- strsplit(x, "(\\[|\\])")


You use a number of regexes in a row here to parse x. Can you combine them into a single regex and use capture groups to pull out the relevant pieces? That should be faster, and will also be more flexible if at some point in the future we want to provide people with options to parse different formats of indices (like different separators or whatever).

To expand on this: here's a regular expression that matches well-formed indexed variable names. Starting/ending the regex with "^" and "$" guarantees it must match the whole name, so failures imply an ill-formed name:

x <- c("a[1]", "b[", "c[1,2]", "d[]", "e]", "f") m <- regexec("^([^][]+)(\\[([^][]+)\\])?$", x) vars_matches <- regmatches(x, m) vars_matches

[[1]] [1] "a[1]" "a" "[1]" "1" [[2]] character(0) [[3]] [1] "c[1,2]" "c" "[1,2]" "1,2" [[4]] character(0) [[5]] character(0) [[6]] [1] "f" "f" "" ""

Then you can check for ill-formed names with something like this:

if (any(lengths(vars_matches) == 0)) { stop_no_call("insert informative error") }

Variables without indices will have the fourth element of their corresponding entry in vars_matches be empty, and variables with indices will have it be the indices which can then be split.

This regex is fairly strict, and I'm not sure how strict we want to be --- e.g., do we want names like "x]" to fail, or do we want them just to be parsed as non-indexed variables? I can see an argument for a looser version of this parsing that rarely (or maybe never) fails but instead lets weird-looking names go by and treats them as non-indexed. E.g. something like this:

x <- c("a[1]", "b[", "c[1,2]", "d[]", "e]", "f") m <- regexec("^(.+?)(\\[(.+)\\])?$", x) vars_matches <- regmatches(x, m) vars_matches

[[1]] [1] "a[1]" "a" "[1]" "1" [[2]] [1] "b[" "b[" "" "" [[3]] [1] "c[1,2]" "c" "[1,2]" "1,2" [[4]] [1] "d[]" "d[]" "" "" [[5]] [1] "e]" "e]" "" "" [[6]] [1] "f" "f" "" ""

i.e. in this example "b[", "d[]", and "e]" all get treated as weird-looking variable names.

This is also a bit more flexible in that it would be pretty easy to allow people to pass in different choices for the opening/closing bracket and update the regex accordingly.

R/parse_variable_indices.R

mjskay · 2021-05-25T00:21:30Z

R/parse_variable_indices.R

+  # Variables with brackets are given dimensionality 1, even if they contain just one element.
+  dimensionality_elementwise <- sapply(vars_indices, function(x){
+    if (length(x) == 2) {
+      out <- length(strsplit(x[2], ",")[[1]]) # number of commas plus one


Why not generate the dimensionality_elementwise variable later, after you have parsed the variable indices? By generating it here you have to parse the indices twice; generating it later you could parse them just once.

mjskay · 2021-05-25T00:33:11Z

R/parse_variable_indices.R

+    }
+    out
+  })
+  variable_indices_info <- lapply(var_names, function (x) {


You have three instances of *apply over the same structure (var_name) in a row here. Is each element of the intermediate variables (dimensionality[[i]], dimensionality_elementwise[[i]]) only used in corresponding element i of this last apply? If so I would skip creating these intermediate variables and move the relevant code into the final apply. This would also resolve my previous comment.

I could well be missing a clever trick here, but I think this and the previous comment are not quite straightforward.

Of these three apply statements, the first applies a function over vars_indices and the second and third apply over var_names. It would be easy to bring the second inside the third. However, the second apply statement is very cheap and includes a check for inconsistent dimensionality. I thought it might be better to error quickly on inconsistent dimensionality rather than start the relatively expensive variable parsing.

Ah I think I misunderstood what this was doing.

I think this still could be simplified so that the parsing of variable indices only has to be done once (which should be faster and also have lower maintenance cost, since we wouldn't have to modify that parsing in two places for future features). If you move this line:

indices <- sapply(vars_indices[var_i], `[[`, 2)

Above its containing if/else block, then I think you could also factor out the vectorized strsplit from the line after the above one to get a list of indices, upon which you could then use lengths / etc to get elementwise ndims and check for consistency. Then index parsing only happens once / in one place.

R/parse_variable_indices.R

mjskay · 2021-05-25T00:47:56Z

R/parse_variable_indices.R

+                       all.x = TRUE, sort = FALSE)
+      # need to do the sort manually after merge because when sort = TRUE, merge
+      # sorts factors as if they were strings, and we need factors to be sorted as factors
+      indices <- indices[do.call(order, as.list(indices[, -ncol(indices), drop = FALSE])),]


Even though I wrote this sort originally I'm not actually sure this sort is needed, because I think the all_indices grid might already be sorted in the correct order? worth checking on this, because if so dropping this sort will be a big time saver on variables with lots of indices.

mjskay · 2021-05-25T00:54:39Z

R/parse_variable_indices.R

+#'    $dimensionality  the number of dimensions of V. Returns 0 for scalars with no brackets
+#'    but 1 for `y[1]` even if `y` has no other entries in `x`.
+#'    
+#'    $dimensions  a vector of the actual dimensions of V, as determined by the number of unique


Is dimensions currently used for anything?

Or put another way: is there a situation in which it is clear that someone would ever want both dimensions and implied_dimensions at the same time? I asked because I think it could simplify things to just calculate and return the implied dimensions as the canonical "dimensions". Possibly with an option to only calculate the non-implied dimensions if those are desired (in which case the function can skip all the expensive calculations involved in figuring out the implied dimensions).

I think the main use case for both would be if somebody wants to check whether they match. Not sure that's a "real" use case though.

In rollup_summary I thought it might be more useful to report the dimensions rather than the implied_dimensions since the former give a better (albeit imperfect) sense of big the rolled-up variable actually is.

TBH, for my own purposes I don't expect dimensions and implied_dimensions to ever differ, so I'll definitely follow your lead on which quantity is more useful.

I think the main use case for both would be if somebody wants to check whether they match. Not sure that's a "real" use case though.

Yeah I'd vote for waiting to see if this ever comes up instead of designing for it in advance.

In rollup_summary I thought it might be more useful to report the dimensions rather than the implied_dimensions since the former give a better (albeit imperfect) sense of big the rolled-up variable actually is.

Good point, I agree. In that case, that function doesn't need any of the stuff that comes with implied_dimensions right? Then I would propose adding a flag to parse_variable_indices that either fills in the implied dimensions or not (with default TRUE). I'm not sure what to call it -- fill_implied_dimensions is too long. Maybe just fill or fill_dim or fill_indices or fill_missing? Then if fill_missing is TRUE, dimensions would be reported with missing indices filled in and otherwise not.

mjskay · 2021-05-25T01:00:32Z

Review of parse_variable_indices.R and, if it looks good, rewriting as_draws_rvars.draws_matrix() to rely on parse_variable_indices

Thanks @jsocolar! I added (hopefully not too many :) ) comments with some suggestions for simplification. It's very exciting to see this getting done, it's gonna be a very useful function (I am probably going to rewrite some tidybayes stuff on top of it :)).

A minor point of discussion about the format: currently you have dimensionality and dimensions elements. Normally I'm all for fully spelling things out, but in this case I might suggest naming these ndim and dim respectively. That would make people's code that uses this function slightly more self-documenting, as those names are used frequently throughout R code to correspond to the two concepts these variables represent.

…ollup

Co-authored-by: Matthew Kay <matthew.kay@gmail.com>

jsocolar · 2021-08-13T18:35:15Z

@mjskay just mentioning that I have some time to jump back on this now. Once I remember where we're at, I'll keep it moving along.

mjskay · 2021-08-14T21:10:54Z

Wonderful to hear! I think that the variable index parsing aspects of this PR may also end up being related to some of the conversation in #187 about exposing internal functions for flattening / unflattening arrays. I think that work can be done after this PR though, since it's already doing a couple of different things at once.

jsocolar added 5 commits May 21, 2021 14:58

initial commit of get_indices

7b826db

initial commit of rollup_summary.R

4afd494

run roxygenise

fb9f21a

roxygenise and related changes

9ff6754

fixing doc

0426a91

jsocolar added 12 commits May 21, 2021 16:23

minor doc cleanup

b7d4ac4

rerun roxygenise

0abbdc5

clean up function args

0af3699

run roxygenise

29180d3

cleaning up doc for get_indices

4636788

cleaning up args

640b2c8

rearranged args

a12fe52

misc

c60f840

Merge branch 'master' of https://github.com/stan-dev/posterior into r…

3a3d525

…ollup

Merge branch 'master' of https://github.com/stan-dev/posterior into r…

6e4bbb6

…ollup

a more fully general parse_variable_indices

c77881f

cleanup

53cc5b3

jsocolar mentioned this pull request May 24, 2021

Coding matrix/vector variables in tidy form #61

Open

jsocolar added 2 commits May 24, 2021 14:09

more cleanup

deb9121

getting dimension column right after switch to parse_variable_indices

d080f2a