Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll up summaries of nonscalar variables #152

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

jsocolar
Copy link
Contributor

@jsocolar jsocolar commented May 21, 2021

See #43

This is not ready for primetime, but I want the authors to be able to commit here if they want, so here it is.

Now has:

  • A dimensions column in the rolled-up output
  • Some minimal documentation
  • a dedicated parse_variable_indices function that does the work of parsing the dimensions/indices

Still needs:

  • Proper error handling
  • Tests
  • Review of parse_variable_indices.R and, if it looks good, rewriting as_draws_rvars.draws_matrix() to rely on parse_variable_indices (see also Coding matrix/vector variables in tidy form #61)
  • Proper handling when summaries are NA for some elements (e.g. Cholesky-factors will have rhat NA for half their elements)

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2021

Codecov Report

Merging #152 (e639cd1) into master (6cf8d24) will decrease coverage by 3.31%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #152      +/-   ##
==========================================
- Coverage   93.89%   90.58%   -3.32%     
==========================================
  Files          41       43       +2     
  Lines        2654     2751      +97     
==========================================
  Hits         2492     2492              
- Misses        162      259      +97     
Impacted Files Coverage Δ
R/parse_variable_indices.R 0.00% <0.00%> (ø)
R/rollup_summary.R 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cf8d24...e639cd1. Read the comment docs.

@jsocolar
Copy link
Contributor Author

jsocolar commented May 24, 2021

Having a think about how to handle NA summaries, and I'm not sure what the best approach is. Some variables will have NAs that can be safely ignored. For example, rhat for Cholesky-factors and the like. But in other cases, the NAs can be crucial diagnostic tools. For example, if there's a problem in the model definition and a parameter that is supposed to vary is fixed, or if numerical issues cause the sampler to freeze for a single element of a vector. Since there's no good way to tell whether an NA is expected or problematic, we need to decide what to do. Some options:

  • Always display NA for the maximum or minimum value if any elements are NA. This is the current behavior.
  • Provide an argument to ignore NAs.
  • Ignore NAs by default, but if the summary contains any NAs for any of the summary functions, append an additional column giving the number of elements that are NA for at least one of the summaries.
  • Something else...

NULL

parse_variable_indices <- function(x){
vars_indices <- strsplit(x, "(\\[|\\])")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use a number of regexes in a row here to parse x. Can you combine them into a single regex and use capture groups to pull out the relevant pieces? That should be faster, and will also be more flexible if at some point in the future we want to provide people with options to parse different formats of indices (like different separators or whatever).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this: here's a regular expression that matches well-formed indexed variable names. Starting/ending the regex with "^" and "$" guarantees it must match the whole name, so failures imply an ill-formed name:

x <- c("a[1]", "b[", "c[1,2]", "d[]", "e]", "f")
m <- regexec("^([^][]+)(\\[([^][]+)\\])?$", x)
vars_matches <- regmatches(x, m)
vars_matches
[[1]]
[1] "a[1]" "a"    "[1]"  "1"   

[[2]]
character(0)

[[3]]
[1] "c[1,2]" "c"      "[1,2]"  "1,2"   

[[4]]
character(0)

[[5]]
character(0)

[[6]]
[1] "f" "f" ""  "" 

Then you can check for ill-formed names with something like this:

if (any(lengths(vars_matches) == 0)) {
   stop_no_call("insert informative error")
}

Variables without indices will have the fourth element of their corresponding entry in vars_matches be empty, and variables with indices will have it be the indices which can then be split.

This regex is fairly strict, and I'm not sure how strict we want to be --- e.g., do we want names like "x]" to fail, or do we want them just to be parsed as non-indexed variables? I can see an argument for a looser version of this parsing that rarely (or maybe never) fails but instead lets weird-looking names go by and treats them as non-indexed. E.g. something like this:

x <- c("a[1]", "b[", "c[1,2]", "d[]", "e]", "f")
m <- regexec("^(.+?)(\\[(.+)\\])?$", x)
vars_matches <- regmatches(x, m)
vars_matches
[[1]]
[1] "a[1]" "a"    "[1]"  "1"   

[[2]]
[1] "b[" "b[" ""   ""  

[[3]]
[1] "c[1,2]" "c"      "[1,2]"  "1,2"   

[[4]]
[1] "d[]" "d[]" ""    ""   

[[5]]
[1] "e]" "e]" ""   ""  

[[6]]
[1] "f" "f" ""  "" 

i.e. in this example "b[", "d[]", and "e]" all get treated as weird-looking variable names.

This is also a bit more flexible in that it would be pretty easy to allow people to pass in different choices for the opening/closing bracket and update the regex accordingly.

R/parse_variable_indices.R Outdated Show resolved Hide resolved
R/parse_variable_indices.R Outdated Show resolved Hide resolved
# Variables with brackets are given dimensionality 1, even if they contain just one element.
dimensionality_elementwise <- sapply(vars_indices, function(x){
if (length(x) == 2) {
out <- length(strsplit(x[2], ",")[[1]]) # number of commas plus one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not generate the dimensionality_elementwise variable later, after you have parsed the variable indices? By generating it here you have to parse the indices twice; generating it later you could parse them just once.

}
out
})
variable_indices_info <- lapply(var_names, function (x) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have three instances of *apply over the same structure (var_name) in a row here. Is each element of the intermediate variables (dimensionality[[i]], dimensionality_elementwise[[i]]) only used in corresponding element i of this last apply? If so I would skip creating these intermediate variables and move the relevant code into the final apply. This would also resolve my previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could well be missing a clever trick here, but I think this and the previous comment are not quite straightforward.

Of these three apply statements, the first applies a function over vars_indices and the second and third apply over var_names. It would be easy to bring the second inside the third. However, the second apply statement is very cheap and includes a check for inconsistent dimensionality. I thought it might be better to error quickly on inconsistent dimensionality rather than start the relatively expensive variable parsing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I misunderstood what this was doing.

I think this still could be simplified so that the parsing of variable indices only has to be done once (which should be faster and also have lower maintenance cost, since we wouldn't have to modify that parsing in two places for future features). If you move this line:

      indices <- sapply(vars_indices[var_i], `[[`, 2)

Above its containing if/else block, then I think you could also factor out the vectorized strsplit from the line after the above one to get a list of indices, upon which you could then use lengths / etc to get elementwise ndims and check for consistency. Then index parsing only happens once / in one place.

R/parse_variable_indices.R Outdated Show resolved Hide resolved
R/parse_variable_indices.R Outdated Show resolved Hide resolved
R/parse_variable_indices.R Outdated Show resolved Hide resolved
all.x = TRUE, sort = FALSE)
# need to do the sort manually after merge because when sort = TRUE, merge
# sorts factors as if they were strings, and we need factors to be sorted as factors
indices <- indices[do.call(order, as.list(indices[, -ncol(indices), drop = FALSE])),]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I wrote this sort originally I'm not actually sure this sort is needed, because I think the all_indices grid might already be sorted in the correct order? worth checking on this, because if so dropping this sort will be a big time saver on variables with lots of indices.

#' $dimensionality the number of dimensions of V. Returns 0 for scalars with no brackets
#' but 1 for `y[1]` even if `y` has no other entries in `x`.
#'
#' $dimensions a vector of the actual dimensions of V, as determined by the number of unique
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dimensions currently used for anything?

Or put another way: is there a situation in which it is clear that someone would ever want both dimensions and implied_dimensions at the same time? I asked because I think it could simplify things to just calculate and return the implied dimensions as the canonical "dimensions". Possibly with an option to only calculate the non-implied dimensions if those are desired (in which case the function can skip all the expensive calculations involved in figuring out the implied dimensions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main use case for both would be if somebody wants to check whether they match. Not sure that's a "real" use case though.

In rollup_summary I thought it might be more useful to report the dimensions rather than the implied_dimensions since the former give a better (albeit imperfect) sense of big the rolled-up variable actually is.

TBH, for my own purposes I don't expect dimensions and implied_dimensions to ever differ, so I'll definitely follow your lead on which quantity is more useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main use case for both would be if somebody wants to check whether they match. Not sure that's a "real" use case though.

Yeah I'd vote for waiting to see if this ever comes up instead of designing for it in advance.

In rollup_summary I thought it might be more useful to report the dimensions rather than the implied_dimensions since the former give a better (albeit imperfect) sense of big the rolled-up variable actually is.

Good point, I agree. In that case, that function doesn't need any of the stuff that comes with implied_dimensions right? Then I would propose adding a flag to parse_variable_indices that either fills in the implied dimensions or not (with default TRUE). I'm not sure what to call it -- fill_implied_dimensions is too long. Maybe just fill or fill_dim or fill_indices or fill_missing? Then if fill_missing is TRUE, dimensions would be reported with missing indices filled in and otherwise not.

@mjskay
Copy link
Collaborator

mjskay commented May 25, 2021

Review of parse_variable_indices.R and, if it looks good, rewriting as_draws_rvars.draws_matrix() to rely on parse_variable_indices

Thanks @jsocolar! I added (hopefully not too many :) ) comments with some suggestions for simplification. It's very exciting to see this getting done, it's gonna be a very useful function (I am probably going to rewrite some tidybayes stuff on top of it :)).

A minor point of discussion about the format: currently you have dimensionality and dimensions elements. Normally I'm all for fully spelling things out, but in this case I might suggest naming these ndim and dim respectively. That would make people's code that uses this function slightly more self-documenting, as those names are used frequently throughout R code to correspond to the two concepts these variables represent.

jsocolar and others added 8 commits May 25, 2021 17:10
Co-authored-by: Matthew Kay <matthew.kay@gmail.com>
Co-authored-by: Matthew Kay <matthew.kay@gmail.com>
Co-authored-by: Matthew Kay <matthew.kay@gmail.com>
Co-authored-by: Matthew Kay <matthew.kay@gmail.com>
@jsocolar
Copy link
Contributor Author

@mjskay just mentioning that I have some time to jump back on this now. Once I remember where we're at, I'll keep it moving along.

@mjskay
Copy link
Collaborator

mjskay commented Aug 14, 2021

Wonderful to hear! I think that the variable index parsing aspects of this PR may also end up being related to some of the conversation in #187 about exposing internal functions for flattening / unflattening arrays. I think that work can be done after this PR though, since it's already doing a couple of different things at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants