Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_variable_array and the with_indices parameter for variables() #342

Merged
merged 14 commits into from
Feb 1, 2024

Conversation

mjskay
Copy link
Collaborator

@mjskay mjskay commented Jan 30, 2024

Summary

This PR:

It also makes a few other related changes:

  • For types that support factor variables (draws_df, draws_list, and draws_rvars), extract_variable() and extract_variable_matrix() now return a factor / ordered if the variable requested is a factor or ordered.

  • It factors out a bunch of internal code for manipulating variable indices into R/variable-indices.R. Most important are:

    • split_variable_names() and split_indices(), which I suggest as the "canonical" functions for (1) splitting a variable into a string representing base name and indices and (2) parsing those indices if needed. The code for split_variable_names() was repeated in several places in the code base (implemented slightly differently). I tried several implementations on large vectors and the one I put there should be fast.
    • variable_names() and variable_names<-(), which are functions with a with_indices parameter that can manipulate character vectors of variable names. These use split_variable_names() under the hood, and are the basis of the implementation of with_indices in variables() and variables<-()
    • flatten_indices() and flatten_array(), which are used to turn arrays into vectors where indices are embedded into variable names (used when converting from draws_rvars to other formats that embed indices into variable names).

    Factoring these functions out means if we later add support for more specialized parsing of variable names, we only have to do that in one place. It also gets us a bit closer to implementations of index parsing we could consider exporting (e.g. for Coding matrix/vector variables in tidy form #61), which could perhaps be a parse_variable_names() function built on split_variable_names() + split_indices().

  • Moves the code for variables(), variables<-(), set_variables(), and nvariables() from R/draws-index.R into R/variables.R and gives them their own documentation pages (R/draws-index.R and the combined doc page for variables/iterations/chains/draws was getting bloated, especially with the extra args on variables functions).

  • Minor refactor of while_preserving_dims() -> copy_dims() and while_preserving_levels() -> copy_levels()

  • Minor fix for a change in how the package doc page is specified in the latest version of roxygen2.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (0322b46) 95.14% compared to head (45b0643) 95.31%.
Report is 28 commits behind head on master.

❗ Current head 45b0643 differs from pull request most recent head fe6f23b. Consider uploading reports for the commit fe6f23b to get more accurate results

Files Patch % Lines
R/variables.R 99.06% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #342      +/-   ##
==========================================
+ Coverage   95.14%   95.31%   +0.17%     
==========================================
  Files          47       50       +3     
  Lines        3745     3840      +95     
==========================================
+ Hits         3563     3660      +97     
+ Misses        182      180       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 5f49cb3 is merged into master:

  •   :rocket:as_draws_array: 107ms -> 104ms [-3.81%, -1.96%]
  • ❗🐌as_draws_df: 38.7ms -> 85.5ms [+110.31%, +130.9%]
  •   :ballot_box_with_check:as_draws_list: 171ms -> 172ms [-1.29%, +2.1%]
  •   :rocket:as_draws_matrix: 34.4ms -> 32.7ms [-9.38%, -0.48%]
  •   :rocket:as_draws_rvars: 149ms -> 82.5ms [-47.48%, -41.9%]
  •   :rocket:summarise_draws_100_variables: 738ms -> 726ms [-2.67%, -0.52%]
  •   :ballot_box_with_check:summarise_draws_10_variables: 79.9ms -> 82.1ms [-2.47%, +7.99%]
    Further explanation regarding interpretation and methodology can be found in the documentation.

# select matched all_variables maintaining the input variables order
variables <- all_variables[all_var_matched_ixs[order(input_ixs, all_var_matched_ixs)]]
} else {
missing_variables <- setdiff(variables, all_variables)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not hit by test coverage (see here), but as far as I can tell it cannot be, as scalar_only is always FALSE (I assume this is a parameter that was used by some old code?).

I have not changed this function except to use the new split_variable_names() on line 263. It only appears as all new because I moved it here from R/draws-index.R.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember scalar_only. The way the variables are named suggests that I have I not written it. Was it previously used in rvars perhaps? I am fine with removing it as long as you don't see any value in keeping it for future use in rvars or similar functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I didn't write check_existing_variables... Looking back through git blame lead me to this issue: #71

Which suggests that scalar_only was going to be exposed in the API and then it was decided not to expose it, pending future use cases.

There is also an interesting conversation there about argument naming that is very reminiscent of questions I've had about the naming of the with_indices argument. I'd be curious if folks have strong opinions about its name. I chose it to parallel the naming of rvar(with_chains =), though I wonder if it is ambiguous to people whether it means the resulting variables have indices or their names do).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Interesting. I don't have any strong opinions about it. @jgabry and @MansMeg you seem to have been involved in the discussion. Any thoughts on this?

Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 7336811 is merged into master:

  •   :ballot_box_with_check:as_draws_array: 113ms -> 111ms [-5.39%, +1.2%]
  • ❗🐌as_draws_df: 42.9ms -> 94.1ms [+111.57%, +127.66%]
  •   :ballot_box_with_check:as_draws_list: 193ms -> 192ms [-3.26%, +2.24%]
  •   :rocket:as_draws_matrix: 35.2ms -> 33.9ms [-6.57%, -0.65%]
  •   :rocket:as_draws_rvars: 172ms -> 91.2ms [-49.56%, -44.15%]
  •   :rocket:summarise_draws_100_variables: 743ms -> 732ms [-2.38%, -0.66%]
  •   :ballot_box_with_check:summarise_draws_10_variables: 84ms -> 85.2ms [-4.48%, +7.38%]
    Further explanation regarding interpretation and methodology can be found in the documentation.

Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 45b0643 is merged into master:

  •   :rocket:as_draws_array: 107ms -> 104ms [-3.25%, -2.26%]
  • ❗🐌as_draws_df: 38.3ms -> 76.5ms [+99.13%, +100.88%]
  •   :ballot_box_with_check:as_draws_list: 171ms -> 170ms [-1.16%, +0.56%]
  •   :rocket:as_draws_matrix: 33.8ms -> 32ms [-7.46%, -3.57%]
  •   :rocket:as_draws_rvars: 146ms -> 82.8ms [-43.74%, -42.73%]
  •   :rocket:summarise_draws_100_variables: 734ms -> 722ms [-1.83%, -1.32%]
  •   :ballot_box_with_check:summarise_draws_10_variables: 79.7ms -> 79.6ms [-0.5%, +0.25%]
    Further explanation regarding interpretation and methodology can be found in the documentation.

@paul-buerkner
Copy link
Collaborator

thank you! Please let me know once this PR is ready for review!

@mjskay
Copy link
Collaborator Author

mjskay commented Jan 30, 2024

Should be ready now

Copy link
Collaborator

@paul-buerkner paul-buerkner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. Thank you!!

@avehtari could you check if as_variable_array does what you would like to do? If you give your okay too, I think this is ready for merging.

# select matched all_variables maintaining the input variables order
variables <- all_variables[all_var_matched_ixs[order(input_ixs, all_var_matched_ixs)]]
} else {
missing_variables <- setdiff(variables, all_variables)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember scalar_only. The way the variables are named suggests that I have I not written it. Was it previously used in rvars perhaps? I am fine with removing it as long as you don't see any value in keeping it for future use in rvars or similar functionality.

@avehtari
Copy link
Collaborator

closes #340 by adding an extract_variable_array() function to extract variables with indices into arrays of iterations x chains x any remaining dimensions. @avehtari does this do what you need?

Yes, it's helpful as it works the same way for any draws object type, and I can write my example in the issue as

p1$draws() |> subset_draws(draw=1) |> extract_variable_array(variable='r_id') |> drop()

or if I want to get all draws array

p1$draws() |> merge_chains() |> extract_variable_array(variable='r_id') |> drop()

(I can also drop drop() and index the 4 dim array)

@paul-buerkner
Copy link
Collaborator

Great! @mjskay anything to change from your side? Otherwise, feel free to merge (or tell me such that I merge).

@mjskay
Copy link
Collaborator Author

mjskay commented Jan 30, 2024

Thanks for the quick review @paul-buerkner!

Maybe let's wait a day or two and see if @jgabry @MansMeg @avehtari or others have objections/suggestions for the with_indices parameter name, and if not then I can merge.

@paul-buerkner
Copy link
Collaborator

Agreed.

@avehtari
Copy link
Collaborator

I'm fine with with_indices and can't come up with better suggestions. I tested using variables(), nvariables(), and set_variables(), and having with_indices parameter is awesome!

@mjskay
Copy link
Collaborator Author

mjskay commented Feb 1, 2024

Great, thanks @avehtari! I will take that as a strong endorsement and merge :)

@mjskay mjskay merged commit c312846 into master Feb 1, 2024
10 checks passed
@mjskay mjskay deleted the indices branch February 2, 2024 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants