Conversation
|
Thanks for working on this! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1165 +/- ##
=======================================
Coverage 90.85% 90.85%
=======================================
Files 14 14
Lines 5924 5924
=======================================
Hits 5382 5382
Misses 542 542 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jgabry do you know why macos devel would be failing in the R dep setup? Doesn't seem to happen in main--could this be a cache/GHA error? https://github.com/stan-dev/cmdstanr/actions/runs/23573417706/job/68640740296 |
|
Looks like something is broken in R Core or pak which is causing macos devel to fail. |
Took a quick glance at the logs, I agree this seems likely. I’ll try to take a look at the actual PR tomorrow. |
|
Thanks! Looks like this (new) function in tools is broken, planning on submit a patch. Got fixed, so should be gtg soon |
jgabry
left a comment
There was a problem hiding this comment.
Mostly looks great, just a few comments/questions.
| before_time <- Sys.time() | ||
| mod <- expect_interactive_message(constructor_call, "Compiling Stan program...") | ||
| constructor_call <- substitute(constructor_call) | ||
| before_time <- Sys.time() - 10 |
There was a problem hiding this comment.
Why do we need to subtract 10 here when we didn't need to do that previously? If we do need to do something, why not do what you did above in expect_compilation? I'm worried that the 10 seconds could result in false positives here since in our tests we often repeatedly rebuild the same test models in quick succession. But also I might be misunderstanding this.
| expect_snapshot_output({ | ||
| fit$print(c("theta", "tau", "lp__", "lp_approx__")) | ||
| }) | ||
| expect_snapshot_error(fit$print(variable = "unknown", max_rows = 20)) |
There was a problem hiding this comment.
Is there a reason why you migrated vb, mle, and laplace like this (snapshotting full print output) but didn't do that for gq and mcmc? I'm a bit worried that snapshotting the full print output like you do here will be brittle since it hardcodes the exact floating point output, which can be sensitive to OS/toolchain/CmdStan. I guess it's passing now on different OSes, but still.
Before you change anything, I'm curious if there was a reason for the different approach here compared to the MCMC tests.
There was a problem hiding this comment.
I decided to ask codex about this concern and I had it run these locally and it found a few things:
- fit-mle on the PR worktree passed cleanly.
- fit-vb on the PR worktree failed, while the same fit-vb test on current master passed on this machine. So there is at least one real local regression signal in the PR, not just a theoretical concern.
- fit-laplace executed successfully, but testthat tried to add fresh snapshots in the temp PR worktree instead of cleanly matching the checked-in ones, which reinforces the concern that these new full-output snapshots are over-specified / need closer validation
| "Download of CmdStan failed with error: cannot open URL 'https://github.com/stan-dev/cmdstan/releases/download/v2.35.5/cmdstan-2.35.5.tar.gz'\nPlease check if the supplied version number is valid." | ||
| ) | ||
| expect_error( | ||
| expect_snapshot_error(install_cmdstan(version = "2.35.5", wsl = os_is_wsl())) |
There was a problem hiding this comment.
(applies generally, not just to this particular test or error messages, just putting it here)
I have mixed feelings about using expect_snapshot_error (and to some extent, but a lesser extent, expect_snapshot more broadly). It's cleaner but less transparent. I have to check separate files to see what the expected output is that we're checking for. It’s also not explicitly enforcing behavior we want and instead asking just “did it change?”, which is also very useful but not exactly the same thing. It more or less amounts to the same thing if we’ve verified and trust everything before creating the snapshots, but it’s not actually the same thing.
What's your take on this tradeoff?
Migrated to latest testthat edition.
Closes #1155.