-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate backend for tidy prediction #41
Comments
That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.
I agree on this. It is something that can happen after prediction if people really don't want those data. As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.
👍 We should think about the names though. Let's say that we choose
Yes, but not be default (if that's what you mean) since 1) most models don't implement them and 2) they might be computationally taxing and, if you don't want them, why would you take the time to generate them. We should have a formal syntax for getting them though; I just disagree with the defaultedness (sp?)
Maybe but I don't want to automatically return the original data along with the prediction. |
The major argument for always having tibble predictions is type safety. I think it's the right move, especially since
This is something I'm running into with library(joineRML)
#> Loading required package: nlme
#> Loading required package: survival
data(heart.valve)
hvd <- heart.valve[!is.na(heart.valve$log.grad) &
!is.na(heart.valve$log.lvmi) &
heart.valve$num <= 50, ]
fit <- mjoint(
formLongFixed = list(
"grad" = log.grad ~ time + sex + hs,
"lvmi" = log.lvmi ~ time + sex
),
formLongRandom = list(
"grad" = ~ 1 | num,
"lvmi" = ~ time | num
),
formSurv = Surv(fuyrs, status) ~ age,
data = hvd,
inits = list("gamma" = c(0.11, 1.51, 0.80)),
timeVar = "time"
)
#> Running multivariate LMM EM algorithm to establish initial parameters...
#> Finished multivariate LMM EM algorithm...
#> EM algorithm has converged!
#> Calculating post model fit statistics...
au <- broom::augment(fit)
colnames(au)
#> [1] "num" "sex" "age" "time"
#> [5] "fuyrs" "status" "grad" "log.grad"
#> [9] "lvmi" "log.lvmi" "ef" "bsa"
#> [13] "lvh" "prenyha" "redo" "size"
#> [17] "con.cabg" "creat" "dm" "acei"
#> [21] "lv" "emergenc" "hc" "sten.reg.mix"
#> [25] "hs" ".fitted_grad_0" ".fitted_lvmi_0" ".fitted_grad_1"
#> [29] ".fitted_lvmi_1" ".resid_grad_0" ".resid_lvmi_0" ".resid_grad_1"
#> [33] ".resid_lvmi_1" Created on 2018-08-07 by the reprex Here it would make sense to have an Dave pretty strongly believes that it is better for users to
I have some brainstorming about this at tidymodels/broom#452. My impression from working with broom is that we should use the most specific name possible. In some sense having a
I agree, uncertainty should not be reported by default. What I'm playing in |
And I really do think that we want to export the tests for whatever specification we come up with. |
Yes, absolutely. |
Other thoughts:
This should also all end up in |
Another decision (related to #37): should uncertainty just be reported as a |
Also, just to throw out edge cases:
I'd say a data frame-like structure is the way to go as it's trivial to add columns. A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other. For my purposes, I decided to always do both but in practice it means callling most |
Some models don't generate class probabilities. I keep them separate as a default but provide an api to get both at once (with a single call to get the probabilities). |
I should also add that I'd be really happy to see broom and prediction integrated into one package if we can find a way to make an integrated version serve both our purposes. The key functionality for me is actually for the margins package, which requires the following:
I imagine it would be beneficial to the community to have one really good package doing this rather than various packages serving slightly different purposes. |
If the scope is really big (and that sounds like the case), I don't think that one package is the answer. That's how
What you're saying is good; let's agree on some standards and develop form there so we don't obstruct each other (and we can import each other without reformatting output). |
I agree that a standalone prediction package is a good idea. Re: bayesian predictions: |
If you think it will work for you with modifications, let me know what's missing and I'll try to get it all implemented. |
How does this should for a prediction standard (pinging @hadley, @artichaud1, and @DavisVaughan for more input) : Any code to produce predictions should follow these conventions:
|
Named?
What about "link"?
Very good!
How about |
Depending on the state of
Any reason to want to format like
Should this include the confidence level? library(forecast)
fit <- auto.arima(WWWusage)
forecast(fit, h=20, level=95)
Point Forecast Lo 95 Hi 95
101 218.8805 212.6840 225.0770
102 218.1524 203.3133 232.9915
103 217.6789 194.1786 241.1792
104 217.3709 185.6508 249.0910
... |
Not sure what you mean.
Maybe... there might be a lot of verbose globals that get made so it would have to be something like |
That's good for standardization but bad for consuming those values. So if I go to make a Let me think about that.
No. Then you've encoding data into the name and you can't predict what the name will be easily. I'd rather set an attribute for the level (so that it is there) but not add it to the same or the value. |
|
Agree it's messy but I think it's sensible to keep class predictions as a separate thing. For example, for margins, I use predictions for numerical derivatives and want to always be able to count on those being numeric. |
Naming: I'm not a huge fan of
I think that ideally as much information as possible should be contained in the tibble itself. Users are much more familiar working with tibbles than attributes. For example, for intervals, I'd strongly prefer a column
No indication that the columns are probabilities in the name?
Yes.
Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.
I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a predict.many_fits <- function(...) {
best_fit <- extract_best_fit(many_fits)
predict(best_fit)
} to get predictions for different hyperparameter values I'd much prefer a workflow like: map(many_fits, predict) # many_fits basically a list of fitted objects But I think we should be really careful here.
Strongly second this, and came here to recommend it. This is an issue because models often drop rows with missing data silently, which leads to all sorts of nasty surprises and after the fact attempts to determine precisely which rows got dropped. |
How about
I don't want to do that. It's replicating data that has a single value into a column that is potentially large.
That's one example of many where the prediction isn't a scalar. Even so, I've had applications where I've needed the full posterior distribution for each sample.
I'm only talking about doing it in the cases where the |
Not sure if this is still up for debate, and although I favored wide format in the past, more often than not I ended up going to long format for analysis, because it allowed me to standardize my workflow:: filter, group_by, summarise -> send to ggplot. In the same vein as @alexpghayes point, I feel that it removes some of the weight in the decision making, i.e. it makes it easier to change ideas whenever, with a lower probability (or severity) of breaking stuff. |
This one pops up on my dashboard every day, so I'll join the party ;) Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default. This would also allow |
spark may not, but that will be illegal in the modeling new world order 😃 |
They are probably going to end up wide in a lot of cases anyway. In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list. |
These are definitely subjective points but imo
|
True but I don't want extreme cases to drive the bus
Unless you are going to merge these results with the original data (which is pretty common). Plus, you already have wide data anyway with classes and probabilities. If you stack the probabilities then you have a class column that is replicated. Why duplicate data? |
Oh I didn't mean we should stack probabilities (although upon re-reading it does sound like it!). I'm thinking about appending |
I think that Similarly, add
|
So for And you could call The separate prediction functions per submodel might be a nice way to modularize any nuances that might come up for the prediction methods of a specific model type, and would be pretty extensible. |
Close. There is not
It will call |
Prediction norms have been fairly standardized for us now. Thanks to all for your thoughts! 🙌 |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
I've been reworking the
augment()
methods and it's rapidly becoming clear that dealing with idiosyncraticpredict()
methods is going to slow down progress immensely.In the end
broom
andparsnip
are both going to want to wrap a bajillion predict methods, and we should report predictions in the same way for consistencies sake. I think we should move this functionality to a separate package. Potentially we could use theprediction
package by Thomas Leeper, but we should decide on the behavior we want first.If we define a new generic / series of generics, we can then test these behaviors in
modeltests
and allow other modelling package developers to guarantee that theirpredict()
methods are sane and consistent.What I want from a predict method:
predict.lm(..., na.action = na.pass)
)I want all of these to be guaranteed, and for methods that cannot meet these guarantees, I want an informative error rather than a partially correct output.
The text was updated successfully, but these errors were encountered: