Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

next set of models #35

Closed
18 of 20 tasks
topepo opened this issue Jul 30, 2018 · 25 comments
Closed
18 of 20 tasks

next set of models #35

topepo opened this issue Jul 30, 2018 · 25 comments
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Jul 30, 2018

  • knn via kknn package
  • decision_tree via rpart, C5.0, spark (others?)
  • SVM models: linear, RBF, polynomial as separate functions (kernlab)
  • multinomial regression via glment and spark
  • mars via earth package
  • null model wrapper as well as fit and pred functions
  • naive Bayes (klaR, spark ?)
  • cubist
  • discriminant analysis (of various types)
  • PLS (sparse, DA)
  • FDA models with different basis functions (MARS, poly)
  • bagged trees (helped by potential new rpart version and side package)
  • bagged MARS (based on side package)
  • Poisson regression (perhaps including ZIP models; otherwise a clone of linear_reg)
  • ARIMA and other time series models
  • generalized additive models
  • multilevel model extension engines for linear, logistic, multinomial, and Poisson regression (in the multilevelmod package)
  • more models for censored data (in the censored package)

👆Already in parsnip or adjacent package
👇working on or thinking about

  • ordinal regression
  • rotation forests
@billdenney
Copy link

billdenney commented Feb 25, 2019

I think that this is the right place to ask:

What about:

  • nonlinear models (nlm, gnls, nls)?
  • linear mixed effect models (lme, lmer)? I think that this is sufficiently separate from linear_regression().
  • nonlinear mixed effects models (nlme, nlmer)?

As a lower priority, what about more complex cases that are combinations of other model types:

  • ordinal mixed-effect nonlinear regression models? (This isn't just making something up-- it's a project that I'm currently working on in Stan.)
  • multiple-outcome models that mix modeling types: last year I was working on a nonlinear mixed-effects regression combined with a nonlinear time to event model.

@alexpghayes
Copy link

So far parsnip mainly supports fit()ing and predict()ing, but doesn't really provide any infrastructure for inference or inspecting model parameters. I suspect parsnip will eventually support multi-level models, but probably at the same fit()/predict() level. Perhaps this will cover your use case, but typically I think people want more inference from these models. lme* and brms, etc, etc, provide a lot of the infrastructure to make this inference easier for users.

Perhaps Max thinks about this differently than I do, but since those models typically don't have lots of hyperparameters / any and don't get cross-validated all that often, I think of them as living in a slightly different universe that isn't as central to parsnip as predictive modeling.

@eduardszoecs
Copy link

what about mgcv::gam / mgcv::bam? I don't think g
they fit into the exiting linear_reg()?

If I'd like to contribute, what would be the best startpoint?

@topepo
Copy link
Member Author

topepo commented Feb 26, 2019

@billdenney

  • nonlinear models (nlm, gnls, nls)?
  • nonlinear mixed effects models (nlme, nlmer)?

To some extent, I'd like to organize these models on a more mechanistic basis. For example, if you were doing PK modeling, we could have one_compartment, two_compartment, and so on. The engine can control the type of estimator (NLS, Bayes, etc).

This could end up generating a ton of different models, but I think this is better than a general nls wrapper since parsnip wouldn't be able to bring much to the table on that.

That's my current thinking right now.

  • linear mixed effect models (lme, lmer)? I think that this is sufficiently separate from linear_regression().

Similarly, having model functions for specific types of experiments/analysis.

Suppose you have a simple 1-level repeated measures design (e.g. replicate data points per drug or longitudinal data for a patient over time). I can think of a few different approaches to this analysis: standard mixed effects (e.g. lme4), a Bayesian generalized linear model, generalized estimating equations, and so on. The model structure for these are not exactly the same but pretty similar. This type of design should be a modeling function.

These models are so general, I don't think that a general wrapper would be a lot of good.

@topepo
Copy link
Member Author

topepo commented Feb 26, 2019

@Edild

what about mgcv::gam / mgcv::bam?

These are more difficult in some ways. How do we parameterize the smoothness functions since they can be different for each term?

We could assume a common df for each term but that is pretty restricting. This would enable the xy interface but, otherwise, a GAM formula would have to be used.

I don't think they fit into the exiting linear_reg()?

Definitely not. We'd have to have a different model specification function.

If I'd like to contribute, what would be the best startpoint?

That's hard to say. The parsnip user interface probably isn't going to change much. Under the hood, we need to reorganize some things and create better tools to enable users to write their own parsnip models. We'll start focusing on that soon.

If you want, focus on this vignette and try implementing a model. Give us some feedback on where the pain was or if the documentation is unclear.

Some of the modeling functions above are pretty vanilla (e.g. Poisson regression, ordinal regression, discriminant analysis).

@billdenney
Copy link

@topepo, the way I had thought about the fit of parsnip within the need for one_compartment() or similar is that I may want to first fit my one_compartment() model equation within a nls model to get some good initial estimates, then I may want to switch over to some form of nonlinear mixed effects by switching out the engine and choosing where the random effect reside. The value lies in the consolidated interface (the mess of interface between nlme, lme4, rstanarm and others is as big as other problems that seem to be a fit in parsnip).

As an example, I'd like to be able to do something like the following (I know that this isn't the current parsnip interface):

nonlinear_reg(mode="regression") %>%
  model_data(my_data) %>%
  set_engine("nlme") %>%
  equation(response="y", predictor="e0 + emax*x/(ec50+x)") %>%
  random_var(e0="subjectid") %>%
  var_start(e0=10, emax=50, ec50=30) %>%
  fit()

That way, we would be able to build up "equation" libraries that map to one_compartment, two_compartment, and similar. Also, long-term, I'd like to be able to take the fit from one model and use it as an input to another.

With this, parsnip as a standardized interface to mixed effects models would be very useful as a base for libraries of models that implement domain-specific model equations (like one_compartment). It could also be the basis for model composition:

nonlinear_reg(mode="regression") %>%
  model_data(my_data) %>%
  set_engine("nlme") %>%
  equation(response="y", predictor="e0 + emax*x/(ec50+x)") %>%
  model_substitute(e0="e0_base + (age-50)*e0_age")
  random_var(e0_base="subjectid") %>%
  var_start(e0=10, emax=50, ec50=30) %>%
  fit()

which could yield an equation like:

y = (e0_base + (age-50)*e0_age) + emax*x/(ec50+x)

(I've done some initial playing around with formula substitution in https://github.com/billdenney/formulops, but this was more for me to play with than intended to release.)

@topepo
Copy link
Member Author

topepo commented Feb 27, 2019

That would introduce 5 or 6 new functions just to fit one type of model. I'd like to avoid that.

I think that the formula could be supplied (even if it has covariates) to fit. We could design tools to help substitute linear predictors in place of a model parameter. We would definitely avoid using character string manipulations; rlang tools for expressions can be used instead. Also, model_data wouldn't be used; the specification is usually independent of the data (the data are passed to fit).

We are currently figuring out how to build random effect-related models terms.

@billdenney
Copy link

I understand not wanting 5-6 new functions unless they generalize to many model types.

And, for figuring out how to build random effect-related model terms, that seems like it could make most of the needs fixed. With a generalized way to define random effect-related model terms, I could imagine something like the following:

model_fixed_effect <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(
    Sepal.Length~Sepal.Width,
    data=iris
  )

model_mixed_effect <-
  linear_reg() %>%
  set_engine("lme4") %>%
  fit(
    Sepal.Length~Sepal.Width+(1|Species),
    data=iris
  )

And, ideally, I could use the same equation form with either the "lm" or "lme4" engine, but the "lm" engine version would give a warning that the random effect-parts are ignored:

model_fixed_effect <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(
    Sepal.Length~Sepal.Width+(1|Species),
    data=iris
  )
  # Gives warning like:  random effect on `1` is ignored with "lm" engine.
model_mixed_effect <-
  linear_reg() %>%
  set_engine("lme4") %>%
  fit(
    Sepal.Length~Sepal.Width+(1|Species),
    data=iris
  )

With this general idea, the only added function required would be nonlinear_reg() which seems like a good fit overall for parsnip.

As an aside, I like the lme4 method of defining the terms, but I would hope that the terminology made for parsnip allowed nested, crossed, and partially-nested/crossed specification of random effects (which is usually possible if crossed is possible).

@juliasilge juliasilge added the feature a feature request or enhancement label Apr 3, 2020
@tiernanmartin
Copy link

I see poisson regression included in @topepo's original list — are you still planning to provide support for it in the future?

Are other count regression models like those in the list below being considered as well?

Thanks!

Count Regression Models

  • Negative Binomial Regression
  • Zero-Inflated Count Models (Poisson, Negative Binomial)
  • Zero-Truncated Count Models (Poisson, Negative Binomial)
  • Hurdle Models
  • Random-effects Count Models

@topepo
Copy link
Member Author

topepo commented Apr 7, 2020

@tiernanmartin Well timed: poissonreg. I'll add hierarchical models via another package in the next few weeks.

@billdenney Getting back to some of your questions... I'll open up a package that adds engines to different packages for (g)lmer models via lme4 and rstanarm. It works so far but I have to sort through some other api issues. I'll try to remember to ping this thread when I've opened it up.

@nlubock
Copy link

nlubock commented May 18, 2020

Thanks for the Poisson support! Echoing @tiernanmartin - any plans for the Negative Binomial in the future?

@SangeetM
Copy link

SangeetM commented Aug 9, 2020

Hello, is parsnip considering an tidymodel implementation of Bayesian Additive Regression Trees (BART) sometime down the line?

@topepo
Copy link
Member Author

topepo commented Aug 9, 2020

@SangeetM That's a good idea, especially since there is now a version that doesn't use rJava.

Feel free to put a PR into parsnip if you want to add a model. There's a lot of tuning parameters so we'd also need to add some dials objects (which is easy).

The thing to avoid is needing to make a wrapper function to fit the model. This means that we'd have to add it as a formal dependency to parsnip and we really want to avoid that. If that's the case, adding it to a parsnip-adjacent package (like discrim or rules) would be the solution.

@SangeetM
Copy link

@topepo thanks for the suggestion. Earlier I was looking into bartMachine which was present in the CARET family. Post your reply, I've started looking into BART which uses CPP in the backend and looks interesting. I will be looking into how I can add it to tidymodel while avoiding the caveats as mentioned by you.

@mdancho84
Copy link
Contributor

@topepo We've just seen our first request for modeltime to add GAMs. Chances are it should probably be a broad package, but we could get something started for regression. business-science/modeltime#71

@topepo
Copy link
Member Author

topepo commented Mar 26, 2021

@mdancho84

I'm all for that. We should probably have the main model definition live in parsnip and keep the engines in an adjacent package (I think at least).

We could assume a common df for each term but that is pretty restricting. This would enable the xy interface but, otherwise, a GAM formula would have to be used.

Any ideas for this? Assuming the same smoothing term across predictors is probably better than nothing but not very satisfying. Davis and I have discussed the possibility of having tune() entries in the model formula but that does not seem feasible.

@mdancho84
Copy link
Contributor

mdancho84 commented Mar 26, 2021

@topepo We've started very early strategizing a new modelgam R package. We have not started development (other than proof of concept), but this will change shortly as we need a parsnip interface to interact with modeltime.

Time Series Results

GAMs are showing very strong promise for time series in the testing we have performed.

  • We did a basic parsnip test - The gams worked.
  • We tested GAMs on several time series, they performed very well provided the user controls for interactions and uses appropriate model link.

image

model_fit_gam_2 <- gam(
    formula = value ~ s(date_month, k = 12) + s(date_num) + s(lag_24) + s(date_num, date_month),
    family=Gamma(link="log"),
    data = training(splits)
)
preds_train_2 <- predict(model_fit_gam_2,
                           newdata = training(splits),
                           type = "response") %>% as.numeric()
preds_test_2 <- predict(model_fit_gam_2,
                          newdata = testing(splits),
                          type = "response") %>% as.numeric()

Implementation and Tuning

We will revert back on the 2nd point. We need to think about this.

@mdancho84
Copy link
Contributor

mdancho84 commented Mar 26, 2021

@topepo Here's what I'm thinking regarding the progression of a package.

  • Start with a basic modelgam package
  • Include a gam_mod() function that is a general parsnip interface to GAMs
  • Include engine = "gam" for MVP algorithm
  • Phase 1 - Tackle the minimum viable product - Formula Parsnip / Workflow interface
  • Phase 2 - Recipe steps that implement gam smoothing - step_gam_smooth()

I've added more details here.

@Tazinho
Copy link

Tazinho commented Dec 19, 2021

@topepo Do you have any plans for wrappers to create a multinomial classification? I mean sth. where you can just supply any binary classification algo and then specify sth. like "one vs. one" or "one vs. all"? Maybe I am overlooking sth. but so far I can only see multinomial algos for a hand full of specified algos? Not sth. I currently need, but I had lots of use cases for this in the past where I wanted to benchmark several algos and didn't find any consistent interface in the R world and so just moved over to use a specific algo e.g. XGBoost by their interface directly.

@topepo
Copy link
Member Author

topepo commented Dec 20, 2021

We might put those in the probably package but it is not high on the priority list. For tidymodels, that is most functional when we have post-processing of model outputs.

@desaiha
Copy link

desaiha commented Dec 26, 2021

Is Negative binomial support added in the parsnip package?

@topepo
Copy link
Member Author

topepo commented Dec 27, 2021

Not yet but feel free to make a PR for the poissonreg package.

@AmeliaMN
Copy link

Just ran across this issue while looking for a way to do ordinal logistic regression in parsnip. Looks like it hasn't been implemented yet, just wanted to voice my support!

@topepo
Copy link
Member Author

topepo commented Feb 21, 2022

@AmeliaMN I did start a prototype repo for this a while back and you can probably give me some feedback.

The main issue that I had as about how to organize the functions. In parsnip we try to have the main model functions describe the structural aspects of the model (e.g. linear_reg(), rand_forest(), etc). For the ordinal models based on generalized linear model (e.g. cumulative logits, adjacent categories etc), my thinking was to have:

  • ordinal_cumulative(link = "logit", odds = "proportional")
  • ordinal_adjacent(link = "logit", odds = "proportional")

and so on. My thinking is that people would probably want to look at the parallel assumption (assuming they have the right design for that) and tuning over the odd argument would be helpful.

How would you like to see these types of model organized?

The other main issue is the high degree of heterogeneity and other issues with some of the packages. glmnetcr and ordinalNet are interesting but not easy to productionize given how they do the regularization.

I'd also try to add ordinalForest and party models for trees.

Finally, there are some good brms models. I've mostly stayed away from those since the compiler requirement seems problematic for a lot of users. Also, I think that the model would need to be compiled in every resample :groan:

@topepo
Copy link
Member Author

topepo commented Mar 15, 2023

Closing this issue. For new model requests, please start new issues. Thanks for all of the feedback!

@topepo topepo closed this as completed Mar 15, 2023
@tidymodels tidymodels locked as resolved and limited conversation to collaborators Mar 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests