Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more pipeable fit() interface #33

Closed
DavisVaughan opened this issue Jul 28, 2018 · 12 comments
Closed

A more pipeable fit() interface #33

DavisVaughan opened this issue Jul 28, 2018 · 12 comments

Comments

@DavisVaughan
Copy link
Member

I gave a little thought to the fit() interface problem and this is what I came up with. I don't really like the interface arg name but thats just a naming thing.

# helper if required
xy <- function(x, y) {
  list(x = x, y = y)
}

# notice how engine would come before the _optional_ data param for pipeability
# all required params are now moved to the front
# engine could come before interface if you want to keep interface+data together
fit <- function(model_spec, interface, engine, data, control, ...) {
  #switch based on interface being a formula VS list
}

linear_reg() %>%
  fit(y ~ x1 + x2, "lm", fit_data)

linear_reg() %>%
  fit(xy(fit_data[,c("x1", "x2")], fit_data[,c("y")]), "lm")

# slightly simpler
xy_defn <- xy(
  x = fit_data[,c("x1", "x2")], 
  y = fit_data[,c("y")]
)

linear_reg() %>%
  fit(xy_defn, "lm")
@DavisVaughan
Copy link
Member Author

I think I would call interface the definition or representation of the model instead

@topepo
Copy link
Member

topepo commented Jul 29, 2018

That's funny, the two alternatives that Hadley and I discussed were:

fit(model, data, formula)
fit_xy(model, x, y)

# and 

fit(model, data, recipe/formula)
fit(model, data_xy(x, y))

I think that I'm leaning towards adding fit_xy.

One issue is that parsnip builds an expression for the model fit function. Because of that, we avoid evaluating the model arguments (i.e. x, data, etc) so that, when embedded into an expression, it doesn't embed the data set in the expression (or resulting call object in the model). It would be doable with xy() or data_xy but much cleaner to use an alternate fit function.

In any case, I'll probably make two branches with each and see what is best (or least worse).

@kevinykuo
Copy link
Collaborator

In the case of fit(model, data, formula) or fit(model, formula, data) it seems like the information encoded in formula is lost when the model is fit. Someone else looking at model wouldn't be able to know what needs to be done to the input data in order to reproduce the fitted model. Similarly, inspecting the fitted model doesn't tell you what's needed for scoring. Also, it may be harder to tune formula as a hyperparameter with this API, since here it's neither an "algorithm hyperparameter" nor a "data transformation hyperparameter."

An alternative design would be to require specification of input and output columns in the model object, and require the transformations needed to be encoded in the data processing, so the user would call fit(model, data) to fit the model, where data is a data.frameesque thing. Or perhaps we always require x and y, so the interface becomes fit(model, list(x, y)), and the data preprocessing would return a list by convention.

This is related to #19 -- I think this design decision needs to be clearly justified and articulated.

@topepo
Copy link
Member

topepo commented Jul 31, 2018

it seems like the information encoded in formula is lost when the model is fit.

fit retains this information. If the underlying model interface is also a formula, the information is in that model object, otherwise, fit saves it so that the predict method will work:

> library(tidymodels)
> library(parsnip)
> 
> rf_fit <- rand_forest(mode = "regression") %>%
+   fit(engine = "randomForest", mpg ~ ., data = mtcars)
> names(rf_fit)
[1] "lvl"     "spec"    "fit"     "preproc"
> names(rf_fit$preproc)
[1] "terms"       "xlevels"     "offset_expr" "options"  

inspecting the fitted model doesn't tell you what's needed for scoring. Also, it may be harder to tune formula as a hyperparameter with this API, since here it's neither an "algorithm hyperparameter" nor a "data transformation hyperparameter."

I agree that it is neither of these things. We generally don't tune the formula per se.

An alternative design would be to require specification of input and output columns in the model object, and require the transformations needed to be encoded in the data processing

maybe we could call that a recipe :-)

parsnip is one part of a bigger whole that is intended to accomplish a simple goal: give a uniform interface to the models and standardize parameter/argument names. The parts that you are talking about are higher level apis that will absorb/utlize parsnip to just get the model part.

@DavisVaughan
Copy link
Member Author

Re) An alternative design would be to require specification of input and output columns in the model object, and require the transformations needed to be encoded in the data processing

I had a thought while looking at this. Are recipes going to be support as an input to fit() rather than the data + formula case? For example should I be able to do both of these? Do they mean the same thing conceptually?

fit(model_spec, log(Sales_Price) ~ Latitude + Longitude, data = ames)

recip <- recipe(Sales_Price ~ Latitude + Longitude, data = ames) %>%
   step_log(Sales_Price, base = 10)

fit(model_spec, recip)

@topepo
Copy link
Member

topepo commented Jul 31, 2018

Originally, the three interfaces for fit also included on for recipes.

I decoupled them because, if you were doing model tuning, you would want to train the recipe once and then apply the results to many models (or submodels when tuning) without remaking the recipe. So, once you have a trained recipe, the x/y interface is fine:

recip <- recipe(Sales_Price ~ Latitude + Longitude, data = ames) %>%
   step_log(Sales_Price, base = 10) %>%
   prep(training = ames, retain = TRUE)

rf_fit <- 
   rf_model_spec %>% 
   fit(x = juice(recip, all_predictors()), y = juice(recip, all_outcomes())

glm_fit <- 
   glm_model_spec %>% 
   fit(x = juice(recip, all_predictors()), y = juice(recip, all_outcomes())

# etc. 

That was the reason for:

parsnip is one part of a bigger whole that is intended to accomplish a simple goal: give a uniform interface to the models and standardize parameter/argument names. The parts that you are talking about are higher level apis that will absorb/utilize parsnip to just get the model part.

You can still get to a fit object from a recipe (vas above), but that will be made really easy using a more general data structure (=pipeline or workflow or protocol or whatever we call it). It is in my nature to try to have a single call to do many things (a la caret) but I want these tools to be more modular and to have another structure that provides the higher level api.

# something like

my_analysis <- workflow() %>% 
   add_recipe(recip) %>%
   add_model(rf_model_spec) %>% 
   # pipe in other elements
   fit(training = training_set)

@DavisVaughan
Copy link
Member Author

😲 very cool! Definitely a balance of wanting to make that fit(juice(), juice()) line simpler, and keeping things general enough to be used with/without recipes.

I'm like 80% convinced that retain = TRUE should be the default, especially if it get's used often for this situation.

@topepo
Copy link
Member

topepo commented Jul 31, 2018

I'm like 80% convinced that retain = TRUE should be the default, especially if it get's used often for this situation.

Yeah. That'll change for the next version. I'm tried of typing it.

@topepo
Copy link
Member

topepo commented Aug 1, 2018

There is a fit_xy branch that is working if anyone wants to comment.

@lorenzwalthert
Copy link

Yeah. That'll change for the next version. I'm tried of typing it.

+1 for this one.

@alexpghayes
Copy link

I am a fan of fit_xy(). It's informative and clear, but also irritating enough that people hopefully won't actually use it that much. My hope would be that this minor irritation would work to normalize a high-level data specification over matrix/vector interfaces.

@topepo topepo mentioned this issue Aug 7, 2018
@topepo topepo closed this as completed Aug 7, 2018
topepo added a commit that referenced this issue Aug 3, 2020
```
Found the following (possibly) invalid URLs:
  URL: #33 (moved to #33)
    From: NEWS.md
    Status: 200
    Message: OK
```
@github-actions
Copy link

github-actions bot commented Mar 9, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants