-
Notifications
You must be signed in to change notification settings - Fork 105
Description
I'm think I can finally translate the thoughts from the modeling abstraction essay (a separate doc that grew out of #19) into parsnip
terms. Some concepts to start:
- A model is a family of probability distributions or functions. That is, a model is set.
- An estimator is a way to calculate the parameters of a model from a dataset. Note that hyperparameters are most often properties of estimators.
- The resulting estimates are a fit (I think @topepo often refers to this a sub-model). This is an element of the model.
- There are often multiple algorithms and implementations of the same estimator. In this case, using
parsnip
terminology, each implementation is a different engine.
Estimators are typically implicit
lm
specifies the OLS estimator for the linear modelglmnet
specifies the elastic net estimator for the linear model
Estimator selection should be explicit
Something along the lines of
ols_hc1_fit <- linear_reg() %>%
linear_estimator(coefs = "ols", coef_covariance = "HC1") %>%
fit_xy(
x = ...,
y = ...,
engine = "lm_robust"
)
Perhaps the linear_reg()
isn't necessary here, but it does feel the most explicit / low-level to me. In particular, I think it's important to explicitly select an estimator, rather than letting it be implicit in engine
. All estimators are not created equal.
Different estimators should have informative subclasses
Currently the parsnip
behavior is to always produce a model_fit
object:
ols <- linear_reg() %>%
fit(hp ~ ., data = mtcars, engine = "lm")
class(ols)
># [1] "model_fit"
I'm strongly of the opinion that ols
should have subclasses that indicate:
- the
model_fit
was estimated using ordinary least squares - the
model_fit
object contains a single fit/submodel, as opposed to a set of fits/submodels
Without this differentiation I don't think it's possible to meaningfully define methods on ols
for inference. Consider the following methods, all for the linear model:
plot_lasso_path()
only makes sense for a set of fits from the LASSO estimatorcoef_standard_errors()
makes sense for a fit from the OLS estimator but not the LASSO estimatorinterpret_coefficients()
should have different behavior for an OLS fit and a GEE fit