Skip to content

Differentiating between models, estimators and engines #54

@alexpghayes

Description

@alexpghayes

I'm think I can finally translate the thoughts from the modeling abstraction essay (a separate doc that grew out of #19) into parsnip terms. Some concepts to start:

  • A model is a family of probability distributions or functions. That is, a model is set.
  • An estimator is a way to calculate the parameters of a model from a dataset. Note that hyperparameters are most often properties of estimators.
  • The resulting estimates are a fit (I think @topepo often refers to this a sub-model). This is an element of the model.
  • There are often multiple algorithms and implementations of the same estimator. In this case, using parsnip terminology, each implementation is a different engine.

Estimators are typically implicit

  • lm specifies the OLS estimator for the linear model
  • glmnet specifies the elastic net estimator for the linear model

Estimator selection should be explicit

Something along the lines of

ols_hc1_fit <- linear_reg() %>% 
  linear_estimator(coefs = "ols", coef_covariance = "HC1") %>% 
  fit_xy(
    x = ...,
    y = ...,
    engine = "lm_robust"
  )

Perhaps the linear_reg() isn't necessary here, but it does feel the most explicit / low-level to me. In particular, I think it's important to explicitly select an estimator, rather than letting it be implicit in engine. All estimators are not created equal.

Different estimators should have informative subclasses

Currently the parsnip behavior is to always produce a model_fit object:

ols <- linear_reg() %>% 
  fit(hp ~ ., data = mtcars, engine = "lm")

class(ols)
># [1] "model_fit"

I'm strongly of the opinion that ols should have subclasses that indicate:

  • the model_fit was estimated using ordinary least squares
  • the model_fit object contains a single fit/submodel, as opposed to a set of fits/submodels

Without this differentiation I don't think it's possible to meaningfully define methods on ols for inference. Consider the following methods, all for the linear model:

  • plot_lasso_path() only makes sense for a set of fits from the LASSO estimator
  • coef_standard_errors() makes sense for a fit from the OLS estimator but not the LASSO estimator
  • interpret_coefficients() should have different behavior for an OLS fit and a GEE fit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions