-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Coffee-addled rant here, but bear with me. I think it'll be really valuable as the tidymodels
universe takes off to have a clear and well documented definition of what a model is.
In classic statistics land, if you have some data x
that live in a space X
, a model is a distribution P(X)
indexed by parameters theta
. In linear regression with three features, theta
lives in R^3
. Then a fit
often refers P(X)
where we've picked a particular theta
to work with, and there's an isomorphism between R^3
and all possible fits.
(Aside: calling a particular theta
a fit isn't great language because fit
should be a verb referring to model fitting, not a noun referring to the object returned by the fitting process).
To me, a key question is how do we express this idea in code. For example, if we write out a linear model:
y = theta_0 + theta_1 * x_1 + ... + theta_p * x_p + epsilon
where epsilon
are IID Gaussian, then the following are all the same model (in the sense that they all have the same parameter space)
- OLS
- LASSO
- RIDGE
- Any other penalized regression technique
- OLS estimated with Horseshoe priors
- Etc
Sure, for the penalized regression methods you have to estimate the penalization parameter, but this is a hyperparameter, which I think we can broadly think about as a parameter that we have to estimate but that we don't really care what value it takes on. So these all have the same parameter space, but different hyperparameter spaces. Another way to express this same idea is that what differentiates MCP from LASSO from OLS, etc, etc is not that they are different models but rather that they are different techniques for estimating the same model.
(Aside: one interesting question is whether or not hierarchical models belong on the list above. I think it depends on whether or not you care about the group level parameters, in which case you are now in a new parameter space. OLS with HC errors is another interesting case to think about. In this case the model is still the linear model, but now we're more explicitly declaring that we want to estimate the covariance matrix, and also that we are going to use, say, HC1 to do so. I'd still call this a linear model, but only if the original definition of the linear model specified covariance as an estimand).
If I'm going to actually implement things in code, I want to work with an object that specifies the estimation method, which likely is closely tied to a hyperparameter space.
I think that a parsnip model specification shouldn't work with the classical stats sense of a model like we're defined above, but rather should encapsulate all the things you need to do to get parameters back. Parsnip is already doing a lot of this, but I think there's a lot of value in being very clear about what a parsnip object should specify. In my mind this includes, at the minimum:
- The estimand, or parameters you want to estimate, mostly implicit in the model you select
- The estimating procedure (i.e. LARS, or the analytic solution to OLS). Often implicit in the package you call.
- Any hyperparameter spaces (ex:
lambda in R+
for LASSO) - Procedures for picking hyperparameters (ex: random search over bootstrap samples picking the smallest
lambda
within 1 SE of the minimum RMSE)
For now I think it makes sense to call this a model specification
, but I think it's critically important to distinguish between the model and the model plus all this other stuff. Similarly, after the model fitting process, when you have many different fits
(one for each hyperparameter combination, say), there are tasks that involve working with all the fits
together (you might be curious which LASSO variable entered the model first), and tasks that involving working with just one fit
(i.e. looking at the LASSO coefficients themselves).
I strongly believe that a good interface very clearly differentiates between a group of fits
together, and single fit
, and provides type-safe methods for working with each of these.
Related issue: canonical modelling examples
A related issue is to find canonical modelling examples that are sufficient to develop our intuition about what the code objects should look like. OLS is too simple because it doesn't need a lot of the machinery that other models need. I think that a good starting place is to have one canonical example where we can employ the submodel trick (penalized regression seems like a good place to start), and one where we can't (maybe SVMs here?). Another way to think about this: we should have one canonical example where there is exploitable structure in the hyperparameter space, and one canonical example where there isn't.