What is a model? #19

Open
opened this issue Jun 15, 2018 · 1 comment

Projects
None yet

alexpghayes commented Jun 15, 2018

Coffee-addled rant here, but bear with me. I think it'll be really valuable as the `tidymodels` universe takes off to have a clear and well documented definition of what a model is.

In classic statistics land, if you have some data `x` that live in a space `X`, a model is a distribution `P(X)` indexed by parameters `theta`. In linear regression with three features, `theta` lives in `R^3`. Then a `fit` often refers `P(X)` where we've picked a particular `theta` to work with, and there's an isomorphism between `R^3` and all possible fits.

(Aside: calling a particular `theta` a fit isn't great language because `fit` should be a verb referring to model fitting, not a noun referring to the object returned by the fitting process).

To me, a key question is how do we express this idea in code. For example, if we write out a linear model:

`y = theta_0 + theta_1 * x_1 + ... + theta_p * x_p + epsilon`

where `epsilon` are IID Gaussian, then the following are all the same model (in the sense that they all have the same parameter space)

• OLS
• LASSO
• RIDGE
• Any other penalized regression technique
• OLS estimated with Horseshoe priors
• Etc

Sure, for the penalized regression methods you have to estimate the penalization parameter, but this is a hyperparameter, which I think we can broadly think about as a parameter that we have to estimate but that we don't really care what value it takes on. So these all have the same parameter space, but different hyperparameter spaces. Another way to express this same idea is that what differentiates MCP from LASSO from OLS, etc, etc is not that they are different models but rather that they are different techniques for estimating the same model.

(Aside: one interesting question is whether or not hierarchical models belong on the list above. I think it depends on whether or not you care about the group level parameters, in which case you are now in a new parameter space. OLS with HC errors is another interesting case to think about. In this case the model is still the linear model, but now we're more explicitly declaring that we want to estimate the covariance matrix, and also that we are going to use, say, HC1 to do so. I'd still call this a linear model, but only if the original definition of the linear model specified covariance as an estimand).

If I'm going to actually implement things in code, I want to work with an object that specifies the estimation method, which likely is closely tied to a hyperparameter space.

I think that a parsnip model specification shouldn't work with the classical stats sense of a model like we're defined above, but rather should encapsulate all the things you need to do to get parameters back. Parsnip is already doing a lot of this, but I think there's a lot of value in being very clear about what a parsnip object should specify. In my mind this includes, at the minimum:

• The estimand, or parameters you want to estimate, mostly implicit in the model you select
• The estimating procedure (i.e. LARS, or the analytic solution to OLS). Often implicit in the package you call.
• Any hyperparameter spaces (ex: `lambda in R+` for LASSO)
• Procedures for picking hyperparameters (ex: random search over bootstrap samples picking the smallest `lambda` within 1 SE of the minimum RMSE)

For now I think it makes sense to call this a `model specification`, but I think it's critically important to distinguish between the model and the model plus all this other stuff. Similarly, after the model fitting process, when you have many different `fits` (one for each hyperparameter combination, say), there are tasks that involve working with all the `fits` together (you might be curious which LASSO variable entered the model first), and tasks that involving working with just one `fit` (i.e. looking at the LASSO coefficients themselves).

I strongly believe that a good interface very clearly differentiates between a group of `fits` together, and single `fit`, and provides type-safe methods for working with each of these.

Related issue: canonical modelling examples

A related issue is to find canonical modelling examples that are sufficient to develop our intuition about what the code objects should look like. OLS is too simple because it doesn't need a lot of the machinery that other models need. I think that a good starting place is to have one canonical example where we can employ the submodel trick (penalized regression seems like a good place to start), and one where we can't (maybe SVMs here?). Another way to think about this: we should have one canonical example where there is exploitable structure in the hyperparameter space, and one canonical example where there isn't.

Collaborator

topepo commented Jun 15, 2018

 I'll take a bit of time to digest this in detail. It's a great point and @lionel- and I have had similar discussions. My current headspace on this: the "model" is conditional on knowing the general function form, distribution specifications, estimation procedure etc. I would say that "functional form" only needs to be specified at the `X'beta` level; what those columns are and how they are encoded are immaterial to the "model". The "fitted model" has all quantities identified and estimated whereas the general "model" can have unknowns for parameters and so on. the "modeling process" includes feature engineering, model tuning, calibration, and the important data-driven activities that can be required to get a good "model" `parsnip` is about the "model" as I've defined it above, mostly for pragmatic reasons. I wouldn't include the encoding of `X` as part of the model (mostly since I don't want to repeat it for different models that use the same encodings). Anyway, I'll comment in more detail later about some specific things above.

Closed

Open

Closed