Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Replace patsy with formulaic #6858

Open
bashtage opened this issue Jul 6, 2020 · 33 comments
Open

ENH: Replace patsy with formulaic #6858

bashtage opened this issue Jul 6, 2020 · 33 comments

Comments

@bashtage
Copy link
Member

bashtage commented Jul 6, 2020

Formulaic is a rewrite of patsy that has been designed for more extensibility. It already has features that patsy doesn't such as sparse array support. Should consider moving to this when it is a bit more mature.

https://github.com/matthewwardrop/formulaic

@bashtage bashtage changed the title Consider moving from patsy to formulaic ENH: Replace patsy with formulaic Jul 6, 2020
@ChadFulton
Copy link
Member

I don't use these features much, but I think it's great that there's a new project that might push formulas forward a bit more.

@josef-pkt
Copy link
Member

It would be great to have an alternative to patsy especially one that avoids eval and where formula information can be pickled.

@rluedde
Copy link

rluedde commented Jul 7, 2020

@josef-pkt This Stack Overflow Post describes pickling as

the process whereby a Python object hierarchy is converted into a byte stream

What is the purpose of pickling a formula in the case of statsmodels? It seems similar to how custom Javscript objects can simply be represented in JSON. Is that a fair comparison?

@josef-pkt
Copy link
Member

Pickling is the simplest way to store python objects in a file or to transmit python objects between threads or processes.

The main use case for statsmodels users is to store a model so it can later be used for prediction. This works with arrays and dataframes but not when patsy formulas are used. As a workaround we are rebuilding the design matrix, exog, from the formula.
e.g. #2469

Prediction, when formulas are used, needs to go through patsy to create the design matrix for the prediction exog from data in the original format.

@rluedde
Copy link

rluedde commented Jul 7, 2020

Why can't you simply write a formula out as a string? Is a formula in this sense something along the lines of y ~ x1 + x2? Or does a formula encompass all weights/coefficients/betas etc after fitting/training?

@josef-pkt What did you mean by this:

We "fixed" the formula pickling by rebuilding from the model (exog) from the original dataframe.

Why are Wilkinson formulas preferred? Is the alternative to use features as *args and formulas are more explicit?

@josef-pkt
Copy link
Member

It's not the formula itself that cannot be pickled, It's all the information associate with the formula information about how the design matrix was built, and how new design matrices with the same structure can be built.

Why are Wilkinson formulas preferred? Is the alternative to use features as *args and formulas are more explicit?

I never really looked at the details. The main point is that formulas are a short hand notation for defining a model. All other options require the user to specify a lot more explicitly.

Statsmodels uses a more limited set of formula features compared to R, e.g. our Mixed Models don't put the random effects specification inside the formula as R does, and we don't use multi-part formulas.

@rluedde
Copy link

rluedde commented Jul 7, 2020

So are patsy/formulaic libraries that translate formulas into instructions for building design matrices?

Statsmodels uses a more limited set of formula features compared to R, e.g. our Mixed Models don't put the random effects specification inside the formula as R does, and we don't use multi-part formulas.

Is this a choice made by statsmodels or is it due to limitations of patsy?

@josef-pkt
Copy link
Member

josef-pkt commented Jul 7, 2020

So are patsy/formulaic libraries that translate formulas into instructions for building design matrices?

Yes, and build the design matrix, and provide information about the design matrix that we can use for example for hypothesis tests.

Is this a choice made by statsmodels or is it due to limitations of patsy?

Mostly our choice. Beyond the basic formulas, formulas start to become cryptic. Being more explicit in those cases, makes it easier to understand if you are not a formula expert.
(There are many cases where I don't know what an R formula actually means and does.)

Partially it's also that we have more control over parts that don't go through patsy. Especially because we don't have a maintainer anymore that also contributes to patsy (or whose work on patsy is getting merged).

@rluedde
Copy link

rluedde commented Jul 7, 2020

Gotcha.

@josef-pkt @bashtage You both have been very helpful to me, so thank you guys.

@bashtage
Copy link
Member Author

bashtage commented Jul 7, 2020

Is this a choice made by statsmodels or is it due to limitations of patsy?

In linearmodels I use some custom specifications that go beyond Patsy. I manually parse these. It would be nice to ultimately have a more integrated approach. For example, in instrumental variable models, which are estimated using 2 regressions, I use a formula like

y ~ 1 + x1 + x2 + x3 + [x4 + x5 ~ z1 + z2 + z3]

where the [ ] formula indicates that endogenous variables and instruments and the remaining variables are exogenous.

In panel models I use keywords like

y ~ x1 + EntityEffects

where EntityEffects tells me that the model should include an entity (fixed) effects.

@rluedde
Copy link

rluedde commented Jul 8, 2020

@bashtage Why is linearmodels seperate from statsmodels? Is linearmodels more specifically for finance?

@bashtage
Copy link
Member Author

bashtage commented Jul 8, 2020

statsmodels base structure is too rigid to reliably accommodate data structures that do not resemble a single series.

@rluedde
Copy link

rluedde commented Jul 8, 2020

What is a data structure that doesn't resemble a single series in this case? Grouped data? Time series? Multiple time series?

@bashtage
Copy link
Member Author

bashtage commented Jul 8, 2020

Both panel and multiple series such as what one uses in a SUR.

My other massive pain point is that way statsmodels uses Wrapper classes. The GIF below shows many key presses it is to get into a results method or property to debug:

sm3

@josef-pkt
Copy link
Member

I never do this, I always use post-mortem debugger, mainly with pytest

@josef-pkt
Copy link
Member

josef-pkt commented Sep 2, 2020

I think we should add an experimental from_formulaic methods to gain more experience and give it more exposure, initially without guarantees for backwards compatibility or that it works beyond model creation and predict, and that not for all models.
For that, we would not need to have complete handle_formula support where it is not generic/inherited.

A prime candidate where it would go beyond patsy, would be sparse support.

Based on the popularity by users, it would be important to support it in anova_lm.

@bashtage
Copy link
Member Author

bashtage commented Sep 2, 2020

It would be easier to add engine="patsy" to from_formula and then allow "formulaic" as an option.

@ClaasRostock
Copy link

Has adding support for formulaic been considered, recently?

@josef-pkt
Copy link
Member

considered yes, but no work to support it has been done yet. It's still on the wishlist.

@ClaasRostock
Copy link

ClaasRostock commented Jan 3, 2023 via email

@josef-pkt
Copy link
Member

I'm looking at formulaic again, but it's difficult to see what is supported and how to get the information for post-estimation.

@bashtage
I briefly looked at linearmodels. I only saw the usage of model_matrix.
How to you do the formula transformation for out of sample predict exog?

One of the main open problems we have is to get the formula information to construct margins (get_margeff) for interaction effects (discrete-continuous, continuous-continuous) and multicolumn transformation (like polynomials).

The two current post-estimation usages of patsy's formula info (beside predict) is anova_lm (which I'm not very familiar with) and wald tests, those based on terms and the pairwise tests for categorical (wald_test_terms and t_test_pairwise).
I didn't find much trying to inspect the formula information available with formulaic.

@karrmagadgeteer2
Copy link

Adding here just in case. patsy issue 191 Please ignore if not relevant. Relates to statsmodels running on tests with Python 11.

@matthewwardrop
Copy link

matthewwardrop commented Apr 22, 2023

Hi @josef-pkt !

I'm the author of Formulaic, and am just now wrapping up the final bits before v1.0.0 in Formulaic. What was mostly missing hitherto was documentation, which is now mostly complete. I'd be more than happy to help port statsmodels over to formulaic, including adding any missing features; and can shoulder most of the load if that is desired.

Re: re-use of the spec used for model matrices, please refer to the docs here: https://matthewwardrop.github.io/formulaic/guides/model_specs/#reusing-model-specifications .

There's also a brief migration guide here: https://matthewwardrop.github.io/formulaic/migration/ . The ordering differences mentioned here are about to be mitigated and/or completely disappear.

Also, thanks for posting the patsy issue @karrmagadgeteer2 . I'll take a look and get it patched. [Update: This is already fixed in 0.5.3, which I released a while back; please upgrade!]

@deanm0000
Copy link

According to patsy's github https://github.com/pydata/patsy, it is no longer under active development. They are recommending people use formulaic.

@bashtage
Copy link
Member Author

I think for v0.15.0 we should have formulaic in place. Whether it is mandatory or we provide a period with both patsy and formulaic needs consideration.

@josef-pkt josef-pkt added this to the 0.15 milestone Apr 28, 2023
@josef-pkt
Copy link
Member

josef-pkt commented Apr 28, 2023

The new documentation of formulaic looks very helpful.

pull requests would be very welcome

AFAIR, the main current usage of formula are

  • from_formula and statsmodels.base.data class should have very similar code between patsy and formulaic and should be relatively straight forward to add
  • predict requires recreating a new exog based on formula info
  • t_test, wald_test require building linear constraint matrices (does not use formula information, and so works as long as patsy is installed)
  • looping through terms used in anova_lm and wald_test_terms to build hypotheses tests for terms
  • t_test_pairwise needs information for a one-factor term
  • missing handling in from_formula and predict includes several workarounds for patsy's automatic dropping of rows with missing/nan values.
  • dropping intercept: we have two models, OrderedModel and ConditionalLogit/ConditionalPoisson, that cannot have an explicit or implicit intercept. We drop an explicit intercept and, AFAIR, update the formula info (design_info)

I think we will not be able to drop patsy anytime soon, both for backwards compatibility and for missing features in formulaic like splines.


for the future:
And then there is a wishlist if it eventually works out with formulaic, issues or features not supported by patsy or not fully used by statsmodels.

  • metainformation about stateful transforms (e.g. splines) (*)
  • interaction terms:
    • we don't have currently any post estimation for interacted terms, eg. marginal/partial effects or pairwise t_test
    • empty cells in interaction terms (will likely be needed to be handled by statsmodels)
  • updating formula if we need to drop a term, e.g. dropping intercept, empty cell column,
  • bonus: dropping a term would be nice for some use cases (but I don't remember where I wanted this). Currently, we don't provide or update formula info if we have to change the model matrix internally.
  • a list of column names in the original dataframe of variables actually used in creating the design matrix (exog). (I guess, this might not be available in all cases if user uses functions in the formula that are evaled.)

The derivative of the formula transform from original data to design/model matrix would be very useful, but difficult to get.
My latest attempt was to use the transform as black box and use numerical derivatives #5387 (comment)

(*) e.g. #7609
We currently do not have any code to automatically create penalization for terms specified in a formula. It's impossible or difficult to get the relevant information from the formula/design info.
GAM implements penalization for a term that is defined outside of the formula. Additionally, our spline code, initially copied from patsy, includes additional methods to compute the penalization terms.

@matthewwardrop
Copy link

matthewwardrop commented Apr 28, 2023

Thanks for your reply, and for calling out the various interesting pieces. I'll set about putting up a PR that adds optional support for formulaic sometime between now and shortly after formulaic 1.0.0 is released. I can't guarantee any timelines (I have four kids, and life is busy at work too), but I will do it.

I've left a few notes about various things inline below. When we get closer / during the PR I may have other questions/comments, but we can deal with that then.

* `t_test`, `wald_test` require building linear constraint matrices (does not use formula information, and so works as long as patsy is installed)

Formulaic also implements linear constraint matrices, though this has so far evaded the documentation. This was initially requested by linearmodels.

* missing handling in `from_formula` and `predict` includes several workarounds for patsy's automatic dropping of rows with missing/nan values.

This is customizable in formulaic, you can optionally drop nan rows (the default), raise exceptions, or ignore and propagate them. Will have to see if those modes are sufficient.

* dropping intercept: we have two models, OrderedModel and ???, that cannot have an explicit or implicit intercept. We drop an explicit intercept and, AFAIR, update the formula info (design_info)

Appending a +0 or -1 to any incoming formula should guarantee this. Will investigate why/if this is the case here.

I think we will not be able to drop patsy anytime soon, both for backwards compatibility and for missing features in formulaic like splines.

Is the set of required functionality from patsy unit tested? Formulaic is highly compatible with patsy, and only misses from patsy the natural and cyclic cubic spline implementations, and tensor smoothing. Orthogonal polynomial bases (not implemented by patsy) and basis splines are supported. None of the missing transforms are challenging to add if they are blockers. But I agree that some amount of transition period would like be required.

* updating formula if we need to drop a term, e.g. dropping intercept, empty cell column,

This is technically straightforward if a bit of a chore, but we could definitely add some helper methods to simplify things.

* bonus: dropping a term would be nice for some use cases (but I don't remember where I wanted this). Currently, we don't provide or update formula info if we have to change the model matrix internally.

This should be easier in Formulaic since the formula is a first class entity that is designed to be mutated. But... we shall see if it caters to your use-cases.

* a list of column names in the original dataframe of variables actually used in creating the design matrix (exog).  (I guess, this might not be available in all cases if user uses functions in the formula that are `eval`ed.)

This is about to be added. Turned out to be pretty straightforward.

The derivative of the formula transform from original data to design/model matrix would be very useful, but difficult to get. My latest attempt was to use the transform as black box and use numerical derivatives #5387 (comment)

Interesting. I also added to formulaic symbolic differentiation of model matrices using sympy. I also intended to add finite differencing support... but since then I have become less convinced of the approach and utility (internally at work we marginalise over the delta of model matrices explicitly, which works fine if being slightly less computationally efficient than computing the categorical delta during factor evaluation). I'd be curious to see if either or both of these would be helpful to statsmodels.

(*) e.g. #7609 We currently do not have any code to automatically create penalization for terms specified in a formula. It's impossible or difficult to get the relevant information from the formula/design info. GAM implements penalization for a term that is defined outside of the formula. Additionally, our spline code, initially copied from patsy, includes additional methods to compute the penalization terms.

Definitely curious about this one. I have in mind to add support for transforms that extended terms and/or formulae with metadata that can be used during modelling... but I will dig into this use-cases a bit more before over-promising! It's definitely a possibility!

@josef-pkt
Copy link
Member

comment to just one item

  • dropping intercept: we have two models, OrderedModel and ???, that cannot have an explicit or implicit intercept. We drop an explicit intercept and, AFAIR, update the formula info (design_info)

Appending a +0 or -1 to any incoming formula should guarantee this. Will investigate why/if this is the case here.

The problem for this in patsy is that if there are categorical variables, then -1 adds the implicit intercept to the categorical encoding, e.g. does not drop any reference level.
This is even worse for us because we can remove an explicit constant but not an implicit constant.

@matthewwardrop
Copy link

The problem for this in patsy is that if there are categorical variables, then -1 adds the implicit intercept to the categorical encoding, e.g. does not drop any reference level.
This is even worse for us because we can remove an explicit constant but not an implicit constant.

Ah... yes, Formulaic will do that too by default. Will see what can be done when we get to that chunk of code :).

@josef-pkt
Copy link
Member

one more issue that I had forgotten

  • pickle support for models: because patsy's design_info cannot be pickled, we have to rebuild it from the original formula and data during unpickling of a model. The tricky part was getting the evalenv level.

I guess it will be a lot of work to fully switch to formulaic.
However, we can take time for the transition, as long as users have the option to use patsy instead of formulaic.
I will be glad once we have the core support for formulaic, so we can gain some experience with it.

And then we can start using features that formulaic has but patsy does not, like supporting sparse design matrices.

@MartinStancsicsQC
Copy link

The problem for this in patsy is that if there are categorical variables, then -1 adds the implicit intercept to the categorical encoding, e.g. does not drop any reference level.
This is even worse for us because we can remove an explicit constant but not an implicit constant.

Ah... yes, Formulaic will do that too by default. Will see what can be done when we get to that chunk of code :).

Just putting it out here in case it is useful: we run into a similar issue in a slightly different context. Even when an intercept is included in the model, we did not want to create a dense all-ones column for it, as the downstream library handles intercepts internally. Our solution was to create an option that adds an empty intercept term. Then full-rankness resolution works as if the intercept was there, but the model matrix does not contain one.

@mgorny
Copy link
Contributor

mgorny commented Nov 30, 2023

At this point, Patsy seems to be broken with Python 3.12.

@matthewwardrop
Copy link

Hi all! Circling back around to this now that Formulaic 1.0.0 has been published. Hopefully I can put up an initial PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants