ENH: Replace patsy with formulaic #6858

bashtage · 2020-07-06T17:36:46Z

Formulaic is a rewrite of patsy that has been designed for more extensibility. It already has features that patsy doesn't such as sparse array support. Should consider moving to this when it is a bit more mature.

https://github.com/matthewwardrop/formulaic

ChadFulton · 2020-07-07T00:55:59Z

I don't use these features much, but I think it's great that there's a new project that might push formulas forward a bit more.

josef-pkt · 2020-07-07T02:20:25Z

It would be great to have an alternative to patsy especially one that avoids eval and where formula information can be pickled.

rluedde · 2020-07-07T13:31:23Z

@josef-pkt This Stack Overflow Post describes pickling as

the process whereby a Python object hierarchy is converted into a byte stream

What is the purpose of pickling a formula in the case of statsmodels? It seems similar to how custom Javscript objects can simply be represented in JSON. Is that a fair comparison?

josef-pkt · 2020-07-07T13:47:05Z

Pickling is the simplest way to store python objects in a file or to transmit python objects between threads or processes.

The main use case for statsmodels users is to store a model so it can later be used for prediction. This works with arrays and dataframes but not when patsy formulas are used. As a workaround we are rebuilding the design matrix, exog, from the formula.
e.g. #2469

Prediction, when formulas are used, needs to go through patsy to create the design matrix for the prediction exog from data in the original format.

rluedde · 2020-07-07T14:13:21Z

Why can't you simply write a formula out as a string? Is a formula in this sense something along the lines of y ~ x1 + x2? Or does a formula encompass all weights/coefficients/betas etc after fitting/training?

@josef-pkt What did you mean by this:

We "fixed" the formula pickling by rebuilding from the model (exog) from the original dataframe.

Why are Wilkinson formulas preferred? Is the alternative to use features as *args and formulas are more explicit?

josef-pkt · 2020-07-07T14:26:25Z

It's not the formula itself that cannot be pickled, It's all the information associate with the formula information about how the design matrix was built, and how new design matrices with the same structure can be built.

Why are Wilkinson formulas preferred? Is the alternative to use features as *args and formulas are more explicit?

I never really looked at the details. The main point is that formulas are a short hand notation for defining a model. All other options require the user to specify a lot more explicitly.

Statsmodels uses a more limited set of formula features compared to R, e.g. our Mixed Models don't put the random effects specification inside the formula as R does, and we don't use multi-part formulas.

rluedde · 2020-07-07T14:38:11Z

So are patsy/formulaic libraries that translate formulas into instructions for building design matrices?

Statsmodels uses a more limited set of formula features compared to R, e.g. our Mixed Models don't put the random effects specification inside the formula as R does, and we don't use multi-part formulas.

Is this a choice made by statsmodels or is it due to limitations of patsy?

josef-pkt · 2020-07-07T14:48:15Z

So are patsy/formulaic libraries that translate formulas into instructions for building design matrices?

Yes, and build the design matrix, and provide information about the design matrix that we can use for example for hypothesis tests.

Is this a choice made by statsmodels or is it due to limitations of patsy?

Mostly our choice. Beyond the basic formulas, formulas start to become cryptic. Being more explicit in those cases, makes it easier to understand if you are not a formula expert.
(There are many cases where I don't know what an R formula actually means and does.)

Partially it's also that we have more control over parts that don't go through patsy. Especially because we don't have a maintainer anymore that also contributes to patsy (or whose work on patsy is getting merged).

rluedde · 2020-07-07T14:55:17Z

Gotcha.

@josef-pkt @bashtage You both have been very helpful to me, so thank you guys.

bashtage · 2020-07-07T15:00:36Z

Is this a choice made by statsmodels or is it due to limitations of patsy?

In linearmodels I use some custom specifications that go beyond Patsy. I manually parse these. It would be nice to ultimately have a more integrated approach. For example, in instrumental variable models, which are estimated using 2 regressions, I use a formula like

y ~ 1 + x1 + x2 + x3 + [x4 + x5 ~ z1 + z2 + z3]

where the [ ] formula indicates that endogenous variables and instruments and the remaining variables are exogenous.

In panel models I use keywords like

y ~ x1 + EntityEffects

where EntityEffects tells me that the model should include an entity (fixed) effects.

rluedde · 2020-07-08T13:28:16Z

@bashtage Why is linearmodels seperate from statsmodels? Is linearmodels more specifically for finance?

bashtage · 2020-07-08T13:35:10Z

statsmodels base structure is too rigid to reliably accommodate data structures that do not resemble a single series.

rluedde · 2020-07-08T13:40:03Z

What is a data structure that doesn't resemble a single series in this case? Grouped data? Time series? Multiple time series?

bashtage · 2020-07-08T15:32:49Z

Both panel and multiple series such as what one uses in a SUR.

My other massive pain point is that way statsmodels uses Wrapper classes. The GIF below shows many key presses it is to get into a results method or property to debug:

josef-pkt · 2020-07-08T15:50:31Z

I never do this, I always use post-mortem debugger, mainly with pytest

josef-pkt · 2020-09-02T14:57:34Z

I think we should add an experimental from_formulaic methods to gain more experience and give it more exposure, initially without guarantees for backwards compatibility or that it works beyond model creation and predict, and that not for all models.
For that, we would not need to have complete handle_formula support where it is not generic/inherited.

A prime candidate where it would go beyond patsy, would be sparse support.

Based on the popularity by users, it would be important to support it in anova_lm.

bashtage · 2020-09-02T15:17:43Z

It would be easier to add engine="patsy" to from_formula and then allow "formulaic" as an option.

ClaasRostock · 2023-01-01T20:37:30Z

Has adding support for formulaic been considered, recently?

josef-pkt · 2023-01-03T02:15:54Z

considered yes, but no work to support it has been done yet. It's still on the wishlist.

ClaasRostock · 2023-01-03T08:24:40Z

Thank you, Josef, for the information! From: Josef Perktold ***@***.***> Sent: Dienstag, 3. Januar 2023 03:16 To: statsmodels/statsmodels ***@***.***> Cc: Rostock, Claas ***@***.***>; Comment ***@***.***> Subject: Re: [statsmodels/statsmodels] ENH: Replace patsy with formulaic (#6858) considered yes, but no work to support it has been done yet. It's still on the wishlist. - Reply to this email directly, view it on GitHub<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstatsmodels%2Fstatsmodels%2Fissues%2F6858%23issuecomment-1369329236&data=05%7C01%7Cclaas.rostock%40dnv.com%7C5ec71d8248564b25fb8408daed307885%7Cadf10e2bb6e941d6be2fc12bb566019c%7C0%7C0%7C638083089681722573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bIxlgSa0uIfX0jV1Jw2u2DwtriQAhaB8%2B8IfgQuebBk%3D&reserved=0>, or unsubscribe<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALT6QOC4ASZN5HLQXDAHOZTWQODWLANCNFSM4OR2INJQ&data=05%7C01%7Cclaas.rostock%40dnv.com%7C5ec71d8248564b25fb8408daed307885%7Cadf10e2bb6e941d6be2fc12bb566019c%7C0%7C0%7C638083089681722573%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ksQOaOwwRsQEB55qYwQbkchfsrI5sNchVn%2FRr6Kfu8s%3D&reserved=0>. You are receiving this because you commented.Message ID: ***@***.******@***.***>> DNV GL SE, Registered Office Hamburg No. HRB 115442 * Chairman of the Supervisory Board: Remi Eriksen * Executive Board: Knut ?rbeck-Nilssen * Torsten Schramm * Bj?rn-Olaf Borth The latest edition of the General Terms and Conditions of DNV GL SE is applicable. ************************************************************************************** This e-mail and any attachments thereto may contain confidential information and/or information protected by intellectual property rights for the exclusive attention of the intended addressees named above. If you have received this transmission in error, please immediately notify the sender by return e-mail and delete this message and its attachments. Unauthorized use, copying or further full or partial distribution of this e-mail or its contents is prohibited. **************************************************************************************

josef-pkt · 2023-03-08T15:13:21Z

I'm looking at formulaic again, but it's difficult to see what is supported and how to get the information for post-estimation.

@bashtage
I briefly looked at linearmodels. I only saw the usage of model_matrix.
How to you do the formula transformation for out of sample predict exog?

One of the main open problems we have is to get the formula information to construct margins (get_margeff) for interaction effects (discrete-continuous, continuous-continuous) and multicolumn transformation (like polynomials).

The two current post-estimation usages of patsy's formula info (beside predict) is anova_lm (which I'm not very familiar with) and wald tests, those based on terms and the pairwise tests for categorical (wald_test_terms and t_test_pairwise).
I didn't find much trying to inspect the formula information available with formulaic.

karrmagadgeteer2 · 2023-04-07T21:49:27Z

Adding here just in case. patsy issue 191 Please ignore if not relevant. Relates to statsmodels running on tests with Python 11.

matthewwardrop · 2023-04-22T05:41:16Z

Hi @josef-pkt !

I'm the author of Formulaic, and am just now wrapping up the final bits before v1.0.0 in Formulaic. What was mostly missing hitherto was documentation, which is now mostly complete. I'd be more than happy to help port statsmodels over to formulaic, including adding any missing features; and can shoulder most of the load if that is desired.

Re: re-use of the spec used for model matrices, please refer to the docs here: https://matthewwardrop.github.io/formulaic/guides/model_specs/#reusing-model-specifications .

There's also a brief migration guide here: https://matthewwardrop.github.io/formulaic/migration/ . The ordering differences mentioned here are about to be mitigated and/or completely disappear.

Also, thanks for posting the patsy issue @karrmagadgeteer2 . I'll take a look and get it patched. [Update: This is already fixed in 0.5.3, which I released a while back; please upgrade!]

deanm0000 · 2023-04-28T11:19:20Z

According to patsy's github https://github.com/pydata/patsy, it is no longer under active development. They are recommending people use formulaic.

bashtage · 2023-04-28T11:20:47Z

I think for v0.15.0 we should have formulaic in place. Whether it is mandatory or we provide a period with both patsy and formulaic needs consideration.

josef-pkt · 2023-04-28T14:56:22Z

The new documentation of formulaic looks very helpful.

pull requests would be very welcome

AFAIR, the main current usage of formula are

from_formula and statsmodels.base.data class should have very similar code between patsy and formulaic and should be relatively straight forward to add
predict requires recreating a new exog based on formula info
t_test, wald_test require building linear constraint matrices (does not use formula information, and so works as long as patsy is installed)
looping through terms used in anova_lm and wald_test_terms to build hypotheses tests for terms
t_test_pairwise needs information for a one-factor term
missing handling in from_formula and predict includes several workarounds for patsy's automatic dropping of rows with missing/nan values.
dropping intercept: we have two models, OrderedModel and ConditionalLogit/ConditionalPoisson, that cannot have an explicit or implicit intercept. We drop an explicit intercept and, AFAIR, update the formula info (design_info)

I think we will not be able to drop patsy anytime soon, both for backwards compatibility and for missing features in formulaic like splines.

for the future:
And then there is a wishlist if it eventually works out with formulaic, issues or features not supported by patsy or not fully used by statsmodels.

metainformation about stateful transforms (e.g. splines) (*)
interaction terms:
- we don't have currently any post estimation for interacted terms, eg. marginal/partial effects or pairwise t_test
- empty cells in interaction terms (will likely be needed to be handled by statsmodels)
updating formula if we need to drop a term, e.g. dropping intercept, empty cell column,
bonus: dropping a term would be nice for some use cases (but I don't remember where I wanted this). Currently, we don't provide or update formula info if we have to change the model matrix internally.
a list of column names in the original dataframe of variables actually used in creating the design matrix (exog). (I guess, this might not be available in all cases if user uses functions in the formula that are evaled.)

The derivative of the formula transform from original data to design/model matrix would be very useful, but difficult to get.
My latest attempt was to use the transform as black box and use numerical derivatives #5387 (comment)

(*) e.g. #7609
We currently do not have any code to automatically create penalization for terms specified in a formula. It's impossible or difficult to get the relevant information from the formula/design info.
GAM implements penalization for a term that is defined outside of the formula. Additionally, our spline code, initially copied from patsy, includes additional methods to compute the penalization terms.

matthewwardrop · 2023-04-28T22:30:42Z

Thanks for your reply, and for calling out the various interesting pieces. I'll set about putting up a PR that adds optional support for formulaic sometime between now and shortly after formulaic 1.0.0 is released. I can't guarantee any timelines (I have four kids, and life is busy at work too), but I will do it.

I've left a few notes about various things inline below. When we get closer / during the PR I may have other questions/comments, but we can deal with that then.

* `t_test`, `wald_test` require building linear constraint matrices (does not use formula information, and so works as long as patsy is installed)

Formulaic also implements linear constraint matrices, though this has so far evaded the documentation. This was initially requested by linearmodels.

* missing handling in `from_formula` and `predict` includes several workarounds for patsy's automatic dropping of rows with missing/nan values.

This is customizable in formulaic, you can optionally drop nan rows (the default), raise exceptions, or ignore and propagate them. Will have to see if those modes are sufficient.

* dropping intercept: we have two models, OrderedModel and ???, that cannot have an explicit or implicit intercept. We drop an explicit intercept and, AFAIR, update the formula info (design_info)

Appending a +0 or -1 to any incoming formula should guarantee this. Will investigate why/if this is the case here.

I think we will not be able to drop patsy anytime soon, both for backwards compatibility and for missing features in formulaic like splines.

Is the set of required functionality from patsy unit tested? Formulaic is highly compatible with patsy, and only misses from patsy the natural and cyclic cubic spline implementations, and tensor smoothing. Orthogonal polynomial bases (not implemented by patsy) and basis splines are supported. None of the missing transforms are challenging to add if they are blockers. But I agree that some amount of transition period would like be required.

* updating formula if we need to drop a term, e.g. dropping intercept, empty cell column,

This is technically straightforward if a bit of a chore, but we could definitely add some helper methods to simplify things.

* bonus: dropping a term would be nice for some use cases (but I don't remember where I wanted this). Currently, we don't provide or update formula info if we have to change the model matrix internally.

This should be easier in Formulaic since the formula is a first class entity that is designed to be mutated. But... we shall see if it caters to your use-cases.

* a list of column names in the original dataframe of variables actually used in creating the design matrix (exog).  (I guess, this might not be available in all cases if user uses functions in the formula that are `eval`ed.)

This is about to be added. Turned out to be pretty straightforward.

The derivative of the formula transform from original data to design/model matrix would be very useful, but difficult to get. My latest attempt was to use the transform as black box and use numerical derivatives #5387 (comment)

Interesting. I also added to formulaic symbolic differentiation of model matrices using sympy. I also intended to add finite differencing support... but since then I have become less convinced of the approach and utility (internally at work we marginalise over the delta of model matrices explicitly, which works fine if being slightly less computationally efficient than computing the categorical delta during factor evaluation). I'd be curious to see if either or both of these would be helpful to statsmodels.

(*) e.g. #7609 We currently do not have any code to automatically create penalization for terms specified in a formula. It's impossible or difficult to get the relevant information from the formula/design info. GAM implements penalization for a term that is defined outside of the formula. Additionally, our spline code, initially copied from patsy, includes additional methods to compute the penalization terms.

Definitely curious about this one. I have in mind to add support for transforms that extended terms and/or formulae with metadata that can be used during modelling... but I will dig into this use-cases a bit more before over-promising! It's definitely a possibility!

josef-pkt · 2023-04-28T23:03:14Z

comment to just one item

dropping intercept: we have two models, OrderedModel and ???, that cannot have an explicit or implicit intercept. We drop an explicit intercept and, AFAIR, update the formula info (design_info)

Appending a +0 or -1 to any incoming formula should guarantee this. Will investigate why/if this is the case here.

The problem for this in patsy is that if there are categorical variables, then -1 adds the implicit intercept to the categorical encoding, e.g. does not drop any reference level.
This is even worse for us because we can remove an explicit constant but not an implicit constant.

matthewwardrop · 2023-04-29T02:05:09Z

The problem for this in patsy is that if there are categorical variables, then -1 adds the implicit intercept to the categorical encoding, e.g. does not drop any reference level.
This is even worse for us because we can remove an explicit constant but not an implicit constant.

Ah... yes, Formulaic will do that too by default. Will see what can be done when we get to that chunk of code :).

josef-pkt · 2023-04-29T12:11:46Z

one more issue that I had forgotten

pickle support for models: because patsy's design_info cannot be pickled, we have to rebuild it from the original formula and data during unpickling of a model. The tricky part was getting the evalenv level.

I guess it will be a lot of work to fully switch to formulaic.
However, we can take time for the transition, as long as users have the option to use patsy instead of formulaic.
I will be glad once we have the core support for formulaic, so we can gain some experience with it.

And then we can start using features that formulaic has but patsy does not, like supporting sparse design matrices.

MartinStancsicsQC · 2023-08-07T07:47:18Z

The problem for this in patsy is that if there are categorical variables, then -1 adds the implicit intercept to the categorical encoding, e.g. does not drop any reference level.
This is even worse for us because we can remove an explicit constant but not an implicit constant.

Ah... yes, Formulaic will do that too by default. Will see what can be done when we get to that chunk of code :).

Just putting it out here in case it is useful: we run into a similar issue in a slightly different context. Even when an intercept is included in the model, we did not want to create a dense all-ones column for it, as the downstream library handles intercepts internally. Our solution was to create an option that adds an empty intercept term. Then full-rankness resolution works as if the intercept was there, but the model matrix does not contain one.

mgorny · 2023-11-30T14:18:41Z

At this point, Patsy seems to be broken with Python 3.12.

matthewwardrop · 2023-12-25T06:08:17Z

Hi all! Circling back around to this now that Formulaic 1.0.0 has been published. Hopefully I can put up an initial PR soon.

bashtage changed the title ~~Consider moving from patsy to formulaic~~ ENH: Replace patsy with formulaic Jul 6, 2020

bashtage added comp-formula type-enh labels Jul 6, 2020

bashtage mentioned this issue Jul 7, 2020

What are the limitations from treatying formulaic as a drop-in replacement for patsy? matthewwardrop/formulaic#2

Closed

josef-pkt mentioned this issue May 21, 2021

handle_formula_data should sometimes return matrix #7427

Open

josef-pkt mentioned this issue Jun 9, 2021

remove_data= option does not appear to remove data fully #7494

Open

josef-pkt mentioned this issue Aug 2, 2021

GLMGam with patsy's built-in B-Spline Functionality #7609

Open

josef-pkt mentioned this issue Oct 9, 2021

ENH: multipart formulas for multipart models #7789

Open

lorentzenchr mentioned this issue Jan 16, 2022

Consider migrating project to some community umbrella matthewwardrop/formulaic#51

Open

josef-pkt added this to the 0.15 milestone Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Replace patsy with formulaic #6858

ENH: Replace patsy with formulaic #6858

bashtage commented Jul 6, 2020

ChadFulton commented Jul 7, 2020

josef-pkt commented Jul 7, 2020

rluedde commented Jul 7, 2020

josef-pkt commented Jul 7, 2020

rluedde commented Jul 7, 2020 •

edited

Loading

josef-pkt commented Jul 7, 2020

rluedde commented Jul 7, 2020

josef-pkt commented Jul 7, 2020 •

edited

Loading

rluedde commented Jul 7, 2020

bashtage commented Jul 7, 2020

rluedde commented Jul 8, 2020

bashtage commented Jul 8, 2020

rluedde commented Jul 8, 2020

bashtage commented Jul 8, 2020

josef-pkt commented Jul 8, 2020

josef-pkt commented Sep 2, 2020 •

edited

Loading

bashtage commented Sep 2, 2020

ClaasRostock commented Jan 1, 2023

josef-pkt commented Jan 3, 2023

ClaasRostock commented Jan 3, 2023 via email

josef-pkt commented Mar 8, 2023

karrmagadgeteer2 commented Apr 7, 2023

matthewwardrop commented Apr 22, 2023 •

edited

Loading

deanm0000 commented Apr 28, 2023

bashtage commented Apr 28, 2023

josef-pkt commented Apr 28, 2023 •

edited

Loading

matthewwardrop commented Apr 28, 2023 •

edited

Loading

josef-pkt commented Apr 28, 2023

matthewwardrop commented Apr 29, 2023

josef-pkt commented Apr 29, 2023

MartinStancsicsQC commented Aug 7, 2023

mgorny commented Nov 30, 2023

matthewwardrop commented Dec 25, 2023

ENH: Replace patsy with formulaic #6858

ENH: Replace patsy with formulaic #6858

Comments

bashtage commented Jul 6, 2020

ChadFulton commented Jul 7, 2020

josef-pkt commented Jul 7, 2020

rluedde commented Jul 7, 2020

josef-pkt commented Jul 7, 2020

rluedde commented Jul 7, 2020 • edited Loading

josef-pkt commented Jul 7, 2020

rluedde commented Jul 7, 2020

josef-pkt commented Jul 7, 2020 • edited Loading

rluedde commented Jul 7, 2020

bashtage commented Jul 7, 2020

rluedde commented Jul 8, 2020

bashtage commented Jul 8, 2020

rluedde commented Jul 8, 2020

bashtage commented Jul 8, 2020

josef-pkt commented Jul 8, 2020

josef-pkt commented Sep 2, 2020 • edited Loading

bashtage commented Sep 2, 2020

ClaasRostock commented Jan 1, 2023

josef-pkt commented Jan 3, 2023

ClaasRostock commented Jan 3, 2023 via email

josef-pkt commented Mar 8, 2023

karrmagadgeteer2 commented Apr 7, 2023

matthewwardrop commented Apr 22, 2023 • edited Loading

deanm0000 commented Apr 28, 2023

bashtage commented Apr 28, 2023

josef-pkt commented Apr 28, 2023 • edited Loading

matthewwardrop commented Apr 28, 2023 • edited Loading

josef-pkt commented Apr 28, 2023

matthewwardrop commented Apr 29, 2023

josef-pkt commented Apr 29, 2023

MartinStancsicsQC commented Aug 7, 2023

mgorny commented Nov 30, 2023

matthewwardrop commented Dec 25, 2023

rluedde commented Jul 7, 2020 •

edited

Loading

josef-pkt commented Jul 7, 2020 •

edited

Loading

josef-pkt commented Sep 2, 2020 •

edited

Loading

matthewwardrop commented Apr 22, 2023 •

edited

Loading

josef-pkt commented Apr 28, 2023 •

edited

Loading

matthewwardrop commented Apr 28, 2023 •

edited

Loading