Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Wilkinson formula interface and scikit-learn style estimators #103

Closed
wants to merge 60 commits into from

Conversation

tdhoffman
Copy link

This pull request supercedes #102. It adds the following features:

  • spatial and nonspatial Wilkinson formulas (spreg/formula.py)
  • a submodule for spreg called spreg.sklearn that includes scikit-learn interfaces to Error, Lag, DurbinError, and DurbinLag models, as well as Anselin LM tests in the scikit-learn metrics style. This submodule has its own formula interface independent of spreg.from_formula that properly dispatches to the sklearn models.
  • two notebooks that demonstrate the usage and value of these features (spreg/notebooks/formula_example.ipynb and spreg/notebooks/sklearn_example.ipynb)
  • unit tests for all added code (spreg/tests for formulas, spreg/sklearn/tests for the
  • documentation and doctests for all added code

Pursuant to this comment on #101, everything in this PR is fully backwards compatible. These are add-ons to the package which sit on top of existing code and provide alternative ways for users to interface with the core functionality of spreg. To ensure the changes are not breaking, the spreg.sklearn submodule must be directly loaded by users (e.g. via import spreg.sklearn) as it will not be automatically imported otherwise.

Currently, the formula interface function dispatches to all of the spatial regression models. While it does not support regime regression or seemingly unrelated regression, it provides a solid base of code to which these features can be added.

More information about the design process for these features can be found in my GSoC 2022 Progress Journal, and of course I'm more than happy to answer any questions and make any changes!

@TaylorOshan
Copy link
Contributor

@jGaboardi just curious, was there any off-line discussion about this before closing? IIRC, this was a pretty good effort towards adding the formula capabilities that we have talked about for a while.

@lanselin
Copy link
Member

I'm not a fan of the Wilkinson formula. For any but the simplest models it creates all kinds of issues, e.g., how to introduce WX variables without having to explicitly list them, etc. For example, compare to the in my opinion overly complicated interface in spatialreg in R to accommodate all the situations related to spatial Durbin, endogenous variables, instruments, etc. Btw, instruments are never part of an actual formula but added on by means of these (again, in my opinion) very awkward | | options. While not ideal, I very much like the most recent setup (1.4) which allows for specification of a range of models such as SLX, spatial Durbin, SLX error and even GNS, using GMM_Error and slx_lags and add_wy arguments. The WX and Wy variables never need to be listed explicitly but are computed (and variable names created) under the hood. The fewer variable names need to be passed, the fewer opportunities for typos. My golden rule in life :-)

I think our efforts are better spent at enhancing functionality in supporting different models and estimation methods. I was never a fan of this in the first place, not sure why it even became a GOSC project ...

@jGaboardi
Copy link
Member

@TaylorOshan I did not close this... Perhaps I screwed something up with the pysal:master -> pysal:main switch. All open PRs were supposed to be automatically updated, but it looks like something funky happened...

@jGaboardi
Copy link
Member

@TaylorOshan I very much apologize for the screw up. I have never seen this happen with master->main switch. Good news is that @tdhoffman's branch still has all the work. Shall we create a new PR from that?

@lanselin
Copy link
Member

I say no. Neither Pedro nor I like this approach. As long as I am bdfl for spreg, I nix it. The sklearn interface is built for prediction, spreg is mostly about inference, so it actually turns out to be very awkward.

@lanselin
Copy link
Member

Just to make clear, I would like it to be closed. We are done with that.

@pedrovma
Copy link
Member

The many additional combinations of specifications introduced by v1.4 make this outdated. As @lanselin pointed out, these formulas can't address them all easily, especially when combined with regimes and/or seemingly unrelated regressions. I believe the suggestion made by @martinfleis in #127 is a more interesting approach to simplify the function calls given the changes introduced by v1.4. Regardless, adding other functionalities we have already mapped, such as additional tests and the computation of direct/indirect effects, seems to me to be more of a priority now than enhancing the arguments' structure.

@TaylorOshan
Copy link
Contributor

Understood @lanselin. Thanks for the additional feedback @pedrovma. My intention was not to advocate for this in spreg. There were several conversations during the GSOC period about formulas more generally and in the context of other modules and I was curious if this would be worth preserving as a proof-of-concept in case it is useful elsewhere in the future. Thanks to @jGaboardi for pointing out that the original branch is in tact and takes care of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants