Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP ENH: optimize knot location, Segmented Regression #2677

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

josef-pkt
Copy link
Member

see #2634

Use case: This is a complementary method to penalized splines, where we want to carefully place a few knots instead of penalizing a large number of knots. The main advantage is in cases where we have clear breaks like in piecewise linear regression. This uses power splines and assumes that the estimated function is continuous but not smooth.

This is still largely "research code". I'm trying to figure out how to make this computationally efficient and robust in the optimization.

to the optimization: I guess we can have many local optima. My initial idea was to optimize a knot at a time, restricting it to be within the interval given by the neighboring knots. (article on knot selection in splines, ref?). Because of a mistake in the scipy documentation for brent I actually use the lower and the center point of the bracket which actually works better. Using fminbound to force the interval gives worse results.

One consequence/problem with brent is that it doesn't preserve the increasing order or knots.
It works well in the examples, but it doesn't "feel" robust.

Another option is to increase the number of knots sequentially. This also works well in the example, but we still have the unsorted knot problem.

TODO:

  • compatibility fixes: I'm working currently only with python 3.4 and numpy 1.9. Final version needs to have compatibility fixes.
  • no predict yet: We need to be able to create the spline/segment basis for an exog in prediction.
    Similar to stateful transform in patsy.
  • pandas integration, xnames, ...: currently extra data information is not transferred to created models
  • no model.__class__.__init__ keywords yet, should be relatively easy to add
  • Other exog: currently uses or assumes the full model: projecting on other exog would be computationally more efficient. (p-inverse of one block in partitioned matrix, instead of full p-inverse)
  • selecting number of knots: sequential addition of knots also provides, ssr, aic, ... for choosing the number of knots. A proper hypothesis test would have to take the optimization of knot location into account (multiple testing, sup test)

@josef-pkt
Copy link
Member Author

extra failure np.percentile doesn't seem to be vectorized in numpy 1.7

I get now the failure in from_model also in my original example script. It looks like a local minimum that shows up because of the default starting knots. The original example used different starting knots and went to the better local minimum.
I get the same bad local minimum if I use fminbound instead of brent. Either we need to use a global optimizer, or we use the sequential addition of knots, and add a drop method.

edit
I still have the problem that knots don't remain sorted, but I use the bracket for brent or bounds for fminbound as if they were sorted.
This is still a possible candidate for messing up convergence.

@josef-pkt
Copy link
Member Author

update to previous:
Using correct brackets for brent or fminbound (using searchsorted) does not make a difference to the local minimum. Adding an additional knot moves it to the optimal 4-knot solution.
Adding a knot checks adding it to every segment. (but adding knots also still assumes partially that knots/bounds are sorted)

edit
Everything seems to work fine after fixing some mistakes in using searchsorted bounds for brackets.
from_model also converges to good local minimum.

@josef-pkt
Copy link
Member Author

I just saw that R has a segmented package https://cran.r-project.org/web/packages/segmented/index.html
Several answers to "segmented" questions in stats.stackexchange refer to it.

@josef-pkt
Copy link
Member Author

AttributeError: 'list' object has no attribute 'copy' in python 2, including 2.7

I haven't decided yet whether to keep bounds as list inside the loops or switch everywhere to ndarrays

@josef-pkt
Copy link
Member Author

about R segmented: seems to have a similar API structure as I used
http://cran.utstat.utoronto.ca/doc/Rnews/Rnews_2008-1.pdf#page=20
the optimization there uses a gap variable (jump size) which should go to zero at convergence if connected linear segments are correctly specified. Non convergence might imply misspecification.

The package uses Davies test for the p-value for the Null of zero slope increment, but seems to use standard Wald values for the rest, e.g. confidence intervals (assuming break exists). (Nuisance parameter not identified under alternative.)
They nan the pvalue in the summary results for the slope increments.

standard error and confidence interval of knot location are based on Delta method with the gap parameters.


class Segmented(object):
"""class to search for a variable transformation in regression

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing Parameters section

@josef-pkt
Copy link
Member Author

Based on the docstring it's not very clear how to use this. rework API or document better.

https://github.com/josef-pkt/misc/blob/segmented/notebooks/ex_segmented_regression_sm.ipynb
the main limitation seems to be the missing predict, i.e. creating new splines given the estimated knots. (based on summary and todos at the end of the notebook)

@josef-pkt josef-pkt added this to the 0.9 milestone Jun 19, 2016
@josef-pkt
Copy link
Member Author

josef-pkt commented Jun 19, 2016

also currently only supports OLS (implicitly, no extra args and minimize ssr)

It might still be useful as is, because it's also possible to construct a regular model with possibly patsy linear splines with the knot locations found by the Segmented class.

aic, bic, and lr test might also be valid for choosing the number of knots, with the caveat that wald inference and standard errors don't take knot search into account.

(I guess LR test will not be a standard case, "parameter not identified under the alternative" ? I don't remember right now.)

@josef-pkt
Copy link
Member Author

rebased (merge conflict in compat/numpy easily resolved)

from what I remember this was ready to merge, just waiting for feedback

@josef-pkt
Copy link
Member Author

fails on python 2:

  File "/home/travis/miniconda2/envs/statsmodels-test/lib/python2.7/site-packages/statsmodels-0.8.0-py2.7-linux-x86_64.egg/statsmodels/base/_segmented.py", line 317, in add_knot
    bounds_all.append(bounds.copy())
AttributeError: 'list' object has no attribute 'copy'

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 90.464% when pulling 017bf76 on josef-pkt:segmented_knots into 90e2af8 on statsmodels:master.

@josef-pkt
Copy link
Member Author

rebased, no merge conflicts

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 81.909% when pulling 5cccb33 on josef-pkt:segmented_knots into 5b8f20d on statsmodels:master.

@josef-pkt
Copy link
Member Author

josef-pkt commented Aug 14, 2018

test failure python 2.7 incompatibility in add_knot

>           bounds_all.append(bounds.copy())
E           AttributeError: 'list' object has no attribute 'copy'

same comment already twice above

@josef-pkt josef-pkt removed this from the 0.9 milestone Feb 16, 2022
@josef-pkt josef-pkt added the superseded Used for PRs with content that is superseded by a newer PR label Feb 16, 2022
@josef-pkt
Copy link
Member Author

rebased version in #8124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-base comp-regression superseded Used for PRs with content that is superseded by a newer PR type-enh
Projects
josef-pulls
85percent
Development

Successfully merging this pull request may close these issues.

None yet

2 participants