Pull request to fix bandwidth bug in issue 597 #1629

Merged
merged 1 commit into from May 23, 2014

Projects

None yet

4 participants

@Padarn
Contributor
Padarn commented Apr 23, 2014

Have added a simple flag to see if a user bandwidth is given, and if so just return this instead of computing a new one.

This is in relation to issue #597

@josef-pkt
Member

I'm not really a big fan of having the explicitly given check now in both methods https://github.com/Padarn/statsmodels/blob/bug597/statsmodels/nonparametric/_kernel_base.py#L129
but it has the advantage that we don't need to change it in 4 __init__

maybe we should add another private method that selects which of (1) use given array (2) _compute_bw or (3) _compute_efficient should be used.

you could use your gist example as unittestcase.

@Padarn
Contributor
Padarn commented Apr 24, 2014

I agree, I don't think it is a very nice solution, but it is a temporary fix. It would be nice to add a new private function, but I think then _compute_efficient should be changed so that the selection of plug-in method is used (at the moment it is hard coded as 'normal_reference' I believe.

@jseabold jseabold closed this Apr 29, 2014
@jseabold jseabold reopened this Apr 29, 2014
@jseabold
Member

Ping Travis-CI to see what's going on with stalled test.

@Padarn
Contributor
Padarn commented Apr 30, 2014

I am in the process of reorganising this a bit, but I'm a bit confused by some of what I'm seeing.

In kernel_reg, if you select cv_ls it uses the cv_loo function, rather than the cv_ls in _kernel_base.py.

Do either of you know what it going on here? If not, would you be opposed to me cleaning it up so it is more consistent between the two classes?

@josef-pkt
Member

I would be careful with reorganizing with insufficient coverage by unit tests and being not sure what the code does and is supposed to do.

You can also compare with the description of the np package http://cran.r-project.org/web/packages/np/index.html which I think is the reference for large parts.

From a brief look KernelReg.cv_loo is the least squares criterion for regression (based on leave one out) specific to regression. while GenericKDE._cv_ls uses imse defined by subclasses that do density estimation.

KernelReg subclasses GenericKDE but not all methods for the density estimation will be appropriate for the regression function estimation

@Padarn
Contributor
Padarn commented Apr 30, 2014

Okay fair point. I'll make sure I back any changed up with unit tests if I do change something.

Right, I missed that imse obviously doesn't make sense for regression. Although I think cv_ls is still using a leave one out approach to calculate this. I wouldn't be suggesting any real functional change, but rather just reorganisation so the classes were a little more consistent with each other.

@josef-pkt
Member

@Padarn It would be very good if you can work on this.

In my experience it would be best to start at the user examples and the unit tests and look at specific problems.
Some internal code reorganization and cleanup will be necessary, but without test coverage this might break other untested parts. Given that we only partially understand what the code is supposed to be doing.

Also try to make smaller "digestible" pull request so the PR reviewer is not overwhelmed by the changes.
For example this PR fixes a specific bug and can be merged. some unittest for it would be helpful.

@Padarn
Contributor
Padarn commented Apr 30, 2014

Okay thanks Josef, I'll follow that plan. I'll add unit tests to this tomorrow and then work on identifying small chunks that can be changed.

@josef-pkt
Member

@Padarn are you working specifically on Kernel regression.

Given that I looked at it again, I have some ideas about #602, how to add a results class and reorganize some methods.
But it's not a small undertaking affecting the fit of 4 classes with insufficient test coverage.
To get started only predict (which is currently done by fit) needs to have full test coverage.

One problem might be to find a test case for each model, I don't remember if the sandbox classes have any examples.

@Padarn
Contributor
Padarn commented May 1, 2014

No not specifically Kernel regression. I was hoping to make the backend of the bandwidth consistent across the various kernel implementations. I'm certainly open to suggestions though. I'll finish this PR and then have a closer look at #602

@Padarn
Contributor
Padarn commented May 1, 2014

I've added a unit test for each of KDEMultivariate, KernelReg and KernelCensoredReg which just makes sure that if the bw is given explicitly and the efficient option is chosen, it will set this as the bandwidth anyway. These tests do repeat each other a bit, but at least it covers all of the options.

@coveralls

Coverage Status

Coverage remained the same when pulling 7b36475 on Padarn:bug597 into a6a927c on statsmodels:master.

@josef-pkt josef-pkt and 1 other commented on an outdated diff May 1, 2014
statsmodels/nonparametric/_kernel_base.py
References
----------
See p.9 in socserv.mcmaster.ca/racine/np_faq.pdf
"""
+
+ if isinstance(bw, (np.ndarray, list)):
+ self._bw_method = "user-specified"
+ return bw
@josef-pkt
josef-pkt May 1, 2014 Member

can bw be a scalar?
then reversing the check to test for isinstance string as in the other calculate_bw would be better.

@Padarn
Padarn May 1, 2014 Contributor

You're right, I'll change this.

@josef-pkt
Member

looks good, TravisCI is all green
reorganizing or streamlining the unit tests can wait, it's still fine

@coveralls

Coverage Status

Coverage remained the same when pulling a9b74c6 on Padarn:bug597 into a6a927c on statsmodels:master.

@josef-pkt
Member

http://cran.r-project.org/web/packages/np/vignettes/np.pdf sections 8 and 9 have examples and how to run them in R.
Should be close to ours, but I don't know whether the options are the same or which ones they are.

It might be better just to grab their example as a test case.

@Padarn
Contributor
Padarn commented May 1, 2014

I had seen those, but I they take data directly from the MASS package. I couldn't immediately see if it would be fine to use this.

@josef-pkt
Member

wage2 looks public domain, originally from the NLS National Longitudinal Survey

WAGE2.RAW
Source: M. Blackburn and D. Neumark (1992), “Unobserved Ability, Efficiency Wages, and
Interindustry Wage Differentials,” Quarterly Journal of Economics 107, 1421-1436.
Professor Neumark kindly provided the data, of which I used a subset.
Used in Text: pages 65-66, 106, 112, 164-165, 212, 214, 252-253, 297-299, 320, 491, 505
518, 521-522, 648
http://www.jstor.org/stable/2118394?seq=4

@josef-pkt
Member

wrong wages (that's male only)

wage1 is used on p.222 of the 2003 edition, which I have by chance
population survey is also public domain

WAGE1.RAW
Source: These are data from the 1976 Current Population Survey, collected by Henry Farber
when he and I were colleagues at MIT in 1988.
Used in Text: pages 7, 38, 76-77, 93, 123-124, 180, 190-192, 214, 222-223, 226-228, 232, 235,
254, 260-261, 311, 648

@Padarn
Contributor
Padarn commented May 1, 2014

I'm also using birthwt (for the single index model), I can't find anything conclusive, but the MASS documentation suggests it should be fine: http://cran.r-project.org/web/packages/MASS/MASS.pdf

@Padarn
Contributor
Padarn commented May 1, 2014

There are a few issues with the R functions having different parameters settings etc. Is it okay to submit a partially completed PR so comments can be made directly on what already exists?

@josef-pkt
Member

yes, open WIP PR

@josef-pkt
Member

If you run into problems, write some regression tests first, that is just use the current numbers for fit/predict as desired.

I wanted to say use the bandwidth and other parameters from R as starting values, but that doesn't work for SemiLinear because there is now way to fix the bandwidth nor give starting values in SemiLinear._est_b_bw

@josef-pkt
Member

I just realized these comments are on the finished PR, not on #602

@josef-pkt josef-pkt and 1 other commented on an outdated diff May 1, 2014
statsmodels/nonparametric/_kernel_base.py
@@ -169,10 +169,27 @@ def _compute_efficient(self, bw):
dividing ``nobs`` into as many ``n_sub`` blocks as needed (if
`randomize` is False).
+ Parameters
+ ----------
+ bw: array_like or str
+ If array_like: user-specified bandwidth.
+ If a string this paraneter is ignored,
+
+ Notes
+ -----
+ Currently if the bw parameter is given as a string it is ignored.
@josef-pkt
josef-pkt May 1, 2014 Member

I think this is not correct, string bw are used for the subsets, and can be whatever option the underlying class allows.
AFAICS, we use the chosen bw method in each subset in _compute_subset and then use the mean or median for the scale estimates, which is then used in the returned bw.

@Padarn
Padarn May 1, 2014 Contributor

Right, it isn't ignored. Perhaps the note should be `Selected bw method is used to estimate scale parameter for plug-in bandwidth.`` I think it is still true that if you specify a cross validation bandwidth you get a plug-in estimator rather than the cross-validation estiamtor?

@josef-pkt
josef-pkt May 1, 2014 Member

It goes back and forth a few times between kernelbase and kernelReg methods but ends up in _compute_reg_bw
which then minimizes either cv_ls=self.cv_loo or aic=self.aic_hurvich.
quite a detour.

so bw ='cv_ls' means we run the leave-one-out loop cv_loo. If efficient is true, we run the LOO loop for each subset.

@Padarn
Padarn May 1, 2014 Contributor

To me it looks like it depends on return_only_bw which is by default false. If this is false, the bandwidths from _compute_reg_bw don't look to be used, instead just the estimate of the dispersion is used to calculate a bandwidth in lines 221-223 of _kernel_base.py?

@Padarn
Padarn May 1, 2014 Contributor

I just tested this, and it doesn't look like the estimated bandwidths from _compute_reg_bw are actually used by default (you could get them by specifying a return_only_bw=True in the estimator settings.

I might just be totally mixed up here, but I'm not sure the _compute_efficient.py is really doing the right thing.

@josef-pkt
josef-pkt May 1, 2014 Member

difficult to get through this.
you're right bw is not used directly if return_only_bw=True
however, _kernel_base line 95 sample_scale_sub = sub_model.bw / fct
if return_only_bw=False then we use sample_scale_sub which is bw corrected for a scale estimate
AFAICS

@josef-pkt
josef-pkt May 2, 2014 Member

maybe a way to check the subset treatment if efficient is True is to use repeated subsets in the original data (np.tile), or try n_sub = 1 (in default). But I'm not sure it doesn't break if there is no variation across subsets, or there is only one.

@Padarn
Padarn May 2, 2014 Contributor

Ah yes true. So the cross validation bandwidths are used to estimate the coefficient for each plug-in bandwidth, then the median of these is used for the full sample.

Perhaps the best thing to do for now is to leave out the comment (at the base of this discussion) and perhaps come back to this after we have more test coverage.

@josef-pkt
josef-pkt May 2, 2014 Member

Perhaps the best thing to do for now is to leave out the comment (at the base of this discussion) and perhaps come back to this after we have more test coverage.

Yes, from the plots it looks like it "works", but it will take some time, unit tests and examples to figure out how this code works and where we might be able to simplify the structure.

@Padarn
Padarn May 2, 2014 Contributor

Okay - that is what I will do then.

@Padarn
Contributor
Padarn commented May 1, 2014

I've opened a WIP pull request for the new unit tests at #1652

@Padarn
Contributor
Padarn commented May 2, 2014

This PR is ready to merge from my point of view when you are happy with it.

@josef-pkt
Member

I'm trying to catch up with this (I got distracted halfway through "bugweek")

I think this would look better when it is squashed into one commit, since it has several changes that were reverted.

@Padarn Do you want to try to do an interactive rebase and squash, or should I?

@Padarn
Contributor
Padarn commented May 22, 2014

I'll give it a go, good to do these things properly.

Just to be clear: I would do a local rebase, squash and then a forced push?

@josef-pkt
Member

Yes, you can squash during an interactive rebase. To be safe and be able to try several times, it's better to backup the branch or work on a copy (new branch of this).
(I needed to do this a lot before I figured out how to do it without too many mistakes.)

@Padarn Padarn Added flag to stop _compute_efficient from calulating bandwith when u…
…ser given bandwidth is present

adding unit tests
5ec15dd
@Padarn
Contributor
Padarn commented May 23, 2014

Okay, I think I've done that right. The commit message is a bit long, but I think everything is as it should be.

@josef-pkt
Member

Yes, looks good, one clean commit and network tree also looks good.

I'll merge in an hour or two.

@josef-pkt josef-pkt merged commit 3adaa1a into statsmodels:master May 23, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@coveralls

Coverage Status

Coverage increased (+0.09%) when pulling 5ec15dd on Padarn:bug597 into d27b085 on statsmodels:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment