Silverman enhancement - Issue #1243 #1271

Merged
merged 12 commits into from Dec 30, 2013

Projects

None yet

3 participants

@Padarn
Contributor
Padarn commented Dec 27, 2013

This PR is to add the kernel dependent constants for the silverman bandwidth as discussed in Issue #1243.

Not sure about the style of the functions which are only implemented for order 2 kernels right now.

I've added some tests, which test against the values given in the notes (only given to 2 dp). But really I just intended these as a place holder for more useful bandwidth tests.

@coveralls

Coverage Status

Coverage remained the same when pulling e860186 on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@josef-pkt josef-pkt commented on an outdated diff Dec 27, 2013
statsmodels/nonparametric/bandwidths.py
+ Returns C * A * n ** (-1/5.) where ::
+
+ A = min(std(x, ddof=1), IQR/1.349)
+ IQR = np.subtract.reduce(np.percentile(x, [75,25]))
+ C = constant from Hansen (2009)
+
+ References
+ ----------
+
+ Silverman, B.W. (1986) `Density Estimation.`
+ Hansen, B.E. (2009) `Lecture Notes on Nonparametrics.`
+ """
+ C = kernel.silverman_constant
+ A = _select_sigma(x)
+ n = len(x)
+ return C * A * n ** -.2
@josef-pkt
josef-pkt Dec 27, 2013 Member

n**(-0.2) would be clearer, but the only rule is no spaces around power **

@josef-pkt josef-pkt commented on an outdated diff Dec 27, 2013
statsmodels/nonparametric/bandwidths.py
@@ -51,7 +53,7 @@ def bw_scott(x):
n = len(x)
return 1.059 * A * n ** -.2
-def bw_silverman(x):
+def bw_silverman(x, kernel):
@josef-pkt
josef-pkt Dec 27, 2013 Member

add gaussian kernel as default ?

@josef-pkt josef-pkt commented on an outdated diff Dec 27, 2013
statsmodels/nonparametric/bandwidths.py
@@ -104,18 +150,18 @@ def select_bandwidth(x, bw, kernel):
are supported
kernel : not used yet
- Returns
+ returns
@josef-pkt
josef-pkt Dec 27, 2013 Member

capital R Returns is docstring standard

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 27, 2013
statsmodels/sandbox/nonparametric/kernels.py
@@ -354,6 +357,42 @@ def kernel_var(self):
self.domain[1])[0]
return self._kernel_var
+ def moments(self, n):
+
+ if not n == 2:
+ msg = "Only second moment currently implemented"
+ raise NotImplementedError(msg)
@josef-pkt
josef-pkt Dec 27, 2013 Member

for n=1, I think we have mean=0 for all current kernels because they are symmetric.

@Padarn
Padarn Dec 27, 2013 Contributor

I was a bit worried there might be use for a non-zero mean kernel at some point. I guess it seems fairly unlikely.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 27, 2013
statsmodels/sandbox/nonparametric/kernels.py
+ R = kernel roughness (square of L^2 norm)
+
+ Note: L2Norm property returns square of norm.
+ """
+ nu = self._order
+
+ if not nu == 2:
+ msg = "Only implemented for second order kernels"
+ raise NotImplementedError(msg)
+
+ if self._silverman_constant is None:
+ C = np.pi**(.5) * factorial(nu)**3 * self.L2Norm
+ C = C/(2*nu * factorial(2*nu) * self.moments(nu)**2)
+ C = 2*C**(1.0/(2*nu+1))
+ self._silverman_constant = C
+
@josef-pkt
josef-pkt Dec 27, 2013 Member

since they are constant that never change, we could hardcode the explicit number in the individual kernels based on this calculation
(small savings if we create many kernel instances, not sure it's relevant)

@Padarn
Padarn Dec 27, 2013 Contributor

Yeah I see that. The only reason I didn't hard code them was because I wanted some sort of nominal test against the values given in the reference. My preference is to leave it this way, but happy to hard code if you think its worth it.

@josef-pkt josef-pkt commented on an outdated diff Dec 27, 2013
statsmodels/sandbox/nonparametric/kernels.py
+ C = 2((pi^(1/2)*(nu!)^3 R(k))/(2nu(2nu)!kap_nu(k)^2))^(1/(2nu+1))
+ nu = kernel order
+ kap_nu = nu'th moment of kernel
+ R = kernel roughness (square of L^2 norm)
+
+ Note: L2Norm property returns square of norm.
+ """
+ nu = self._order
+
+ if not nu == 2:
+ msg = "Only implemented for second order kernels"
+ raise NotImplementedError(msg)
+
+ if self._silverman_constant is None:
+ C = np.pi**(.5) * factorial(nu)**3 * self.L2Norm
+ C = C/(2*nu * factorial(2*nu) * self.moments(nu)**2)
@josef-pkt
josef-pkt Dec 27, 2013 Member

this can be inplace division C /= ...
space around *

@josef-pkt
Member

looks good, Thanks Padarn

AFAICS, I think we don't need both bw_silverman and bw_silverman_kernel.
I would just add the kernel option to bw_silverman with gaussian as default, so we get the same number as before when we don't specify a kernel.

I'm curious if it improves some of the default plots for kernels that have a Silverman constant that is larger than or different from the normal. cosine?

@Padarn
Contributor
Padarn commented Dec 27, 2013

Actually I think if you put the default as gaussian, the constant works out to be different than what you had. It was set at 0.9, and is not 1.06 for the gaussian. I'm not sure why it was 0.9. Do you have the Silverman reference? Hopefully changing the default doesn't break any tests.

Good idea about seeing if it improves the plots - maybe I'll add an example comparing the two.

@josef-pkt
Member

I don't know what's going on, we have 1.06 in Scott's rule, but 0.9 in Silverman. From Bruce Hansen's note I thought Silverman should be 1.06.

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/bandwidth.html has the same, but Wikipedia has Silverman at `.06 http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth

It might take some time before I can look into this. (I'll be back from skiing in 5 days.)
In this case it's better to keep bw_silverman_kernel separate until we figure out which is which.

@Padarn
Contributor
Padarn commented Dec 27, 2013

Sure. I'll add a few examples and clean up the formatting problems that you pointed out and then the pull request can wait.

@josef-pkt
Member

Haerdle also refers to Silverman's rule of thumb with 1.06
http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode15.html#rulathumb

I don't think I have access to Silverman's or Scott's book

@josef-pkt
Member

Ok this looks like an explanation
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.5202&rep=rep1&type=pdf section 3.1
Silverman derived 1.06 but "recommended" 0.9.
SAS terminology "simple normal reference" versus "Silverman's rule of thumb"

SAS also has "oversmoothed" at 1.144
It doesn't say where Scott's rule comes in.

@Padarn
Contributor
Padarn commented Dec 27, 2013

Right, so I suppose the 0.9 suggestion is his suggestion because 1.06 is based on an asymptotic argument. Perhaps the 0.9 should be an extra option in that case?

@josef-pkt
Member

I assume the 0.9 recommendation is just lowering the value for the gaussian kernel, and we don't know how to lower in the same way the value for other kernels.

What I think now is that we really add a new method as you have it, but with a different name (normal reference ?) which is already an option or in a docstring somewhere IIRC.

And leave Sott's and Silverman's rule just as a specific functions that have numbers with "name recognition". ?

@Padarn
Contributor
Padarn commented Dec 27, 2013

You're right. 'normal_reference' is used in the Multivariate kernel.

Maybe its worth putting all the bandwidth stuff together? The 'compute bandwidth' stuff in GenericKDE could easily be put elsewhere?

For now should I just add 'normal_reference' as an option for the 1D kernels?

@coveralls

Coverage Status

Coverage remained the same when pulling 96c9b7f on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@josef-pkt
Member

For now should I just add 'normal_reference' as an option for the 1D kernels?

Yes, KDEUnivariate should have access to the new silverman-kernel as normal_reference

(I can look at it again tonight.)

@Padarn
Contributor
Padarn commented Dec 28, 2013

Yep, that is the option available now. To get this to work properly with KDEUnivariate, I had to add a line to the 'fit' routines so that 'kernel' is the kernel and not a string. This is a minor change - actually it occurs to me that this could go into the bandwidth function.

@coveralls

Coverage Status

Coverage remained the same when pulling edf3792 on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 28, 2013
statsmodels/examples/ex_kde_normalreference.py
+kde = npar.KDEUnivariate(x)
+
+
+kernel_names = ['Gaussian', 'Epanechnikov', 'Biweight',
+ 'Triangular', 'Triweight', 'Cosine'
+ ]
+
+kernel_switch = ['gau', 'epa', 'tri', 'biw',
+ 'triw', 'cos'
+ ]
+
+fig = plt.figure()
+for ii, kn in enumerate(kernel_switch):
+
+ ax = fig.add_subplot(2, 3, ii + 1) # without uniform
+ ax.hist(x2, bins=40, normed=True, alpha=0.25)
@josef-pkt
josef-pkt Dec 28, 2013 Member

In the plot you have x2 from the large sample, but kde from the small sample

plots look good.

@Padarn
Padarn Dec 28, 2013 Contributor

Yeah actually this was on purpose. I thought it would be good to have a good representation of the true distribution and couldn't immediately see a good way to get the true pdf.

I can get rid of x2 altogether if you prefer?

@josef-pkt
josef-pkt Dec 28, 2013 Member

It looked misleading to me because from the histogram it's not visible from where the two bumps in the left distribution come from. With the x histogram is clearer that the oversmoothing matches the two groups of sample points.

When I run your example with the 4000 points (which is slow) it shows that the corrected rules of thumb still oversmooth, as expected for mixtures.

I sent a plot to the mailing list that has only 0.05 weight on the lower, left mixture to be closer to the normal reference case.

you could get the true pdf from
pdf(x) = 0.25 * stats.norm.pdf(x_grid, loc=loc1, scale=scale1) + 0.75 * stats.norm.pdf(x_grid, loc=loc2, scale=scale2)
using from scipy import stats

I thought we added the pdf to the mixture in statsmodels.distributions, maybe not.

(Otherwise, I'm to exhausted today to think about, or review details.)

@Padarn
Padarn Dec 28, 2013 Contributor

Oops, good point - will change this so that it has the 'true' and the reference histogram for x (not x2).

@Padarn
Padarn Dec 28, 2013 Contributor

I've changed the mixture distribution a little to show case the normal_reference bandwidth better - I did add a note in the heading saying that I've chosen it with this in mind. Also now showing the true pdf.

@josef-pkt
Member

To get this to work properly with KDEUnivariate, I had to add a line to the 'fit' routines so that 'kernel' is the kernel and not a string.

Did you change this? I don't see any changes with fit.
Your current way looks pretty backwards compatible to me. I think it looks very good overall.
(I might nitpick on some details when I go over it again.)

@Padarn
Contributor
Padarn commented Dec 28, 2013

Did you change this? I don't see any changes with fit.

Yes, the change is in ksdensity. See lines 348,349,355. But yes I think it is backwards compatible too.

@coveralls

Coverage Status

Coverage remained the same when pulling 7050daa on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 28, 2013
statsmodels/nonparametric/kde.py
@@ -345,11 +345,14 @@ def kdensity(X, kernel="gau", bw="scott", weights=None, gridsize=None,
weights = weights[clip_x.squeeze()]
q = weights.sum()
+ # Added so that kernel is a kernel object not a string
+ kern = kernel_switch[kernel]()
@josef-pkt
josef-pkt Dec 28, 2013 Member

the same is on line 365, which looks now redundant since it's already here. ?

@Padarn
Padarn Dec 28, 2013 Contributor

On line 365 it is

kernel_switch[kernel](h=bw)

So that the kernel is given it's calculated bandwidth. But this could be replaced by

kern.setH(bw)
@josef-pkt
josef-pkt Dec 28, 2013 Member

Ok I got confused, if setH works, then that's better
otherwise it would also be possible to get the kernel class just once kernel_class = kernel_switch[kernel] or something like that.

@Padarn
Padarn Dec 28, 2013 Contributor

Okay, now that I look at this I'm a bit concerned there is an error in kernels.py. The kernel's domain does not seem to be affected by the bandwidth (no matter how you set it), this can't be right.

@Padarn
Padarn Dec 28, 2013 Contributor

hmm actually maybe it is right. Will just use seth then.

@josef-pkt
Member

Ok, I see how it works, looks fine to me
kernel argument in function select_bandwidth(x, bw, kernel) was unused before AFAICS.

@josef-pkt
Member

One more interface change:
in KDEUnivariate.fit and in kdensity we could replace bw="scott" by the new normal_reference. It should be backwards compatible for "gaus" since the constant is the same, if I read this correctly. Then, users will get the "correct" bw for other kernels automatically, even though that is not backwards compatible, but we are fixing the rules of thumb.

@Padarn
Contributor
Padarn commented Dec 28, 2013

Sounds good. I'll add something to the docstring for fit too.

@Padarn
Contributor
Padarn commented Dec 28, 2013

This commit is going to fail the TravisCI build - some of the newish tests you wrote which test the kernels will now produce different results than those stored in the .csv files because the default bandwidth of some kernels changes.

I don't have Stata to generate new values unfortunately. Since nothing else has changed could assume that the currently generated values are correct and use them as replacements?

@josef-pkt
Member

Most test can be used by specifying a specific bandwidth in
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/nonparametric/tests/test_kde.py#L141
in Stata I used the explicit bandwidth (from my Stata .do file):
0.51796217708858117 is silverman from statsmodels
in the other set of tests I used explicitly, bw="silverman" and those tests don't fail.

For the kernel unit test we don't really need the new default bandwidth.

I don't know why
FAIL: statsmodels.nonparametric.tests.test_kernel_density.TestKDEUnivariate.test_weighted_pdf_non_fft
fails, maybe a different rounding of some constants. ?
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/nonparametric/tests/test_kernel_density.py#L95 might also need explicit bandwidth.
However, I don't see why the other test test_pdf_non_fft doesn't fail, it looks to be using the same pattern as test_weighted_pdf_non_fft

@Padarn
Contributor
Padarn commented Dec 29, 2013

Ah oops. There was an error turning up for the Biweight, but it was just because the default bw had been changed.

None of those tests fail once I've changed the default to 'scott' - which is slightly differently rounded. So as long some tests are added to the bandwidth calculations, everything seems fine.

@Padarn Padarn closed this Dec 29, 2013
@Padarn Padarn reopened this Dec 29, 2013
@coveralls

Coverage Status

Coverage remained the same when pulling 2a97860 on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 2a97860 on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@josef-pkt
Member

yes, scott looks like what was used for the tests.

test_kernel_density.TestKDEUnivariate.test_weighted_pdf_non_fft
I have this tests passing at roughly rtol=1e-15, much higher than the atol=1e-6 that is currently used.

I think for testing the bandwidth selection for the new non-gaussian normal_reference we might not find any other package that has it, and we just add tests that compare with a direct calculation, or just a regression test. The constants that you added to the kernels already have their own test.

@josef-pkt
Member

Otherwise, this PR looks pretty much finished to me. Do you still see any open questions, problems?

@josef-pkt josef-pkt commented on an outdated diff Dec 29, 2013
statsmodels/nonparametric/bandwidths.py
@@ -49,9 +53,9 @@ def bw_scott(x):
"""
A = _select_sigma(x)
n = len(x)
- return 1.059 * A * n ** -.2
+ return 1.059 * A * n **(-0.2)
@josef-pkt
josef-pkt Dec 29, 2013 Member

extra space between n and **

@Padarn
Contributor
Padarn commented Dec 29, 2013

No, I think I'm done. I'll just add a regression test, and fix any more typos (as I see you're uncovering).

I was thinking of also adding a adding the Sheather & Jones bandwidth too, but I think that would be best in a separate PR.

@Padarn
Contributor
Padarn commented Dec 29, 2013

Actually, since everything other than the bandwidth calculation is the same as in other tests, perhaps just a regression test that calculates the various bandwidths for a random sample?

@josef-pkt josef-pkt commented on the diff Dec 29, 2013
statsmodels/nonparametric/bandwidths.py
+ x : array-like
+ Array for which to get the bandwidth
+ kernel : CustomKernel object
+ Used to calcualate the constant for the plug-in bandwidth.
+
+ Returns
+ -------
+ bw : float
+ The estimate of the bandwidth
+
+ Notes
+ -----
+ Returns C * A * n ** (-1/5.) where ::
+
+ A = min(std(x, ddof=1), IQR/1.349)
+ IQR = np.subtract.reduce(np.percentile(x, [75,25]))
@josef-pkt
josef-pkt Dec 29, 2013 Member

subtract reduce in docstring sounds strange, takes some time and experience to understand
q3 - q1 where q1, q3 are the first and third quartiles
?

@josef-pkt josef-pkt commented on an outdated diff Dec 29, 2013
statsmodels/nonparametric/bandwidths.py
@@ -79,7 +85,44 @@ def bw_silverman(x):
"""
A = _select_sigma(x)
n = len(x)
- return .9 * A * n ** -.2
+ return .9 * A * n **(-0.2)
+
+
+def bw_normal_reference(x, kernel=kernels.Gaussian):
+ """
+ Silverman's Rule of Thumb with constant calculated using specific kernel.
+ Currently only second order kernels are supported.
@josef-pkt
josef-pkt Dec 29, 2013 Member

looks a bit redundant since we only have second order kernels

move to Notes section so it's less prominent

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 29, 2013
statsmodels/nonparametric/bandwidths.py
@@ -79,7 +85,44 @@ def bw_silverman(x):
"""
A = _select_sigma(x)
n = len(x)
- return .9 * A * n ** -.2
+ return .9 * A * n **(-0.2)
+
+
+def bw_normal_reference(x, kernel=kernels.Gaussian):
+ """
+ Silverman's Rule of Thumb with constant calculated using specific kernel.
@josef-pkt
josef-pkt Dec 29, 2013 Member

I think we want to avoid using "Rule of Thumb" to avoid confusion with the 0.9 constant rule.
I'm not sure how to phrase it,
Silverman's normal reference bandwidth with kernel specific constant

@Padarn
Padarn Dec 29, 2013 Contributor

How about 'Plug-in bandwidth with kernel specific constant based on normal reference' - Not sure its right to call it 'Silvermans' or not anymore.

@josef-pkt
josef-pkt Dec 29, 2013 Member

yes, sounds good

@josef-pkt josef-pkt commented on the diff Dec 29, 2013
statsmodels/nonparametric/bandwidths.py
+
+ Parameters
+ ----------
+ x : array-like
+ Array for which to get the bandwidth
+ kernel : CustomKernel object
+ Used to calcualate the constant for the plug-in bandwidth.
+
+ Returns
+ -------
+ bw : float
+ The estimate of the bandwidth
+
+ Notes
+ -----
+ Returns C * A * n ** (-1/5.) where ::
@josef-pkt
josef-pkt Dec 29, 2013 Member

I would add a sentence here first, saying that
The bandwidth is optimal (minimizes expected integrated squared error ?) if the true distribution is the normal. This is an appropriate bandwidth for single peaked distributions that are similar to the normal distribution.
or something along those lines

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 29, 2013
statsmodels/nonparametric/kde.py
@@ -107,6 +107,9 @@ def fit(self, kernel="gau", bw="scott", fft=True, weights=None,
`min(std(X),IQR/1.34)`
- "silverman" - .9 * A * nobs ** (-1/5.), where A is
`min(std(X),IQR/1.34)`
+ - "normal_reference" - C * A * nobs ** (-1/5.), where C is
+ calculated from the kernel. Same as "scott" for gaussian.
@josef-pkt
josef-pkt Dec 29, 2013 Member

except scott uses a constant that is rounded to 2 decimals.
we should mention this somewhere because as in the test failure, we don't get the same result at precision lower than 4 to 6 decimals.

@Padarn
Padarn Dec 29, 2013 Contributor

I'm adding a note in bandwidths.py saying that they are equivalent up to 2 dp.

@josef-pkt
Member

Yes, other options for bandwidth choices like the ones that R has can go into a new PR.

clean up a bit, and I just hit the merge button. (tomorrow night)
Then, you can start again from master with this in it.

@josef-pkt
Member

Just another "fun" idea:
Silverman's rule of thumb reduced the bandwidth by a factor 0.9 / 1.06 = 0.8490566037735849
Maybe the same factor can also be used with the other kernels. If the reduction is based on the normal reference, then it might be that the same reduction also applies to the other kernels.
When I played a bit with your example it looked like "silverman" bandwidth gets pretty close to the true mixture pdf. It's worth a try to see if the reduction gets close to the mixture pdf for the other kernels, at least in some bimodal cases.

@Padarn
Contributor
Padarn commented Dec 29, 2013

The optimal bandwidth giving the mixture could actually be found since we know the underlying distribution. Might have a play with this.

@Padarn
Contributor
Padarn commented Dec 29, 2013

I added a test which calculates the various bandwidths for a Gaussian. I would rather set this up a bit differently, but I think its better than nothing. Will have time to work on this tomorrow hopefully.

@coveralls

Coverage Status

Coverage remained the same when pulling f6a5a1f on Padarn:silverman_enhancement into 8372093 on statsmodels:master.

@josef-pkt
Member

@Padarn So, do you still want to make changes before merge? add a comment when it's ready.

@Padarn
Contributor
Padarn commented Dec 30, 2013

No I think I'll leave it for the next PR. I'll have a better idea of how to organize the tests once I've figured out what else to add.

Cheers

@josef-pkt josef-pkt merged commit 946021a into statsmodels:master Dec 30, 2013

1 check passed

default The Travis CI build passed
Details
@josef-pkt
Member

merged, Thanks Padarn, very nice enhancement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment