Kde weights 1103 823 #1240

Merged
merged 14 commits into from Dec 18, 2013

Projects

None yet

3 participants

@josef-pkt
Member

rebased and corrected version of @Padarns PR #1103

needed to normalize weights to sum to 1 for test to pass with evaluate
fixed unittests to test evaluate

I'm still looking into #1239 to see whether it should be added to this PR.

@coveralls

Coverage Status

Coverage remained the same when pulling 57deb65 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@josef-pkt josef-pkt commented on the diff Dec 16, 2013
statsmodels/nonparametric/tests/test_kernel_density.py
@@ -58,6 +58,62 @@ def setUp(self):
0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0]
+ self.weights = np.random.random(nobs)
+
+
+class TestKDEUnivariate(MyTest):
@josef-pkt
josef-pkt Dec 16, 2013 Member

test_kde has the other KDEUnivariate tests
I think we should move this there. (small problem is that we don't have the random seed for the weights, if we don't generate all other random variables.)
I have saved the random values to a csv file for use with Stata.

@josef-pkt josef-pkt commented on the diff Dec 16, 2013
statsmodels/nonparametric/tests/test_kernel_density.py
@@ -58,6 +58,62 @@ def setUp(self):
0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0]
+ self.weights = np.random.random(nobs)
+
+
+class TestKDEUnivariate(MyTest):
+
+ def test_pdf_non_fft(self):
+
+ kde = nparam.KDEUnivariate(self.noise)
+ kde.fit(fft=False)
+
+
+ grid = kde.support
+ testx = [grid[10*i] for i in range(6)]
@josef-pkt
josef-pkt Dec 16, 2013 Member

does testx need to be a list, grid[10 * np.arange(6)] will create an array

@josef-pkt josef-pkt commented on an outdated diff Dec 16, 2013
statsmodels/sandbox/nonparametric/kernels.py
@@ -185,7 +191,10 @@ def density(self, xs, x):
xs = xs[:,None]
if len(xs)>0:
h = self.h
- w = 1/h * np.mean(self((xs-x)/h), axis=0)
+ if self.weights is not None:
+ w = 1/h * np.sum(self((xs-x)/h).T * self.weights, axis=1)
@josef-pkt
josef-pkt Dec 16, 2013 Member

What are possible shapes for xs ? .T should work for 0-d 1-d and 2-d

@josef-pkt
josef-pkt Dec 17, 2013 Member

see comment above, weights needs to be filtered to correspond to xs inDomain

@josef-pkt
Member

needs fix from #1233 before adding tests for uniform

@josef-pkt josef-pkt commented on the diff Dec 17, 2013
statsmodels/nonparametric/tests/test_kde.py
def test_density(self):
npt.assert_almost_equal(self.res1.density, self.res_density,
self.decimal_density)
+ def t_est_evaluate(self):
+ # disable test
+ # fails for Epan, Triangular and Biweight, only Gaussian is correct
+ # added it as test method to TestKDEGauss below
+ # inDomain is not vectorized
+ #kde_vals = self.res1.evaluate(self.res1.support)
@josef-pkt
josef-pkt Dec 17, 2013 Member

looks like there is a bug somewhere in the non-Gaussian kernels, or I don't use it correctly

@coveralls

Coverage Status

Coverage remained the same when pulling 3179b37 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@josef-pkt josef-pkt commented on the diff Dec 17, 2013
statsmodels/sandbox/nonparametric/kernels.py
n = len(xs)
#xs = self.inDomain( xs, xs, x )[0]
if len(xs)>0: ## Need to do product of marginal distributions
#w = np.sum([self(self._Hrootinv * (xx-x).T ) for xx in xs])/n
#vectorized doesn't work:
- w = np.mean(self((xs-x) * self._Hrootinv )) #transposed
+ if self.weights is not None:
+ w = np.mean(self((xs-x) * self._Hrootinv).T * self.weights)/sum(self.weights)
@josef-pkt
josef-pkt Dec 17, 2013 Member

I think this is wrong for bounded support kernels, xs will have the points in the neighborhood, inDomain, weights needs to be "filtered" selected in the same way

in the kde case, I think we just use weights instead of ys (second xs) in self.inDomain(xs, weights, x)

@josef-pkt
josef-pkt Dec 17, 2013 Member

Comment should apply below in CustomKernel, This here is NDKernel.
This version of using weights has a normalization by sum(weights)

@josef-pkt
Member

AFAICS, I fixed the bugs with bounded support kernels in last commit 0d574e2
We need to divide by the total number of observations, not take the mean over kernel of the neighborhood points.

Also use inDomain to filter the weights. Looks correct in an example, but still need unit tests for weights and non-gaussian kernels

@coveralls

Coverage Status

Coverage remained the same when pulling 0d574e2 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 758ccc9 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 1aa494f on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 24e4f91 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling a6fcae0 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 2e8d4d2 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 134b590 on josef-pkt:kde_weights_1103_823 into 5cf4cad on statsmodels:master.

@Padarn
Padarn commented on 0d574e2 Dec 18, 2013

I wonder if its worth just having the bounded support kernels behave the same as the Gaussian kernel for inDomain and having 0 weights outside of the support - would simplify things a little bit?

Not sure how the performance would be affected. Currently there is no FFT for weights so it might be important.

Owner

I thought about it, it would make several things easier, however I think this way can be made faster than always calculating with the full array, I guess. #1241
We would also have to create for each point a new kernel/shape array because the nonzero positions will move.

I might not understand how this is working, but will the idea in #1241 work with the NdKernel? Not sure if the searchsorted approach can be used there.
Hmm yes, good point about the new kernel/shape array for each point.

Owner

No, np.searchsorted only works for 1-D.
The equivalent for NDKernel would be what scikit-learn is using, kdtree or another method to have a fast neighbor search.

@Padarn

Does self.L2Norm work for all kernels? I seem to remember it being specifically implemented in some (the Gaussian for example).

Nevermind - just checked and looks like it does.

Owner

Yes it has a generic implementation, but all current kernels define it as hardcoded number.

@Padarn
Contributor
Padarn commented Dec 18, 2013

All looks pretty good to me.

Might be nice to have the new variance estimate across the support grid as an attribute of the kernel class?

@josef-pkt
Member

I think we should add it as a method to KDEUnivariate directly. But since you have the open PR for refactoring KDEUnivariate, I didn't want to create additional merge conflicts, and the method should match your refactored version.
One advantage of density_var is that it is cheap since it doesn't require any loops, given that we already have the density estimate, same for confint.

@Padarn
Contributor
Padarn commented Dec 18, 2013

That makes sense to me. The KDEUnivariate class could take advantage of density_var when it is working out the gridded values of density anyway.

@josef-pkt josef-pkt merged commit c0a62a0 into statsmodels:master Dec 18, 2013

1 check passed

default The Travis CI build passed
Details
@josef-pkt josef-pkt deleted the josef-pkt:kde_weights_1103_823 branch Dec 18, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment