Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Nonparametric density #408

Closed
wants to merge 79 commits into from

4 participants

@gpanterov

Nonparametric Density estimators. Conditional and Unconditional multivariable mixed data estimation available with plug-in (normal-reference) and data-driven bandwidth selection methods (cross-validation least squares and cross-validation maximum likelihood).

@josef-pkt

I think it would pay off to write test for various shapes of this function.

For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.

The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):

dens_c=CKDE(tydat=[c1,c3],txdat=[c2], dep_type='cc',indep_type='c',bwmethod='normal_reference')
dens_c.pdf()

nice, now I see that the bandwidth h is also vectorized.

@rgommers

The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.

Collaborator

Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.

@rgommers

Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.

Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.

@rgommers

The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.

@rgommers

It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).

@rgommers

For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.

@rgommers

This for-loop doesn't do anything. It only creates var, which isn't used below.

column_stack (one of my favorites) could be used, however,

concatenate and column_stack copy the data, AFAICS

In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)

@rgommers

Same here, for loops don't do anything.

@josef-pkt

index could be 1d (row selector), then reshape is not necessary

@josef-pkt

pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to

sorry, I guess I wasn't clear before, below are many spaces in keyword =

@josef-pkt

is if bw is not None or if not bw is None

@josef-pkt

should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required

I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?

@rgommers

Should be

if edat is None:
    edat = self.tdat
@rgommers
Collaborator

When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.

@rgommers

Should be if not isinstance(bw, basestring).

The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').

@rgommers

This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.

Did you also plan to provide other metrics, like ISE or IAE?

@rgommers
Collaborator

I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.

rgommers and others added some commits
@rgommers

I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.

@rgommers

Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?

@rgommers

Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?

I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?

Collaborator

Perhaps all of them, first on axis 0, then 1, etc.?

@rgommers
Collaborator

If you don't need code anymore, better to delete it.

@rgommers
Collaborator

This fix really needs a test for edat usage.

@rgommers

The old version (array_like) is actually the correct one.

@rgommers
Collaborator

I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).

@josef-pkt

doesn't work for me

import statsmodels.nonparametric.nonparametric2 as nparam ?

@josef-pkt
Owner

leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)

Owner

pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries

one print statement is left somewhere when running the tests

tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging

(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)

Owner

some of the test cases would make nice example scripts

Owner

I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.

@josef-pkt

special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm

@josef-pkt
Owner

nice, I like having the cdf available, Azzalini (IIRC) mentioned that the rate (as function of n) for the bandwidth for cdf should be smaller (?) than for the pdf. Did you see anything about bw choice for cdf?

Collaborator

Fast, and seems to work well. At least in 1-D, converges nicely to the empirical CDF.

Collaborator

@josef-pkt that's what I thought too, bandwidth isn't the same as for pdf.

Collaborator

Can you factor out all the code related to asarray, K, N and reshaping? It's more than 10 lines that are duplicated in every single kernel function. Should be something like

def _get_shape_and_type(Xi, x, kind='c'):
    ...
    return Xi, x, K, N
Collaborator

The UKDE, CKDE interface now doesn't allow specifying the kernels to use. The Epannechnikov kernel isn't used at all. Are you planning to expand that interface? In any case, what the default kernels are should be documented.

Collaborator

I have the feeling that the for-loop in GPKE can still be optimized, it's very expensive now. You can see this easily by profiling in IPython. Use for example %prun dens_scott.cdf() after having run the below script. 33000 function calls for a 1-D example with 1000 points.

import numpy as np

from statsmodels.sandbox.distributions.mixture_rvs import mixture_rvs
from statsmodels.nonparametric import UKDE
from statsmodels.tools.tools import ECDF

import matplotlib.pyplot as plt


np.random.seed(12345)
obs_dist = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))

dens_scott = UKDE(tdat=[obs_dist], var_type='c', bw='normal_reference')
est_scott = dens_scott.pdf()

xx = np.linspace(-4, 4, 100)
est_scott_cdf = dens_scott.cdf(xx)
ecdf = ECDF(obs_dist)

# Plot the cdf
fig = plt.figure()
plt.plot(xx, est_scott_cdf, 'b.-')
plt.plot(ecdf.x, ecdf.y, 'r.-')

plt.show()
Collaborator

The for-loop could be over the number of variables instead of the data points, right?

Owner

It looks that way to me too. @gpanterov is there any reason this couldn't be vectorized over the number of observations?

@rgommers

Indentation not equal here.

@rgommers

Need a blank line here.

@rgommers
Collaborator

Please add import matplotlib.pyplot as plt.

Collaborator

Pareto is still broken.

Collaborator

The Weibull plot shows an interesting issue - finite support isn't handled in the UKDE, CKDE classes. Close to 0 this goes wrong.

matplotlib.pyplot - added
Broke Pareto -- strange. It works for me. Maybe it depends on the seed. With np.seed(123456) I get the following plot :
http://statsmodels-np.blogspot.com/2012/07/kde-estimate-of-pareto-rv.html
The density estimate isn't very good but this could be due to the relatively small sample size. Will add seed.

Collaborator

With that seed I get the same result as you. The result is quite far off again though, again due to not dealing with support.

@jseabold

Probably want to avoid underscores in class names unless it's to mark the class as private. CamelCase is almost always good enough.

Owner

You also want to stick explicitly with new-style classes. Ie., all of your classes should inherit from object

class GenericKDE(object):
@jseabold

Not a huge deal, but you want a class method here or this gets called for every test method. Ie.,

@classmethod 
def setUp(cls):
   ...
@jseabold

Could you post scripts somewhere to compare the output for the full datasets. I'd like to compare the performance.

Sure. I can do that. But it becomes quite slow if you include the entire data set and if you use the data-driven bandwidth estimates (especially cross-validation least squares)

Owner

Yeah, don't put it in the tests, but if you could put it as an example script somewhere I can play with that would be helpful.

@gpanterov gpanterov removed the Epanechnikov Kernel. Could be added at a later time again…
… if we decide to give the user an option to specify kernels. However most refs claim kernel not important
02b3781
@jseabold

This could be a property since it doesn't need any arguments. I'm not sure about caching it yet since I don't know all the moving parts.

@rgommers
Collaborator

My version of MPL (1.0.1) doesn't have ax.zaxis. What version are you on? Can you leave out those 2 lines, or replace them with something that works for multiple versions?

Also, you could add a second plot of the same density with imshow(Z). While the 3-D version is fancier, I find the 2-D one much easier to interpret.

Other than that, looks good.

@rgommers

You don't need A, B and the for-loop. These 8 lines can be replaced by

ix = np.random.uniform(size=N) > 0.5
V = np.random.multivariate_normal(mu1, cov1, size=N)
V[ix, :] = np.random.multivariate_normal(mu2, cov2, size=N)[ix, :]
@rgommers

This for-loop can also be removed. Three lines above can be replaced by

edat = np.column_stack([X.ravel(), Y.ravel()])
Z = dens.pdf(edat).reshape(X.shape)

It would be good to document in the pdf method that the required shape for edat is (num_points, K), with K the number of dimensions (available as the K attribute.

@rgommers
Collaborator

You forgot to remove the Reg class.

statsmodels/examples/ex_multivar_UKDE.py
@@ -0,0 +1,55 @@
+#import nonparametric2 as nparam
+import statsmodels.nonparametric as nparam
+import scipy.stats as stats
+import numpy as np
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import axes3d
+from matplotlib import cm
+from matplotlib.ticker import LinearLocator, FormatStrFormatter
@rgommers Collaborator

This line isn't needed, nor is line 1 (commented-out import).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/nonparametric/nonparametric2.py
((240 lines not shown))
+ \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
+
+ where :math:`K_{h}` represents the
+ Generalized product kernel estimator:
+
+ .. math:: K_{h}(X_{i},X_{j})=
+ \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
+
+ LOO = tools.LeaveOneOut(self.tdat)
+ i = 0
+ L = 0
+ for X_j in LOO:
+ f_i = tools.gpke(bw, tdat=-X_j, edat=-self.tdat[i, :],
+ var_type=self.var_type)
+ i += 1
@rgommers Collaborator

This i=0; i+=1 is a bit un-Pythonic. You can get i from the for-loop like so: for i, X_j in enumerate(LOO):.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/nonparametric/nonparametric2.py
((21 lines not shown))
+"""
+
+import numpy as np
+from scipy import integrate, stats
+import np_tools as tools
+import scipy.optimize as opt
+import KernelFunctions as kf
+
+__all__ = ['UKDE', 'CKDE']
+
+
+class GenericKDE (object):
+ """
+ Generic KDE class with methods shared by both UKDE and CKDE
+ """
+ def compute_bw(self, bw):
@rgommers Collaborator

This isn't a public method, so should start with an underscore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
Collaborator

Please mark the current slow tests with @dec.slow from numpy.testing, as we discussed several times. You can add a test with smaller input as non-slow.

@rgommers
Collaborator

I've fixed a lot of small issues in gpanterov#4. Fixing was much quicker than noting all of them here.

@rgommers
Collaborator

Other things to still address related to current code:

  • gpke() still takes kernel function keywords, but there aren't any alternatives. Remove?
  • imse() documented Return value (CV) is incorrect, it doesn't return a function.
  • AitchisonAitken() returns a (N, K) array, while WangRyzin() is documented as returning a float. This cannot be correct.
  • Gaussian() and convolution/cdf kernels don't have docstrings yet.
  • function names in kernels.py should be lower case (PEP8)

EDIT: (16 Aug) these points are taken care of now.

@rgommers
Collaborator

More points:

  • tests not marked with @dec.slow still take 60 seconds to run, this needs to be improved by running tests with smaller input data. Marking more with @dec.slow is not an option, because it will just mean most tests won't be run by default.
  • UKDE/CKDE names aren't too informative. Josef asked before for some better names.
  • An example with ordered/unordered data really needs to be added. The examples folder now has two examples, one with 1-D continuous input and one with 2-D continuous input.

I've reviewed all code now, that's all I saw.

@rgommers
Collaborator

One more thing: Josef asked you about what it would take to handle distributions with finite support correctly. See plots for Pareto and Weibull distributions in ex_univar_kde.py for an example of where this would help. Have you thought about that?

@jseabold jseabold commented on the diff
statsmodels/nonparametric/np_tools.py
@@ -0,0 +1,139 @@
+import numpy as np
+
+from . import kernels
@jseabold Owner

Can you change this to an absolute import?

@rgommers Collaborator

I thought you wanted relative ones? Or was that in the past?

@jseabold Owner

I think we settled on relative in api, absolute everywhere else. I usually debug in the source directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on the diff
statsmodels/nonparametric/nonparametric2.py
((231 lines not shown))
+ The leave-one-out kernel estimator of :math:`f_{-i}` is:
+
+ .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
+ \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
+
+ where :math:`K_{h}` represents the generalized product kernel
+ estimator:
+
+ .. math:: K_{h}(X_{i},X_{j}) =
+ \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
+ LOO = tools.LeaveOneOut(self.tdat)
+ L = 0
+ for i, X_j in enumerate(LOO):
+ f_i = tools.gpke(bw, tdat=-X_j, edat=-self.tdat[i, :],
+ var_type=self.var_type)
@jseabold Owner

A technical question. This is the density of all the X_j, evaluated at the left out point X_i correct? Is this right? I thought you would just evaluate the density of f_hat = f(X_j) over some grid support x and use sum(log(f_hat)) as in Racine and Li section 2.3.4.

"f_hat(−i) (x) is the leave-one-out kernel estimator of f (Xi ) that uses all points except Xi to construct the density estimate."

Is there another reference or more details for this? I haven't been through this in a while.

A technical question. This is the density of all the X_j, evaluated at the left out point X_i correct? Is this right?

yep. This is correct. X_j represent the entire data without the left out point i

@jseabold Owner

But then you're evaluating it at the left out point. Is this the right thing to do? My impression was that you get the whole density of X_j, something like

density = KDE(X_j)
density.fit(bw=bw)

Then you evaluate the log-likelihood of this whole density.

np.sum(np.log(density.density))

That's one f_i. You want to sum all the f_i, that's the whole leave-one-out likelihood. Am I misunderstanding?

@jseabold Owner

To explain a bit more (for my own edification), what it looks like you have now is a sum of each leave one out density estimate (not-normalized by 1/(n-1)) evaluated at each left out point. This doesn't seem right to me.

Yes. This is correct. The normalization is done at the end. Since it is a sum it doesn't matter.
I believe it is exactly the same as section 2.3.4 in Racine's primer. See the equation at the top of p.16

@jseabold Owner

I don't see anything that indicates that the x in this equation is X_i. Why not just write X_i? I always assumed that x was just some grid that covers the support of f_hat. This is how I interpreted its use elsewhere, e.g., equation 2.2. Does what I'm saying make sense? This is why I assumed that the FFT estimator will be of great benefit because we have to evaluate n full densities, ie., the likelihood of it's entire density over it's entire support. I'll have a look at the original papers and see if it sheds any light on my confusion.

@jseabold Owner

Hmm, ok, looking elsewhere I guess I misunderstood the notation here. I guess it makes sense intuitively, but I'll need to work with this a bit more.

@jseabold Owner

And it does explicitly state f(X_i). D'oh. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on the diff
statsmodels/nonparametric/nonparametric2.py
((226 lines not shown))
+ func: function
+ For the log likelihood should be numpy.log
+
+ Notes
+ -----
+ The leave-one-out kernel estimator of :math:`f_{-i}` is:
+
+ .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
+ \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
+
+ where :math:`K_{h}` represents the generalized product kernel
+ estimator:
+
+ .. math:: K_{h}(X_{i},X_{j}) =
+ \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
@jseabold Owner

Interesting parallelization results. I don't know if you want to play around with this that much, but you don't start seeing gains for using all cores until nobs > 1500 or so, and even then it's modest (ie., barely worth it). I guess the overhead of joblib is still costlier than the computations, but I would've expected it to be a bit faster earlier. There may be other hotspots that this benefits from, you can use this

        var_type = self.var_type
        tdat = self.tdat
        LOO = tools.LeaveOneOut(tdat)
        from statsmodels.tools.parallel import parallel_func
        parallel, p_func, n_jobs = parallel_func(tools.gpke, n_jobs=-1,
                                                 verbose=0)
        L = sum(map(func, parallel(p_func(bw, tdat=-X_j, edat=-tdat[i, :],
                            var_type=var_type) for i, X_j in enumerate(LOO))))

Alternatively we're going to have look elsewhere for speed gains.

@jseabold Owner

Forgot the

        return -L
@josef-pkt Owner

joblib needs to work in batches, especially on Windows, see mailing list "playing with joblib" 10/10/2011 Alexandre's comments early on

@rgommers Collaborator

I think it's better to look at the algorithms again in a some more detail (i.e. with a profiler) instead of at joblib. Joblib also isn't going to yield more than a factor of a couple (2 at most for me); we're looking for more than that still.

@jseabold Owner

2-8x speed-up is a pretty decent speed-up for things that take 1-2 minutes. This is embarrassingly parallel, so we should be able to take advantage here.

I still think we want to use binning + FFT for the default Gaussian kernel. This is going to yield a 70-300x speed-up for each evaluation according to the literature and my experience with doing univariate. Newer multipole-type methods will yield 2-3x beyond this, but it looks more complicated than I have time for. I'm going to see if I can't look through a copy of Silverman or Wand and Jones tomorrow for the details.

@rgommers Collaborator

Note that George already implemented batching/blocking, with large speedups.

About FFT, there then needs to be a solution for evaluating at certain points or non-equidistant grids.

@jseabold Owner

Ah, good. I'll have to catch up (currently homeless with limited internet). Looking into the changes needed for FFT.

@rgommers Collaborator

Note that that's not in this PR, but it's in the nonparametric-all branch.

@rgommers Collaborator

Reading up some more about FFT methods, I found http://books.google.nl/books?hl=en&lr=&id=AyJ9xrrwDnIC&oi=fnd&pg=PA203&dq=density+estimation+multipole&ots=K_IBmhGVTY&sig=yrzJOhpPe4d2-2i_xCbeFpdE6hE#v=onepage&q=density%20estimation%20multipole&f=false which gives the following scalings:

O(N^D log(n^D))    # FFT
O(D N^2)     # direct summation (like in this PR)

It also gives a reference to an empirical study (Wand, "Fast Computation of Multivariate Kernel Estimators", 1994) showing that for D=3 the speed-up of FFT over direct summation is at most 5 for 10k samples. This scaling is the reason that FFT methods are mostly restricted to 1-D.

@josef-pkt Owner

I have not much idea about the fft in the multivariate kde case, but O(N^D log(n^D)) might not be correct for product kernels. my guess would be that the terms are also N*D instead of N**D for product kernels.

I think we should have the work on speed improvements later on, so we can get the current work merged, and the discussion where it might not get lost with a git rebase.

@rgommers Collaborator

That's a good point, we don't want to rebase this PR. That should happen on a new PR, otherwise many of these comments will be lost.

@rgommers Collaborator

I'm finding lots of references to implementations of the Fast Gauss Transform, but nothing that's directly reusable. That would be the way to go.

@rgommers Collaborator

FigTree seems to be the best available implementation, but its LGPL.

@jseabold Owner

Thanks for the references. I don't think these are directly comparable though. The benchmark for the Wand study is not direct summation. It's a multivariate simple binning algorithm mentioned in Scott written in Fortran AFAIK. The reference in Scott for this is "On Frequency Polygons and Average Shifted Histograms in Higher Dimensions" Hjort 1986.

My impression for the product kernels is the same as Josef's, though I'm not as familiar with this paper. I'm not proposing multivariate gridding, and I could be wrong - these are just impressions right now. Just the way we have it now, we are essentially doing univariate kernel density estimation, which should be easy to speed up, since we'd only do the binning once and then evaluate at the different points with different bandwidths, which should give the speed-ups I mentioned. I just need to figure out how to make this possible.

@rgommers Collaborator

Thanks. I didn't read the Wand paper, just the summary of it in the booked I linked, so I may have misunderstood the reference case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
Collaborator

Open points from mailing list discussion:

  • renaming: UKDE --> KDE should be done. CKDE may be renamed to KDEConditional or ConditionalKDE`` (preference Josef).
  • Skipper has a preference for a do-nothing __init__() and fit() to do bw estimation. Related to #429 (don't see exactly how though).
  • as mentioned above too, the issue of densities with finite support. Need to at least note as an open issue.
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((26 lines not shown))
+ if Xi.ndim > 1:
+ K = np.shape(Xi)[1]
+ N = np.shape(Xi)[0]
+ elif Xi.ndim == 1: # One variable with many observations
+ K = 1
+ N = np.shape(Xi)[0]
+ else: # ndim ==0 so Xi is a single point (number)
+ K = 1
+ N = 1
+
+ assert N >= K # Need more observations than variables
+ Xi = Xi.reshape([N, K])
+ return h, Xi, x, N, K
+
+
+def aitchison_aitken(h, Xi, x, num_levels=False):
@josef-pkt Owner

(not sure if we need to change this)

I would prefer the reversed order of the arguments (x, Xi, h)
I like h last (in case we get default arguments and h as keyword)
order x, Xi mainly to read it as function of x given Xi and h (not sure about this since standard Kernel notation is K(Xi, x) ? )

with x and Xi it's not obvious to me from the names which is which (I often use xi meaning x_i subscript, the ith observation. What does i stand for in the training set.)
maybe data instead of Xi, would be more informative

(We need kernel functions for other parts, but I don't know or recall the details.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((23 lines not shown))
+ h = np.asarray(h, dtype=float)
+ Xi = np.asarray(Xi)
+ # More than one variable with more than one observations
+ if Xi.ndim > 1:
+ K = np.shape(Xi)[1]
+ N = np.shape(Xi)[0]
+ elif Xi.ndim == 1: # One variable with many observations
+ K = 1
+ N = np.shape(Xi)[0]
+ else: # ndim ==0 so Xi is a single point (number)
+ K = 1
+ N = 1
+
+ assert N >= K # Need more observations than variables
+ Xi = Xi.reshape([N, K])
+ return h, Xi, x, N, K
@josef-pkt Owner

It's not clear to me what the actual shape restriction of the kernels are.
The docstrings in the individual kernel functions are ambigous. Are the kernels for univariate or single multivariate or really many multivariate observations.

@josef-pkt Owner

I don't see any unit tests for the kernel functions themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((123 lines not shown))
+ discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981.
+ """
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ x = np.abs(np.asarray(x, dtype=int))
+ if K == 0:
+ return Xi
+
+ kernel_value = (0.5 * (1 - h) * (h ** abs(Xi - x)))
+ kernel_value = kernel_value.reshape([N, K])
+ inDom = (Xi == x) * (1 - h)
+ kernel_value[Xi == x] = inDom[Xi == x]
+ return kernel_value
+
+
+def gaussian(h, Xi, x):
@josef-pkt Owner

In general, we need more continuous kernels than just gaussian, especially ones with bounded/compact support

@rgommers Collaborator

Handling bounded support would be very useful. Having other continuous kernels besides gaussian that aren't fundamentally different (like Epanechnikov) would be at the bottom of my list, it's pretty much unimportant for the result of estimation.

@josef-pkt Owner

Thinking about extendability: How could a gamma kernel be included, for endog that is continuous but strictly positive?

Epanechnikov (?) or others have the advantage that they only need to be evaluated at points in the neighborhood, while gaussian requires all points.
They might also be better with multimodal models (but I don't have any evidence).

@rgommers Collaborator

There were kernel selection keywords in an earlier version, but we took them out because there were no other kernels. One issue is that you need explicit integral and convolution forms like gaussian_cdf and gaussian_convolution, to not make things very slow. Not sure if there are analytical expressions for those for the gamma kernel.

@josef-pkt Owner

I don't know about the convolution version, but if I understand correctly, the cdf can be obtained from

class gamma_gen(rv_continuous):

def _cdf(self, x, a):
        return special.gammainc(a, x)
@josef-pkt Owner

similar for beta kernel (distributions with lower and upper bound)

I don't think we need to get this now, but it might be down the road either in these classes or separate.

@rgommers Collaborator

If we have the different kernel forms, then it's simple to add. But I agree that that's for later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((163 lines not shown))
+ return kernel_value
+
+
+def gaussian_convolution(h, Xi, x):
+ """ Calculates the Gaussian Convolution Kernel """
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
+ if K == 0:
+ return Xi
+
+ z = (Xi - x) / h
+ kernel_value = (1. / np.sqrt(4 * np.pi)) * np.exp(- z ** 2 / 4.)
+ kernel_value = kernel_value.reshape([N, K])
+ return kernel_value
+
+
+def wang_ryzin_convolution(h, Xi, Xj):
@josef-pkt Owner

why Xj and not x? What's the difference between convolution and the plain kernel? (I didn't read the reference for this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((218 lines not shown))
+ for x in Dom_x[i]:
+ Sigma_x += aitchison_aitken(h[i], Xi[:, i], int(x),
+ num_levels=len(Dom_x[i])) * \
+ aitchison_aitken(h[i], Xj[i], int(x), num_levels=len(Dom_x[i]))
+
+ Ordered[:, i] = Sigma_x[:, 0]
+
+ return Ordered
+
+
+def gaussian_cdf(h, Xi, x):
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
+ if K == 0:
+ return Xi
+
+ cdf = 0.5 * h * (1 + erf((x - Xi) / (h * np.sqrt(2))))
@josef-pkt Owner

should be replaced by a norm_cdf for clarity, but not now, since this is cheaper than calling scipy.stats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((233 lines not shown))
+ cdf = 0.5 * h * (1 + erf((x - Xi) / (h * np.sqrt(2))))
+ cdf = cdf.reshape([N, K])
+ return cdf
+
+
+def aitchison_aitken_cdf(h, Xi, x_u):
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ if Xi.ndim > 1:
+ K = np.shape(Xi)[1]
+ N = np.shape(Xi)[0]
+ elif Xi.ndim == 1:
+ K = 1
+ N = np.shape(Xi)[0]
+ else: # ndim ==0 so Xi is a single point (number)
+ K = 1
+ N = 1
@josef-pkt Owner

shape handling is outsourced ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((254 lines not shown))
+ Xi = Xi.reshape([N, K])
+ Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
+ Ordered = np.empty([N, K])
+ for i in range(K):
+ Sigma_x = 0
+ for x in Dom_x[i]:
+ if x <= x_u:
+ Sigma_x += aitchison_aitken(h[i], Xi[:, i], int(x),
+ num_levels=len(Dom_x[i]))
+
+ Ordered[:, i] = Sigma_x[:, 0]
+
+ return Ordered
+
+
+def wang_ryzin_cdf(h, Xi, x_u):
@josef-pkt Owner

docstrings, I guess Skipper's type of docstring concatenation or templating would pay off in this module and not be too messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((265 lines not shown))
+
+ return Ordered
+
+
+def wang_ryzin_cdf(h, Xi, x_u):
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ h = np.asarray(h, dtype=float)
+ if Xi.ndim > 1:
+ K = np.shape(Xi)[1]
+ N = np.shape(Xi)[0]
+ elif Xi.ndim == 1:
+ K = 1
+ N = np.shape(Xi)[0]
+ else: # ndim ==0 so Xi is a single point (number)
+ K = 1
+ N = 1
@josef-pkt Owner

outsourcing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((278 lines not shown))
+ else: # ndim ==0 so Xi is a single point (number)
+ K = 1
+ N = 1
+
+ if K == 0:
+ return Xi
+
+ Xi = Xi.reshape([N, K])
+ h = h.reshape((K, ))
+ Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
+ Ordered = np.empty([N, K])
+ for i in range(K):
+ Sigma_x = 0
+ for x in Dom_x[i]:
+ if x <= x_u:
+ Sigma_x += wang_ryzin(h[i], Xi[:, i], int(x))
@josef-pkt Owner

can kernel be vectorized for x? see question above about shapes of arguments to kernel functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((69 lines not shown))
+
+ def _normal_reference(self):
+ """
+ Returns Scott's normal reference rule of thumb bandwidth parameter.
+
+ Notes
+ -----
+ See p.13 in [2] for an example and discussion. The formula for the
+ bandwidth is
+
+ .. math:: h = 1.06n^{-1/(4+q)}
+
+ where :math:`n` is the number of observations and :math:`q` is the
+ number of variables.
+ """
+ c = 1.06
@josef-pkt Owner

is c always 1.06 or should this be an option ?

Yes, for the gaussian kernel the scaling factor c is always 1.06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((45 lines not shown))
+ - cv_ml: cross validation maximum likelihood
+ - normal_reference: normal reference rule of thumb
+ - cv_ls: cross validation least squares
+
+ Notes
+ -----
+ The default values for bw is 'normal_reference'.
+ """
+
+ self.bw_func = dict(normal_reference=self._normal_reference,
+ cv_ml=self._cv_ml, cv_ls=self._cv_ls)
+ if bw is None:
+ bwfunc = self.bw_func['normal_reference']
+ return bwfunc()
+
+ if not isinstance(bw, basestring):
@josef-pkt Owner

Note for later: basestring might cause compatibility problems with python 3.x, but I don't know out of hand what the compatible way is.

@josef-pkt Owner

I would change the conditional
if
...
elif < callable> # new option
...
else

(in the last case: asarray will raise exception if anything else, sounds ok and doesn't need try exept)

@rgommers Collaborator

Why does basestring cause an issue for py3k? I'm pretty sure this is the standard way to check if a variable is a string, and 2to3 should handle it fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((229 lines not shown))
+ Notes
+ -----
+ The leave-one-out kernel estimator of :math:`f_{-i}` is:
+
+ .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
+ \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
+
+ where :math:`K_{h}` represents the generalized product kernel
+ estimator:
+
+ .. math:: K_{h}(X_{i},X_{j}) =
+ \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
+ LOO = tools.LeaveOneOut(self.tdat)
+ L = 0
+ for i, X_j in enumerate(LOO):
@josef-pkt Owner

I would prefer x_noti instead of X_j

@josef-pkt Owner

X_j is all observations with j != i ? to see if I understand correctly

@rgommers Collaborator

correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/np_tools.py
((120 lines not shown))
+ k\left(\frac{X_{iq}-x_{q}}{h_{q}}\right)
+ """
+ iscontinuous, isordered, isunordered = _get_type_pos(var_type)
+ K = len(var_type)
+ N = np.shape(tdat)[0]
+ # must remain 1-D for indexing to work
+ bw = np.reshape(np.asarray(bw), (K,))
+ Kval = np.concatenate((
+ kernel_func[ckertype](bw[iscontinuous],
+ tdat[:, iscontinuous], edat[:, iscontinuous]),
+ kernel_func[okertype](bw[isordered], tdat[:, isordered],
+ edat[:, isordered]),
+ kernel_func[ukertype](bw[isunordered], tdat[:, isunordered],
+ edat[:, isunordered])), axis=1)
+
+ dens = np.prod(Kval, axis=1) * 1. / (np.prod(bw[iscontinuous]))
@josef-pkt Owner

improved numerical precision and maybe efficiency:
define this as log so we only need the sum, and we don't have to take the log in the cv_ml loop.

But I don't really understand this part and the connection with cv_ml yet

@josef-pkt Owner

I think not, the product is just over the multivariate dimension not observations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((67 lines not shown))
+ Here :math:`c` is the number of levels plus one of the RV.
+
+ References
+ ----------
+ .. [1] J. Aitchison and C.G.G. Aitken, "Multivariate binary discrimination
+ by the kernel method", Biometrika, vol. 63, pp. 413-420, 1976.
+ .. [2] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation
+ and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008.
+ """
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ x = np.abs(np.asarray(x, dtype=int))
+ if K == 0:
+ return Xi
+
+ c = np.asarray([len(np.unique(Xi[:, i])) for i in range(K)], dtype=int)
@josef-pkt Owner

put this in the else of if num_levels, expensive call that is not always used

@rgommers Collaborator

done in by pr-408-comments branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((199 lines not shown))
+
+ Ordered[:, i] = Sigma_x[:, 0]
+
+ return Ordered
+
+
+def aitchison_aitken_convolution(h, Xi, Xj):
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi)
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ Xj = np.abs(np.asarray(Xj, dtype=int))
+ if K == 0:
+ return Xi
+
+ Xi = Xi.reshape([N, K])
+ h = h.reshape((K, ))
+ Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
@josef-pkt Owner

this might also be expensive if it has to be calculated very often.

add as argument and store in caller ?

When we to LOO, do we need to adjust Dom_x each time or is it better to keep it constant? The kernel, aitchison_aitken, only uses num_levels, which could be argued, I guess, should stay unchanged.
I'm not sure what happens with the loop for x in Dom_x[i] if x_{i} is missing.

@rgommers Collaborator

This line takes only a few percent of the total time; the for-loop under it is the expensive part. Reviewing/optimizing that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/kernels.py
((80 lines not shown))
+ return Xi
+
+ c = np.asarray([len(np.unique(Xi[:, i])) for i in range(K)], dtype=int)
+ if num_levels:
+ c = num_levels
+
+ kernel_value = np.tile(h / (c - 1), (N, 1))
+ inDom = (Xi == x) * (1 - h)
+ kernel_value[Xi == x] = inDom[Xi == x]
+ kernel_value = kernel_value.reshape([N, K])
+ return kernel_value
+
+
+def wang_ryzin(h, Xi, x):
+ """
+ The Wang-Ryzin kernel, used for ordered discrete random variables.
@josef-pkt Owner

this sounds like a misnomer to me, It doesn't just assume an ordering, it actually assumes discrete variables with "uniform scale", fully metric. uses absolute distance as measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((31 lines not shown))
+class _GenericKDE (object):
+ """
+ Generic KDE class with methods shared by both UKDE and CKDE
+ """
+ def _compute_bw(self, bw):
+ """
+ Computes the bandwidth of the data.
+
+ Parameters
+ ----------
+ bw: array_like or str
+ If array_like: user-specified bandwidth.
+ If a string, should be one of:
+
+ - cv_ml: cross validation maximum likelihood
+ - normal_reference: normal reference rule of thumb
@josef-pkt Owner

should there by normal_scott and normal_silverman instead ?

@rgommers Collaborator

That would be good, no reason not to supply both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((126 lines not shown))
+ Returns the value of the bandwidth that maximizes the integrated mean
+ square error between the estimated and actual distribution. The
+ integrated mean square error (IMSE) is given by:
+
+ .. math:: \int\left[\hat{f}(x)-f(x)\right]^{2}dx
+
+ This is the general formula for the IMSE. The IMSE differs for
+ conditional (CKDE) and unconditional (UKDE) kernel density estimation.
+ """
+ h0 = self._normal_reference()
+ bw = optimize.fmin(self.imse, x0=h0, maxiter=1e3, maxfun=1e3, disp=0)
+ return np.abs(bw)
+
+ def loo_likelihood(self):
+ raise NotImplementedError
+
@josef-pkt Owner

add def imse with NotImplementedError ?

@rgommers Collaborator

_GenericKDE is now also a base class for the regression class, which doesn't have loo_likelihood or imse. So remove loo_likelihood here instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((134 lines not shown))
+ """
+ h0 = self._normal_reference()
+ bw = optimize.fmin(self.imse, x0=h0, maxiter=1e3, maxfun=1e3, disp=0)
+ return np.abs(bw)
+
+ def loo_likelihood(self):
+ raise NotImplementedError
+
+
+class UKDE(_GenericKDE):
+ """
+ Unconditional Kernel Density Estimator
+
+ Parameters
+ ----------
+ tdat: list of ndarrays or 2-D ndarray
@josef-pkt Owner

I think I would just call it data, or endog :)

I don't like the t and e abbreviation for train and evaluation much. It took me a long time to figure out what it's supposed stand for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((186 lines not shown))
+ >>> N = 300
+ >>> np.random.seed(1234) # Seed random generator
+ >>> c1 = np.random.normal(size=(N,1))
+ >>> c2 = np.random.normal(2, 1, size=(N,1))
+
+ Estimate a bivariate distribution and display the bandwidth found:
+
+ >>> dens_u = UKDE(tdat=[c1,c2], var_type='cc', bw='normal_reference')
+ >>> dens_u.bw
+ array([ 0.39967419, 0.38423292])
+ """
+ def __init__(self, tdat, var_type, bw=None):
+ self.var_type = var_type
+ self.K = len(self.var_type)
+ self.tdat = tools.adjust_shape(tdat, self.K)
+ self.all_vars = self.tdat
@josef-pkt Owner

I prefer postfixing qualifiers

endog_all or just endog ordata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((187 lines not shown))
+ >>> np.random.seed(1234) # Seed random generator
+ >>> c1 = np.random.normal(size=(N,1))
+ >>> c2 = np.random.normal(2, 1, size=(N,1))
+
+ Estimate a bivariate distribution and display the bandwidth found:
+
+ >>> dens_u = UKDE(tdat=[c1,c2], var_type='cc', bw='normal_reference')
+ >>> dens_u.bw
+ array([ 0.39967419, 0.38423292])
+ """
+ def __init__(self, tdat, var_type, bw=None):
+ self.var_type = var_type
+ self.K = len(self.var_type)
+ self.tdat = tools.adjust_shape(tdat, self.K)
+ self.all_vars = self.tdat
+ self.N, self.K = np.shape(self.tdat)
@josef-pkt Owner

standard terminology nobs, k_vars

@rgommers Collaborator

+1, that's clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((227 lines not shown))
+ For the log likelihood should be numpy.log
+
+ Notes
+ -----
+ The leave-one-out kernel estimator of :math:`f_{-i}` is:
+
+ .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
+ \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
+
+ where :math:`K_{h}` represents the generalized product kernel
+ estimator:
+
+ .. math:: K_{h}(X_{i},X_{j}) =
+ \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
+ LOO = tools.LeaveOneOut(self.tdat)
@josef-pkt Owner

put a limit on LOO loop for large data sets, large nobs, subsampling ?

@rgommers Collaborator

Agreed, blocking should become the default above a certain sample size (O(500)?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((267 lines not shown))
+ The probability density is given by the generalized product kernel
+ estimator:
+
+ .. math:: K_{h}(X_{i},X_{j}) =
+ \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
+ if edat is None:
+ edat = self.tdat
+ else:
+ edat = tools.adjust_shape(edat, self.K)
+
+ pdf_est = []
+ N_edat = np.shape(edat)[0]
+ for i in xrange(N_edat):
+ pdf_est.append(tools.gpke(self.bw, tdat=self.tdat, edat=edat[i, :],
+ var_type=self.var_type) / self.N)
@josef-pkt Owner

vectorize gpke, work in batches to not blow memory consumption?

@rgommers Collaborator

Batching is implemented in another branch (have a look at https://github.com/gpanterov/statsmodels/blob/nonparametric-all/statsmodels/nonparametric/nonparametric2.py for an overview of all the work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((313 lines not shown))
+ where G() is the product kernel CDF estimator for the continuous
+ and L() for the discrete variables.
+ """
+ if edat is None:
+ edat = self.tdat
+ else:
+ edat = tools.adjust_shape(edat, self.K)
+
+ N_edat = np.shape(edat)[0]
+ cdf_est = []
+ for i in xrange(N_edat):
+ cdf_est.append(tools.gpke(self.bw, tdat=self.tdat,
+ edat=edat[i, :], var_type=self.var_type,
+ ckertype="gaussian_cdf",
+ ukertype="aitchisonaitken_cdf",
+ okertype='wangryzin_cdf') / self.N)
@josef-pkt Owner

why does cdf specify the kertype but pdf doesn't ?

I think ckertype will need more options. maybe later

@josef-pkt Owner

add kertype as attribute to instance. in __init__ ?

@josef-pkt Owner

ok, I see now, _cdf or _convolution below, different kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((377 lines not shown))
+ Conditional Kernel Density Estimator.
+
+ Calculates ``P(X_1,X_2,...X_n | Y_1,Y_2...Y_m) =
+ P(X_1, X_2,...X_n, Y_1, Y_2,..., Y_m)/P(Y_1, Y_2,..., Y_m)``.
+ The conditional density is by definition the ratio of the two unconditional
+ densities, see [1]_.
+
+ Parameters
+ ----------
+ tydat: list of ndarrays or 2-D ndarray
+ The training data for the dependent variables, used to determine
+ the bandwidth(s). If a 2-D array, should be of shape
+ (num_observations, num_variables). If a list, each list element is a
+ separate observation.
+ txdat: list of ndarrays or 2-D ndarray
+ The training data for the independent variable; same shape as `tydat`.
@josef-pkt Owner

standard is endog, exog until we change it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((358 lines not shown))
+
+ Where :math:`\bar{K}_{h}` is the multivariate product convolution
+ kernel (consult [3] for mixed data types).
+ """
+ F = 0
+ for i in range(self.N):
+ k_bar_sum = tools.gpke(bw, tdat=-self.tdat, edat=-self.tdat[i, :],
+ var_type=self.var_type,
+ ckertype='gauss_convolution',
+ okertype='wangryzin_convolution',
+ ukertype='aitchisonaitken_convolution')
+ F += k_bar_sum
+ # there is a + because loo_likelihood returns the negative
+ return (F / (self.N ** 2) + self.loo_likelihood(bw) *\
+ 2 / ((self.N) * (self.N - 1)))
+
@josef-pkt Owner

some plot methods would be nice

@rgommers Collaborator

Is there something special you think a plot method should do? If it's just plot(x), which plots estimate(x) vs. x on a linear scale then I don't think it adds much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((431 lines not shown))
+ >>> c1 = np.random.normal(size=(N,1))
+ >>> c2 = np.random.normal(2,1,size=(N,1))
+
+ >>> dens_c = CKDE(tydat=[c1], txdat=[c2], dep_type='c',
+ ... indep_type='c', bwmethod='normal_reference')
+
+ >>> print "The bandwidth is: ", dens_c.bw
+ """
+ def __init__(self, tydat, txdat, dep_type, indep_type, bw=None):
+ self.dep_type = dep_type
+ self.indep_type = indep_type
+ self.K_dep = len(self.dep_type)
+ self.K_indep = len(self.indep_type)
+ self.tydat = tools.adjust_shape(tydat, self.K_dep)
+ self.txdat = tools.adjust_shape(txdat, self.K_indep)
+ self.N, self.K_dep = np.shape(self.tydat)
@josef-pkt Owner

nobs, k not capitalised

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((432 lines not shown))
+ >>> c2 = np.random.normal(2,1,size=(N,1))
+
+ >>> dens_c = CKDE(tydat=[c1], txdat=[c2], dep_type='c',
+ ... indep_type='c', bwmethod='normal_reference')
+
+ >>> print "The bandwidth is: ", dens_c.bw
+ """
+ def __init__(self, tydat, txdat, dep_type, indep_type, bw=None):
+ self.dep_type = dep_type
+ self.indep_type = indep_type
+ self.K_dep = len(self.dep_type)
+ self.K_indep = len(self.indep_type)
+ self.tydat = tools.adjust_shape(tydat, self.K_dep)
+ self.txdat = tools.adjust_shape(txdat, self.K_indep)
+ self.N, self.K_dep = np.shape(self.tydat)
+ self.all_vars = np.concatenate((self.tydat, self.txdat), axis=1)
@josef-pkt Owner

column_stack would require less thinking

(I didn't see that tydat is 2d even if univariate.)

@rgommers Collaborator

Replaced concatenate with column_stack and row_stack everywhere in https://github.com/rgommers/statsmodels/tree/pr-408-comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((589 lines not shown))
+ else:
+ exdat = tools.adjust_shape(exdat, self.K_indep)
+
+ N_edat = np.shape(exdat)[0]
+ cdf_est = np.empty(N_edat)
+ for i in xrange(N_edat):
+ mu_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
+ edat=exdat[i, :], var_type=self.indep_type) / self.N
+ mu_x = np.squeeze(mu_x)
+ G_y = tools.gpke(self.bw[0:self.K_dep], tdat=self.tydat,
+ edat=eydat[i, :], var_type=self.dep_type,
+ ckertype="gaussian_cdf",
+ ukertype="aitchisonaitken_cdf",
+ okertype='wangryzin_cdf', tosum=False)
+
+ W_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
@josef-pkt Owner

:: requires thinking, missing comma ?

@rgommers Collaborator

No comma, just removing the second :. Addressed in https://github.com/rgommers/statsmodels/tree/pr-408-comments for all code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((583 lines not shown))
+ if eydat is None:
+ eydat = self.tydat
+ else:
+ eydat = tools.adjust_shape(eydat, self.K_dep)
+ if exdat is None:
+ exdat = self.txdat
+ else:
+ exdat = tools.adjust_shape(exdat, self.K_indep)
+
+ N_edat = np.shape(exdat)[0]
+ cdf_est = np.empty(N_edat)
+ for i in xrange(N_edat):
+ mu_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
+ edat=exdat[i, :], var_type=self.indep_type) / self.N
+ mu_x = np.squeeze(mu_x)
+ G_y = tools.gpke(self.bw[0:self.K_dep], tdat=self.tydat,
@josef-pkt Owner

cdf_y cdf_endog instead of G_y

should this be a separate method, or available ? do we want to store them in the same class, or a user should create his own marginal and joint distributions. (application mutual information, example in sandbox with scipy's gaussian_kde)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((639 lines not shown))
+
+ .. math:: G_{-l}(X_{l}) = n^{-2}\sum_{i\neq l}\sum_{j\neq l}
+ K_{X_{i},X_{l}} K_{X_{j},X_{l}}K_{Y_{i},Y_{j}}^{(2)}
+
+ where :math:`K_{X_{i},X_{l}}` is the multivariate product kernel and
+ :math:`\mu_{-l}(X_{l})` is the leave-one-out estimator of the pdf.
+
+ :math:`K_{Y_{i},Y_{j}}^{(2)}` is the convolution kernel.
+
+ The value of the function is minimized by the ``_cv_ls`` method of the
+ `_GenericKDE` class to return the bw estimates that minimize the
+ distance between the estimated and "true" probability density.
+ """
+ zLOO = tools.LeaveOneOut(self.all_vars)
+ CV = 0
+ for l, Z in enumerate(zLOO):
@josef-pkt Owner

l (small L) is not a good variable name

txdat[l, :] is this a 1 or an l or an I

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff
statsmodels/nonparametric/nonparametric2.py
((100 lines not shown))
+
+ The leave-one-out kernel estimator of :math:`f_{-i}` is:
+
+ .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
+ \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
+
+ where :math:`K_{h}` represents the Generalized product kernel
+ estimator:
+
+ .. math:: K_{h}(X_{i},X_{j})=\prod_{s=1}^
+ {q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
+ """
+ # the initial value for the optimization is the normal_reference
+ h0 = self._normal_reference()
+ bw = optimize.fmin(self.loo_likelihood, x0=h0, args=(np.log, ),
+ maxiter=1e3, maxfun=1e3, disp=0)
@josef-pkt Owner

one possible speadup: increase xtol=0.0001
As far as I understand: the exact bandwidth might not be very important, so we might not need it to converge to a high precision. The important part is convergence in function value. (a guess: this might save time and calculations if the objective function is relatively flat at the optimimum.)

What I don't know is what the scale is: do we need absolute or relative tolerance in x?

extra question (maybe for future): is any of the other optimizers potentially better/faster, make it into a choice?

@rgommers Collaborator

xtol=1e-4 is the default, so I guess you mean 1e-3. Bandwidth should be range 0-1, but it can be small for large sample size. Therefore I think relative tolerance (xtol) should be used.

fmin_bfgs is indeed faster in most cases, so making this configurable would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
Collaborator

About api.py vs. __init__.py, I had a look at other modules and it's pretty much inconsistent. Because completely emptying the current __init__.py is not backwards compatible, I think it shouldn't be done in this PR. Adding UKDE, CKDE in a new api.py and leaving the current KDE in __init__.py also doesn't make sense.

Also, I've checked the import time and it is only 3 ms, plus the time for importing numpy + scipy.optimize.

@rgommers rgommers referenced this pull request
Closed

Nonparametric all #434

@rgommers
Collaborator

Summary of all the variable naming comments of Josef:

tdat --> data
txdat --> exog
tydat --> endog
all_vars --> endog_all / endog / data
X_j --> x_noti
G_y --> cdf_y / cdf_endog
N --> nobs
K_dep --> k_dep
no small L

Most of this is indeed standard throughout statsmodels. See http://statsmodels.sourceforge.net/devel/gettingstarted.html#design-matrices-endog-exog for endog/exog and http://statsmodels.sourceforge.net/devel/dev/naming_conventions.html for nobs/k.

I'd propose x_not_i for readability. The all_vars suggestion I'm not sure about, because::

all_vars = np.column_stack((self.tydat, self.txdat))

Other than those two I think these are good suggestions. @gpanterov what do you think?

@jseabold
Owner

I would even propose exog, endog -> X, Y. I'd like to revisit this package-wide before the pydata talk so I don't get the same suggestion from however many people again to change this.

@rgommers
Collaborator

That would be even better; has to happen at some point anyway.

@rgommers rgommers commented on the diff
statsmodels/nonparametric/kernels.py
((112 lines not shown))
+ -----
+ See p. 19 in [1]_ for details. The value of the kernel L if
+ :math:`X_{i}=x` is :math:`1-\lambda`, otherwise it is
+ :math:`\frac{1-\lambda}{2}\lambda^{|X_{i}-x|}`.
+
+ References
+ ----------
+ .. [1] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation
+ and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008.
+ http://dx.doi.org/10.1561/0800000009
+ .. [2] M.-C. Wang and J. van Ryzin, "A class of smooth estimators for
+ discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981.
+ """
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ x = np.abs(np.asarray(x, dtype=int))
@rgommers Collaborator

Looking at the definition in the docstring and ref [1], the abs calls in the two lines above are incorrect. It's abs(Xi-x), not abs(abs(Xi) - abs(x)). Removing them in my speedup branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
statsmodels/nonparametric/kernels.py
((114 lines not shown))
+ :math:`X_{i}=x` is :math:`1-\lambda`, otherwise it is
+ :math:`\frac{1-\lambda}{2}\lambda^{|X_{i}-x|}`.
+
+ References
+ ----------
+ .. [1] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation
+ and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008.
+ http://dx.doi.org/10.1561/0800000009
+ .. [2] M.-C. Wang and J. van Ryzin, "A class of smooth estimators for
+ discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981.
+ """
+ h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
+ Xi = np.abs(np.asarray(Xi, dtype=int))
+ x = np.abs(np.asarray(x, dtype=int))
+ if K == 0:
+ return Xi
@rgommers Collaborator

This check is unnecessary, the _get_shape_and_transform call guarantees K >= 1. Removing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt
Owner

I still don't really like one letter names in the classes. Users can use the formula interface if they don't want to see endog/exog.

5c987e0#diff-1

@rgommers
Collaborator

Josef, given the number of people being in favor for X/Y (both on-list and in Skippers tutorials) and virtually no one except you liking endog/exog, I really hope you'll reconsider your opinion on this.

That issue is orthogonal to this PR though. So I propose to not discuss it further here, and if the status quo is still endog/exog at merge time of this PR, we should use it in this code.

@rgommers
Collaborator

@gpanterov: can you comment on the renaming proposals above? I'd like to get those out of the way, then we're pretty much done with this PR.

@rgommers
Collaborator

Time of nonparametric.test() is down to 16 seconds, nonparametric.test('full') to 80 seconds on my machine. So this is getting more or less acceptable for merging I think. There's more that can be shaved off still, but it's OK-ish now I think.

@gpanterov

On the naming conventions:
tydat, txdat, eydat and exdat were borrowed from R's np package.
I like the idea of changing them so that they are more in line with statsmodels naming conventions. I like the brevity of X and Y and if this is more in line with the rest of statsmodels I propose that we go ahead with this. However, we should also consider renaming the arguments of the KDE.cdf() and COnditionalKDE.cdf() and pdf() which are currenty exdat and eydat (e standing for evaluating as opposed to training data (t) ). I propose eX and eY ?

@josef-pkt
Owner

eX, eY is still uninformative, I never managed to guess what the e stands for.

our terminology in general is fit and predict

shorthand names can sometimes be useful inside a function, but for the outside I prefer descriptive names.

we don't have exdat yet in other models, just exog in predict. The full name would be :)

exog_predict

@rgommers
Collaborator

exog_predict is pretty uninformative too imho. Note that the already existing kdensity and kdensity_fft do use X already:

def kdensity(X, kernel="gau", bw="scott", weights=None, gridsize=None,
             adjust=1, clip=(-np.inf,np.inf), cut=3, retgrid=True):
    """
    Rosenblatz-Parzen univariate kernel desnity estimator

    Parameters
    ----------
    X : array-like
        The variable for which the density estimate is desired.

I see a bunch of predict methods, but no parameters with _predict appended.

Using fit() for kicking off bandwidth estimation is a good point though.

@josef-pkt
Owner

No, we don't have _predict postfix, because we never needed it.
standard signature is just predict(exog) since none of the other models needs to keep the original exog from the fit around at the same time. (predict signature will get more complicated with formulas)

Kernel methods are currently the only ones where we don't just have "parameters", and where we need the full original sample for "prediction".

For density estimation there is no real endog and exog, just the data, so I don't really care much. Similar, I usually don't use the endog/exog terminology in statistical tests.
But kernel regression is similar to the other model in that it estimates a relationship between endog and exog under some assumptions on the process that generated the data.

@rgommers
Collaborator

OK, all the renames done in my speedup-nonparametric branch. I included exog_predict and even exog_nonparametric_predict, because I realised the most confusing part of exog_predict is not _predict. So why not, at least it's consistent.

Adding fit presents a slight problem, since the regression classes already have a fit method (for doing the actual regression and marginal effects). If fit() would be used for computing the bandwidth (still don't see a good use-case for why that's needed), what to call the current fit()?

@josef-pkt
Owner

Ralf, thanks for working through this
Can you create a pull request from your branch, so we can review your version?

The way it sounds like, I think we should merge it soon.

@gpanterov

Can we indicate somewhere that the CensoredReg, SemiLinear and SingleIndexModel classes are experimental? Or should we exclude them from the PR ? They are a unique feature of statsmodels i.e. they are not present in any other package to my knowledge. But because of this, I was not able to cross-check the output that they give.

@josef-pkt
Owner

I would prefer to merge everything instead of splitting up the pull request again.

If there are some tests that they work, then I would just add a comment to the docstrings and leave them in, even if they are not verified against other packages.
If they are unfinished or might have problems, then we could also just move them temporarily to the sandbox.

(I haven't looked at the code for those yet.)

@rgommers
Collaborator

Test coverage should be increased first though if we want to merge everything at once:

  • TestFForm and SingleIndexModel are completely untested
  • SemiLinear and CensoredReg have only one test.

I propose to rename TestFForm to FFormTest by the way. No name should start with Test.

George, do you have time to work on those tests in the near future? If not, I propose to leave those things out for now.

@rgommers
Collaborator

What do we do about current KDE class by the way? Rename to KDE_1d?

@jseabold
Owner

Is there a name clash? Is something called KDE in the PR? If so, I think it should be renamed given that KDE is in use already. If not, I'm ok to change KDE to KDE_1d (or whatever), but we'd have to go through a deprecation period for the current KDE.

@rgommers
Collaborator

The main class is called KDE (renamed from UKDE). Josef's suggestion from the mailing list:

I would spell it out ConditionalKDE,  or KDEConditional,
I think shortening UKDE to KDE is fine, if there is no problem with
several KDE, eg. rename the other one to KDEUnivariate or UnivariateKDE.
@jseabold
Owner

I think at least initially we would have to have KDEMultivariate or MultivariateKDE, then we can deprecate KDE and move to this if it's what we really want to do unless there are some plans to separate by namespace (still would be a bit confusing IMO).

@rgommers
Collaborator

Hmm, not sure what would be best. Basically, the current KDE is a specialization for 1-D continuous data. It's faster and has support for data on a finite domain. The new KDE can handle mixed continuous/ordered/unordered data and has cross-validated bandwidth selection. We should be able to add finite domain support to it (longer term, non-trivial). So it looks to me like KDE should become KDE_1d or UnivariateKDE. However, given the current state that's indeed not very practical. Alternative is to rename the new KDE and explain how the classes are related in the two docstrings.

Splitting the two between two namespaces doesn't make sense.

@josef-pkt
Owner

I don't think we want to separate by namespace, too confusing and there are no natural namespaces withing kde.

One possibility would be to keep all the KDE with qualifiers.

The KDE in this PR is using product kernels, so we could also call it KDEProduct to distinguish from scipy gaussian_kde which I think doesn't impose the product structure. KDEMutivariate is also fine (I'm not sure users care about "Product")

(other multivariate: nearest neighbor, rbf kernels ?)

@rgommers
Collaborator

Users indeed won't care much. Multivariate sounds better than Product.

@jseabold
Owner

+1

@josef-pkt
Owner

unless the untested parts will get tests soon, I prefer moving them to the sandbox.
requires less maintenance, avoids bitrot, and we don't have to struggle with git merges.
the only disadvantage is that it's easier to add github comments to a PR.

And it can be used even if it doesn't have a full test suite, and we can add tests incrementally.

(example: We got a bug report for survival2 which is in the sandbox, but the code in the sandbox is outdated and no one has worked on the new version the branch in 7 months.
Would have been easier to merge and add test coverage piece by piece.)

@josef-pkt
Owner

Ralf can you make a pull request from your branch

This branch doesn't contain the extra classes SingleIndexModel, Censored, ...

Also I would suggest splitting the module nonparametric2.py into at least two or three parts, density, regression and others? and rename nonparametric2 to "kernel_density.py" or something like that.

I only had a brief look at the extras. One of the main parts that might need expansion (later?) is to check what should be returned by fit, there are currently no results classes.

@rgommers
Collaborator

Yes, will get the last renames done and send a PR this week.

@gpanterov I'll follow Josef's suggestion on moving things that aren't tested to the sandbox; they can always be moved back as soon as we have tests.

@gpanterov

Please go ahead and move them to the sandbox Ralf. I was thinking along the same lines. I will have time to write some tests for them in the next few weeks. But I can't do it immediately ...

@rgommers
Collaborator

New PR opened, so closing this one.

@rgommers rgommers closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 26, 2012
  1. @gpanterov

    Multivar KDEs

    gpanterov authored
  2. @gpanterov

    Minor Change

    gpanterov authored
Commits on Jun 9, 2012
  1. @gpanterov
  2. @gpanterov

    Fixed bw/bwmethod

    gpanterov authored
Commits on Jun 10, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov
Commits on Jun 18, 2012
  1. @rgommers
  2. @rgommers

    TST: nonparametric: convert UKDE example into a doctest.

    rgommers authored
    Doctests can be run with:
    
        from statsmodels import nonparametric
        nonparametric.test(doctests=True)
    
    Or "nosetests --with-doctest" in the nonparametric dir.
    
    Note that some other doctests seem to be failing at the moment.
Commits on Jun 22, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
Commits on Jun 25, 2012
  1. @rgommers

    ENH: add __repr__ method to UKDE.

    rgommers authored
    CKDE method still to do.  Also clean up some docstrings and code.
Commits on Jun 27, 2012
  1. @gpanterov

    pep8 on KernelFunctions.py

    gpanterov authored
  2. @gpanterov
Commits on Jun 29, 2012
  1. @gpanterov

    edat fix

    gpanterov authored
  2. @gpanterov

    fixed edat

    gpanterov authored
  3. @gpanterov

    removed imse_slow and gpke2

    gpanterov authored
Commits on Jun 30, 2012
  1. @gpanterov
  2. @gpanterov

    some more changes

    gpanterov authored
  3. @gpanterov

    changes..

    gpanterov authored
Commits on Jul 1, 2012
  1. @gpanterov

    resolved conflicts

    gpanterov authored
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov

    some minor fixes + clean up

    gpanterov authored
  5. @gpanterov

    fixed test failures

    gpanterov authored
  6. @gpanterov

    ...

    gpanterov authored
  7. @gpanterov
Commits on Jul 4, 2012
  1. @gpanterov
Commits on Jul 7, 2012
  1. @gpanterov
  2. @gpanterov
Commits on Jul 8, 2012
  1. @gpanterov
Commits on Jul 10, 2012
  1. @gpanterov
  2. @gpanterov

    added the derivative of the Gaussian kernel to be used for the calcul…

    gpanterov authored
    …ation of the marginal effects in the regression
Commits on Jul 11, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov

    cleaned GPKE and PKE

    gpanterov authored
  4. @gpanterov
  5. @gpanterov
  6. @gpanterov
  7. @gpanterov
  8. @gpanterov
Commits on Jul 12, 2012
  1. @gpanterov
  2. @gpanterov

    removed the Epanechnikov Kernel. Could be added at a later time again…

    gpanterov authored
    … if we decide to give the user an option to specify kernels. However most refs claim kernel not important
  3. @gpanterov
  4. @gpanterov
  5. @gpanterov
  6. @gpanterov
  7. @gpanterov
  8. @gpanterov
Commits on Jul 13, 2012
  1. @gpanterov
  2. @gpanterov

    added repr method to Reg

    gpanterov authored
  3. @gpanterov

    small fix to adjust_shape

    gpanterov authored
  4. @gpanterov
Commits on Jul 15, 2012
  1. @rgommers

    BUG: nonparametric: fix three test failures due to incorrect usage of…

    rgommers authored
    … `fill`.
    
    Note also the added FIXME's.  Some bugs are still left; points also to
    incomplete test coverage.
Commits on Jul 17, 2012
  1. @gpanterov

    examples + kernel fix

    gpanterov authored
Commits on Jul 22, 2012
  1. @gpanterov
  2. @gpanterov

    fixed adjust_shape

    gpanterov authored
  3. @gpanterov

    some new fixes

    gpanterov authored
Commits on Jul 23, 2012
  1. @gpanterov

    removed pke

    gpanterov authored
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov

    fixed np_tools

    gpanterov authored
  5. @gpanterov

    fixed pep8 issues

    gpanterov authored
Commits on Jul 25, 2012
  1. @gpanterov
Commits on Jul 30, 2012
  1. @gpanterov
Commits on Aug 4, 2012
  1. @rgommers
  2. @rgommers
  3. @rgommers
  4. @rgommers

    MAINT: rename KernelFunctions.py --> kernels.py

    rgommers authored
    Also remove some more unused imports.
Commits on Aug 13, 2012
  1. @gpanterov

    Merge pull request #4 from rgommers/nonparametric-density

    gpanterov authored
    Fixes / typos / small improvements for PR-408.
  2. @gpanterov
  3. @gpanterov

    Merge branch 'nonparametric-density' of github.com:gpanterov/statsmod…

    gpanterov authored
    …els into nonparametric-density
  4. @gpanterov
Commits on Aug 14, 2012
  1. @gpanterov
  2. @gpanterov

    lower case names for kernels

    gpanterov authored
Commits on Aug 19, 2012
  1. @gpanterov
  2. @gpanterov
Something went wrong with that request. Please try again.