# Nonparametric density#408

Closed
wants to merge 79 commits into from
+1,535 −0

### 4 participants

Nonparametric Density estimators. Conditional and Unconditional multivariable mixed data estimation available with plug-in (normal-reference) and data-driven bandwidth selection methods (cross-validation least squares and cross-validation maximum likelihood).

added some commits May 26, 2012
 gpanterov Multivar KDEs dd102b2 gpanterov Minor Change 56a25b3
statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 26, 2012

I think it would pay off to write test for various shapes of this function.

For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.

The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):

dens_c=CKDE(tydat=[c1,c3],txdat=[c2], dep_type='cc',indep_type='c',bwmethod='normal_reference')
dens_c.pdf()

statsmodels member

nice, now I see that the bandwidth h is also vectorized.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.

statsmodels member

Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.

Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

This for-loop doesn't do anything. It only creates var, which isn't used below.

statsmodels member

column_stack (one of my favorites) could be used, however,

concatenate and column_stack copy the data, AFAICS

In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

Same here, for loops don't do anything.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 May 27, 2012

index could be 1d (row selector), then reshape is not necessary

added some commits Jun 9, 2012
 gpanterov Implemented Suggestions by Ralf and Josef 8610a4a gpanterov Fixed bw/bwmethod ab2439c
statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 8610a4a Jun 10, 2012

pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to

statsmodels member

sorry, I guess I wasn't clear before, below are many spaces in keyword =

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 8610a4a Jun 10, 2012

is if bw is not None or if not bw is None

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 8610a4a Jun 10, 2012

should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required

I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 8610a4a Jun 10, 2012

is None

added some commits Jun 9, 2012
 gpanterov  Introduced default for bandwidth selection (Josef) fb558e9 gpanterov Removed convolution and cumulative density 0702486 gpanterov least squares cross validation + changes in API - added tools module 19b7548 gpanterov Finished with cv_ls for continuous and ordered variables c771f52
statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 8610a4a Jun 13, 2012

Should be

if edat is None:
edat = self.tdat

statsmodels member
commented on 8610a4a Jun 13, 2012

When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in ab2439c Jun 13, 2012

Should be if not isinstance(bw, basestring).

The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 19b7548 Jun 17, 2012

This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.

Did you also plan to provide other metrics, like ISE or IAE?

statsmodels member
commented on c771f52 Jun 17, 2012

I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.

and others added some commits Jun 18, 2012
 rgommers TST: convert test script to actual unit tests. Add UKDE, CKDE to __in… …it__.py bc3f30e rgommers TST: nonparametric: convert UKDE example into a doctest. Doctests can be run with: from statsmodels import nonparametric nonparametric.test(doctests=True) Or "nosetests --with-doctest" in the nonparametric dir. Note that some other doctests seem to be failing at the moment. 7efea19 gpanterov Added lst sq cv for conditional density (slow) 0163e3b gpanterov Vectorized parts of IMSE in CKDE. Significantly improved performance 5ec2c48 gpanterov Added various tests. Most performe well but there are still issues wi… …th speed 21a19a9 rgommers ENH: add __repr__ method to UKDE. CKDE method still to do. Also clean up some docstrings and code. 08eaa00 gpanterov pep8 on KernelFunctions.py 7985008 gpanterov pep8 on nonparametric2.py and np_tools.py b4815c9 gpanterov edat fix 2ccbae0 gpanterov fixed edat f175261 gpanterov removed imse_slow and gpke2 2942de4 gpanterov added graphical tests for several distributions 1182a8c gpanterov some more changes eb198fa gpanterov changes.. 22ea70d gpanterov resolved conflicts e56f4db gpanterov expanded doc strings for Generic KDE and UKDE 4eed5e0 gpanterov fixed doc strings and included latex formulas 7eeb5e3 gpanterov some minor fixes + clean up 3bad1da gpanterov fixed test failures 51a175f gpanterov ... b182b9a gpanterov added latex formulas for kernels in KernelFunctions.py 877b63d
statsmodels member
commented on statsmodels/nonparametric/KernelFunctions.py in 877b63d Jul 2, 2012

I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.

statsmodels member
commented on statsmodels/nonparametric/np_tools.py in 877b63d Jul 2, 2012

Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 51a175f Jul 2, 2012

Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?

I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?

statsmodels member

Perhaps all of them, first on axis 0, then 1, etc.?

statsmodels member
commented on 3bad1da Jul 2, 2012

If you don't need code anymore, better to delete it.

statsmodels member
commented on f175261 Jul 2, 2012

This fix really needs a test for edat usage.

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 4eed5e0 Jul 2, 2012

The old version (array_like) is actually the correct one.

statsmodels member
commented on 1182a8c Jul 2, 2012

I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).

 gpanterov added nonparametric regression + tests for ordered case 7c7ae88
statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 7c7ae88 Jul 4, 2012
statsmodels member

...

statsmodels member
commented on statsmodels/nonparametric/tests/test_nonparametric2.py in 7c7ae88 Jul 6, 2012

doesn't work for me

import statsmodels.nonparametric.nonparametric2 as nparam ?

statsmodels member
commented on 7c7ae88 Jul 6, 2012

leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)

statsmodels member

pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries

one print statement is left somewhere when running the tests

tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging

(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)

statsmodels member

some of the test cases would make nice example scripts

statsmodels member

I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.

 gpanterov added fast unconditional multivariate cdf and tested with continuous … …and mixed data 125c66a
statsmodels member
commented on statsmodels/nonparametric/KernelFunctions.py in 125c66a Jul 7, 2012

special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm

statsmodels member
commented on 125c66a Jul 7, 2012

nice, I like having the cdf available, Azzalini (IIRC) mentioned that the rate (as function of n) for the bandwidth for cdf should be smaller (?) than for the pdf. Did you see anything about bw choice for cdf?

statsmodels member

Fast, and seems to work well. At least in 1-D, converges nicely to the empirical CDF.

statsmodels member

@josef-pkt that's what I thought too, bandwidth isn't the same as for pdf.

statsmodels member

Can you factor out all the code related to asarray, K, N and reshaping? It's more than 10 lines that are duplicated in every single kernel function. Should be something like

def _get_shape_and_type(Xi, x, kind='c'):
...
return Xi, x, K, N

statsmodels member

The UKDE, CKDE interface now doesn't allow specifying the kernels to use. The Epannechnikov kernel isn't used at all. Are you planning to expand that interface? In any case, what the default kernels are should be documented.

statsmodels member

I have the feeling that the for-loop in GPKE can still be optimized, it's very expensive now. You can see this easily by profiling in IPython. Use for example %prun dens_scott.cdf() after having run the below script. 33000 function calls for a 1-D example with 1000 points.

import numpy as np

from statsmodels.sandbox.distributions.mixture_rvs import mixture_rvs
from statsmodels.nonparametric import UKDE
from statsmodels.tools.tools import ECDF

import matplotlib.pyplot as plt

np.random.seed(12345)
obs_dist = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))

dens_scott = UKDE(tdat=[obs_dist], var_type='c', bw='normal_reference')
est_scott = dens_scott.pdf()

xx = np.linspace(-4, 4, 100)
est_scott_cdf = dens_scott.cdf(xx)
ecdf = ECDF(obs_dist)

# Plot the cdf
fig = plt.figure()
plt.plot(xx, est_scott_cdf, 'b.-')
plt.plot(ecdf.x, ecdf.y, 'r.-')

plt.show()

statsmodels member

The for-loop could be over the number of variables instead of the data points, right?

statsmodels member

It looks that way to me too. @gpanterov is there any reason this couldn't be vectorized over the number of observations?

statsmodels member
commented on statsmodels/nonparametric/np_tools.py in 125c66a Jul 7, 2012

Indentation not equal here.

statsmodels member
commented on statsmodels/nonparametric/tests/test_nonparametric2.py in 125c66a Jul 7, 2012

Need a blank line here.

added some commits Jul 7, 2012
 gpanterov added conditional multivariate cdf; tests for continuous and mixed; c… …ode needs cleaning e29e026 gpanterov nonparametric regression with margina fx, R2, signiciance 04895ff gpanterov Fixed npreg CV.LS; added tests for continuous and mixed for mean, R2,… … bandwidth 46e076d gpanterov added the derivative of the Gaussian kernel to be used for the calcul… …ation of the marginal effects in the regression 292c0a3 gpanterov implemented some suggestions by Josef and Ralf b66e698 gpanterov created only one GPKE function for all classes 3a29a93 gpanterov cleaned GPKE and PKE 398acf1 gpanterov moved distribution plots from test file to examples file (yet to be c… …ommitted 8b14a1a gpanterov added Reg class to __init__.py fc84351 gpanterov added univariate kde example for 6 common distributions. merged with … …main nonparametric branch 6a4fa5a
statsmodels member
commented on 6a4fa5a Jul 11, 2012

Please add import matplotlib.pyplot as plt.

statsmodels member

Pareto is still broken.

statsmodels member

The Weibull plot shows an interesting issue - finite support isn't handled in the UKDE, CKDE classes. Close to 0 this goes wrong.

Broke Pareto -- strange. It works for me. Maybe it depends on the seed. With np.seed(123456) I get the following plot :
http://statsmodels-np.blogspot.com/2012/07/kde-estimate-of-pareto-rv.html
The density estimate isn't very good but this could be due to the relatively small sample size. Will add seed.

statsmodels member

With that seed I get the same result as you. The result is quite far off again though, again due to not dealing with support.

added some commits Jul 11, 2012
 gpanterov added import plt to ex_univar_kde.py 9242513 gpanterov added seed 123456 to ex_univar_kde.py c7fe86c
statsmodels member
commented on statsmodels/nonparametric/KernelFunctions.py in 7985008 Jul 11, 2012

pep 8 should include lower case for function names

http://www.python.org/dev/peps/pep-0008/#function-names

statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in dd102b2 Jul 12, 2012

Probably want to avoid underscores in class names unless it's to mark the class as private. CamelCase is almost always good enough.

statsmodels member

You also want to stick explicitly with new-style classes. Ie., all of your classes should inherit from object

class GenericKDE(object):
statsmodels member
commented on statsmodels/nonparametric/tests/test_nonparametric2.py in 21a19a9 Jul 12, 2012

Not a huge deal, but you want a class method here or this gets called for every test method. Ie.,

@classmethod
def setUp(cls):
...
 gpanterov added docstrings for Reg class. added repr method for CKDE 9ed520b
statsmodels member
commented on statsmodels/nonparametric/tests/test_nonparametric2.py in 21a19a9 Jul 12, 2012

Could you post scripts somewhere to compare the output for the full datasets. I'd like to compare the performance.

Sure. I can do that. But it becomes quite slow if you include the entire data set and if you use the data-driven bandwidth estimates (especially cross-validation least squares)

statsmodels member

Yeah, don't put it in the tests, but if you could put it as an example script somewhere I can play with that would be helpful.

 gpanterov removed the Epanechnikov Kernel. Could be added at a later time again… … if we decide to give the user an option to specify kernels. However most refs claim kernel not important 02b3781
statsmodels member
commented on statsmodels/nonparametric/nonparametric2.py in 46e076d Jul 12, 2012

This could be a property since it doesn't need any arguments. I'm not sure about caching it yet since I don't know all the moving parts.

added some commits Jul 11, 2012
 gpanterov made nonparametric2.py pep 8 compliant 7e35ec4 gpanterov made np_tools.py pep 8 compliant ee2cd70 gpanterov made KernelFunctions.py pep 8 compliant b828625 gpanterov made test_nonparametric2.py pep 8 compliant 616d956 gpanterov naming conventions suggested by Skipper be6eb11 gpanterov added local linear estimator to Reg. modified Reg api. added tests 85d0773 gpanterov fixed a minor issue with CKDE.cdf, modified adjust_shape function 46a9a8f gpanterov added repr method to Reg 008ef14 gpanterov small fix to adjust_shape 2a72d5f gpanterov added an example with multivariate bimodal distribution plot (cv_ml) 285297e
statsmodels member
commented on 285297e Jul 14, 2012

My version of MPL (1.0.1) doesn't have ax.zaxis. What version are you on? Can you leave out those 2 lines, or replace them with something that works for multiple versions?

Also, you could add a second plot of the same density with imshow(Z). While the 3-D version is fancier, I find the 2-D one much easier to interpret.

Other than that, looks good.

statsmodels member
commented on statsmodels/examples/ex_multivar_UKDE.py in 285297e Jul 14, 2012

You don't need A, B and the for-loop. These 8 lines can be replaced by

ix = np.random.uniform(size=N) > 0.5
V = np.random.multivariate_normal(mu1, cov1, size=N)
V[ix, :] = np.random.multivariate_normal(mu2, cov2, size=N)[ix, :]

statsmodels member
commented on statsmodels/examples/ex_multivar_UKDE.py in 285297e Jul 14, 2012

This for-loop can also be removed. Three lines above can be replaced by

edat = np.column_stack([X.ravel(), Y.ravel()])
Z = dens.pdf(edat).reshape(X.shape)


It would be good to document in the pdf method that the required shape for edat is (num_points, K), with K the number of dimensions (available as the K attribute.

and others added some commits Jul 15, 2012
 rgommers BUG: nonparametric: fix three test failures due to incorrect usage of… … fill. Note also the added FIXME's. Some bugs are still left; points also to incomplete test coverage. a957da0 gpanterov examples + kernel fix 6fb2c1f gpanterov optimizing gpke + kernels for speed 919aa97 gpanterov fixed adjust_shape 00fa9f0 gpanterov some new fixes 1383dba gpanterov removed pke bc9463d gpanterov cleaned up KernelFunctions.py 2888c1d gpanterov preparing the density estimators for pull request (without the regres… …sion) ea9b587 gpanterov fixed np_tools 656f96a gpanterov fixed pep8 issues 0c105cf
statsmodels member

You forgot to remove the Reg class.

statsmodels/examples/ex_multivar_UKDE.py
 @@ -0,0 +1,55 @@ +#import nonparametric2 as nparam +import statsmodels.nonparametric as nparam +import scipy.stats as stats +import numpy as np +import matplotlib.pyplot as plt +from mpl_toolkits.mplot3d import axes3d +from matplotlib import cm +from matplotlib.ticker import LinearLocator, FormatStrFormatter
 statsmodels member rgommers added a note Jul 24, 2012 This line isn't needed, nor is line 1 (commented-out import). to join this conversation on GitHub. Already have an account? Sign in to comment
 gpanterov removed reg class from nonparametric2.py 09d761a
statsmodels/nonparametric/nonparametric2.py
 + \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j}) + + where :math:K_{h} represents the + Generalized product kernel estimator: + + .. math:: K_{h}(X_{i},X_{j})= + \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """ + + LOO = tools.LeaveOneOut(self.tdat) + i = 0 + L = 0 + for X_j in LOO: + f_i = tools.gpke(bw, tdat=-X_j, edat=-self.tdat[i, :], + var_type=self.var_type) + i += 1
 statsmodels member rgommers added a note Jul 25, 2012 This i=0; i+=1 is a bit un-Pythonic. You can get i from the for-loop like so: for i, X_j in enumerate(LOO):. to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/nonparametric/nonparametric2.py
 +""" + +import numpy as np +from scipy import integrate, stats +import np_tools as tools +import scipy.optimize as opt +import KernelFunctions as kf + +__all__ = ['UKDE', 'CKDE'] + + +class GenericKDE (object): + """ + Generic KDE class with methods shared by both UKDE and CKDE + """ + def compute_bw(self, bw):
 statsmodels member rgommers added a note Jul 25, 2012 This isn't a public method, so should start with an underscore. to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels member

Please mark the current slow tests with @dec.slow from numpy.testing, as we discussed several times. You can add a test with smaller input as non-slow.

and others added some commits Jul 30, 2012
 gpanterov implemented suggestions by Ralf c67e189 rgommers MAINT: fix many small typos and inconsistencies in UKDE/CKDE classes. 7faa745 rgommers MAINT: clean up nonparametric/np_tools.py and remove unused imports. 25818f2 rgommers MAINT: clean up nonparametric/KernelFunctions.py c9daa4e rgommers MAINT: rename KernelFunctions.py --> kernels.py Also remove some more unused imports. a6be4fe
statsmodels member
commented Aug 4, 2012

I've fixed a lot of small issues in gpanterov#4. Fixing was much quicker than noting all of them here.

statsmodels member
commented Aug 4, 2012

Other things to still address related to current code:

• gpke() still takes kernel function keywords, but there aren't any alternatives. Remove?
• imse() documented Return value (CV) is incorrect, it doesn't return a function.
• AitchisonAitken() returns a (N, K) array, while WangRyzin() is documented as returning a float. This cannot be correct.
• Gaussian() and convolution/cdf kernels don't have docstrings yet.
• function names in kernels.py should be lower case (PEP8)

EDIT: (16 Aug) these points are taken care of now.

statsmodels member
commented Aug 4, 2012

More points:

• tests not marked with @dec.slow still take 60 seconds to run, this needs to be improved by running tests with smaller input data. Marking more with @dec.slow is not an option, because it will just mean most tests won't be run by default.
• UKDE/CKDE names aren't too informative. Josef asked before for some better names.
• An example with ordered/unordered data really needs to be added. The examples folder now has two examples, one with 1-D continuous input and one with 2-D continuous input.

I've reviewed all code now, that's all I saw.

statsmodels member
commented Aug 4, 2012

One more thing: Josef asked you about what it would take to handle distributions with finite support correctly. See plots for Pareto and Weibull distributions in ex_univar_kde.py for an example of where this would help. Have you thought about that?

added some commits Aug 13, 2012
 gpanterov Merge pull request #4 from rgommers/nonparametric-density Fixes / typos / small improvements for PR-408. 7e732d8 gpanterov fixed minor issue with example in CKDE class b705644 gpanterov Merge branch 'nonparametric-density' of github.com:gpanterov/statsmod… …els into nonparametric-density aef14c6 gpanterov fixed issues with documentation of kernels 1fc6777 gpanterov lower case names of kernels -- pep8 compliant 7be7228 gpanterov lower case names for kernels 70ce4a4
commented on the diff Aug 14, 2012
statsmodels/nonparametric/np_tools.py
 @@ -0,0 +1,139 @@ +import numpy as np + +from . import kernels
 statsmodels member jseabold added a note Aug 14, 2012 Can you change this to an absolute import? statsmodels member rgommers added a note Aug 14, 2012 I thought you wanted relative ones? Or was that in the past? statsmodels member jseabold added a note Aug 14, 2012 I think we settled on relative in api, absolute everywhere else. I usually debug in the source directory. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 14, 2012
statsmodels/nonparametric/nonparametric2.py
 + The leave-one-out kernel estimator of :math:f_{-i} is: + + .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h} + \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j}) + + where :math:K_{h} represents the generalized product kernel + estimator: + + .. math:: K_{h}(X_{i},X_{j}) = + \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """ + LOO = tools.LeaveOneOut(self.tdat) + L = 0 + for i, X_j in enumerate(LOO): + f_i = tools.gpke(bw, tdat=-X_j, edat=-self.tdat[i, :], + var_type=self.var_type)
commented on the diff Aug 15, 2012
statsmodels/nonparametric/nonparametric2.py
 + func: function + For the log likelihood should be numpy.log + + Notes + ----- + The leave-one-out kernel estimator of :math:f_{-i} is: + + .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h} + \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j}) + + where :math:K_{h} represents the generalized product kernel + estimator: + + .. math:: K_{h}(X_{i},X_{j}) = + \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """
 statsmodels member jseabold added a note Aug 15, 2012 Interesting parallelization results. I don't know if you want to play around with this that much, but you don't start seeing gains for using all cores until nobs > 1500 or so, and even then it's modest (ie., barely worth it). I guess the overhead of joblib is still costlier than the computations, but I would've expected it to be a bit faster earlier. There may be other hotspots that this benefits from, you can use this  var_type = self.var_type tdat = self.tdat LOO = tools.LeaveOneOut(tdat) from statsmodels.tools.parallel import parallel_func parallel, p_func, n_jobs = parallel_func(tools.gpke, n_jobs=-1, verbose=0) L = sum(map(func, parallel(p_func(bw, tdat=-X_j, edat=-tdat[i, :], var_type=var_type) for i, X_j in enumerate(LOO))))  Alternatively we're going to have look elsewhere for speed gains. statsmodels member jseabold added a note Aug 15, 2012 Forgot the  return -L  statsmodels member josef-pkt added a note Aug 15, 2012 joblib needs to work in batches, especially on Windows, see mailing list "playing with joblib" 10/10/2011 Alexandre's comments early on statsmodels member rgommers added a note Aug 16, 2012 I think it's better to look at the algorithms again in a some more detail (i.e. with a profiler) instead of at joblib. Joblib also isn't going to yield more than a factor of a couple (2 at most for me); we're looking for more than that still. statsmodels member jseabold added a note Aug 19, 2012 2-8x speed-up is a pretty decent speed-up for things that take 1-2 minutes. This is embarrassingly parallel, so we should be able to take advantage here. I still think we want to use binning + FFT for the default Gaussian kernel. This is going to yield a 70-300x speed-up for each evaluation according to the literature and my experience with doing univariate. Newer multipole-type methods will yield 2-3x beyond this, but it looks more complicated than I have time for. I'm going to see if I can't look through a copy of Silverman or Wand and Jones tomorrow for the details. statsmodels member rgommers added a note Aug 19, 2012 Note that George already implemented batching/blocking, with large speedups. About FFT, there then needs to be a solution for evaluating at certain points or non-equidistant grids. statsmodels member jseabold added a note Aug 19, 2012 Ah, good. I'll have to catch up (currently homeless with limited internet). Looking into the changes needed for FFT. statsmodels member rgommers added a note Aug 19, 2012 Note that that's not in this PR, but it's in the nonparametric-all branch. statsmodels member rgommers added a note Aug 19, 2012 Reading up some more about FFT methods, I found http://books.google.nl/books?hl=en&lr=&id=AyJ9xrrwDnIC&oi=fnd&pg=PA203&dq=density+estimation+multipole&ots=K_IBmhGVTY&sig=yrzJOhpPe4d2-2i_xCbeFpdE6hE#v=onepage&q=density%20estimation%20multipole&f=false which gives the following scalings: O(N^D log(n^D)) # FFT O(D N^2) # direct summation (like in this PR)  It also gives a reference to an empirical study (Wand, "Fast Computation of Multivariate Kernel Estimators", 1994) showing that for D=3 the speed-up of FFT over direct summation is at most 5 for 10k samples. This scaling is the reason that FFT methods are mostly restricted to 1-D. statsmodels member josef-pkt added a note Aug 19, 2012 I have not much idea about the fft in the multivariate kde case, but O(N^D log(n^D)) might not be correct for product kernels. my guess would be that the terms are also N*D instead of N**D for product kernels. I think we should have the work on speed improvements later on, so we can get the current work merged, and the discussion where it might not get lost with a git rebase. statsmodels member rgommers added a note Aug 19, 2012 That's a good point, we don't want to rebase this PR. That should happen on a new PR, otherwise many of these comments will be lost. statsmodels member rgommers added a note Aug 19, 2012 I'm finding lots of references to implementations of the Fast Gauss Transform, but nothing that's directly reusable. That would be the way to go. statsmodels member rgommers added a note Aug 19, 2012 FigTree seems to be the best available implementation, but its LGPL. statsmodels member jseabold added a note Aug 19, 2012 Thanks for the references. I don't think these are directly comparable though. The benchmark for the Wand study is not direct summation. It's a multivariate simple binning algorithm mentioned in Scott written in Fortran AFAIK. The reference in Scott for this is "On Frequency Polygons and Average Shifted Histograms in Higher Dimensions" Hjort 1986. My impression for the product kernels is the same as Josef's, though I'm not as familiar with this paper. I'm not proposing multivariate gridding, and I could be wrong - these are just impressions right now. Just the way we have it now, we are essentially doing univariate kernel density estimation, which should be easy to speed up, since we'd only do the binning once and then evaluate at the different points with different bandwidths, which should give the speed-ups I mentioned. I just need to figure out how to make this possible. statsmodels member rgommers added a note Aug 19, 2012 Thanks. I didn't read the Wand paper, just the summary of it in the booked I linked, so I may have misunderstood the reference case. to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels member

Open points from mailing list discussion:

• renaming: UKDE --> KDE should be done. CKDE may be renamed to KDEConditional or ConditionalKDE (preference Josef).
• Skipper has a preference for a do-nothing __init__() and fit() to do bw estimation. Related to #429 (don't see exactly how though).
• as mentioned above too, the issue of densities with finite support. Need to at least note as an open issue.
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + if Xi.ndim > 1: + K = np.shape(Xi)[1] + N = np.shape(Xi)[0] + elif Xi.ndim == 1: # One variable with many observations + K = 1 + N = np.shape(Xi)[0] + else: # ndim ==0 so Xi is a single point (number) + K = 1 + N = 1 + + assert N >= K # Need more observations than variables + Xi = Xi.reshape([N, K]) + return h, Xi, x, N, K + + +def aitchison_aitken(h, Xi, x, num_levels=False):
 statsmodels member josef-pkt added a note Aug 17, 2012 (not sure if we need to change this) I would prefer the reversed order of the arguments (x, Xi, h) I like h last (in case we get default arguments and h as keyword) order x, Xi mainly to read it as function of x given Xi and h (not sure about this since standard Kernel notation is K(Xi, x) ? ) with x and Xi it's not obvious to me from the names which is which (I often use xi meaning x_i subscript, the ith observation. What does i stand for in the training set.) maybe data instead of Xi, would be more informative (We need kernel functions for other parts, but I don't know or recall the details.) to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + h = np.asarray(h, dtype=float) + Xi = np.asarray(Xi) + # More than one variable with more than one observations + if Xi.ndim > 1: + K = np.shape(Xi)[1] + N = np.shape(Xi)[0] + elif Xi.ndim == 1: # One variable with many observations + K = 1 + N = np.shape(Xi)[0] + else: # ndim ==0 so Xi is a single point (number) + K = 1 + N = 1 + + assert N >= K # Need more observations than variables + Xi = Xi.reshape([N, K]) + return h, Xi, x, N, K
 statsmodels member josef-pkt added a note Aug 17, 2012 It's not clear to me what the actual shape restriction of the kernels are. The docstrings in the individual kernel functions are ambigous. Are the kernels for univariate or single multivariate or really many multivariate observations. statsmodels member josef-pkt added a note Aug 17, 2012 I don't see any unit tests for the kernel functions themselves. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981. + """ + h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x) + Xi = np.abs(np.asarray(Xi, dtype=int)) + x = np.abs(np.asarray(x, dtype=int)) + if K == 0: + return Xi + + kernel_value = (0.5 * (1 - h) * (h ** abs(Xi - x))) + kernel_value = kernel_value.reshape([N, K]) + inDom = (Xi == x) * (1 - h) + kernel_value[Xi == x] = inDom[Xi == x] + return kernel_value + + +def gaussian(h, Xi, x):
 statsmodels member josef-pkt added a note Aug 17, 2012 In general, we need more continuous kernels than just gaussian, especially ones with bounded/compact support statsmodels member rgommers added a note Aug 19, 2012 Handling bounded support would be very useful. Having other continuous kernels besides gaussian that aren't fundamentally different (like Epanechnikov) would be at the bottom of my list, it's pretty much unimportant for the result of estimation. statsmodels member josef-pkt added a note Aug 19, 2012 Thinking about extendability: How could a gamma kernel be included, for endog that is continuous but strictly positive? Epanechnikov (?) or others have the advantage that they only need to be evaluated at points in the neighborhood, while gaussian requires all points. They might also be better with multimodal models (but I don't have any evidence). statsmodels member rgommers added a note Aug 19, 2012 There were kernel selection keywords in an earlier version, but we took them out because there were no other kernels. One issue is that you need explicit integral and convolution forms like gaussian_cdf and gaussian_convolution, to not make things very slow. Not sure if there are analytical expressions for those for the gamma kernel. statsmodels member josef-pkt added a note Aug 19, 2012 I don't know about the convolution version, but if I understand correctly, the cdf can be obtained from class gamma_gen(rv_continuous): def _cdf(self, x, a): return special.gammainc(a, x)  statsmodels member josef-pkt added a note Aug 19, 2012 similar for beta kernel (distributions with lower and upper bound) I don't think we need to get this now, but it might be down the road either in these classes or separate. statsmodels member rgommers added a note Aug 19, 2012 If we have the different kernel forms, then it's simple to add. But I agree that that's for later. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + return kernel_value + + +def gaussian_convolution(h, Xi, x): + """ Calculates the Gaussian Convolution Kernel """ + h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x) + if K == 0: + return Xi + + z = (Xi - x) / h + kernel_value = (1. / np.sqrt(4 * np.pi)) * np.exp(- z ** 2 / 4.) + kernel_value = kernel_value.reshape([N, K]) + return kernel_value + + +def wang_ryzin_convolution(h, Xi, Xj):
 statsmodels member josef-pkt added a note Aug 17, 2012 why Xj and not x? What's the difference between convolution and the plain kernel? (I didn't read the reference for this.) to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + for x in Dom_x[i]: + Sigma_x += aitchison_aitken(h[i], Xi[:, i], int(x), + num_levels=len(Dom_x[i])) * \ + aitchison_aitken(h[i], Xj[i], int(x), num_levels=len(Dom_x[i])) + + Ordered[:, i] = Sigma_x[:, 0] + + return Ordered + + +def gaussian_cdf(h, Xi, x): + h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x) + if K == 0: + return Xi + + cdf = 0.5 * h * (1 + erf((x - Xi) / (h * np.sqrt(2))))
 statsmodels member josef-pkt added a note Aug 17, 2012 should be replaced by a norm_cdf for clarity, but not now, since this is cheaper than calling scipy.stats to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + cdf = 0.5 * h * (1 + erf((x - Xi) / (h * np.sqrt(2)))) + cdf = cdf.reshape([N, K]) + return cdf + + +def aitchison_aitken_cdf(h, Xi, x_u): + Xi = np.abs(np.asarray(Xi, dtype=int)) + if Xi.ndim > 1: + K = np.shape(Xi)[1] + N = np.shape(Xi)[0] + elif Xi.ndim == 1: + K = 1 + N = np.shape(Xi)[0] + else: # ndim ==0 so Xi is a single point (number) + K = 1 + N = 1
 statsmodels member josef-pkt added a note Aug 17, 2012 shape handling is outsourced ? to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + Xi = Xi.reshape([N, K]) + Dom_x = [np.unique(Xi[:, i]) for i in range(K)] + Ordered = np.empty([N, K]) + for i in range(K): + Sigma_x = 0 + for x in Dom_x[i]: + if x <= x_u: + Sigma_x += aitchison_aitken(h[i], Xi[:, i], int(x), + num_levels=len(Dom_x[i])) + + Ordered[:, i] = Sigma_x[:, 0] + + return Ordered + + +def wang_ryzin_cdf(h, Xi, x_u):
 statsmodels member josef-pkt added a note Aug 17, 2012 docstrings, I guess Skipper's type of docstring concatenation or templating would pay off in this module and not be too messy. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + + return Ordered + + +def wang_ryzin_cdf(h, Xi, x_u): + Xi = np.abs(np.asarray(Xi, dtype=int)) + h = np.asarray(h, dtype=float) + if Xi.ndim > 1: + K = np.shape(Xi)[1] + N = np.shape(Xi)[0] + elif Xi.ndim == 1: + K = 1 + N = np.shape(Xi)[0] + else: # ndim ==0 so Xi is a single point (number) + K = 1 + N = 1
 statsmodels member josef-pkt added a note Aug 17, 2012 outsourcing? to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + else: # ndim ==0 so Xi is a single point (number) + K = 1 + N = 1 + + if K == 0: + return Xi + + Xi = Xi.reshape([N, K]) + h = h.reshape((K, )) + Dom_x = [np.unique(Xi[:, i]) for i in range(K)] + Ordered = np.empty([N, K]) + for i in range(K): + Sigma_x = 0 + for x in Dom_x[i]: + if x <= x_u: + Sigma_x += wang_ryzin(h[i], Xi[:, i], int(x))
 statsmodels member josef-pkt added a note Aug 17, 2012 can kernel be vectorized for x? see question above about shapes of arguments to kernel functions. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + + def _normal_reference(self): + """ + Returns Scott's normal reference rule of thumb bandwidth parameter. + + Notes + ----- + See p.13 in [2] for an example and discussion. The formula for the + bandwidth is + + .. math:: h = 1.06n^{-1/(4+q)} + + where :math:n is the number of observations and :math:q is the + number of variables. + """ + c = 1.06
 statsmodels member josef-pkt added a note Aug 17, 2012 is c always 1.06 or should this be an option ? gpanterov added a note Aug 19, 2012 Yes, for the gaussian kernel the scaling factor c is always 1.06 to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + - cv_ml: cross validation maximum likelihood + - normal_reference: normal reference rule of thumb + - cv_ls: cross validation least squares + + Notes + ----- + The default values for bw is 'normal_reference'. + """ + + self.bw_func = dict(normal_reference=self._normal_reference, + cv_ml=self._cv_ml, cv_ls=self._cv_ls) + if bw is None: + bwfunc = self.bw_func['normal_reference'] + return bwfunc() + + if not isinstance(bw, basestring):
 statsmodels member josef-pkt added a note Aug 17, 2012 Note for later: basestring might cause compatibility problems with python 3.x, but I don't know out of hand what the compatible way is. statsmodels member josef-pkt added a note Aug 17, 2012 I would change the conditional if ... elif < callable> # new option ... else (in the last case: asarray will raise exception if anything else, sounds ok and doesn't need try exept) statsmodels member rgommers added a note Aug 19, 2012 Why does basestring cause an issue for py3k? I'm pretty sure this is the standard way to check if a variable is a string, and 2to3 should handle it fine. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + Notes + ----- + The leave-one-out kernel estimator of :math:f_{-i} is: + + .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h} + \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j}) + + where :math:K_{h} represents the generalized product kernel + estimator: + + .. math:: K_{h}(X_{i},X_{j}) = + \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """ + LOO = tools.LeaveOneOut(self.tdat) + L = 0 + for i, X_j in enumerate(LOO):
 statsmodels member josef-pkt added a note Aug 17, 2012 I would prefer x_noti instead of X_j statsmodels member josef-pkt added a note Aug 17, 2012 X_j is all observations with j != i ? to see if I understand correctly statsmodels member rgommers added a note Oct 18, 2012 correct to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/np_tools.py
 + k\left(\frac{X_{iq}-x_{q}}{h_{q}}\right) + """ + iscontinuous, isordered, isunordered = _get_type_pos(var_type) + K = len(var_type) + N = np.shape(tdat)[0] + # must remain 1-D for indexing to work + bw = np.reshape(np.asarray(bw), (K,)) + Kval = np.concatenate(( + kernel_func[ckertype](bw[iscontinuous], + tdat[:, iscontinuous], edat[:, iscontinuous]), + kernel_func[okertype](bw[isordered], tdat[:, isordered], + edat[:, isordered]), + kernel_func[ukertype](bw[isunordered], tdat[:, isunordered], + edat[:, isunordered])), axis=1) + + dens = np.prod(Kval, axis=1) * 1. / (np.prod(bw[iscontinuous]))
 statsmodels member josef-pkt added a note Aug 17, 2012 improved numerical precision and maybe efficiency: define this as log so we only need the sum, and we don't have to take the log in the cv_ml loop. But I don't really understand this part and the connection with cv_ml yet statsmodels member josef-pkt added a note Aug 17, 2012 I think not, the product is just over the multivariate dimension not observations to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + Here :math:c is the number of levels plus one of the RV. + + References + ---------- + .. [1] J. Aitchison and C.G.G. Aitken, "Multivariate binary discrimination + by the kernel method", Biometrika, vol. 63, pp. 413-420, 1976. + .. [2] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation + and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008. + """ + h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x) + Xi = np.abs(np.asarray(Xi, dtype=int)) + x = np.abs(np.asarray(x, dtype=int)) + if K == 0: + return Xi + + c = np.asarray([len(np.unique(Xi[:, i])) for i in range(K)], dtype=int)
 statsmodels member josef-pkt added a note Aug 17, 2012 put this in the else of if num_levels, expensive call that is not always used statsmodels member rgommers added a note Oct 13, 2012 done in by pr-408-comments branch. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + + Ordered[:, i] = Sigma_x[:, 0] + + return Ordered + + +def aitchison_aitken_convolution(h, Xi, Xj): + h, Xi, x, N, K = _get_shape_and_transform(h, Xi) + Xi = np.abs(np.asarray(Xi, dtype=int)) + Xj = np.abs(np.asarray(Xj, dtype=int)) + if K == 0: + return Xi + + Xi = Xi.reshape([N, K]) + h = h.reshape((K, )) + Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
 statsmodels member josef-pkt added a note Aug 17, 2012 this might also be expensive if it has to be calculated very often. add as argument and store in caller ? When we to LOO, do we need to adjust Dom_x each time or is it better to keep it constant? The kernel, aitchison_aitken, only uses num_levels, which could be argued, I guess, should stay unchanged. I'm not sure what happens with the loop for x in Dom_x[i] if x_{i} is missing. statsmodels member rgommers added a note Oct 13, 2012 This line takes only a few percent of the total time; the for-loop under it is the expensive part. Reviewing/optimizing that now. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/kernels.py
 + return Xi + + c = np.asarray([len(np.unique(Xi[:, i])) for i in range(K)], dtype=int) + if num_levels: + c = num_levels + + kernel_value = np.tile(h / (c - 1), (N, 1)) + inDom = (Xi == x) * (1 - h) + kernel_value[Xi == x] = inDom[Xi == x] + kernel_value = kernel_value.reshape([N, K]) + return kernel_value + + +def wang_ryzin(h, Xi, x): + """ + The Wang-Ryzin kernel, used for ordered discrete random variables.
 statsmodels member josef-pkt added a note Aug 17, 2012 this sounds like a misnomer to me, It doesn't just assume an ordering, it actually assumes discrete variables with "uniform scale", fully metric. uses absolute distance as measure. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 +class _GenericKDE (object): + """ + Generic KDE class with methods shared by both UKDE and CKDE + """ + def _compute_bw(self, bw): + """ + Computes the bandwidth of the data. + + Parameters + ---------- + bw: array_like or str + If array_like: user-specified bandwidth. + If a string, should be one of: + + - cv_ml: cross validation maximum likelihood + - normal_reference: normal reference rule of thumb
 statsmodels member josef-pkt added a note Aug 17, 2012 should there by normal_scott and normal_silverman instead ? statsmodels member rgommers added a note Aug 19, 2012 That would be good, no reason not to supply both. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + Returns the value of the bandwidth that maximizes the integrated mean + square error between the estimated and actual distribution. The + integrated mean square error (IMSE) is given by: + + .. math:: \int\left[\hat{f}(x)-f(x)\right]^{2}dx + + This is the general formula for the IMSE. The IMSE differs for + conditional (CKDE) and unconditional (UKDE) kernel density estimation. + """ + h0 = self._normal_reference() + bw = optimize.fmin(self.imse, x0=h0, maxiter=1e3, maxfun=1e3, disp=0) + return np.abs(bw) + + def loo_likelihood(self): + raise NotImplementedError +
 statsmodels member josef-pkt added a note Aug 17, 2012 add def imse with NotImplementedError ? statsmodels member rgommers added a note Oct 13, 2012 _GenericKDE is now also a base class for the regression class, which doesn't have loo_likelihood or imse. So remove loo_likelihood here instead? to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + """ + h0 = self._normal_reference() + bw = optimize.fmin(self.imse, x0=h0, maxiter=1e3, maxfun=1e3, disp=0) + return np.abs(bw) + + def loo_likelihood(self): + raise NotImplementedError + + +class UKDE(_GenericKDE): + """ + Unconditional Kernel Density Estimator + + Parameters + ---------- + tdat: list of ndarrays or 2-D ndarray
 statsmodels member josef-pkt added a note Aug 17, 2012 I think I would just call it data, or endog :) I don't like the t and e abbreviation for train and evaluation much. It took me a long time to figure out what it's supposed stand for. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + >>> N = 300 + >>> np.random.seed(1234) # Seed random generator + >>> c1 = np.random.normal(size=(N,1)) + >>> c2 = np.random.normal(2, 1, size=(N,1)) + + Estimate a bivariate distribution and display the bandwidth found: + + >>> dens_u = UKDE(tdat=[c1,c2], var_type='cc', bw='normal_reference') + >>> dens_u.bw + array([ 0.39967419, 0.38423292]) + """ + def __init__(self, tdat, var_type, bw=None): + self.var_type = var_type + self.K = len(self.var_type) + self.tdat = tools.adjust_shape(tdat, self.K) + self.all_vars = self.tdat
 statsmodels member josef-pkt added a note Aug 17, 2012 I prefer postfixing qualifiers endog_all or just endog ordata to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + >>> np.random.seed(1234) # Seed random generator + >>> c1 = np.random.normal(size=(N,1)) + >>> c2 = np.random.normal(2, 1, size=(N,1)) + + Estimate a bivariate distribution and display the bandwidth found: + + >>> dens_u = UKDE(tdat=[c1,c2], var_type='cc', bw='normal_reference') + >>> dens_u.bw + array([ 0.39967419, 0.38423292]) + """ + def __init__(self, tdat, var_type, bw=None): + self.var_type = var_type + self.K = len(self.var_type) + self.tdat = tools.adjust_shape(tdat, self.K) + self.all_vars = self.tdat + self.N, self.K = np.shape(self.tdat)
 statsmodels member josef-pkt added a note Aug 17, 2012 standard terminology nobs, k_vars statsmodels member rgommers added a note Aug 19, 2012 +1, that's clearer. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + For the log likelihood should be numpy.log + + Notes + ----- + The leave-one-out kernel estimator of :math:f_{-i} is: + + .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h} + \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j}) + + where :math:K_{h} represents the generalized product kernel + estimator: + + .. math:: K_{h}(X_{i},X_{j}) = + \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """ + LOO = tools.LeaveOneOut(self.tdat)
 statsmodels member josef-pkt added a note Aug 17, 2012 put a limit on LOO loop for large data sets, large nobs, subsampling ? statsmodels member rgommers added a note Aug 19, 2012 Agreed, blocking should become the default above a certain sample size (O(500)?). to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + The probability density is given by the generalized product kernel + estimator: + + .. math:: K_{h}(X_{i},X_{j}) = + \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """ + if edat is None: + edat = self.tdat + else: + edat = tools.adjust_shape(edat, self.K) + + pdf_est = [] + N_edat = np.shape(edat)[0] + for i in xrange(N_edat): + pdf_est.append(tools.gpke(self.bw, tdat=self.tdat, edat=edat[i, :], + var_type=self.var_type) / self.N)
 statsmodels member josef-pkt added a note Aug 17, 2012 vectorize gpke, work in batches to not blow memory consumption? statsmodels member rgommers added a note Aug 19, 2012 Batching is implemented in another branch (have a look at https://github.com/gpanterov/statsmodels/blob/nonparametric-all/statsmodels/nonparametric/nonparametric2.py for an overview of all the work). to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + where G() is the product kernel CDF estimator for the continuous + and L() for the discrete variables. + """ + if edat is None: + edat = self.tdat + else: + edat = tools.adjust_shape(edat, self.K) + + N_edat = np.shape(edat)[0] + cdf_est = [] + for i in xrange(N_edat): + cdf_est.append(tools.gpke(self.bw, tdat=self.tdat, + edat=edat[i, :], var_type=self.var_type, + ckertype="gaussian_cdf", + ukertype="aitchisonaitken_cdf", + okertype='wangryzin_cdf') / self.N)
 statsmodels member josef-pkt added a note Aug 17, 2012 why does cdf specify the kertype but pdf doesn't ? I think ckertype will need more options. maybe later statsmodels member josef-pkt added a note Aug 17, 2012 add kertype as attribute to instance. in __init__ ? statsmodels member josef-pkt added a note Aug 17, 2012 ok, I see now, _cdf or _convolution below, different kernels to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + Conditional Kernel Density Estimator. + + Calculates P(X_1,X_2,...X_n | Y_1,Y_2...Y_m) = + P(X_1, X_2,...X_n, Y_1, Y_2,..., Y_m)/P(Y_1, Y_2,..., Y_m). + The conditional density is by definition the ratio of the two unconditional + densities, see [1]_. + + Parameters + ---------- + tydat: list of ndarrays or 2-D ndarray + The training data for the dependent variables, used to determine + the bandwidth(s). If a 2-D array, should be of shape + (num_observations, num_variables). If a list, each list element is a + separate observation. + txdat: list of ndarrays or 2-D ndarray + The training data for the independent variable; same shape as tydat.
 statsmodels member josef-pkt added a note Aug 17, 2012 standard is endog, exog until we change it to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + + Where :math:\bar{K}_{h} is the multivariate product convolution + kernel (consult [3] for mixed data types). + """ + F = 0 + for i in range(self.N): + k_bar_sum = tools.gpke(bw, tdat=-self.tdat, edat=-self.tdat[i, :], + var_type=self.var_type, + ckertype='gauss_convolution', + okertype='wangryzin_convolution', + ukertype='aitchisonaitken_convolution') + F += k_bar_sum + # there is a + because loo_likelihood returns the negative + return (F / (self.N ** 2) + self.loo_likelihood(bw) *\ + 2 / ((self.N) * (self.N - 1))) +
 statsmodels member josef-pkt added a note Aug 17, 2012 some plot methods would be nice statsmodels member rgommers added a note Aug 19, 2012 Is there something special you think a plot method should do? If it's just plot(x), which plots estimate(x) vs. x on a linear scale then I don't think it adds much. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + >>> c1 = np.random.normal(size=(N,1)) + >>> c2 = np.random.normal(2,1,size=(N,1)) + + >>> dens_c = CKDE(tydat=[c1], txdat=[c2], dep_type='c', + ... indep_type='c', bwmethod='normal_reference') + + >>> print "The bandwidth is: ", dens_c.bw + """ + def __init__(self, tydat, txdat, dep_type, indep_type, bw=None): + self.dep_type = dep_type + self.indep_type = indep_type + self.K_dep = len(self.dep_type) + self.K_indep = len(self.indep_type) + self.tydat = tools.adjust_shape(tydat, self.K_dep) + self.txdat = tools.adjust_shape(txdat, self.K_indep) + self.N, self.K_dep = np.shape(self.tydat)
 statsmodels member josef-pkt added a note Aug 17, 2012 nobs, k not capitalised to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + >>> c2 = np.random.normal(2,1,size=(N,1)) + + >>> dens_c = CKDE(tydat=[c1], txdat=[c2], dep_type='c', + ... indep_type='c', bwmethod='normal_reference') + + >>> print "The bandwidth is: ", dens_c.bw + """ + def __init__(self, tydat, txdat, dep_type, indep_type, bw=None): + self.dep_type = dep_type + self.indep_type = indep_type + self.K_dep = len(self.dep_type) + self.K_indep = len(self.indep_type) + self.tydat = tools.adjust_shape(tydat, self.K_dep) + self.txdat = tools.adjust_shape(txdat, self.K_indep) + self.N, self.K_dep = np.shape(self.tydat) + self.all_vars = np.concatenate((self.tydat, self.txdat), axis=1)
 statsmodels member josef-pkt added a note Aug 17, 2012 column_stack would require less thinking (I didn't see that tydat is 2d even if univariate.) statsmodels member rgommers added a note Oct 13, 2012 Replaced concatenate with column_stack and row_stack everywhere in https://github.com/rgommers/statsmodels/tree/pr-408-comments. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + else: + exdat = tools.adjust_shape(exdat, self.K_indep) + + N_edat = np.shape(exdat)[0] + cdf_est = np.empty(N_edat) + for i in xrange(N_edat): + mu_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat, + edat=exdat[i, :], var_type=self.indep_type) / self.N + mu_x = np.squeeze(mu_x) + G_y = tools.gpke(self.bw[0:self.K_dep], tdat=self.tydat, + edat=eydat[i, :], var_type=self.dep_type, + ckertype="gaussian_cdf", + ukertype="aitchisonaitken_cdf", + okertype='wangryzin_cdf', tosum=False) + + W_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
 statsmodels member josef-pkt added a note Aug 17, 2012 :: requires thinking, missing comma ? statsmodels member rgommers added a note Oct 13, 2012 No comma, just removing the second :. Addressed in https://github.com/rgommers/statsmodels/tree/pr-408-comments for all code. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + if eydat is None: + eydat = self.tydat + else: + eydat = tools.adjust_shape(eydat, self.K_dep) + if exdat is None: + exdat = self.txdat + else: + exdat = tools.adjust_shape(exdat, self.K_indep) + + N_edat = np.shape(exdat)[0] + cdf_est = np.empty(N_edat) + for i in xrange(N_edat): + mu_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat, + edat=exdat[i, :], var_type=self.indep_type) / self.N + mu_x = np.squeeze(mu_x) + G_y = tools.gpke(self.bw[0:self.K_dep], tdat=self.tydat,
 statsmodels member josef-pkt added a note Aug 17, 2012 cdf_y cdf_endog instead of G_y should this be a separate method, or available ? do we want to store them in the same class, or a user should create his own marginal and joint distributions. (application mutual information, example in sandbox with scipy's gaussian_kde) to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 17, 2012
statsmodels/nonparametric/nonparametric2.py
 + + .. math:: G_{-l}(X_{l}) = n^{-2}\sum_{i\neq l}\sum_{j\neq l} + K_{X_{i},X_{l}} K_{X_{j},X_{l}}K_{Y_{i},Y_{j}}^{(2)} + + where :math:K_{X_{i},X_{l}} is the multivariate product kernel and + :math:\mu_{-l}(X_{l}) is the leave-one-out estimator of the pdf. + + :math:K_{Y_{i},Y_{j}}^{(2)} is the convolution kernel. + + The value of the function is minimized by the _cv_ls method of the + _GenericKDE class to return the bw estimates that minimize the + distance between the estimated and "true" probability density. + """ + zLOO = tools.LeaveOneOut(self.all_vars) + CV = 0 + for l, Z in enumerate(zLOO):
 statsmodels member josef-pkt added a note Aug 17, 2012 l (small L) is not a good variable name txdat[l, :] is this a 1 or an l or an I to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 19, 2012
statsmodels/nonparametric/nonparametric2.py
 + + The leave-one-out kernel estimator of :math:f_{-i} is: + + .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h} + \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j}) + + where :math:K_{h} represents the Generalized product kernel + estimator: + + .. math:: K_{h}(X_{i},X_{j})=\prod_{s=1}^ + {q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right) + """ + # the initial value for the optimization is the normal_reference + h0 = self._normal_reference() + bw = optimize.fmin(self.loo_likelihood, x0=h0, args=(np.log, ), + maxiter=1e3, maxfun=1e3, disp=0)
 statsmodels member josef-pkt added a note Aug 19, 2012 one possible speadup: increase xtol=0.0001 As far as I understand: the exact bandwidth might not be very important, so we might not need it to converge to a high precision. The important part is convergence in function value. (a guess: this might save time and calculations if the objective function is relatively flat at the optimimum.) What I don't know is what the scale is: do we need absolute or relative tolerance in x? extra question (maybe for future): is any of the other optimizers potentially better/faster, make it into a choice? statsmodels member rgommers added a note Oct 13, 2012 xtol=1e-4 is the default, so I guess you mean 1e-3. Bandwidth should be range 0-1, but it can be small for large sample size. Therefore I think relative tolerance (xtol) should be used. fmin_bfgs is indeed faster in most cases, so making this configurable would help. statsmodels member rgommers added a note Oct 13, 2012 Changed tolerance in https://github.com/rgommers/statsmodels/tree/speedup-nonparametric to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels member

About api.py vs. __init__.py, I had a look at other modules and it's pretty much inconsistent. Because completely emptying the current __init__.py is not backwards compatible, I think it shouldn't be done in this PR. Adding UKDE, CKDE in a new api.py and leaving the current KDE in __init__.py also doesn't make sense.

Also, I've checked the import time and it is only 3 ms, plus the time for importing numpy + scipy.optimize.

added some commits Aug 19, 2012
 gpanterov changed UKDE to KDE and CKDE to ConditionalKDE e8519e6 gpanterov fixed a small issue with the pdf_mixeddata_cv_ml test b9a8a51
referenced this pull request Aug 20, 2012
Closed

### Nonparametric all #434

statsmodels member

Summary of all the variable naming comments of Josef:

tdat --> data
txdat --> exog
tydat --> endog
all_vars --> endog_all / endog / data
X_j --> x_noti
G_y --> cdf_y / cdf_endog
N --> nobs
K_dep --> k_dep
no small L


Most of this is indeed standard throughout statsmodels. See http://statsmodels.sourceforge.net/devel/gettingstarted.html#design-matrices-endog-exog for endog/exog and http://statsmodels.sourceforge.net/devel/dev/naming_conventions.html for nobs/k.

I'd propose x_not_i for readability. The all_vars suggestion I'm not sure about, because::

all_vars = np.column_stack((self.tydat, self.txdat))


Other than those two I think these are good suggestions. @gpanterov what do you think?

statsmodels member

I would even propose exog, endog -> X, Y. I'd like to revisit this package-wide before the pydata talk so I don't get the same suggestion from however many people again to change this.

statsmodels member

That would be even better; has to happen at some point anyway.

commented on the diff Oct 14, 2012
statsmodels/nonparametric/kernels.py
 + ----- + See p. 19 in [1]_ for details. The value of the kernel L if + :math:X_{i}=x is :math:1-\lambda, otherwise it is + :math:\frac{1-\lambda}{2}\lambda^{|X_{i}-x|}. + + References + ---------- + .. [1] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation + and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008. + http://dx.doi.org/10.1561/0800000009 + .. [2] M.-C. Wang and J. van Ryzin, "A class of smooth estimators for + discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981. + """ + h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x) + Xi = np.abs(np.asarray(Xi, dtype=int)) + x = np.abs(np.asarray(x, dtype=int))
 statsmodels member rgommers added a note Oct 14, 2012 Looking at the definition in the docstring and ref [1], the abs calls in the two lines above are incorrect. It's abs(Xi-x), not abs(abs(Xi) - abs(x)). Removing them in my speedup branch. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Oct 14, 2012
statsmodels/nonparametric/kernels.py
 + :math:X_{i}=x is :math:1-\lambda, otherwise it is + :math:\frac{1-\lambda}{2}\lambda^{|X_{i}-x|}. + + References + ---------- + .. [1] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation + and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008. + http://dx.doi.org/10.1561/0800000009 + .. [2] M.-C. Wang and J. van Ryzin, "A class of smooth estimators for + discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981. + """ + h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x) + Xi = np.abs(np.asarray(Xi, dtype=int)) + x = np.abs(np.asarray(x, dtype=int)) + if K == 0: + return Xi
 statsmodels member rgommers added a note Oct 14, 2012 This check is unnecessary, the _get_shape_and_transform call guarantees K >= 1. Removing it. to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels member

I still don't really like one letter names in the classes. Users can use the formula interface if they don't want to see endog/exog.

5c987e0#diff-1

statsmodels member

Josef, given the number of people being in favor for X/Y (both on-list and in Skippers tutorials) and virtually no one except you liking endog/exog, I really hope you'll reconsider your opinion on this.

That issue is orthogonal to this PR though. So I propose to not discuss it further here, and if the status quo is still endog/exog at merge time of this PR, we should use it in this code.

statsmodels member

@gpanterov: can you comment on the renaming proposals above? I'd like to get those out of the way, then we're pretty much done with this PR.

statsmodels member

Time of nonparametric.test() is down to 16 seconds, nonparametric.test('full') to 80 seconds on my machine. So this is getting more or less acceptable for merging I think. There's more that can be shaved off still, but it's OK-ish now I think.

On the naming conventions:
tydat, txdat, eydat and exdat were borrowed from R's np package.
I like the idea of changing them so that they are more in line with statsmodels naming conventions. I like the brevity of X and Y and if this is more in line with the rest of statsmodels I propose that we go ahead with this. However, we should also consider renaming the arguments of the KDE.cdf() and COnditionalKDE.cdf() and pdf() which are currenty exdat and eydat (e standing for evaluating as opposed to training data (t) ). I propose eX and eY ?

statsmodels member

eX, eY is still uninformative, I never managed to guess what the e stands for.

our terminology in general is fit and predict

shorthand names can sometimes be useful inside a function, but for the outside I prefer descriptive names.

we don't have exdat yet in other models, just exog in predict. The full name would be :)

exog_predict

statsmodels member

exog_predict is pretty uninformative too imho. Note that the already existing kdensity and kdensity_fft do use X already:

def kdensity(X, kernel="gau", bw="scott", weights=None, gridsize=None,
"""
Rosenblatz-Parzen univariate kernel desnity estimator

Parameters
----------
X : array-like
The variable for which the density estimate is desired.


I see a bunch of predict methods, but no parameters with _predict appended.

Using fit() for kicking off bandwidth estimation is a good point though.

statsmodels member

No, we don't have _predict postfix, because we never needed it.
standard signature is just predict(exog) since none of the other models needs to keep the original exog from the fit around at the same time. (predict signature will get more complicated with formulas)

Kernel methods are currently the only ones where we don't just have "parameters", and where we need the full original sample for "prediction".

For density estimation there is no real endog and exog, just the data, so I don't really care much. Similar, I usually don't use the endog/exog terminology in statistical tests.
But kernel regression is similar to the other model in that it estimates a relationship between endog and exog under some assumptions on the process that generated the data.

statsmodels member

OK, all the renames done in my speedup-nonparametric branch. I included exog_predict and even exog_nonparametric_predict, because I realised the most confusing part of exog_predict is not _predict. So why not, at least it's consistent.

Adding fit presents a slight problem, since the regression classes already have a fit method (for doing the actual regression and marginal effects). If fit() would be used for computing the bandwidth (still don't see a good use-case for why that's needed), what to call the current fit()?

statsmodels member

Ralf, thanks for working through this
Can you create a pull request from your branch, so we can review your version?

The way it sounds like, I think we should merge it soon.

Can we indicate somewhere that the CensoredReg, SemiLinear and SingleIndexModel classes are experimental? Or should we exclude them from the PR ? They are a unique feature of statsmodels i.e. they are not present in any other package to my knowledge. But because of this, I was not able to cross-check the output that they give.

statsmodels member

I would prefer to merge everything instead of splitting up the pull request again.

If there are some tests that they work, then I would just add a comment to the docstrings and leave them in, even if they are not verified against other packages.
If they are unfinished or might have problems, then we could also just move them temporarily to the sandbox.

(I haven't looked at the code for those yet.)

statsmodels member

Test coverage should be increased first though if we want to merge everything at once:

• TestFForm and SingleIndexModel are completely untested
• SemiLinear and CensoredReg have only one test.

I propose to rename TestFForm to FFormTest by the way. No name should start with Test.

George, do you have time to work on those tests in the near future? If not, I propose to leave those things out for now.

statsmodels member

What do we do about current KDE class by the way? Rename to KDE_1d?

statsmodels member

Is there a name clash? Is something called KDE in the PR? If so, I think it should be renamed given that KDE is in use already. If not, I'm ok to change KDE to KDE_1d (or whatever), but we'd have to go through a deprecation period for the current KDE.

statsmodels member

The main class is called KDE (renamed from UKDE). Josef's suggestion from the mailing list:

I would spell it out ConditionalKDE,  or KDEConditional,
I think shortening UKDE to KDE is fine, if there is no problem with
several KDE, eg. rename the other one to KDEUnivariate or UnivariateKDE.

statsmodels member

I think at least initially we would have to have KDEMultivariate or MultivariateKDE, then we can deprecate KDE and move to this if it's what we really want to do unless there are some plans to separate by namespace (still would be a bit confusing IMO).

statsmodels member

Hmm, not sure what would be best. Basically, the current KDE is a specialization for 1-D continuous data. It's faster and has support for data on a finite domain. The new KDE can handle mixed continuous/ordered/unordered data and has cross-validated bandwidth selection. We should be able to add finite domain support to it (longer term, non-trivial). So it looks to me like KDE should become KDE_1d or UnivariateKDE. However, given the current state that's indeed not very practical. Alternative is to rename the new KDE and explain how the classes are related in the two docstrings.

Splitting the two between two namespaces doesn't make sense.

statsmodels member

I don't think we want to separate by namespace, too confusing and there are no natural namespaces withing kde.

One possibility would be to keep all the KDE with qualifiers.

The KDE in this PR is using product kernels, so we could also call it KDEProduct to distinguish from scipy gaussian_kde which I think doesn't impose the product structure. KDEMutivariate is also fine (I'm not sure users care about "Product")

(other multivariate: nearest neighbor, rbf kernels ?)

statsmodels member

Users indeed won't care much. Multivariate sounds better than Product.

statsmodels member

+1

statsmodels member

unless the untested parts will get tests soon, I prefer moving them to the sandbox.
requires less maintenance, avoids bitrot, and we don't have to struggle with git merges.

And it can be used even if it doesn't have a full test suite, and we can add tests incrementally.

(example: We got a bug report for survival2 which is in the sandbox, but the code in the sandbox is outdated and no one has worked on the new version the branch in 7 months.
Would have been easier to merge and add test coverage piece by piece.)

statsmodels member

Ralf can you make a pull request from your branch

This branch doesn't contain the extra classes SingleIndexModel, Censored, ...

Also I would suggest splitting the module nonparametric2.py into at least two or three parts, density, regression and others? and rename nonparametric2 to "kernel_density.py" or something like that.

I only had a brief look at the extras. One of the main parts that might need expansion (later?) is to check what should be returned by fit, there are currently no results classes.

statsmodels member

Yes, will get the last renames done and send a PR this week.

@gpanterov I'll follow Josef's suggestion on moving things that aren't tested to the sandbox; they can always be moved back as soon as we have tests.

Please go ahead and move them to the sandbox Ralf. I was thinking along the same lines. I will have time to write some tests for them in the next few weeks. But I can't do it immediately ...

referenced this pull request Nov 3, 2012
Closed

### Nonparametric kernel density estimation and regression. #562

statsmodels member
commented Nov 3, 2012

New PR opened, so closing this one.

closed this Nov 3, 2012