Skip to content
This repository

Nonparametric density #408

Closed
wants to merge 79 commits into from

4 participants

gpanterov Ralf Gommers Skipper Seabold Josef Perktold
gpanterov

Nonparametric Density estimators. Conditional and Unconditional multivariable mixed data estimation available with plug-in (normal-reference) and data-driven bandwidth selection methods (cross-validation least squares and cross-validation maximum likelihood).

Josef Perktold

I think it would pay off to write test for various shapes of this function.

For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.

The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):

dens_c=CKDE(tydat=[c1,c3],txdat=[c2], dep_type='cc',indep_type='c',bwmethod='normal_reference')
dens_c.pdf()

Owner

nice, now I see that the bandwidth h is also vectorized.

Ralf Gommers

The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.

Collaborator

Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.

Ralf Gommers

Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.

Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.

Ralf Gommers

The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.

Ralf Gommers

It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).

Ralf Gommers

For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.

Ralf Gommers

This for-loop doesn't do anything. It only creates var, which isn't used below.

Owner

column_stack (one of my favorites) could be used, however,

concatenate and column_stack copy the data, AFAICS

In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)

Ralf Gommers

Same here, for loops don't do anything.

Josef Perktold

index could be 1d (row selector), then reshape is not necessary

Josef Perktold

pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to

Owner

sorry, I guess I wasn't clear before, below are many spaces in keyword =

Josef Perktold

is if bw is not None or if not bw is None

Josef Perktold

should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required

I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?

Ralf Gommers

Should be

if edat is None:
    edat = self.tdat
Ralf Gommers
Collaborator

When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.

Ralf Gommers

Should be if not isinstance(bw, basestring).

The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').

Ralf Gommers

This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.

Did you also plan to provide other metrics, like ISE or IAE?

Ralf Gommers
Collaborator

I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.

and others added some commits June 18, 2012
Ralf Gommers

I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.

Ralf Gommers

Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?

Ralf Gommers

Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?

I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?

Collaborator

Perhaps all of them, first on axis 0, then 1, etc.?

Ralf Gommers
Collaborator

If you don't need code anymore, better to delete it.

Ralf Gommers
Collaborator

This fix really needs a test for edat usage.

Ralf Gommers

The old version (array_like) is actually the correct one.

Ralf Gommers
Collaborator

I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).

Josef Perktold

doesn't work for me

import statsmodels.nonparametric.nonparametric2 as nparam ?

Josef Perktold
Owner

leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)

Owner

pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries

one print statement is left somewhere when running the tests

tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging

(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)

Owner

some of the test cases would make nice example scripts

Owner

I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.

Josef Perktold

special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm

Josef Perktold
Owner

nice, I like having the cdf available, Azzalini (IIRC) mentioned that the rate (as function of n) for the bandwidth for cdf should be smaller (?) than for the pdf. Did you see anything about bw choice for cdf?

Collaborator

Fast, and seems to work well. At least in 1-D, converges nicely to the empirical CDF.

Collaborator

@josef-pkt that's what I thought too, bandwidth isn't the same as for pdf.

Collaborator

Can you factor out all the code related to asarray, K, N and reshaping? It's more than 10 lines that are duplicated in every single kernel function. Should be something like

def _get_shape_and_type(Xi, x, kind='c'):
    ...
    return Xi, x, K, N
Collaborator

The UKDE, CKDE interface now doesn't allow specifying the kernels to use. The Epannechnikov kernel isn't used at all. Are you planning to expand that interface? In any case, what the default kernels are should be documented.

Collaborator

I have the feeling that the for-loop in GPKE can still be optimized, it's very expensive now. You can see this easily by profiling in IPython. Use for example %prun dens_scott.cdf() after having run the below script. 33000 function calls for a 1-D example with 1000 points.

import numpy as np

from statsmodels.sandbox.distributions.mixture_rvs import mixture_rvs
from statsmodels.nonparametric import UKDE
from statsmodels.tools.tools import ECDF

import matplotlib.pyplot as plt


np.random.seed(12345)
obs_dist = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))

dens_scott = UKDE(tdat=[obs_dist], var_type='c', bw='normal_reference')
est_scott = dens_scott.pdf()

xx = np.linspace(-4, 4, 100)
est_scott_cdf = dens_scott.cdf(xx)
ecdf = ECDF(obs_dist)

# Plot the cdf
fig = plt.figure()
plt.plot(xx, est_scott_cdf, 'b.-')
plt.plot(ecdf.x, ecdf.y, 'r.-')

plt.show()
Collaborator

The for-loop could be over the number of variables instead of the data points, right?

Owner

It looks that way to me too. @gpanterov is there any reason this couldn't be vectorized over the number of observations?

Ralf Gommers

Indentation not equal here.

Ralf Gommers
Collaborator

Please add import matplotlib.pyplot as plt.

Collaborator

Pareto is still broken.

Collaborator

The Weibull plot shows an interesting issue - finite support isn't handled in the UKDE, CKDE classes. Close to 0 this goes wrong.

matplotlib.pyplot - added
Broke Pareto -- strange. It works for me. Maybe it depends on the seed. With np.seed(123456) I get the following plot :
http://statsmodels-np.blogspot.com/2012/07/kde-estimate-of-pareto-rv.html
The density estimate isn't very good but this could be due to the relatively small sample size. Will add seed.

Collaborator

With that seed I get the same result as you. The result is quite far off again though, again due to not dealing with support.

Skipper Seabold

Probably want to avoid underscores in class names unless it's to mark the class as private. CamelCase is almost always good enough.

Owner

You also want to stick explicitly with new-style classes. Ie., all of your classes should inherit from object

class GenericKDE(object):
Skipper Seabold

Not a huge deal, but you want a class method here or this gets called for every test method. Ie.,

@classmethod 
def setUp(cls):
   ...
Skipper Seabold

Could you post scripts somewhere to compare the output for the full datasets. I'd like to compare the performance.

Sure. I can do that. But it becomes quite slow if you include the entire data set and if you use the data-driven bandwidth estimates (especially cross-validation least squares)

Owner

Yeah, don't put it in the tests, but if you could put it as an example script somewhere I can play with that would be helpful.

gpanterov removed the Epanechnikov Kernel. Could be added at a later time again…
… if we decide to give the user an option to specify kernels. However most refs claim kernel not important
02b3781
Skipper Seabold

This could be a property since it doesn't need any arguments. I'm not sure about caching it yet since I don't know all the moving parts.

Ralf Gommers
Collaborator

My version of MPL (1.0.1) doesn't have ax.zaxis. What version are you on? Can you leave out those 2 lines, or replace them with something that works for multiple versions?

Also, you could add a second plot of the same density with imshow(Z). While the 3-D version is fancier, I find the 2-D one much easier to interpret.

Other than that, looks good.

Ralf Gommers

You don't need A, B and the for-loop. These 8 lines can be replaced by

ix = np.random.uniform(size=N) > 0.5
V = np.random.multivariate_normal(mu1, cov1, size=N)
V[ix, :] = np.random.multivariate_normal(mu2, cov2, size=N)[ix, :]
Ralf Gommers

This for-loop can also be removed. Three lines above can be replaced by

edat = np.column_stack([X.ravel(), Y.ravel()])
Z = dens.pdf(edat).reshape(X.shape)

It would be good to document in the pdf method that the required shape for edat is (num_points, K), with K the number of dimensions (available as the K attribute.

Ralf Gommers
Collaborator

You forgot to remove the Reg class.

statsmodels/examples/ex_multivar_UKDE.py
... ...
@@ -0,0 +1,55 @@
  1
+#import nonparametric2 as nparam
  2
+import statsmodels.nonparametric as nparam
  3
+import scipy.stats as stats
  4
+import numpy as np
  5
+import matplotlib.pyplot as plt
  6
+from mpl_toolkits.mplot3d import axes3d
  7
+from matplotlib import cm
  8
+from matplotlib.ticker import LinearLocator, FormatStrFormatter
1
Ralf Gommers Collaborator
rgommers added a note July 24, 2012

This line isn't needed, nor is line 1 (commented-out import).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/nonparametric/nonparametric2.py
((240 lines not shown))
  240
+        \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
  241
+
  242
+        where :math:`K_{h}` represents the
  243
+        Generalized product kernel estimator:
  244
+
  245
+        .. math:: K_{h}(X_{i},X_{j})=
  246
+        \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  247
+        """
  248
+
  249
+        LOO = tools.LeaveOneOut(self.tdat)
  250
+        i = 0
  251
+        L = 0
  252
+        for X_j in LOO:
  253
+            f_i = tools.gpke(bw, tdat=-X_j, edat=-self.tdat[i, :],
  254
+                             var_type=self.var_type)
  255
+            i += 1
1
Ralf Gommers Collaborator
rgommers added a note July 25, 2012

This i=0; i+=1 is a bit un-Pythonic. You can get i from the for-loop like so: for i, X_j in enumerate(LOO):.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/nonparametric/nonparametric2.py
((21 lines not shown))
  21
+"""
  22
+
  23
+import numpy as np
  24
+from scipy import integrate, stats
  25
+import np_tools as tools
  26
+import scipy.optimize as opt
  27
+import KernelFunctions as kf
  28
+
  29
+__all__ = ['UKDE', 'CKDE']
  30
+
  31
+
  32
+class GenericKDE (object):
  33
+    """
  34
+    Generic KDE class with methods shared by both UKDE and CKDE
  35
+    """
  36
+    def compute_bw(self, bw):
1
Ralf Gommers Collaborator
rgommers added a note July 25, 2012

This isn't a public method, so should start with an underscore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Collaborator

Please mark the current slow tests with @dec.slow from numpy.testing, as we discussed several times. You can add a test with smaller input as non-slow.

Ralf Gommers
Collaborator

I've fixed a lot of small issues in gpanterov#4. Fixing was much quicker than noting all of them here.

Ralf Gommers
Collaborator

Other things to still address related to current code:

  • gpke() still takes kernel function keywords, but there aren't any alternatives. Remove?
  • imse() documented Return value (CV) is incorrect, it doesn't return a function.
  • AitchisonAitken() returns a (N, K) array, while WangRyzin() is documented as returning a float. This cannot be correct.
  • Gaussian() and convolution/cdf kernels don't have docstrings yet.
  • function names in kernels.py should be lower case (PEP8)

EDIT: (16 Aug) these points are taken care of now.

Ralf Gommers
Collaborator

More points:

  • tests not marked with @dec.slow still take 60 seconds to run, this needs to be improved by running tests with smaller input data. Marking more with @dec.slow is not an option, because it will just mean most tests won't be run by default.
  • UKDE/CKDE names aren't too informative. Josef asked before for some better names.
  • An example with ordered/unordered data really needs to be added. The examples folder now has two examples, one with 1-D continuous input and one with 2-D continuous input.

I've reviewed all code now, that's all I saw.

Ralf Gommers
Collaborator

One more thing: Josef asked you about what it would take to handle distributions with finite support correctly. See plots for Pareto and Weibull distributions in ex_univar_kde.py for an example of where this would help. Have you thought about that?

Skipper Seabold jseabold commented on the diff August 14, 2012
statsmodels/nonparametric/np_tools.py
... ...
@@ -0,0 +1,139 @@
  1
+import numpy as np
  2
+
  3
+from . import kernels
3
Skipper Seabold Owner
jseabold added a note August 14, 2012

Can you change this to an absolute import?

Ralf Gommers Collaborator
rgommers added a note August 14, 2012

I thought you wanted relative ones? Or was that in the past?

Skipper Seabold Owner
jseabold added a note August 14, 2012

I think we settled on relative in api, absolute everywhere else. I usually debug in the source directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Skipper Seabold jseabold commented on the diff August 14, 2012
statsmodels/nonparametric/nonparametric2.py
((231 lines not shown))
  231
+        The leave-one-out kernel estimator of :math:`f_{-i}` is:
  232
+
  233
+        .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
  234
+                    \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
  235
+
  236
+        where :math:`K_{h}` represents the generalized product kernel
  237
+        estimator:
  238
+
  239
+        .. math:: K_{h}(X_{i},X_{j}) =
  240
+            \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  241
+        """
  242
+        LOO = tools.LeaveOneOut(self.tdat)
  243
+        L = 0
  244
+        for i, X_j in enumerate(LOO):
  245
+            f_i = tools.gpke(bw, tdat=-X_j, edat=-self.tdat[i, :],
  246
+                             var_type=self.var_type)
8
Skipper Seabold Owner
jseabold added a note August 14, 2012

A technical question. This is the density of all the X_j, evaluated at the left out point X_i correct? Is this right? I thought you would just evaluate the density of f_hat = f(X_j) over some grid support x and use sum(log(f_hat)) as in Racine and Li section 2.3.4.

"f_hat(−i) (x) is the leave-one-out kernel estimator of f (Xi ) that uses all points except Xi to construct the density estimate."

Is there another reference or more details for this? I haven't been through this in a while.

gpanterov
gpanterov added a note August 14, 2012

A technical question. This is the density of all the X_j, evaluated at the left out point X_i correct? Is this right?

yep. This is correct. X_j represent the entire data without the left out point i

Skipper Seabold Owner
jseabold added a note August 14, 2012

But then you're evaluating it at the left out point. Is this the right thing to do? My impression was that you get the whole density of X_j, something like

density = KDE(X_j)
density.fit(bw=bw)

Then you evaluate the log-likelihood of this whole density.

np.sum(np.log(density.density))

That's one f_i. You want to sum all the f_i, that's the whole leave-one-out likelihood. Am I misunderstanding?

Skipper Seabold Owner
jseabold added a note August 14, 2012

To explain a bit more (for my own edification), what it looks like you have now is a sum of each leave one out density estimate (not-normalized by 1/(n-1)) evaluated at each left out point. This doesn't seem right to me.

gpanterov
gpanterov added a note August 14, 2012

Yes. This is correct. The normalization is done at the end. Since it is a sum it doesn't matter.
I believe it is exactly the same as section 2.3.4 in Racine's primer. See the equation at the top of p.16

Skipper Seabold Owner
jseabold added a note August 14, 2012

I don't see anything that indicates that the x in this equation is X_i. Why not just write X_i? I always assumed that x was just some grid that covers the support of f_hat. This is how I interpreted its use elsewhere, e.g., equation 2.2. Does what I'm saying make sense? This is why I assumed that the FFT estimator will be of great benefit because we have to evaluate n full densities, ie., the likelihood of it's entire density over it's entire support. I'll have a look at the original papers and see if it sheds any light on my confusion.

Skipper Seabold Owner
jseabold added a note August 14, 2012

Hmm, ok, looking elsewhere I guess I misunderstood the notation here. I guess it makes sense intuitively, but I'll need to work with this a bit more.

Skipper Seabold Owner
jseabold added a note August 14, 2012

And it does explicitly state f(X_i). D'oh. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Skipper Seabold jseabold commented on the diff August 14, 2012
statsmodels/nonparametric/nonparametric2.py
((226 lines not shown))
  226
+        func: function
  227
+            For the log likelihood should be numpy.log
  228
+
  229
+        Notes
  230
+        -----
  231
+        The leave-one-out kernel estimator of :math:`f_{-i}` is:
  232
+
  233
+        .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
  234
+                    \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
  235
+
  236
+        where :math:`K_{h}` represents the generalized product kernel
  237
+        estimator:
  238
+
  239
+        .. math:: K_{h}(X_{i},X_{j}) =
  240
+            \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  241
+        """
15
Skipper Seabold Owner
jseabold added a note August 14, 2012

Interesting parallelization results. I don't know if you want to play around with this that much, but you don't start seeing gains for using all cores until nobs > 1500 or so, and even then it's modest (ie., barely worth it). I guess the overhead of joblib is still costlier than the computations, but I would've expected it to be a bit faster earlier. There may be other hotspots that this benefits from, you can use this

        var_type = self.var_type
        tdat = self.tdat
        LOO = tools.LeaveOneOut(tdat)
        from statsmodels.tools.parallel import parallel_func
        parallel, p_func, n_jobs = parallel_func(tools.gpke, n_jobs=-1,
                                                 verbose=0)
        L = sum(map(func, parallel(p_func(bw, tdat=-X_j, edat=-tdat[i, :],
                            var_type=var_type) for i, X_j in enumerate(LOO))))

Alternatively we're going to have look elsewhere for speed gains.

Skipper Seabold Owner
jseabold added a note August 14, 2012

Forgot the

        return -L
Josef Perktold Owner
josef-pkt added a note August 14, 2012

joblib needs to work in batches, especially on Windows, see mailing list "playing with joblib" 10/10/2011 Alexandre's comments early on

Ralf Gommers Collaborator
rgommers added a note August 16, 2012

I think it's better to look at the algorithms again in a some more detail (i.e. with a profiler) instead of at joblib. Joblib also isn't going to yield more than a factor of a couple (2 at most for me); we're looking for more than that still.

Skipper Seabold Owner
jseabold added a note August 19, 2012

2-8x speed-up is a pretty decent speed-up for things that take 1-2 minutes. This is embarrassingly parallel, so we should be able to take advantage here.

I still think we want to use binning + FFT for the default Gaussian kernel. This is going to yield a 70-300x speed-up for each evaluation according to the literature and my experience with doing univariate. Newer multipole-type methods will yield 2-3x beyond this, but it looks more complicated than I have time for. I'm going to see if I can't look through a copy of Silverman or Wand and Jones tomorrow for the details.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Note that George already implemented batching/blocking, with large speedups.

About FFT, there then needs to be a solution for evaluating at certain points or non-equidistant grids.

Skipper Seabold Owner
jseabold added a note August 19, 2012

Ah, good. I'll have to catch up (currently homeless with limited internet). Looking into the changes needed for FFT.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Note that that's not in this PR, but it's in the nonparametric-all branch.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Reading up some more about FFT methods, I found http://books.google.nl/books?hl=en&lr=&id=AyJ9xrrwDnIC&oi=fnd&pg=PA203&dq=density+estimation+multipole&ots=K_IBmhGVTY&sig=yrzJOhpPe4d2-2i_xCbeFpdE6hE#v=onepage&q=density%20estimation%20multipole&f=false which gives the following scalings:

O(N^D log(n^D))    # FFT
O(D N^2)     # direct summation (like in this PR)

It also gives a reference to an empirical study (Wand, "Fast Computation of Multivariate Kernel Estimators", 1994) showing that for D=3 the speed-up of FFT over direct summation is at most 5 for 10k samples. This scaling is the reason that FFT methods are mostly restricted to 1-D.

Josef Perktold Owner
josef-pkt added a note August 19, 2012

I have not much idea about the fft in the multivariate kde case, but O(N^D log(n^D)) might not be correct for product kernels. my guess would be that the terms are also N*D instead of N**D for product kernels.

I think we should have the work on speed improvements later on, so we can get the current work merged, and the discussion where it might not get lost with a git rebase.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

That's a good point, we don't want to rebase this PR. That should happen on a new PR, otherwise many of these comments will be lost.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

I'm finding lots of references to implementations of the Fast Gauss Transform, but nothing that's directly reusable. That would be the way to go.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

FigTree seems to be the best available implementation, but its LGPL.

Skipper Seabold Owner
jseabold added a note August 19, 2012

Thanks for the references. I don't think these are directly comparable though. The benchmark for the Wand study is not direct summation. It's a multivariate simple binning algorithm mentioned in Scott written in Fortran AFAIK. The reference in Scott for this is "On Frequency Polygons and Average Shifted Histograms in Higher Dimensions" Hjort 1986.

My impression for the product kernels is the same as Josef's, though I'm not as familiar with this paper. I'm not proposing multivariate gridding, and I could be wrong - these are just impressions right now. Just the way we have it now, we are essentially doing univariate kernel density estimation, which should be easy to speed up, since we'd only do the binning once and then evaluate at the different points with different bandwidths, which should give the speed-ups I mentioned. I just need to figure out how to make this possible.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Thanks. I didn't read the Wand paper, just the summary of it in the booked I linked, so I may have misunderstood the reference case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Collaborator

Open points from mailing list discussion:

  • renaming: UKDE --> KDE should be done. CKDE may be renamed to KDEConditional or ConditionalKDE`` (preference Josef).
  • Skipper has a preference for a do-nothing __init__() and fit() to do bw estimation. Related to #429 (don't see exactly how though).
  • as mentioned above too, the issue of densities with finite support. Need to at least note as an open issue.
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((26 lines not shown))
  26
+    if Xi.ndim > 1:
  27
+        K = np.shape(Xi)[1]
  28
+        N = np.shape(Xi)[0]
  29
+    elif Xi.ndim == 1:  # One variable with many observations
  30
+        K = 1
  31
+        N = np.shape(Xi)[0]
  32
+    else:  # ndim ==0 so Xi is a single point (number)
  33
+        K = 1
  34
+        N = 1
  35
+
  36
+    assert N >= K  # Need more observations than variables
  37
+    Xi = Xi.reshape([N, K])
  38
+    return h, Xi, x, N, K
  39
+
  40
+
  41
+def aitchison_aitken(h, Xi, x, num_levels=False):
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

(not sure if we need to change this)

I would prefer the reversed order of the arguments (x, Xi, h)
I like h last (in case we get default arguments and h as keyword)
order x, Xi mainly to read it as function of x given Xi and h (not sure about this since standard Kernel notation is K(Xi, x) ? )

with x and Xi it's not obvious to me from the names which is which (I often use xi meaning x_i subscript, the ith observation. What does i stand for in the training set.)
maybe data instead of Xi, would be more informative

(We need kernel functions for other parts, but I don't know or recall the details.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((23 lines not shown))
  23
+    h = np.asarray(h, dtype=float)
  24
+    Xi = np.asarray(Xi)
  25
+    # More than one variable with more than one observations
  26
+    if Xi.ndim > 1:
  27
+        K = np.shape(Xi)[1]
  28
+        N = np.shape(Xi)[0]
  29
+    elif Xi.ndim == 1:  # One variable with many observations
  30
+        K = 1
  31
+        N = np.shape(Xi)[0]
  32
+    else:  # ndim ==0 so Xi is a single point (number)
  33
+        K = 1
  34
+        N = 1
  35
+
  36
+    assert N >= K  # Need more observations than variables
  37
+    Xi = Xi.reshape([N, K])
  38
+    return h, Xi, x, N, K
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

It's not clear to me what the actual shape restriction of the kernels are.
The docstrings in the individual kernel functions are ambigous. Are the kernels for univariate or single multivariate or really many multivariate observations.

Josef Perktold Owner
josef-pkt added a note August 17, 2012

I don't see any unit tests for the kernel functions themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((123 lines not shown))
  123
+           discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981.
  124
+    """
  125
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
  126
+    Xi = np.abs(np.asarray(Xi, dtype=int))
  127
+    x = np.abs(np.asarray(x, dtype=int))
  128
+    if K == 0:
  129
+        return Xi
  130
+
  131
+    kernel_value = (0.5 * (1 - h) * (h ** abs(Xi - x)))
  132
+    kernel_value = kernel_value.reshape([N, K])
  133
+    inDom = (Xi == x) * (1 - h)
  134
+    kernel_value[Xi == x] = inDom[Xi == x]
  135
+    return kernel_value
  136
+
  137
+
  138
+def gaussian(h, Xi, x):
7
Josef Perktold Owner
josef-pkt added a note August 17, 2012

In general, we need more continuous kernels than just gaussian, especially ones with bounded/compact support

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Handling bounded support would be very useful. Having other continuous kernels besides gaussian that aren't fundamentally different (like Epanechnikov) would be at the bottom of my list, it's pretty much unimportant for the result of estimation.

Josef Perktold Owner
josef-pkt added a note August 19, 2012

Thinking about extendability: How could a gamma kernel be included, for endog that is continuous but strictly positive?

Epanechnikov (?) or others have the advantage that they only need to be evaluated at points in the neighborhood, while gaussian requires all points.
They might also be better with multimodal models (but I don't have any evidence).

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

There were kernel selection keywords in an earlier version, but we took them out because there were no other kernels. One issue is that you need explicit integral and convolution forms like gaussian_cdf and gaussian_convolution, to not make things very slow. Not sure if there are analytical expressions for those for the gamma kernel.

Josef Perktold Owner
josef-pkt added a note August 19, 2012

I don't know about the convolution version, but if I understand correctly, the cdf can be obtained from

class gamma_gen(rv_continuous):

def _cdf(self, x, a):
        return special.gammainc(a, x)
Josef Perktold Owner
josef-pkt added a note August 19, 2012

similar for beta kernel (distributions with lower and upper bound)

I don't think we need to get this now, but it might be down the road either in these classes or separate.

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

If we have the different kernel forms, then it's simple to add. But I agree that that's for later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((163 lines not shown))
  163
+    return kernel_value
  164
+
  165
+
  166
+def gaussian_convolution(h, Xi, x):
  167
+    """ Calculates the Gaussian Convolution Kernel """
  168
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
  169
+    if K == 0:
  170
+        return Xi
  171
+
  172
+    z = (Xi - x) / h
  173
+    kernel_value = (1. / np.sqrt(4 * np.pi)) * np.exp(- z ** 2 / 4.)
  174
+    kernel_value = kernel_value.reshape([N, K])
  175
+    return kernel_value
  176
+
  177
+
  178
+def wang_ryzin_convolution(h, Xi, Xj):
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

why Xj and not x? What's the difference between convolution and the plain kernel? (I didn't read the reference for this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((218 lines not shown))
  218
+        for x in Dom_x[i]:
  219
+            Sigma_x += aitchison_aitken(h[i], Xi[:, i], int(x),
  220
+                                       num_levels=len(Dom_x[i])) * \
  221
+            aitchison_aitken(h[i], Xj[i], int(x), num_levels=len(Dom_x[i]))
  222
+
  223
+        Ordered[:, i] = Sigma_x[:, 0]
  224
+
  225
+    return Ordered
  226
+
  227
+
  228
+def gaussian_cdf(h, Xi, x):
  229
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
  230
+    if K == 0:
  231
+        return Xi
  232
+
  233
+    cdf = 0.5 * h * (1 + erf((x - Xi) / (h * np.sqrt(2))))
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

should be replaced by a norm_cdf for clarity, but not now, since this is cheaper than calling scipy.stats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((233 lines not shown))
  233
+    cdf = 0.5 * h * (1 + erf((x - Xi) / (h * np.sqrt(2))))
  234
+    cdf = cdf.reshape([N, K])
  235
+    return cdf
  236
+
  237
+
  238
+def aitchison_aitken_cdf(h, Xi, x_u):
  239
+    Xi = np.abs(np.asarray(Xi, dtype=int))
  240
+    if Xi.ndim > 1:
  241
+        K = np.shape(Xi)[1]
  242
+        N = np.shape(Xi)[0]
  243
+    elif Xi.ndim == 1:
  244
+        K = 1
  245
+        N = np.shape(Xi)[0]
  246
+    else:  # ndim ==0 so Xi is a single point (number)
  247
+        K = 1
  248
+        N = 1
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

shape handling is outsourced ?

Josef Perktold Owner
josef-pkt added a note August 17, 2012

outsourcing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((254 lines not shown))
  254
+    Xi = Xi.reshape([N, K])
  255
+    Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
  256
+    Ordered = np.empty([N, K])
  257
+    for i in range(K):
  258
+        Sigma_x = 0
  259
+        for x in Dom_x[i]:
  260
+            if x <= x_u:
  261
+                Sigma_x += aitchison_aitken(h[i], Xi[:, i], int(x),
  262
+                                           num_levels=len(Dom_x[i]))
  263
+
  264
+        Ordered[:, i] = Sigma_x[:, 0]
  265
+
  266
+    return Ordered
  267
+
  268
+
  269
+def wang_ryzin_cdf(h, Xi, x_u):
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

docstrings, I guess Skipper's type of docstring concatenation or templating would pay off in this module and not be too messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((278 lines not shown))
  278
+    else:  # ndim ==0 so Xi is a single point (number)
  279
+        K = 1
  280
+        N = 1
  281
+
  282
+    if K == 0:
  283
+        return Xi
  284
+
  285
+    Xi = Xi.reshape([N, K])
  286
+    h = h.reshape((K, ))
  287
+    Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
  288
+    Ordered = np.empty([N, K])
  289
+    for i in range(K):
  290
+        Sigma_x = 0
  291
+        for x in Dom_x[i]:
  292
+            if x <= x_u:
  293
+                Sigma_x += wang_ryzin(h[i], Xi[:, i], int(x))
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

can kernel be vectorized for x? see question above about shapes of arguments to kernel functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((69 lines not shown))
  69
+
  70
+    def _normal_reference(self):
  71
+        """
  72
+        Returns Scott's normal reference rule of thumb bandwidth parameter.
  73
+
  74
+        Notes
  75
+        -----
  76
+        See p.13 in [2] for an example and discussion.  The formula for the
  77
+        bandwidth is
  78
+
  79
+        .. math:: h = 1.06n^{-1/(4+q)}
  80
+
  81
+        where :math:`n` is the number of observations and :math:`q` is the
  82
+        number of variables.
  83
+        """
  84
+        c = 1.06
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

is c always 1.06 or should this be an option ?

gpanterov
gpanterov added a note August 19, 2012

Yes, for the gaussian kernel the scaling factor c is always 1.06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((45 lines not shown))
  45
+                - cv_ml: cross validation maximum likelihood
  46
+                - normal_reference: normal reference rule of thumb
  47
+                - cv_ls: cross validation least squares
  48
+
  49
+        Notes
  50
+        -----
  51
+        The default values for bw is 'normal_reference'.
  52
+        """
  53
+
  54
+        self.bw_func = dict(normal_reference=self._normal_reference,
  55
+                            cv_ml=self._cv_ml, cv_ls=self._cv_ls)
  56
+        if bw is None:
  57
+            bwfunc = self.bw_func['normal_reference']
  58
+            return bwfunc()
  59
+
  60
+        if not isinstance(bw, basestring):
3
Josef Perktold Owner
josef-pkt added a note August 17, 2012

Note for later: basestring might cause compatibility problems with python 3.x, but I don't know out of hand what the compatible way is.

Josef Perktold Owner
josef-pkt added a note August 17, 2012

I would change the conditional
if
...
elif < callable> # new option
...
else

(in the last case: asarray will raise exception if anything else, sounds ok and doesn't need try exept)

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Why does basestring cause an issue for py3k? I'm pretty sure this is the standard way to check if a variable is a string, and 2to3 should handle it fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((229 lines not shown))
  229
+        Notes
  230
+        -----
  231
+        The leave-one-out kernel estimator of :math:`f_{-i}` is:
  232
+
  233
+        .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
  234
+                    \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
  235
+
  236
+        where :math:`K_{h}` represents the generalized product kernel
  237
+        estimator:
  238
+
  239
+        .. math:: K_{h}(X_{i},X_{j}) =
  240
+            \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  241
+        """
  242
+        LOO = tools.LeaveOneOut(self.tdat)
  243
+        L = 0
  244
+        for i, X_j in enumerate(LOO):
3
Josef Perktold Owner
josef-pkt added a note August 17, 2012

I would prefer x_noti instead of X_j

Josef Perktold Owner
josef-pkt added a note August 17, 2012

X_j is all observations with j != i ? to see if I understand correctly

Ralf Gommers Collaborator
rgommers added a note October 18, 2012

correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/np_tools.py
((120 lines not shown))
  120
+                k\left(\frac{X_{iq}-x_{q}}{h_{q}}\right)
  121
+    """
  122
+    iscontinuous, isordered, isunordered = _get_type_pos(var_type)
  123
+    K = len(var_type)
  124
+    N = np.shape(tdat)[0]
  125
+    # must remain 1-D for indexing to work
  126
+    bw = np.reshape(np.asarray(bw), (K,))
  127
+    Kval = np.concatenate((
  128
+        kernel_func[ckertype](bw[iscontinuous],
  129
+                            tdat[:, iscontinuous], edat[:, iscontinuous]),
  130
+        kernel_func[okertype](bw[isordered], tdat[:, isordered],
  131
+                            edat[:, isordered]),
  132
+        kernel_func[ukertype](bw[isunordered], tdat[:, isunordered],
  133
+                            edat[:, isunordered])), axis=1)
  134
+
  135
+    dens = np.prod(Kval, axis=1) * 1. / (np.prod(bw[iscontinuous]))
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

improved numerical precision and maybe efficiency:
define this as log so we only need the sum, and we don't have to take the log in the cv_ml loop.

But I don't really understand this part and the connection with cv_ml yet

Josef Perktold Owner
josef-pkt added a note August 17, 2012

I think not, the product is just over the multivariate dimension not observations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((67 lines not shown))
  67
+    Here :math:`c` is the number of levels plus one of the RV.
  68
+
  69
+    References
  70
+    ----------
  71
+    .. [1] J. Aitchison and C.G.G. Aitken, "Multivariate binary discrimination
  72
+           by the kernel method", Biometrika, vol. 63, pp. 413-420, 1976.
  73
+    .. [2] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation
  74
+           and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008.
  75
+    """
  76
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
  77
+    Xi = np.abs(np.asarray(Xi, dtype=int))
  78
+    x = np.abs(np.asarray(x, dtype=int))
  79
+    if K == 0:
  80
+        return Xi
  81
+
  82
+    c = np.asarray([len(np.unique(Xi[:, i])) for i in range(K)], dtype=int)
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

put this in the else of if num_levels, expensive call that is not always used

Ralf Gommers Collaborator
rgommers added a note October 13, 2012

done in by pr-408-comments branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((199 lines not shown))
  199
+
  200
+        Ordered[:, i] = Sigma_x[:, 0]
  201
+
  202
+    return Ordered
  203
+
  204
+
  205
+def aitchison_aitken_convolution(h, Xi, Xj):
  206
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi)
  207
+    Xi = np.abs(np.asarray(Xi, dtype=int))
  208
+    Xj = np.abs(np.asarray(Xj, dtype=int))
  209
+    if K == 0:
  210
+        return Xi
  211
+
  212
+    Xi = Xi.reshape([N, K])
  213
+    h = h.reshape((K, ))
  214
+    Dom_x = [np.unique(Xi[:, i]) for i in range(K)]
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

this might also be expensive if it has to be calculated very often.

add as argument and store in caller ?

When we to LOO, do we need to adjust Dom_x each time or is it better to keep it constant? The kernel, aitchison_aitken, only uses num_levels, which could be argued, I guess, should stay unchanged.
I'm not sure what happens with the loop for x in Dom_x[i] if x_{i} is missing.

Ralf Gommers Collaborator
rgommers added a note October 13, 2012

This line takes only a few percent of the total time; the for-loop under it is the expensive part. Reviewing/optimizing that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/kernels.py
((80 lines not shown))
  80
+        return Xi
  81
+
  82
+    c = np.asarray([len(np.unique(Xi[:, i])) for i in range(K)], dtype=int)
  83
+    if num_levels:
  84
+        c = num_levels
  85
+
  86
+    kernel_value = np.tile(h / (c - 1), (N, 1))
  87
+    inDom = (Xi == x) * (1 - h)
  88
+    kernel_value[Xi == x] = inDom[Xi == x]
  89
+    kernel_value = kernel_value.reshape([N, K])
  90
+    return kernel_value
  91
+
  92
+
  93
+def wang_ryzin(h, Xi, x):
  94
+    """
  95
+    The Wang-Ryzin kernel, used for ordered discrete random variables.
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

this sounds like a misnomer to me, It doesn't just assume an ordering, it actually assumes discrete variables with "uniform scale", fully metric. uses absolute distance as measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((31 lines not shown))
  31
+class _GenericKDE (object):
  32
+    """
  33
+    Generic KDE class with methods shared by both UKDE and CKDE
  34
+    """
  35
+    def _compute_bw(self, bw):
  36
+        """
  37
+        Computes the bandwidth of the data.
  38
+
  39
+        Parameters
  40
+        ----------
  41
+        bw: array_like or str
  42
+            If array_like: user-specified bandwidth.
  43
+            If a string, should be one of:
  44
+
  45
+                - cv_ml: cross validation maximum likelihood
  46
+                - normal_reference: normal reference rule of thumb
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

should there by normal_scott and normal_silverman instead ?

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

That would be good, no reason not to supply both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((126 lines not shown))
  126
+        Returns the value of the bandwidth that maximizes the integrated mean
  127
+        square error between the estimated and actual distribution.  The
  128
+        integrated mean square error (IMSE) is given by:
  129
+
  130
+        .. math:: \int\left[\hat{f}(x)-f(x)\right]^{2}dx
  131
+
  132
+        This is the general formula for the IMSE.  The IMSE differs for
  133
+        conditional (CKDE) and unconditional (UKDE) kernel density estimation.
  134
+        """
  135
+        h0 = self._normal_reference()
  136
+        bw = optimize.fmin(self.imse, x0=h0, maxiter=1e3, maxfun=1e3, disp=0)
  137
+        return np.abs(bw)
  138
+
  139
+    def loo_likelihood(self):
  140
+        raise NotImplementedError
  141
+
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

add def imse with NotImplementedError ?

Ralf Gommers Collaborator
rgommers added a note October 13, 2012

_GenericKDE is now also a base class for the regression class, which doesn't have loo_likelihood or imse. So remove loo_likelihood here instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((134 lines not shown))
  134
+        """
  135
+        h0 = self._normal_reference()
  136
+        bw = optimize.fmin(self.imse, x0=h0, maxiter=1e3, maxfun=1e3, disp=0)
  137
+        return np.abs(bw)
  138
+
  139
+    def loo_likelihood(self):
  140
+        raise NotImplementedError
  141
+
  142
+
  143
+class UKDE(_GenericKDE):
  144
+    """
  145
+    Unconditional Kernel Density Estimator
  146
+
  147
+    Parameters
  148
+    ----------
  149
+    tdat: list of ndarrays or 2-D ndarray
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

I think I would just call it data, or endog :)

I don't like the t and e abbreviation for train and evaluation much. It took me a long time to figure out what it's supposed stand for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((186 lines not shown))
  186
+    >>> N = 300
  187
+    >>> np.random.seed(1234)  # Seed random generator
  188
+    >>> c1 = np.random.normal(size=(N,1))
  189
+    >>> c2 = np.random.normal(2, 1, size=(N,1))
  190
+
  191
+    Estimate a bivariate distribution and display the bandwidth found:
  192
+
  193
+    >>> dens_u = UKDE(tdat=[c1,c2], var_type='cc', bw='normal_reference')
  194
+    >>> dens_u.bw
  195
+    array([ 0.39967419,  0.38423292])
  196
+    """
  197
+    def __init__(self, tdat, var_type, bw=None):
  198
+        self.var_type = var_type
  199
+        self.K = len(self.var_type)
  200
+        self.tdat = tools.adjust_shape(tdat, self.K)
  201
+        self.all_vars = self.tdat
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

I prefer postfixing qualifiers

endog_all or just endog ordata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((187 lines not shown))
  187
+    >>> np.random.seed(1234)  # Seed random generator
  188
+    >>> c1 = np.random.normal(size=(N,1))
  189
+    >>> c2 = np.random.normal(2, 1, size=(N,1))
  190
+
  191
+    Estimate a bivariate distribution and display the bandwidth found:
  192
+
  193
+    >>> dens_u = UKDE(tdat=[c1,c2], var_type='cc', bw='normal_reference')
  194
+    >>> dens_u.bw
  195
+    array([ 0.39967419,  0.38423292])
  196
+    """
  197
+    def __init__(self, tdat, var_type, bw=None):
  198
+        self.var_type = var_type
  199
+        self.K = len(self.var_type)
  200
+        self.tdat = tools.adjust_shape(tdat, self.K)
  201
+        self.all_vars = self.tdat
  202
+        self.N, self.K = np.shape(self.tdat)
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

standard terminology nobs, k_vars

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

+1, that's clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((227 lines not shown))
  227
+            For the log likelihood should be numpy.log
  228
+
  229
+        Notes
  230
+        -----
  231
+        The leave-one-out kernel estimator of :math:`f_{-i}` is:
  232
+
  233
+        .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
  234
+                    \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
  235
+
  236
+        where :math:`K_{h}` represents the generalized product kernel
  237
+        estimator:
  238
+
  239
+        .. math:: K_{h}(X_{i},X_{j}) =
  240
+            \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  241
+        """
  242
+        LOO = tools.LeaveOneOut(self.tdat)
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

put a limit on LOO loop for large data sets, large nobs, subsampling ?

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Agreed, blocking should become the default above a certain sample size (O(500)?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((267 lines not shown))
  267
+        The probability density is given by the generalized product kernel
  268
+        estimator:
  269
+
  270
+        .. math:: K_{h}(X_{i},X_{j}) =
  271
+            \prod_{s=1}^{q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  272
+        """
  273
+        if edat is None:
  274
+            edat = self.tdat
  275
+        else:
  276
+            edat = tools.adjust_shape(edat, self.K)
  277
+
  278
+        pdf_est = []
  279
+        N_edat = np.shape(edat)[0]
  280
+        for i in xrange(N_edat):
  281
+            pdf_est.append(tools.gpke(self.bw, tdat=self.tdat, edat=edat[i, :],
  282
+                          var_type=self.var_type) / self.N)
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

vectorize gpke, work in batches to not blow memory consumption?

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Batching is implemented in another branch (have a look at https://github.com/gpanterov/statsmodels/blob/nonparametric-all/statsmodels/nonparametric/nonparametric2.py for an overview of all the work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((313 lines not shown))
  313
+        where G() is the product kernel CDF estimator for the continuous
  314
+        and L() for the discrete variables.
  315
+        """
  316
+        if edat is None:
  317
+            edat = self.tdat
  318
+        else:
  319
+            edat = tools.adjust_shape(edat, self.K)
  320
+
  321
+        N_edat = np.shape(edat)[0]
  322
+        cdf_est = []
  323
+        for i in xrange(N_edat):
  324
+            cdf_est.append(tools.gpke(self.bw, tdat=self.tdat,
  325
+                             edat=edat[i, :], var_type=self.var_type,
  326
+                             ckertype="gaussian_cdf",
  327
+                             ukertype="aitchisonaitken_cdf",
  328
+                             okertype='wangryzin_cdf') / self.N)
3
Josef Perktold Owner
josef-pkt added a note August 17, 2012

why does cdf specify the kertype but pdf doesn't ?

I think ckertype will need more options. maybe later

Josef Perktold Owner
josef-pkt added a note August 17, 2012

add kertype as attribute to instance. in __init__ ?

Josef Perktold Owner
josef-pkt added a note August 17, 2012

ok, I see now, _cdf or _convolution below, different kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((377 lines not shown))
  377
+    Conditional Kernel Density Estimator.
  378
+
  379
+    Calculates ``P(X_1,X_2,...X_n | Y_1,Y_2...Y_m) =
  380
+    P(X_1, X_2,...X_n, Y_1, Y_2,..., Y_m)/P(Y_1, Y_2,..., Y_m)``.
  381
+    The conditional density is by definition the ratio of the two unconditional
  382
+    densities, see [1]_.
  383
+
  384
+    Parameters
  385
+    ----------
  386
+    tydat: list of ndarrays or 2-D ndarray
  387
+        The training data for the dependent variables, used to determine
  388
+        the bandwidth(s).  If a 2-D array, should be of shape
  389
+        (num_observations, num_variables).  If a list, each list element is a
  390
+        separate observation.
  391
+    txdat: list of ndarrays or 2-D ndarray
  392
+        The training data for the independent variable; same shape as `tydat`.
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

standard is endog, exog until we change it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((358 lines not shown))
  358
+
  359
+        Where :math:`\bar{K}_{h}` is the multivariate product convolution
  360
+        kernel (consult [3] for mixed data types).
  361
+        """
  362
+        F = 0
  363
+        for i in range(self.N):
  364
+            k_bar_sum = tools.gpke(bw, tdat=-self.tdat, edat=-self.tdat[i, :],
  365
+                                   var_type=self.var_type,
  366
+                                   ckertype='gauss_convolution',
  367
+                                   okertype='wangryzin_convolution',
  368
+                                   ukertype='aitchisonaitken_convolution')
  369
+            F += k_bar_sum
  370
+        # there is a + because loo_likelihood returns the negative
  371
+        return (F / (self.N ** 2) + self.loo_likelihood(bw) *\
  372
+                2 / ((self.N) * (self.N - 1)))
  373
+
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

some plot methods would be nice

Ralf Gommers Collaborator
rgommers added a note August 19, 2012

Is there something special you think a plot method should do? If it's just plot(x), which plots estimate(x) vs. x on a linear scale then I don't think it adds much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((431 lines not shown))
  431
+    >>> c1 = np.random.normal(size=(N,1))
  432
+    >>> c2 = np.random.normal(2,1,size=(N,1))
  433
+
  434
+    >>> dens_c = CKDE(tydat=[c1], txdat=[c2], dep_type='c',
  435
+    ...               indep_type='c', bwmethod='normal_reference')
  436
+
  437
+    >>> print "The bandwidth is: ", dens_c.bw
  438
+    """
  439
+    def __init__(self, tydat, txdat, dep_type, indep_type, bw=None):
  440
+        self.dep_type = dep_type
  441
+        self.indep_type = indep_type
  442
+        self.K_dep = len(self.dep_type)
  443
+        self.K_indep = len(self.indep_type)
  444
+        self.tydat = tools.adjust_shape(tydat, self.K_dep)
  445
+        self.txdat = tools.adjust_shape(txdat, self.K_indep)
  446
+        self.N, self.K_dep = np.shape(self.tydat)
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

nobs, k not capitalised

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((432 lines not shown))
  432
+    >>> c2 = np.random.normal(2,1,size=(N,1))
  433
+
  434
+    >>> dens_c = CKDE(tydat=[c1], txdat=[c2], dep_type='c',
  435
+    ...               indep_type='c', bwmethod='normal_reference')
  436
+
  437
+    >>> print "The bandwidth is: ", dens_c.bw
  438
+    """
  439
+    def __init__(self, tydat, txdat, dep_type, indep_type, bw=None):
  440
+        self.dep_type = dep_type
  441
+        self.indep_type = indep_type
  442
+        self.K_dep = len(self.dep_type)
  443
+        self.K_indep = len(self.indep_type)
  444
+        self.tydat = tools.adjust_shape(tydat, self.K_dep)
  445
+        self.txdat = tools.adjust_shape(txdat, self.K_indep)
  446
+        self.N, self.K_dep = np.shape(self.tydat)
  447
+        self.all_vars = np.concatenate((self.tydat, self.txdat), axis=1)
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

column_stack would require less thinking

(I didn't see that tydat is 2d even if univariate.)

Ralf Gommers Collaborator
rgommers added a note October 13, 2012

Replaced concatenate with column_stack and row_stack everywhere in https://github.com/rgommers/statsmodels/tree/pr-408-comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((589 lines not shown))
  589
+        else:
  590
+            exdat = tools.adjust_shape(exdat, self.K_indep)
  591
+
  592
+        N_edat = np.shape(exdat)[0]
  593
+        cdf_est = np.empty(N_edat)
  594
+        for i in xrange(N_edat):
  595
+            mu_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
  596
+                              edat=exdat[i, :], var_type=self.indep_type) / self.N
  597
+            mu_x = np.squeeze(mu_x)
  598
+            G_y = tools.gpke(self.bw[0:self.K_dep], tdat=self.tydat,
  599
+                             edat=eydat[i, :], var_type=self.dep_type,
  600
+                             ckertype="gaussian_cdf",
  601
+                             ukertype="aitchisonaitken_cdf",
  602
+                             okertype='wangryzin_cdf', tosum=False)
  603
+
  604
+            W_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
2
Josef Perktold Owner
josef-pkt added a note August 17, 2012

:: requires thinking, missing comma ?

Ralf Gommers Collaborator
rgommers added a note October 13, 2012

No comma, just removing the second :. Addressed in https://github.com/rgommers/statsmodels/tree/pr-408-comments for all code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((583 lines not shown))
  583
+        if eydat is None:
  584
+            eydat = self.tydat
  585
+        else:
  586
+            eydat = tools.adjust_shape(eydat, self.K_dep)
  587
+        if exdat is None:
  588
+            exdat = self.txdat
  589
+        else:
  590
+            exdat = tools.adjust_shape(exdat, self.K_indep)
  591
+
  592
+        N_edat = np.shape(exdat)[0]
  593
+        cdf_est = np.empty(N_edat)
  594
+        for i in xrange(N_edat):
  595
+            mu_x = tools.gpke(self.bw[self.K_dep::], tdat=self.txdat,
  596
+                              edat=exdat[i, :], var_type=self.indep_type) / self.N
  597
+            mu_x = np.squeeze(mu_x)
  598
+            G_y = tools.gpke(self.bw[0:self.K_dep], tdat=self.tydat,
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

cdf_y cdf_endog instead of G_y

should this be a separate method, or available ? do we want to store them in the same class, or a user should create his own marginal and joint distributions. (application mutual information, example in sandbox with scipy's gaussian_kde)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 17, 2012
statsmodels/nonparametric/nonparametric2.py
((639 lines not shown))
  639
+
  640
+        .. math:: G_{-l}(X_{l}) = n^{-2}\sum_{i\neq l}\sum_{j\neq l}
  641
+                        K_{X_{i},X_{l}} K_{X_{j},X_{l}}K_{Y_{i},Y_{j}}^{(2)}
  642
+
  643
+        where :math:`K_{X_{i},X_{l}}` is the multivariate product kernel and
  644
+        :math:`\mu_{-l}(X_{l})` is the leave-one-out estimator of the pdf.
  645
+
  646
+        :math:`K_{Y_{i},Y_{j}}^{(2)}` is the convolution kernel.
  647
+
  648
+        The value of the function is minimized by the ``_cv_ls`` method of the
  649
+        `_GenericKDE` class to return the bw estimates that minimize the
  650
+        distance between the estimated and "true" probability density.
  651
+        """
  652
+        zLOO = tools.LeaveOneOut(self.all_vars)
  653
+        CV = 0
  654
+        for l, Z in enumerate(zLOO):
1
Josef Perktold Owner
josef-pkt added a note August 17, 2012

l (small L) is not a good variable name

txdat[l, :] is this a 1 or an l or an I

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff August 19, 2012
statsmodels/nonparametric/nonparametric2.py
((100 lines not shown))
  100
+
  101
+        The leave-one-out kernel estimator of :math:`f_{-i}` is:
  102
+
  103
+        .. math:: f_{-i}(X_{i})=\frac{1}{(n-1)h}
  104
+                        \sum_{j=1,j\neq i}K_{h}(X_{i},X_{j})
  105
+
  106
+        where :math:`K_{h}` represents the Generalized product kernel
  107
+        estimator:
  108
+
  109
+        .. math:: K_{h}(X_{i},X_{j})=\prod_{s=1}^
  110
+                        {q}h_{s}^{-1}k\left(\frac{X_{is}-X_{js}}{h_{s}}\right)
  111
+        """
  112
+        # the initial value for the optimization is the normal_reference
  113
+        h0 = self._normal_reference()
  114
+        bw = optimize.fmin(self.loo_likelihood, x0=h0, args=(np.log, ),
  115
+                           maxiter=1e3, maxfun=1e3, disp=0)
3
Josef Perktold Owner
josef-pkt added a note August 19, 2012

one possible speadup: increase xtol=0.0001
As far as I understand: the exact bandwidth might not be very important, so we might not need it to converge to a high precision. The important part is convergence in function value. (a guess: this might save time and calculations if the objective function is relatively flat at the optimimum.)

What I don't know is what the scale is: do we need absolute or relative tolerance in x?

extra question (maybe for future): is any of the other optimizers potentially better/faster, make it into a choice?

Ralf Gommers Collaborator
rgommers added a note October 13, 2012

xtol=1e-4 is the default, so I guess you mean 1e-3. Bandwidth should be range 0-1, but it can be small for large sample size. Therefore I think relative tolerance (xtol) should be used.

fmin_bfgs is indeed faster in most cases, so making this configurable would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Collaborator

About api.py vs. __init__.py, I had a look at other modules and it's pretty much inconsistent. Because completely emptying the current __init__.py is not backwards compatible, I think it shouldn't be done in this PR. Adding UKDE, CKDE in a new api.py and leaving the current KDE in __init__.py also doesn't make sense.

Also, I've checked the import time and it is only 3 ms, plus the time for importing numpy + scipy.optimize.

Ralf Gommers rgommers referenced this pull request August 20, 2012
Closed

Nonparametric all #434

Ralf Gommers
Collaborator

Summary of all the variable naming comments of Josef:

tdat --> data
txdat --> exog
tydat --> endog
all_vars --> endog_all / endog / data
X_j --> x_noti
G_y --> cdf_y / cdf_endog
N --> nobs
K_dep --> k_dep
no small L

Most of this is indeed standard throughout statsmodels. See http://statsmodels.sourceforge.net/devel/gettingstarted.html#design-matrices-endog-exog for endog/exog and http://statsmodels.sourceforge.net/devel/dev/naming_conventions.html for nobs/k.

I'd propose x_not_i for readability. The all_vars suggestion I'm not sure about, because::

all_vars = np.column_stack((self.tydat, self.txdat))

Other than those two I think these are good suggestions. @gpanterov what do you think?

Skipper Seabold
Owner

I would even propose exog, endog -> X, Y. I'd like to revisit this package-wide before the pydata talk so I don't get the same suggestion from however many people again to change this.

Ralf Gommers
Collaborator

That would be even better; has to happen at some point anyway.

Ralf Gommers rgommers commented on the diff October 14, 2012
statsmodels/nonparametric/kernels.py
((112 lines not shown))
  112
+    -----
  113
+    See p. 19 in [1]_ for details.  The value of the kernel L if
  114
+    :math:`X_{i}=x` is :math:`1-\lambda`, otherwise it is
  115
+    :math:`\frac{1-\lambda}{2}\lambda^{|X_{i}-x|}`.
  116
+
  117
+    References
  118
+    ----------
  119
+    .. [1] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation
  120
+           and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008.
  121
+           http://dx.doi.org/10.1561/0800000009
  122
+    .. [2] M.-C. Wang and J. van Ryzin, "A class of smooth estimators for
  123
+           discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981.
  124
+    """
  125
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
  126
+    Xi = np.abs(np.asarray(Xi, dtype=int))
  127
+    x = np.abs(np.asarray(x, dtype=int))
1
Ralf Gommers Collaborator
rgommers added a note October 14, 2012

Looking at the definition in the docstring and ref [1], the abs calls in the two lines above are incorrect. It's abs(Xi-x), not abs(abs(Xi) - abs(x)). Removing them in my speedup branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers rgommers commented on the diff October 14, 2012
statsmodels/nonparametric/kernels.py
((114 lines not shown))
  114
+    :math:`X_{i}=x` is :math:`1-\lambda`, otherwise it is
  115
+    :math:`\frac{1-\lambda}{2}\lambda^{|X_{i}-x|}`.
  116
+
  117
+    References
  118
+    ----------
  119
+    .. [1] Racine, Jeff. "Nonparametric Econometrics: A Primer," Foundation
  120
+           and Trends in Econometrics: Vol 3: No 1, pp1-88., 2008.
  121
+           http://dx.doi.org/10.1561/0800000009
  122
+    .. [2] M.-C. Wang and J. van Ryzin, "A class of smooth estimators for
  123
+           discrete distributions", Biometrika, vol. 68, pp. 301-309, 1981.
  124
+    """
  125
+    h, Xi, x, N, K = _get_shape_and_transform(h, Xi, x)
  126
+    Xi = np.abs(np.asarray(Xi, dtype=int))
  127
+    x = np.abs(np.asarray(x, dtype=int))
  128
+    if K == 0:
  129
+        return Xi
1
Ralf Gommers Collaborator
rgommers added a note October 14, 2012

This check is unnecessary, the _get_shape_and_transform call guarantees K >= 1. Removing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold
Owner

I still don't really like one letter names in the classes. Users can use the formula interface if they don't want to see endog/exog.

5c987e0#diff-1

Ralf Gommers
Collaborator

Josef, given the number of people being in favor for X/Y (both on-list and in Skippers tutorials) and virtually no one except you liking endog/exog, I really hope you'll reconsider your opinion on this.

That issue is orthogonal to this PR though. So I propose to not discuss it further here, and if the status quo is still endog/exog at merge time of this PR, we should use it in this code.

Ralf Gommers
Collaborator

@gpanterov: can you comment on the renaming proposals above? I'd like to get those out of the way, then we're pretty much done with this PR.

Ralf Gommers
Collaborator

Time of nonparametric.test() is down to 16 seconds, nonparametric.test('full') to 80 seconds on my machine. So this is getting more or less acceptable for merging I think. There's more that can be shaved off still, but it's OK-ish now I think.

gpanterov

On the naming conventions:
tydat, txdat, eydat and exdat were borrowed from R's np package.
I like the idea of changing them so that they are more in line with statsmodels naming conventions. I like the brevity of X and Y and if this is more in line with the rest of statsmodels I propose that we go ahead with this. However, we should also consider renaming the arguments of the KDE.cdf() and COnditionalKDE.cdf() and pdf() which are currenty exdat and eydat (e standing for evaluating as opposed to training data (t) ). I propose eX and eY ?

Josef Perktold
Owner

eX, eY is still uninformative, I never managed to guess what the e stands for.

our terminology in general is fit and predict

shorthand names can sometimes be useful inside a function, but for the outside I prefer descriptive names.

we don't have exdat yet in other models, just exog in predict. The full name would be :)

exog_predict

Ralf Gommers
Collaborator

exog_predict is pretty uninformative too imho. Note that the already existing kdensity and kdensity_fft do use X already:

def kdensity(X, kernel="gau", bw="scott", weights=None, gridsize=None,
             adjust=1, clip=(-np.inf,np.inf), cut=3, retgrid=True):
    """
    Rosenblatz-Parzen univariate kernel desnity estimator

    Parameters
    ----------
    X : array-like
        The variable for which the density estimate is desired.

I see a bunch of predict methods, but no parameters with _predict appended.

Using fit() for kicking off bandwidth estimation is a good point though.

Josef Perktold
Owner

No, we don't have _predict postfix, because we never needed it.
standard signature is just predict(exog) since none of the other models needs to keep the original exog from the fit around at the same time. (predict signature will get more complicated with formulas)

Kernel methods are currently the only ones where we don't just have "parameters", and where we need the full original sample for "prediction".

For density estimation there is no real endog and exog, just the data, so I don't really care much. Similar, I usually don't use the endog/exog terminology in statistical tests.
But kernel regression is similar to the other model in that it estimates a relationship between endog and exog under some assumptions on the process that generated the data.

Ralf Gommers
Collaborator

OK, all the renames done in my speedup-nonparametric branch. I included exog_predict and even exog_nonparametric_predict, because I realised the most confusing part of exog_predict is not _predict. So why not, at least it's consistent.

Adding fit presents a slight problem, since the regression classes already have a fit method (for doing the actual regression and marginal effects). If fit() would be used for computing the bandwidth (still don't see a good use-case for why that's needed), what to call the current fit()?

Josef Perktold
Owner

Ralf, thanks for working through this
Can you create a pull request from your branch, so we can review your version?

The way it sounds like, I think we should merge it soon.

gpanterov

Can we indicate somewhere that the CensoredReg, SemiLinear and SingleIndexModel classes are experimental? Or should we exclude them from the PR ? They are a unique feature of statsmodels i.e. they are not present in any other package to my knowledge. But because of this, I was not able to cross-check the output that they give.

Josef Perktold
Owner

I would prefer to merge everything instead of splitting up the pull request again.

If there are some tests that they work, then I would just add a comment to the docstrings and leave them in, even if they are not verified against other packages.
If they are unfinished or might have problems, then we could also just move them temporarily to the sandbox.

(I haven't looked at the code for those yet.)

Ralf Gommers
Collaborator

Test coverage should be increased first though if we want to merge everything at once:

  • TestFForm and SingleIndexModel are completely untested
  • SemiLinear and CensoredReg have only one test.

I propose to rename TestFForm to FFormTest by the way. No name should start with Test.

George, do you have time to work on those tests in the near future? If not, I propose to leave those things out for now.

Ralf Gommers
Collaborator

What do we do about current KDE class by the way? Rename to KDE_1d?

Skipper Seabold
Owner

Is there a name clash? Is something called KDE in the PR? If so, I think it should be renamed given that KDE is in use already. If not, I'm ok to change KDE to KDE_1d (or whatever), but we'd have to go through a deprecation period for the current KDE.

Ralf Gommers
Collaborator

The main class is called KDE (renamed from UKDE). Josef's suggestion from the mailing list:

I would spell it out ConditionalKDE,  or KDEConditional,
I think shortening UKDE to KDE is fine, if there is no problem with
several KDE, eg. rename the other one to KDEUnivariate or UnivariateKDE.
Skipper Seabold
Owner

I think at least initially we would have to have KDEMultivariate or MultivariateKDE, then we can deprecate KDE and move to this if it's what we really want to do unless there are some plans to separate by namespace (still would be a bit confusing IMO).

Ralf Gommers
Collaborator

Hmm, not sure what would be best. Basically, the current KDE is a specialization for 1-D continuous data. It's faster and has support for data on a finite domain. The new KDE can handle mixed continuous/ordered/unordered data and has cross-validated bandwidth selection. We should be able to add finite domain support to it (longer term, non-trivial). So it looks to me like KDE should become KDE_1d or UnivariateKDE. However, given the current state that's indeed not very practical. Alternative is to rename the new KDE and explain how the classes are related in the two docstrings.

Splitting the two between two namespaces doesn't make sense.

Josef Perktold
Owner

I don't think we want to separate by namespace, too confusing and there are no natural namespaces withing kde.

One possibility would be to keep all the KDE with qualifiers.

The KDE in this PR is using product kernels, so we could also call it KDEProduct to distinguish from scipy gaussian_kde which I think doesn't impose the product structure. KDEMutivariate is also fine (I'm not sure users care about "Product")

(other multivariate: nearest neighbor, rbf kernels ?)

Ralf Gommers
Collaborator

Users indeed won't care much. Multivariate sounds better than Product.

Skipper Seabold
Owner

+1

Josef Perktold
Owner

unless the untested parts will get tests soon, I prefer moving them to the sandbox.
requires less maintenance, avoids bitrot, and we don't have to struggle with git merges.
the only disadvantage is that it's easier to add github comments to a PR.

And it can be used even if it doesn't have a full test suite, and we can add tests incrementally.

(example: We got a bug report for survival2 which is in the sandbox, but the code in the sandbox is outdated and no one has worked on the new version the branch in 7 months.
Would have been easier to merge and add test coverage piece by piece.)

Josef Perktold
Owner

Ralf can you make a pull request from your branch

This branch doesn't contain the extra classes SingleIndexModel, Censored, ...

Also I would suggest splitting the module nonparametric2.py into at least two or three parts, density, regression and others? and rename nonparametric2 to "kernel_density.py" or something like that.

I only had a brief look at the extras. One of the main parts that might need expansion (later?) is to check what should be returned by fit, there are currently no results classes.

Ralf Gommers
Collaborator

Yes, will get the last renames done and send a PR this week.

@gpanterov I'll follow Josef's suggestion on moving things that aren't tested to the sandbox; they can always be moved back as soon as we have tests.

gpanterov

Please go ahead and move them to the sandbox Ralf. I was thinking along the same lines. I will have time to write some tests for them in the next few weeks. But I can't do it immediately ...

Ralf Gommers
Collaborator

New PR opened, so closing this one.

Ralf Gommers rgommers closed this November 03, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 79 unique commits by 2 authors.

May 26, 2012
gpanterov Multivar KDEs dd102b2
gpanterov Minor Change 56a25b3
Jun 09, 2012
gpanterov Implemented Suggestions by Ralf and Josef 8610a4a
gpanterov Fixed bw/bwmethod ab2439c
gpanterov Introduced default for bandwidth selection (Josef) fb558e9
gpanterov Removed convolution and cumulative density 0702486
Jun 10, 2012
gpanterov least squares cross validation + changes in API - added tools module 19b7548
gpanterov Finished with cv_ls for continuous and ordered variables c771f52
Jun 18, 2012
Ralf Gommers TST: convert test script to actual unit tests. Add UKDE, CKDE to __in…
…it__.py
bc3f30e
Ralf Gommers TST: nonparametric: convert UKDE example into a doctest.
Doctests can be run with:

    from statsmodels import nonparametric
    nonparametric.test(doctests=True)

Or "nosetests --with-doctest" in the nonparametric dir.

Note that some other doctests seem to be failing at the moment.
7efea19
Jun 22, 2012
gpanterov Added lst sq cv for conditional density (slow) 0163e3b
gpanterov Vectorized parts of IMSE in CKDE. Significantly improved performance 5ec2c48
gpanterov Added various tests. Most performe well but there are still issues wi…
…th speed
21a19a9
Jun 25, 2012
Ralf Gommers ENH: add __repr__ method to UKDE.
CKDE method still to do.  Also clean up some docstrings and code.
08eaa00
Jun 27, 2012
gpanterov pep8 on KernelFunctions.py 7985008
gpanterov pep8 on nonparametric2.py and np_tools.py b4815c9
Jun 29, 2012
gpanterov edat fix 2ccbae0
gpanterov fixed edat f175261
gpanterov removed imse_slow and gpke2 2942de4
Jun 30, 2012
gpanterov added graphical tests for several distributions 1182a8c
gpanterov some more changes eb198fa
gpanterov changes.. 22ea70d
gpanterov resolved conflicts e56f4db
Jul 01, 2012
gpanterov expanded doc strings for Generic KDE and UKDE 4eed5e0
gpanterov fixed doc strings and included latex formulas 7eeb5e3
gpanterov some minor fixes + clean up 3bad1da
gpanterov fixed test failures 51a175f
gpanterov ... b182b9a
gpanterov added latex formulas for kernels in KernelFunctions.py 877b63d
Jul 04, 2012
gpanterov added nonparametric regression + tests for ordered case 7c7ae88
Jul 06, 2012
gpanterov added fast unconditional multivariate cdf and tested with continuous …
…and mixed data
125c66a
Jul 07, 2012
gpanterov added conditional multivariate cdf; tests for continuous and mixed; c…
…ode needs cleaning
e29e026
Jul 08, 2012
gpanterov nonparametric regression with margina fx, R2, signiciance 04895ff
Jul 09, 2012
gpanterov Fixed npreg CV.LS; added tests for continuous and mixed for mean, R2,…
… bandwidth
46e076d
gpanterov added the derivative of the Gaussian kernel to be used for the calcul…
…ation of the marginal effects in the regression
292c0a3
Jul 10, 2012
gpanterov implemented some suggestions by Josef and Ralf b66e698
gpanterov created only one GPKE function for all classes 3a29a93
gpanterov cleaned GPKE and PKE 398acf1
gpanterov moved distribution plots from test file to examples file (yet to be c…
…ommitted
8b14a1a
Jul 11, 2012
gpanterov added Reg class to __init__.py fc84351
gpanterov added univariate kde example for 6 common distributions. merged with …
…main nonparametric branch
6a4fa5a
gpanterov added import plt to ex_univar_kde.py 9242513
gpanterov added seed 123456 to ex_univar_kde.py c7fe86c
gpanterov added docstrings for Reg class. added repr method for CKDE 9ed520b
gpanterov removed the Epanechnikov Kernel. Could be added at a later time again…
… if we decide to give the user an option to specify kernels. However most refs claim kernel not important
02b3781
gpanterov made nonparametric2.py pep 8 compliant 7e35ec4
gpanterov made np_tools.py pep 8 compliant ee2cd70
gpanterov made KernelFunctions.py pep 8 compliant b828625
Jul 12, 2012
gpanterov made test_nonparametric2.py pep 8 compliant 616d956
gpanterov naming conventions suggested by Skipper be6eb11
gpanterov added local linear estimator to Reg. modified Reg api. added tests 85d0773
gpanterov fixed a minor issue with CKDE.cdf, modified adjust_shape function 46a9a8f
gpanterov added repr method to Reg 008ef14
gpanterov small fix to adjust_shape 2a72d5f
gpanterov added an example with multivariate bimodal distribution plot (cv_ml) 285297e
Jul 15, 2012
Ralf Gommers BUG: nonparametric: fix three test failures due to incorrect usage of…
… `fill`.

Note also the added FIXME's.  Some bugs are still left; points also to
incomplete test coverage.
a957da0
Jul 17, 2012
gpanterov examples + kernel fix 6fb2c1f
Jul 22, 2012
gpanterov optimizing gpke + kernels for speed 919aa97
gpanterov fixed adjust_shape 00fa9f0
gpanterov some new fixes 1383dba
Jul 23, 2012
gpanterov removed pke bc9463d
gpanterov cleaned up KernelFunctions.py 2888c1d
gpanterov preparing the density estimators for pull request (without the regres…
…sion)
ea9b587
gpanterov fixed np_tools 656f96a
gpanterov