Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Nonparametric all #434

Closed
wants to merge 114 commits into from

4 participants

@gpanterov

Final pull request for the coding during GSoC. This branch includes some improvements to the density estimation methods in nonparametric-density such as efficient bandwidth estimation by breaking the sample into smaller blocks. It also includes new features such as nonparametric regression, nonparametric censored regression, significance tests for nonparametric regression variables (for both continuous and discrete variables); first draft of nonparametric test for functional form. The branch includes two semiparametric models: Semiparametric Partially Linear Model and Semiparametric Single Index Model.

@josef-pkt

I think it would pay off to write test for various shapes of this function.

For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.

The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):

dens_c=CKDE(tydat=[c1,c3],txdat=[c2], dep_type='cc',indep_type='c',bwmethod='normal_reference')
dens_c.pdf()

nice, now I see that the bandwidth h is also vectorized.

@rgommers

The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.

Collaborator

Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.

@rgommers

Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.

Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.

@rgommers

The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.

@rgommers

It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).

@rgommers

For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.

@rgommers

This for-loop doesn't do anything. It only creates var, which isn't used below.

column_stack (one of my favorites) could be used, however,

concatenate and column_stack copy the data, AFAICS

In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)

@rgommers

Same here, for loops don't do anything.

@josef-pkt

index could be 1d (row selector), then reshape is not necessary

@josef-pkt

pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to

sorry, I guess I wasn't clear before, below are many spaces in keyword =

@josef-pkt

is if bw is not None or if not bw is None

@josef-pkt

should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required

I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?

@rgommers

Should be

if edat is None:
    edat = self.tdat
@rgommers
Collaborator

When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.

@rgommers

Should be if not isinstance(bw, basestring).

The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').

@rgommers

This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.

Did you also plan to provide other metrics, like ISE or IAE?

@rgommers
Collaborator

I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.

rgommers and others added some commits
@rgommers

I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.

@rgommers

Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?

@rgommers

Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?

I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?

Collaborator

Perhaps all of them, first on axis 0, then 1, etc.?

@rgommers
Collaborator

If you don't need code anymore, better to delete it.

@rgommers
Collaborator

This fix really needs a test for edat usage.

@rgommers

The old version (array_like) is actually the correct one.

@rgommers
Collaborator

I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).

@josef-pkt

doesn't work for me

import statsmodels.nonparametric.nonparametric2 as nparam ?

@josef-pkt
Owner

leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)

Owner

pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries

one print statement is left somewhere when running the tests

tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging

(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)

Owner

some of the test cases would make nice example scripts

Owner

I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.

@josef-pkt

special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm

@josef-pkt
Owner

nice, I like having the cdf available, Azzalini (IIRC) mentioned that the rate (as function of n) for the bandwidth for cdf should be smaller (?) than for the pdf. Did you see anything about bw choice for cdf?

Collaborator

Fast, and seems to work well. At least in 1-D, converges nicely to the empirical CDF.

Collaborator

@josef-pkt that's what I thought too, bandwidth isn't the same as for pdf.

Collaborator

Can you factor out all the code related to asarray, K, N and reshaping? It's more than 10 lines that are duplicated in every single kernel function. Should be something like

def _get_shape_and_type(Xi, x, kind='c'):
    ...
    return Xi, x, K, N
Collaborator

The UKDE, CKDE interface now doesn't allow specifying the kernels to use. The Epannechnikov kernel isn't used at all. Are you planning to expand that interface? In any case, what the default kernels are should be documented.

Collaborator

I have the feeling that the for-loop in GPKE can still be optimized, it's very expensive now. You can see this easily by profiling in IPython. Use for example %prun dens_scott.cdf() after having run the below script. 33000 function calls for a 1-D example with 1000 points.

import numpy as np

from statsmodels.sandbox.distributions.mixture_rvs import mixture_rvs
from statsmodels.nonparametric import UKDE
from statsmodels.tools.tools import ECDF

import matplotlib.pyplot as plt


np.random.seed(12345)
obs_dist = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))

dens_scott = UKDE(tdat=[obs_dist], var_type='c', bw='normal_reference')
est_scott = dens_scott.pdf()

xx = np.linspace(-4, 4, 100)
est_scott_cdf = dens_scott.cdf(xx)
ecdf = ECDF(obs_dist)

# Plot the cdf
fig = plt.figure()
plt.plot(xx, est_scott_cdf, 'b.-')
plt.plot(ecdf.x, ecdf.y, 'r.-')

plt.show()
Collaborator

The for-loop could be over the number of variables instead of the data points, right?

Owner

It looks that way to me too. @gpanterov is there any reason this couldn't be vectorized over the number of observations?

@rgommers

Indentation not equal here.

@rgommers

Need a blank line here.

@rgommers
Collaborator

Please add import matplotlib.pyplot as plt.

Collaborator

Pareto is still broken.

Collaborator

The Weibull plot shows an interesting issue - finite support isn't handled in the UKDE, CKDE classes. Close to 0 this goes wrong.

matplotlib.pyplot - added
Broke Pareto -- strange. It works for me. Maybe it depends on the seed. With np.seed(123456) I get the following plot :
http://statsmodels-np.blogspot.com/2012/07/kde-estimate-of-pareto-rv.html
The density estimate isn't very good but this could be due to the relatively small sample size. Will add seed.

Collaborator

With that seed I get the same result as you. The result is quite far off again though, again due to not dealing with support.

@jseabold

Probably want to avoid underscores in class names unless it's to mark the class as private. CamelCase is almost always good enough.

Owner

You also want to stick explicitly with new-style classes. Ie., all of your classes should inherit from object

class GenericKDE(object):
@jseabold

Not a huge deal, but you want a class method here or this gets called for every test method. Ie.,

@classmethod 
def setUp(cls):
   ...
@jseabold

Could you post scripts somewhere to compare the output for the full datasets. I'd like to compare the performance.

Sure. I can do that. But it becomes quite slow if you include the entire data set and if you use the data-driven bandwidth estimates (especially cross-validation least squares)

Owner

Yeah, don't put it in the tests, but if you could put it as an example script somewhere I can play with that would be helpful.

@gpanterov gpanterov removed the Epanechnikov Kernel. Could be added at a later time again…
… if we decide to give the user an option to specify kernels. However most refs claim kernel not important
02b3781
@jseabold

This could be a property since it doesn't need any arguments. I'm not sure about caching it yet since I don't know all the moving parts.

@rgommers
Collaborator

My version of MPL (1.0.1) doesn't have ax.zaxis. What version are you on? Can you leave out those 2 lines, or replace them with something that works for multiple versions?

Also, you could add a second plot of the same density with imshow(Z). While the 3-D version is fancier, I find the 2-D one much easier to interpret.

Other than that, looks good.

@rgommers

You don't need A, B and the for-loop. These 8 lines can be replaced by

ix = np.random.uniform(size=N) > 0.5
V = np.random.multivariate_normal(mu1, cov1, size=N)
V[ix, :] = np.random.multivariate_normal(mu2, cov2, size=N)[ix, :]
@rgommers

This for-loop can also be removed. Three lines above can be replaced by

edat = np.column_stack([X.ravel(), Y.ravel()])
Z = dens.pdf(edat).reshape(X.shape)

It would be good to document in the pdf method that the required shape for edat is (num_points, K), with K the number of dimensions (available as the K attribute.

@rgommers

This is OK for some testing, but should be replaced by usage of StringIO before it can be merged. Writing actual files to disk shouldn't be done in tests.

@rgommers
Collaborator

This looks good from a quick browse. I'd call SetDefaults something more informative, perhaps EstimatorSettings?

@josef-pkt

__init__.py should be empty except for test.
those imports should move into api.py

@rgommers
Collaborator

Noticing now that I don't really understand the API here. The description for censor_var is

censor_var: Float
    Value at which the dependent variable is censored

Now you're saying censor_var=0. What does that even mean?

Collaborator

Also, C3 isn't used.

You are right. It is a bit confusing. It should be censor_val. I.e. the value at which the dependent variable is censored. For example if you only have income data up to $100,000 then your dependent variable is censored at 100K

Collaborator

Renaming would be good then.

Also, please add a clear explanation of what left-censored means. Looking at the test (and not remembering left/right), I would assume 0 means only positive values are present. Your example for 100K clearly means the opposite....

Collaborator

Can you also fix these, right now the nonparametric-all branch doesn't import due to syntax errors:

def __init__(self, tydat, txdat, bw, var_type fform, estimator):  # missing comma

def __init__(self, tydat, txdat, var_type, reg_type, bw='cv_ls',   # censor_var needs to be a kw
            censor_var, defaults=SetDefaults()):
Collaborator

And a bunch more too.

@rgommers
Collaborator

Don't forget your commit messages. They may stick around for a couple of decades:)

@rgommers
Collaborator

Before we get again lots of comments on this PR, could you please rebase it onto latest statsmodels master? All (almost all) the comments that are here now are also present in #408.

@rgommers
Collaborator

Closing, superseded by #562.

@rgommers rgommers closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 26, 2012
  1. @gpanterov

    Multivar KDEs

    gpanterov authored
  2. @gpanterov

    Minor Change

    gpanterov authored
Commits on Jun 9, 2012
  1. @gpanterov
  2. @gpanterov

    Fixed bw/bwmethod

    gpanterov authored
Commits on Jun 10, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov
Commits on Jun 18, 2012
  1. @rgommers
  2. @rgommers

    TST: nonparametric: convert UKDE example into a doctest.

    rgommers authored
    Doctests can be run with:
    
        from statsmodels import nonparametric
        nonparametric.test(doctests=True)
    
    Or "nosetests --with-doctest" in the nonparametric dir.
    
    Note that some other doctests seem to be failing at the moment.
Commits on Jun 22, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
Commits on Jun 25, 2012
  1. @rgommers

    ENH: add __repr__ method to UKDE.

    rgommers authored
    CKDE method still to do.  Also clean up some docstrings and code.
Commits on Jun 27, 2012
  1. @gpanterov

    pep8 on KernelFunctions.py

    gpanterov authored
  2. @gpanterov
Commits on Jun 29, 2012
  1. @gpanterov

    edat fix

    gpanterov authored
  2. @gpanterov

    fixed edat

    gpanterov authored
  3. @gpanterov

    removed imse_slow and gpke2

    gpanterov authored
Commits on Jun 30, 2012
  1. @gpanterov
  2. @gpanterov

    some more changes

    gpanterov authored
  3. @gpanterov

    changes..

    gpanterov authored
Commits on Jul 1, 2012
  1. @gpanterov

    resolved conflicts

    gpanterov authored
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov

    some minor fixes + clean up

    gpanterov authored
  5. @gpanterov

    fixed test failures

    gpanterov authored
  6. @gpanterov

    ...

    gpanterov authored
  7. @gpanterov
Commits on Jul 4, 2012
  1. @gpanterov
Commits on Jul 7, 2012
  1. @gpanterov
  2. @gpanterov
Commits on Jul 8, 2012
  1. @gpanterov
Commits on Jul 10, 2012
  1. @gpanterov
  2. @gpanterov

    added the derivative of the Gaussian kernel to be used for the calcul…

    gpanterov authored
    …ation of the marginal effects in the regression
Commits on Jul 11, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov

    cleaned GPKE and PKE

    gpanterov authored
  4. @gpanterov
  5. @gpanterov
  6. @gpanterov
  7. @gpanterov
  8. @gpanterov
Commits on Jul 12, 2012
  1. @gpanterov
  2. @gpanterov

    removed the Epanechnikov Kernel. Could be added at a later time again…

    gpanterov authored
    … if we decide to give the user an option to specify kernels. However most refs claim kernel not important
  3. @gpanterov
  4. @gpanterov
  5. @gpanterov
  6. @gpanterov
  7. @gpanterov
  8. @gpanterov
Commits on Jul 13, 2012
  1. @gpanterov
  2. @gpanterov

    added repr method to Reg

    gpanterov authored
  3. @gpanterov

    small fix to adjust_shape

    gpanterov authored
  4. @gpanterov
Commits on Jul 15, 2012
  1. @rgommers

    BUG: nonparametric: fix three test failures due to incorrect usage of…

    rgommers authored
    … `fill`.
    
    Note also the added FIXME's.  Some bugs are still left; points also to
    incomplete test coverage.
Commits on Jul 17, 2012
  1. @gpanterov

    examples + kernel fix

    gpanterov authored
Commits on Jul 22, 2012
  1. @gpanterov
  2. @gpanterov

    fixed adjust_shape

    gpanterov authored
  3. @gpanterov

    some new fixes

    gpanterov authored
Commits on Jul 23, 2012
  1. @gpanterov

    removed pke

    gpanterov authored
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov

    fixed np_tools

    gpanterov authored
  5. @gpanterov

    fixed pep8 issues

    gpanterov authored
Commits on Jul 25, 2012
  1. @gpanterov
Commits on Jul 26, 2012
  1. @gpanterov
Commits on Jul 27, 2012
  1. @gpanterov
  2. @gpanterov

    censored regression

    gpanterov authored
  3. @gpanterov
  4. @gpanterov
Commits on Jul 30, 2012
  1. @gpanterov

    ..

    gpanterov authored
  2. @gpanterov
Commits on Jul 31, 2012
  1. @gpanterov

    added efficient estimation (blocking) of bandiwdth by breaking up lar…

    gpanterov authored
    …ge samples (still some test failures with Reg)
Commits on Aug 1, 2012
  1. @gpanterov

    ..

    gpanterov authored
Commits on Aug 2, 2012
  1. @gpanterov

    first attempt at sig test

    gpanterov authored
Commits on Aug 4, 2012
  1. @rgommers
  2. @rgommers
  3. @rgommers
  4. @rgommers

    MAINT: rename KernelFunctions.py --> kernels.py

    rgommers authored
    Also remove some more unused imports.
Commits on Aug 6, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov
  5. @gpanterov

    ..

    gpanterov authored
  6. @gpanterov
Commits on Aug 13, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov

    Merge pull request #4 from rgommers/nonparametric-density

    gpanterov authored
    Fixes / typos / small improvements for PR-408.
  5. @gpanterov
  6. @gpanterov

    Merge branch 'nonparametric-density' of github.com:gpanterov/statsmod…

    gpanterov authored
    …els into nonparametric-density
  7. @gpanterov
  8. @gpanterov
  9. @gpanterov
  10. @gpanterov

    ..

    gpanterov authored
  11. @gpanterov
  12. @gpanterov
  13. @gpanterov

    changed __init__.py

    gpanterov authored
Commits on Aug 14, 2012
  1. @gpanterov
  2. @gpanterov

    lower case names for kernels

    gpanterov authored
Commits on Aug 15, 2012
  1. @gpanterov
  2. @gpanterov
Commits on Aug 16, 2012
  1. @gpanterov

    ...

    gpanterov authored
  2. @gpanterov

    ..

    gpanterov authored
Commits on Aug 17, 2012
  1. @gpanterov

    added significance tests for discrete and continuous variables, fixed…

    gpanterov authored
    … some bugs with the AIC bandwidth selection criterion, added the Semiparametric Single Index Model
  2. @gpanterov
  3. @gpanterov
Commits on Aug 19, 2012
  1. @gpanterov
  2. @gpanterov
  3. @gpanterov
  4. @gpanterov
Commits on Aug 20, 2012
  1. @gpanterov
  2. @gpanterov
Something went wrong with that request. Please try again.