Skip to content
This repository

Nonparametric all #434

Closed
wants to merge 114 commits into from

4 participants

gpanterov Ralf Gommers Josef Perktold Skipper Seabold
gpanterov

Final pull request for the coding during GSoC. This branch includes some improvements to the density estimation methods in nonparametric-density such as efficient bandwidth estimation by breaking the sample into smaller blocks. It also includes new features such as nonparametric regression, nonparametric censored regression, significance tests for nonparametric regression variables (for both continuous and discrete variables); first draft of nonparametric test for functional form. The branch includes two semiparametric models: Semiparametric Partially Linear Model and Semiparametric Single Index Model.

Josef Perktold

I think it would pay off to write test for various shapes of this function.

For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.

The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):

dens_c=CKDE(tydat=[c1,c3],txdat=[c2], dep_type='cc',indep_type='c',bwmethod='normal_reference')
dens_c.pdf()

Owner

nice, now I see that the bandwidth h is also vectorized.

Ralf Gommers

The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.

Collaborator

Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.

Ralf Gommers

Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.

Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.

Ralf Gommers

The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.

Ralf Gommers

It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).

Ralf Gommers

For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.

Ralf Gommers

This for-loop doesn't do anything. It only creates var, which isn't used below.

Owner

column_stack (one of my favorites) could be used, however,

concatenate and column_stack copy the data, AFAICS

In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)

Ralf Gommers

Same here, for loops don't do anything.

Josef Perktold

index could be 1d (row selector), then reshape is not necessary

Josef Perktold

pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to

Owner

sorry, I guess I wasn't clear before, below are many spaces in keyword =

Josef Perktold

is if bw is not None or if not bw is None

Josef Perktold

should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required

I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?

Ralf Gommers

Should be

if edat is None:
    edat = self.tdat
Ralf Gommers
Collaborator

When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.

Ralf Gommers

Should be if not isinstance(bw, basestring).

The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').

Ralf Gommers

This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.

Did you also plan to provide other metrics, like ISE or IAE?

Ralf Gommers
Collaborator

I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.

rgommers and others added some commits
Ralf Gommers

I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.

Ralf Gommers

Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?

Ralf Gommers

Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?

I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?

Collaborator

Perhaps all of them, first on axis 0, then 1, etc.?

Ralf Gommers
Collaborator

If you don't need code anymore, better to delete it.

Ralf Gommers
Collaborator

This fix really needs a test for edat usage.

Ralf Gommers

The old version (array_like) is actually the correct one.

Ralf Gommers
Collaborator

I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).

Josef Perktold

doesn't work for me

import statsmodels.nonparametric.nonparametric2 as nparam ?

Josef Perktold
Owner

leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)

Owner

pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries

one print statement is left somewhere when running the tests

tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging

(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)

Owner

some of the test cases would make nice example scripts

Owner

I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.

Josef Perktold

special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm

Josef Perktold
Owner

nice, I like having the cdf available, Azzalini (IIRC) mentioned that the rate (as function of n) for the bandwidth for cdf should be smaller (?) than for the pdf. Did you see anything about bw choice for cdf?

Collaborator

Fast, and seems to work well. At least in 1-D, converges nicely to the empirical CDF.

Collaborator

@josef-pkt that's what I thought too, bandwidth isn't the same as for pdf.

Collaborator

Can you factor out all the code related to asarray, K, N and reshaping? It's more than 10 lines that are duplicated in every single kernel function. Should be something like

def _get_shape_and_type(Xi, x, kind='c'):
    ...
    return Xi, x, K, N
Collaborator

The UKDE, CKDE interface now doesn't allow specifying the kernels to use. The Epannechnikov kernel isn't used at all. Are you planning to expand that interface? In any case, what the default kernels are should be documented.

Collaborator

I have the feeling that the for-loop in GPKE can still be optimized, it's very expensive now. You can see this easily by profiling in IPython. Use for example %prun dens_scott.cdf() after having run the below script. 33000 function calls for a 1-D example with 1000 points.

import numpy as np

from statsmodels.sandbox.distributions.mixture_rvs import mixture_rvs
from statsmodels.nonparametric import UKDE
from statsmodels.tools.tools import ECDF

import matplotlib.pyplot as plt


np.random.seed(12345)
obs_dist = mixture_rvs([.25,.75], size=1000, dist=[stats.norm, stats.norm],
                kwargs = (dict(loc=-1,scale=.5),dict(loc=1,scale=.5)))

dens_scott = UKDE(tdat=[obs_dist], var_type='c', bw='normal_reference')
est_scott = dens_scott.pdf()

xx = np.linspace(-4, 4, 100)
est_scott_cdf = dens_scott.cdf(xx)
ecdf = ECDF(obs_dist)

# Plot the cdf
fig = plt.figure()
plt.plot(xx, est_scott_cdf, 'b.-')
plt.plot(ecdf.x, ecdf.y, 'r.-')

plt.show()
Collaborator

The for-loop could be over the number of variables instead of the data points, right?

Owner

It looks that way to me too. @gpanterov is there any reason this couldn't be vectorized over the number of observations?

Ralf Gommers

Indentation not equal here.

Ralf Gommers
Collaborator

Please add import matplotlib.pyplot as plt.

Collaborator

Pareto is still broken.

Collaborator

The Weibull plot shows an interesting issue - finite support isn't handled in the UKDE, CKDE classes. Close to 0 this goes wrong.

matplotlib.pyplot - added
Broke Pareto -- strange. It works for me. Maybe it depends on the seed. With np.seed(123456) I get the following plot :
http://statsmodels-np.blogspot.com/2012/07/kde-estimate-of-pareto-rv.html
The density estimate isn't very good but this could be due to the relatively small sample size. Will add seed.

Collaborator

With that seed I get the same result as you. The result is quite far off again though, again due to not dealing with support.

Skipper Seabold

Probably want to avoid underscores in class names unless it's to mark the class as private. CamelCase is almost always good enough.

Owner

You also want to stick explicitly with new-style classes. Ie., all of your classes should inherit from object

class GenericKDE(object):
Skipper Seabold

Not a huge deal, but you want a class method here or this gets called for every test method. Ie.,

@classmethod 
def setUp(cls):
   ...
Skipper Seabold

Could you post scripts somewhere to compare the output for the full datasets. I'd like to compare the performance.

Sure. I can do that. But it becomes quite slow if you include the entire data set and if you use the data-driven bandwidth estimates (especially cross-validation least squares)

Owner

Yeah, don't put it in the tests, but if you could put it as an example script somewhere I can play with that would be helpful.

gpanterov gpanterov removed the Epanechnikov Kernel. Could be added at a later time again…
… if we decide to give the user an option to specify kernels. However most refs claim kernel not important
02b3781
Skipper Seabold

This could be a property since it doesn't need any arguments. I'm not sure about caching it yet since I don't know all the moving parts.

Ralf Gommers
Collaborator

My version of MPL (1.0.1) doesn't have ax.zaxis. What version are you on? Can you leave out those 2 lines, or replace them with something that works for multiple versions?

Also, you could add a second plot of the same density with imshow(Z). While the 3-D version is fancier, I find the 2-D one much easier to interpret.

Other than that, looks good.

Ralf Gommers

You don't need A, B and the for-loop. These 8 lines can be replaced by

ix = np.random.uniform(size=N) > 0.5
V = np.random.multivariate_normal(mu1, cov1, size=N)
V[ix, :] = np.random.multivariate_normal(mu2, cov2, size=N)[ix, :]
Ralf Gommers

This for-loop can also be removed. Three lines above can be replaced by

edat = np.column_stack([X.ravel(), Y.ravel()])
Z = dens.pdf(edat).reshape(X.shape)

It would be good to document in the pdf method that the required shape for edat is (num_points, K), with K the number of dimensions (available as the K attribute.

Ralf Gommers

This is OK for some testing, but should be replaced by usage of StringIO before it can be merged. Writing actual files to disk shouldn't be done in tests.

Ralf Gommers
Collaborator

This looks good from a quick browse. I'd call SetDefaults something more informative, perhaps EstimatorSettings?

Josef Perktold

__init__.py should be empty except for test.
those imports should move into api.py

Ralf Gommers
Collaborator

Noticing now that I don't really understand the API here. The description for censor_var is

censor_var: Float
    Value at which the dependent variable is censored

Now you're saying censor_var=0. What does that even mean?

Collaborator

Also, C3 isn't used.

You are right. It is a bit confusing. It should be censor_val. I.e. the value at which the dependent variable is censored. For example if you only have income data up to $100,000 then your dependent variable is censored at 100K

Collaborator

Renaming would be good then.

Also, please add a clear explanation of what left-censored means. Looking at the test (and not remembering left/right), I would assume 0 means only positive values are present. Your example for 100K clearly means the opposite....

Collaborator

Can you also fix these, right now the nonparametric-all branch doesn't import due to syntax errors:

def __init__(self, tydat, txdat, bw, var_type fform, estimator):  # missing comma

def __init__(self, tydat, txdat, var_type, reg_type, bw='cv_ls',   # censor_var needs to be a kw
            censor_var, defaults=SetDefaults()):
Collaborator

And a bunch more too.

Ralf Gommers
Collaborator

Don't forget your commit messages. They may stick around for a couple of decades:)

Ralf Gommers
Collaborator

Before we get again lots of comments on this PR, could you please rebase it onto latest statsmodels master? All (almost all) the comments that are here now are also present in #408.

Ralf Gommers
Collaborator

Closing, superseded by #562.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 114 unique commits by 2 authors.

May 26, 2012
gpanterov gpanterov Multivar KDEs dd102b2
gpanterov gpanterov Minor Change 56a25b3
Jun 09, 2012
gpanterov gpanterov Implemented Suggestions by Ralf and Josef 8610a4a
gpanterov gpanterov Fixed bw/bwmethod ab2439c
Jun 10, 2012
gpanterov gpanterov Introduced default for bandwidth selection (Josef) fb558e9
gpanterov gpanterov Removed convolution and cumulative density 0702486
gpanterov gpanterov least squares cross validation + changes in API - added tools module 19b7548
gpanterov gpanterov Finished with cv_ls for continuous and ordered variables c771f52
Jun 18, 2012
Ralf Gommers rgommers TST: convert test script to actual unit tests. Add UKDE, CKDE to __in…
…it__.py
bc3f30e
Ralf Gommers rgommers TST: nonparametric: convert UKDE example into a doctest.
Doctests can be run with:

    from statsmodels import nonparametric
    nonparametric.test(doctests=True)

Or "nosetests --with-doctest" in the nonparametric dir.

Note that some other doctests seem to be failing at the moment.
7efea19
Jun 22, 2012
gpanterov gpanterov Added lst sq cv for conditional density (slow) 0163e3b
gpanterov gpanterov Vectorized parts of IMSE in CKDE. Significantly improved performance 5ec2c48
gpanterov gpanterov Added various tests. Most performe well but there are still issues wi…
…th speed
21a19a9
Jun 25, 2012
Ralf Gommers rgommers ENH: add __repr__ method to UKDE.
CKDE method still to do.  Also clean up some docstrings and code.
08eaa00
Jun 27, 2012
gpanterov gpanterov pep8 on KernelFunctions.py 7985008
gpanterov gpanterov pep8 on nonparametric2.py and np_tools.py b4815c9
Jun 29, 2012
gpanterov gpanterov edat fix 2ccbae0
gpanterov gpanterov fixed edat f175261
gpanterov gpanterov removed imse_slow and gpke2 2942de4
Jun 30, 2012
gpanterov gpanterov added graphical tests for several distributions 1182a8c
gpanterov gpanterov some more changes eb198fa
gpanterov gpanterov changes.. 22ea70d
Jul 01, 2012
gpanterov gpanterov resolved conflicts e56f4db
gpanterov gpanterov expanded doc strings for Generic KDE and UKDE 4eed5e0
gpanterov gpanterov fixed doc strings and included latex formulas 7eeb5e3
gpanterov gpanterov some minor fixes + clean up 3bad1da
gpanterov gpanterov fixed test failures 51a175f
gpanterov gpanterov ... b182b9a
gpanterov gpanterov added latex formulas for kernels in KernelFunctions.py 877b63d
Jul 04, 2012
gpanterov gpanterov added nonparametric regression + tests for ordered case 7c7ae88
Jul 07, 2012
gpanterov gpanterov added fast unconditional multivariate cdf and tested with continuous …
…and mixed data
125c66a
gpanterov gpanterov added conditional multivariate cdf; tests for continuous and mixed; c…
…ode needs cleaning
e29e026
Jul 08, 2012
gpanterov gpanterov nonparametric regression with margina fx, R2, signiciance 04895ff
Jul 10, 2012
gpanterov gpanterov Fixed npreg CV.LS; added tests for continuous and mixed for mean, R2,…
… bandwidth
46e076d
gpanterov gpanterov added the derivative of the Gaussian kernel to be used for the calcul…
…ation of the marginal effects in the regression
292c0a3
Jul 11, 2012
gpanterov gpanterov implemented some suggestions by Josef and Ralf b66e698
gpanterov gpanterov created only one GPKE function for all classes 3a29a93
gpanterov gpanterov cleaned GPKE and PKE 398acf1
gpanterov gpanterov moved distribution plots from test file to examples file (yet to be c…
…ommitted
8b14a1a
gpanterov gpanterov added Reg class to __init__.py fc84351
gpanterov gpanterov added univariate kde example for 6 common distributions. merged with …
…main nonparametric branch
6a4fa5a
gpanterov gpanterov added import plt to ex_univar_kde.py 9242513
gpanterov gpanterov added seed 123456 to ex_univar_kde.py c7fe86c
Jul 12, 2012
gpanterov gpanterov added docstrings for Reg class. added repr method for CKDE 9ed520b
gpanterov gpanterov removed the Epanechnikov Kernel. Could be added at a later time again…
… if we decide to give the user an option to specify kernels. However most refs claim kernel not important
02b3781
gpanterov gpanterov made nonparametric2.py pep 8 compliant 7e35ec4
gpanterov gpanterov made np_tools.py pep 8 compliant ee2cd70
gpanterov gpanterov made KernelFunctions.py pep 8 compliant b828625
gpanterov gpanterov made test_nonparametric2.py pep 8 compliant 616d956
gpanterov gpanterov naming conventions suggested by Skipper be6eb11
gpanterov gpanterov added local linear estimator to Reg. modified Reg api. added tests 85d0773
Jul 13, 2012
gpanterov gpanterov fixed a minor issue with CKDE.cdf, modified adjust_shape function 46a9a8f
gpanterov gpanterov added repr method to Reg 008ef14
gpanterov gpanterov small fix to adjust_shape 2a72d5f
gpanterov gpanterov added an example with multivariate bimodal distribution plot (cv_ml) 285297e
Jul 15, 2012
Ralf Gommers rgommers BUG: nonparametric: fix three test failures due to incorrect usage of…
… `fill`.

Note also the added FIXME's.  Some bugs are still left; points also to
incomplete test coverage.
a957da0
Jul 17, 2012
gpanterov gpanterov examples + kernel fix 6fb2c1f
Jul 22, 2012
gpanterov gpanterov optimizing gpke + kernels for speed 919aa97
gpanterov gpanterov fixed adjust_shape 00fa9f0
gpanterov gpanterov some new fixes 1383dba
Jul 23, 2012
gpanterov gpanterov removed pke bc9463d
gpanterov gpanterov cleaned up KernelFunctions.py 2888c1d
gpanterov gpanterov preparing the density estimators for pull request (without the regres…
…sion)
ea9b587
gpanterov gpanterov fixed np_tools 656f96a
gpanterov gpanterov fixed pep8 issues 0c105cf
Jul 25, 2012
gpanterov gpanterov removed reg class from nonparametric2.py 09d761a
Jul 26, 2012
gpanterov gpanterov added local linear with marginal effects for mixed data 9f32c2f
Jul 27, 2012
gpanterov gpanterov added marginal effects for the local linar estimator with mixed data ea750b4
gpanterov gpanterov censored regression 7cfbc39
gpanterov gpanterov marginal effects for local linear estimator + tests bd2229a
gpanterov gpanterov added right-censored regression to Reg class d9636fd
Jul 30, 2012
gpanterov gpanterov .. bf61c16
gpanterov gpanterov implemented suggestions by Ralf c67e189
Jul 31, 2012
gpanterov gpanterov added efficient estimation (blocking) of bandiwdth by breaking up lar…
…ge samples (still some test failures with Reg)
da9455c
Aug 01, 2012
gpanterov gpanterov .. ab3e2e2
Aug 02, 2012
gpanterov gpanterov first attempt at sig test 2c5b17a
Aug 04, 2012
Ralf Gommers rgommers MAINT: fix many small typos and inconsistencies in UKDE/CKDE classes. 7faa745
Ralf Gommers rgommers MAINT: clean up nonparametric/np_tools.py and remove unused imports. 25818f2
Ralf Gommers rgommers MAINT: clean up nonparametric/KernelFunctions.py c9daa4e
Ralf Gommers rgommers MAINT: rename KernelFunctions.py --> kernels.py
Also remove some more unused imports.
a6be4fe
Aug 06, 2012
gpanterov gpanterov reworked api for bw blocking estimation f749c52
gpanterov gpanterov changed mean to median in only_bw option c922e3b
gpanterov gpanterov moved censored reg in seperate class 7d32bde
gpanterov gpanterov merged with nonparametric-reg-block d227b58
gpanterov gpanterov .. 955c0b4
gpanterov gpanterov finished mergin with nonparametric-reg-block 4774029
Aug 13, 2012
gpanterov gpanterov added docstrings for the efficient bw estimator a0b50dd
gpanterov gpanterov fixed docstrings for censored class 7166b1b
gpanterov gpanterov added docstrings (still fixing some issues with writing the tests) 9f9b970
gpanterov gpanterov Merge pull request #4 from rgommers/nonparametric-density
Fixes / typos / small improvements for PR-408.
7e732d8
gpanterov gpanterov fixed minor issue with example in CKDE class b705644
gpanterov gpanterov Merge branch 'nonparametric-density' of github.com:gpanterov/statsmod…
…els into nonparametric-density
aef14c6
gpanterov gpanterov fixed issues with documentation of kernels 1fc6777
gpanterov gpanterov merged nonparametric-reg branch 457ae87
gpanterov gpanterov merged nonparametric-reg-block branch for efficient bw estimation 7b57346
gpanterov gpanterov .. 12f749a
gpanterov gpanterov merged nonparametric-censored branch for censored regression 24eef8d
gpanterov gpanterov merged nonparametric-reg-sigtests for significance tests for nonparam…
…etric regression
5c8e4e6
gpanterov gpanterov changed __init__.py 7e7fa9b
Aug 14, 2012
gpanterov gpanterov lower case names of kernels -- pep8 compliant 7be7228
gpanterov gpanterov lower case names for kernels 70ce4a4
Aug 15, 2012
gpanterov gpanterov added a test for censored reg 3e654d2
gpanterov gpanterov merged with nonparametric-density (lower case for kernels names for p…
…ep8)
e0c3734
Aug 16, 2012
gpanterov gpanterov ... 6e7ddfb
gpanterov gpanterov .. 5c019f9
Aug 17, 2012
gpanterov gpanterov added significance tests for discrete and continuous variables, fixed…
… some bugs with the AIC bandwidth selection criterion, added the Semiparametric Single Index Model
3aa6068
gpanterov gpanterov added the semiparametric partially linear model 33300cf
gpanterov gpanterov docstrings for semiparametric models and significance tests a73e815
Aug 19, 2012
gpanterov gpanterov added the loc linear to the two semiparametric models 2bed8bd
gpanterov gpanterov added tests for the linear component of the semi parametric partially…
… linear model
0bc5ffc
gpanterov gpanterov changed names of UKDE and CKDE in nonparametric-all 3d1cd22
gpanterov gpanterov added clarification comments to the efficient bandwidth estimation part 1be4201
Aug 20, 2012
gpanterov gpanterov fixed issues with the tests for functional form in TestFForm() a759c2b
gpanterov gpanterov added an explanation of a left-censored variable to CensoredReg class 1cf8861
Something went wrong with that request. Please try again.