Nonparametric all #434

Closed
wants to merge 114 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

gpanterov commented Aug 20, 2012

Final pull request for the coding during GSoC. This branch includes some improvements to the density estimation methods in nonparametric-density such as efficient bandwidth estimation by breaking the sample into smaller blocks. It also includes new features such as nonparametric regression, nonparametric censored regression, significance tests for nonparametric regression variables (for both continuous and discrete variables); first draft of nonparametric test for functional form. The branch includes two semiparametric models: Semiparametric Partially Linear Model and Semiparametric Single Index Model.

I think it would pay off to write test for various shapes of this function.

For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.

Owner

gpanterov replied May 26, 2012

The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):

dens_c=CKDE(tydat=[c1,c3],txdat=[c2], dep_type='cc',indep_type='c',bwmethod='normal_reference')
dens_c.pdf()

nice, now I see that the bandwidth h is also vectorized.

The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.

Collaborator

rgommers replied May 27, 2012

Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.

Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.

Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.

The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.

It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).

For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.

This for-loop doesn't do anything. It only creates var, which isn't used below.

column_stack (one of my favorites) could be used, however,

concatenate and column_stack copy the data, AFAICS

In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)

Same here, for loops don't do anything.

index could be 1d (row selector), then reshape is not necessary

pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to

sorry, I guess I wasn't clear before, below are many spaces in keyword =

is if bw is not None or if not bw is None

should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required

I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?

Should be

if edat is None:
    edat = self.tdat
Collaborator

rgommers commented on 8610a4a Jun 13, 2012

When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.

Should be if not isinstance(bw, basestring).

The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').

This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.

Did you also plan to provide other metrics, like ISE or IAE?

Collaborator

rgommers commented on c771f52 Jun 17, 2012

I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.

I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.

Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?

Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?

Owner

gpanterov replied Jul 2, 2012

I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?

Collaborator

rgommers replied Jul 2, 2012

Perhaps all of them, first on axis 0, then 1, etc.?

Collaborator

rgommers commented on 3bad1da Jul 2, 2012

If you don't need code anymore, better to delete it.

Collaborator

rgommers commented on f175261 Jul 2, 2012

This fix really needs a test for edat usage.

The old version (array_like) is actually the correct one.

Collaborator

rgommers commented on 1182a8c Jul 2, 2012

I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).

doesn't work for me

import statsmodels.nonparametric.nonparametric2 as nparam ?

leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)

pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries

one print statement is left somewhere when running the tests

tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging

(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)

some of the test cases would make nice example scripts

I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.

special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm

This is OK for some testing, but should be replaced by usage of StringIO before it can be merged. Writing actual files to disk shouldn't be done in tests.

Collaborator

rgommers commented on f749c52 Aug 6, 2012

This looks good from a quick browse. I'd call SetDefaults something more informative, perhaps EstimatorSettings?

__init__.py should be empty except for test.
those imports should move into api.py

Collaborator

rgommers commented on 3e654d2 Aug 15, 2012

Noticing now that I don't really understand the API here. The description for censor_var is

censor_var: Float
    Value at which the dependent variable is censored

Now you're saying censor_var=0. What does that even mean?

Collaborator

rgommers replied Aug 15, 2012

Also, C3 isn't used.

Owner

gpanterov replied Aug 15, 2012

You are right. It is a bit confusing. It should be censor_val. I.e. the value at which the dependent variable is censored. For example if you only have income data up to $100,000 then your dependent variable is censored at 100K

Collaborator

rgommers replied Aug 15, 2012

Renaming would be good then.

Also, please add a clear explanation of what left-censored means. Looking at the test (and not remembering left/right), I would assume 0 means only positive values are present. Your example for 100K clearly means the opposite....

Collaborator

rgommers replied Aug 15, 2012

Can you also fix these, right now the nonparametric-all branch doesn't import due to syntax errors:

def __init__(self, tydat, txdat, bw, var_type fform, estimator):  # missing comma

def __init__(self, tydat, txdat, var_type, reg_type, bw='cv_ls',   # censor_var needs to be a kw
            censor_var, defaults=SetDefaults()):
Collaborator

rgommers replied Aug 15, 2012

And a bunch more too.

Collaborator

rgommers commented on 6e7ddfb Aug 16, 2012

Don't forget your commit messages. They may stick around for a couple of decades:)

Member

rgommers commented Aug 20, 2012

Before we get again lots of comments on this PR, could you please rebase it onto latest statsmodels master? All (almost all) the comments that are here now are also present in #408.

Member

rgommers commented Nov 14, 2012

Closing, superseded by #562.

@rgommers rgommers closed this Nov 14, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment