GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Final pull request for the coding during GSoC. This branch includes some improvements to the density estimation methods in nonparametric-density such as efficient bandwidth estimation by breaking the sample into smaller blocks. It also includes new features such as nonparametric regression, nonparametric censored regression, significance tests for nonparametric regression variables (for both continuous and discrete variables); first draft of nonparametric test for functional form. The branch includes two semiparametric models: Semiparametric Partially Linear Model and Semiparametric Single Index Model.
I think it would pay off to write test for various shapes of this function.
For example, my guess is that the calls to kernel_func don't work if there are more than one variable of the same type, or does it? It might if kernelfunc are vectorized.
The kernel functions are vectorized. I actually tested it with multiple variables. For example to obtain P(c1,c3 | c2):
nice, now I see that the bandwidth h is also vectorized.
The way you've written the signature of this function, both arguments should be specified. So this note must be incorrect. Your TODO below (combining into one parameter) makes sense. I assume bw will then be a user-specified function (it doesn't actually say that in the description of bw) with a standard signature.
Naming a method get_x implies that x is an already-existing attribute/number. Perhaps better to name it find_bw or compute_bw or similar.
Could you indicate that this is Scott's rule of thumb? Silverman's is almost as popular I think.
Also, the method name is not so clear I think - could be named such that it's clear that this is a bandwidth estimate.
The methods normal_reference, cv_ml and cv_ls are all private, right? They should only be called through get_bw. So start the names with an underscore.
It would be good to explain in a few sentences what "conditional kde" actually is and give a reference. Conditional estimation is a lot less common than unconditional; unconditional is normally even left off ("kernel density estimation" refers to your UKDE class).
For keywords which can be left out, use None as default parameter. False implies that this is a boolean variable. The check for input given or not is then written as if eydat is not None.
if eydat is not None
This for-loop doesn't do anything. It only creates var, which isn't used below.
column_stack (one of my favorites) could be used, however,
concatenate and column_stack copy the data, AFAICS
In general it might be better to work with views, and require users to concatenate themselves.
For example, in the LOO loop, tdat is already an array (view), if there is no concatenate call then the class would have a view instead of making a copy of the data. In most models we try to use views on the original data, exog and endog, although some calculations might create copies anyway.
(We never checked systematically whether we save any memory or are faster this way.)
Same here, for loops don't do anything.
index could be 1d (row selector), then reshape is not necessary
Implemented Suggestions by Ralf and Josef
pep8 doesn't have space for = in keyword arguments. minor issue but useful to get used to
sorry, I guess I wasn't clear before, below are many spaces in keyword =
is if bw is not None or if not bw is None
if bw is not None
if not bw is None
should this be compute_bw(self, bw=None, bwmethod=None) ? should be, if they are optional, even if one of the two is required
compute_bw(self, bw=None, bwmethod=None)
I think, if it's possible, then there should be a recommended default, Scott's or Silverman - normal_reference?
Introduced default for bandwidth selection (Josef)
Removed convolution and cumulative density
least squares cross validation + changes in API - added tools module
Finished with cv_ls for continuous and ordered variables
if edat is None:
edat = self.tdat
When you're doing PEP8 fixes, make your life easier by running http://pypi.python.org/pypi/pep8 over the file(s). It will warn you when things are non-standard.
Should be if not isinstance(bw, basestring).
if not isinstance(bw, basestring)
The else clause below should probably also check that the input is a callable, like so hasattr(bw, '__call__').
This took me a bit of puzzling. IMSE doesn't actually calculate the integral, so the name is a bit deceptive. I guess you don't need to explicitly calculate it if you're only using it from optimize.fmin.
Did you also plan to provide other metrics, like ISE or IAE?
I think the purpose of the convolution kernels and how to use them needs some explanation. So far they're only used in UKDE.IMSE as far as I can tell.
TST: convert test script to actual unit tests. Add UKDE, CKDE to __in…
TST: nonparametric: convert UKDE example into a doctest.
Doctests can be run with:
from statsmodels import nonparametric
Or "nosetests --with-doctest" in the nonparametric dir.
Note that some other doctests seem to be failing at the moment.
Added lst sq cv for conditional density (slow)
Vectorized parts of IMSE in CKDE. Significantly improved performance
Added various tests. Most performe well but there are still issues wi…
ENH: add __repr__ method to UKDE.
CKDE method still to do. Also clean up some docstrings and code.
pep8 on KernelFunctions.py
pep8 on nonparametric2.py and np_tools.py
removed imse_slow and gpke2
added graphical tests for several distributions
some more changes
expanded doc strings for Generic KDE and UKDE
fixed doc strings and included latex formulas
some minor fixes + clean up
fixed test failures
added latex formulas for kernels in KernelFunctions.py
I'd reserve LaTeX for formulas that are at least somewhat complicated. This would be better written as plain text.
Probably better to describe what's different from GPKE. Anything besides the summing? I thought this was vectorized too, can't that be reused?
Not actually implemented yet (commented out below). Wouldn't it be easier to return the sorted tdat or edat than ix?
I left it inactive because it get confusing when you have more than one variable. What should you sort by when you in the multivariate case?
Perhaps all of them, first on axis 0, then 1, etc.?
If you don't need code anymore, better to delete it.
This fix really needs a test for edat usage.
The old version (array_like) is actually the correct one.
I think you meant these as examples right? Nothing is actually tested. Matplotlib is only an optional dependency of statsmodels, so you should only use it within a function or with a conditional import (i.e. within a try-catch).
added nonparametric regression + tests for ordered case
doesn't work for me
import statsmodels.nonparametric.nonparametric2 as nparam ?
leaving plot.show() in a test hangs the test when I run it
also matplotlib import needs to be protected (try except)
pareto graph is just a line at 1e100
laplace and powerlaw ? seem to have problems close to the boundaries
one print statement is left somewhere when running the tests
tests run without failures, but are slow, 244 seconds, we need to figure out a way to shorten this or mark some as slow before merging
(These are things that need to be fixed before merging, but can be ok, or not our business, in a development branch)
some of the test cases would make nice example scripts
I think the class names need to be made longer and more descriptive. Only the most famous models are allowed to have acronyms. KDE is ok, but UKDE doesn't tell me anything.
added fast unconditional multivariate cdf and tested with continuous …
…and mixed data
special.ndtr(x) has the cdf for standard normal, used by scipy.stats.distributions.norm
preparing the density estimators for pull request (without the regres…
fixed pep8 issues
removed reg class from nonparametric2.py
added local linear with marginal effects for mixed data
This is OK for some testing, but should be replaced by usage of StringIO before it can be merged. Writing actual files to disk shouldn't be done in tests.
added marginal effects for the local linar estimator with mixed data
marginal effects for local linear estimator + tests
added right-censored regression to Reg class
implemented suggestions by Ralf
added efficient estimation (blocking) of bandiwdth by breaking up lar…
…ge samples (still some test failures with Reg)
first attempt at sig test
MAINT: fix many small typos and inconsistencies in UKDE/CKDE classes.
MAINT: clean up nonparametric/np_tools.py and remove unused imports.
MAINT: clean up nonparametric/KernelFunctions.py
MAINT: rename KernelFunctions.py --> kernels.py
Also remove some more unused imports.
reworked api for bw blocking estimation
changed mean to median in only_bw option
moved censored reg in seperate class
merged with nonparametric-reg-block
finished mergin with nonparametric-reg-block
This looks good from a quick browse. I'd call SetDefaults something more informative, perhaps EstimatorSettings?
added docstrings for the efficient bw estimator
fixed docstrings for censored class
added docstrings (still fixing some issues with writing the tests)
Merge pull request #4 from rgommers/nonparametric-density
Fixes / typos / small improvements for PR-408.
fixed minor issue with example in CKDE class
Merge branch 'nonparametric-density' of github.com:gpanterov/statsmod…
…els into nonparametric-density
fixed issues with documentation of kernels
merged nonparametric-reg branch
merged nonparametric-reg-block branch for efficient bw estimation
merged nonparametric-censored branch for censored regression
merged nonparametric-reg-sigtests for significance tests for nonparam…
__init__.py should be empty except for test.
those imports should move into api.py
lower case names of kernels -- pep8 compliant
lower case names for kernels
added a test for censored reg
merged with nonparametric-density (lower case for kernels names for p…
Noticing now that I don't really understand the API here. The description for censor_var is
Value at which the dependent variable is censored
Now you're saying censor_var=0. What does that even mean?
Also, C3 isn't used.
You are right. It is a bit confusing. It should be censor_val. I.e. the value at which the dependent variable is censored. For example if you only have income data up to $100,000 then your dependent variable is censored at 100K
Renaming would be good then.
Also, please add a clear explanation of what left-censored means. Looking at the test (and not remembering left/right), I would assume 0 means only positive values are present. Your example for 100K clearly means the opposite....
Can you also fix these, right now the nonparametric-all branch doesn't import due to syntax errors:
def __init__(self, tydat, txdat, bw, var_type fform, estimator): # missing comma
def __init__(self, tydat, txdat, var_type, reg_type, bw='cv_ls', # censor_var needs to be a kw
And a bunch more too.
Don't forget your commit messages. They may stick around for a couple of decades:)
added significance tests for discrete and continuous variables, fixed…
… some bugs with the AIC bandwidth selection criterion, added the Semiparametric Single Index Model
added the semiparametric partially linear model
docstrings for semiparametric models and significance tests
added the loc linear to the two semiparametric models
added tests for the linear component of the semi parametric partially…
… linear model
changed names of UKDE and CKDE in nonparametric-all
added clarification comments to the efficient bandwidth estimation part
fixed issues with the tests for functional form in TestFForm()
added an explanation of a left-censored variable to CensoredReg class
Before we get again lots of comments on this PR, could you please rebase it onto latest statsmodels master? All (almost all) the comments that are here now are also present in #408.
Closing, superseded by #562.