El desc #326

Closed
wants to merge 81 commits into
from

4 participants

@j-grana6

This should complete all empirical likelihood descriptive statistics, including tests.

@jseabold
statsmodels member

This file can be empty if you want.

@jseabold
statsmodels member

Could you make the docstrings indented at the same level? It makes the code easier to read.

@josef-pkt
statsmodels member

just a general comment

I thinks, it's safer to get into the habit of using floating point numbers ((self.nobs ** 2) - 1.) / ((self.nobs - 3.) just to reduce a possibility of integer division error, even if nobs is defined as float. (avoids having to check if nobs is float, and would be robust to being overwritten as integer)

@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ self.nobs = float(endog.shape[0])
+ self.weights = np.ones(self.nobs) / float(self.nobs)
+ if endog.ndim == 1:
+ self.endog = self.endog.reshape(self.nobs, 1)
+ # For now, self. weights should always be a vector of 1's and a
+ # variable "new weights" should be created everytime weights are
+ # changed.
+
+
+class OptFuncts(ElModel):
+ """
+
+
+ A class that holds functions that are optimized/solved. Not
+ intended for the end user.
+ """
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

even functions and classes that are no intended for the end user should get full docstrings, at least Parameters and Returns

@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

given the not really informative names j, y, star, ..., (without the book at hand), a basic summary of the names used in the class would be very helpful. Could be added to the docstring as notes for what's going on.

@j-grana6
j-grana6 added a line comment Jun 21, 2012

I tried my best to explain J, y and star in each of the functions without going too much into the theory. The optimization is done in a round about way since we solve for the Lagrangian multipliers and then compute the weights as a function of a Lagrangian multiplier. That might be why the names/ of variables/intuition is not immediately clear. I did however try to replicate Owen's notation as closely as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt and 2 others commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ super(OptFuncts, self).__init__(endog)
+
+ def get_j_y(self, eta1):
+ """
+ Calculates J and y via the log*' and log*''.
+
+ Maximizing log* is done via sequential regression of
+ y on J.
+
+ See Owen pg. 63
+
+ """
+
+ data = np.copy(self.est_vect.T)
+ data_star_prime = np.copy((1 + np.dot(eta1, data)))
+ data_star_doub_prime = np.copy((1 + np.dot(eta1, data)))
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

my guess is copy is redundant here will copy twice: 1 + ... creates new temporary array

@j-grana6
j-grana6 added a line comment Jun 21, 2012

Deleted

@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Is the code that's on github the most recent? I still see these and just made a similar comment.

@j-grana6
j-grana6 added a line comment Jul 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ data_star_prime = data_star_prime.reshape(self.nobs, 1)
+ data_star_doub_prime = data_star_doub_prime.reshape(self.nobs, 1)
+ J = ((- 1 * data_star_doub_prime) ** .5) * self.est_vect
+ y = data_star_prime / ((- 1 * data_star_doub_prime) ** .5)
+ return J, y
+
+ def modif_newton(self, x0):
+ """
+ Modified Newton's method for maximizing the log* equation.
+
+ See Owen pg. 64
+
+ """
+ params = x0.reshape(1, self.est_vect.shape[1])
+ diff = 1
+ while diff > 10 ** (-10):
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

needs maxiter to avoid possible endless loops

Do we need a newton maximizer here, or can we reuse some other ones? Would be more economical to invest into it only once.
For example LikelihoodModel newton has problems when the step size is too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+import numpy as np
+from scipy import optimize
+from scipy.stats import chi2, skew, kurtosis
+from matplotlib import pyplot as plt
+import itertools
+
+
+class ElModel(object):
+ """
+
+
+ Initializes data for empirical likelihood. Not intended for end user
+ """
+
+ def __init__(self, endog):
+ self.endog = endog
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

are there general shape requirements for endog? Does it need some basic conversion code? asarray, atleast_2d ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ gamma_star_u = optimize.brentq(self.find_gamma, \
+ max(self.endog) + epsilon, gamma_high)
+ weights_low = ((self.endog - gamma_star_l) ** -1) / \
+ np.sum((self.endog - gamma_star_l) ** -1)
+ weights_high = ((self.endog - gamma_star_u) ** -1) / \
+ np.sum((self.endog - gamma_star_u) ** -1)
+ mu_low = np.sum(weights_low * self.endog)
+ mu_high = np.sum(weights_high * self.endog)
+ return mu_low, mu_high
+
+ if method == 'bisect':
+ self.r0 = chi2.ppf(sig, 1)
+ self.mu_high = self.endog.mean()
+ mu_hmax = max(self.endog)
+ while abs(self.hy_test_mean(self.mu_high)[1]
+ - self.r0) > tol:
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

does this need a safety maxiter, checks if problem is well defined?

@j-grana6
j-grana6 added a line comment Jun 21, 2012

I don't believe so. The LLR for the mean is monotonic between the mean and the maximum value of the data. The algorithm should converge in all cases. Nevertheless, I was thinking about deleting this as a whole since the gamma method works fine and even if it doesn't, brents is much faster too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ -------
+ random_numbers = np.random.standard_normal(100)
+ el_analysis = el.DescStat(random_numbers)
+ # Initialize El
+ el_analysis.ci_var()
+ >>>f(a) and f(b) must have different signs
+ el_analysis.ci_var(.5, 2)
+ # Searches for confidence limits where the lower limit > .5
+ # and the upper limit <2.
+
+ Troubleshooting Tips
+ --------------------
+
+ If the function returns the error f(a) and f(b) must have
+ different signs, consider lowering lower_bound and raising
+ upper_bound.
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

scipy just got an improvement in a similar case, where the bounds are automatically shifted if they don't have different signs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ p_val = 1 - chi2.cdf(-2 * llr, mu_array.shape[1])
+ if print_weights:
+ return p_val, -2 * llr, self.new_weights.T
+ else:
+ return p_val, -2 * llr
+
+ def mv_mean_contour(self, mu1_l, mu1_u, mu2_l, mu2_u, step1, step2,
+ levs=[.2, .1, .05, .01, .001], plot_dta=False):
+ """
+
+ Creates confidence region plot for the mean of bivariate data
+
+ Parameters
+ ----------
+
+ m1_l: Minimum value of the mean for variable 1
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

check compliance with numpy doc format

m1_l: float
    Minimum value of the mean for variable 1
@j-grana6
j-grana6 added a line comment Jun 21, 2012

I changed to this format for all functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ """
+
+ Returns the p_value and -2 * log_likelihood for the hypothesized
+ kurtosis.
+
+ Parameters
+ ----------
+ kurt0: kurtosis value to be tested
+
+ Optional
+ --------
+
+ mu_min, mu_max, var_min, var_max: Minimum and maximum values
+ of the nuisance parameters to be optimized over. If None,
+ the function computes the 95% confidence interval for
+ the mean and variance and uses the resulting values.
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

statistics question: does this work for any distribution? (with some restrictions like finite moments)

I was once looking for statistical tests for skew and kurtosis, but found essentially only normal distributed examples, not for other distributions.

@j-grana6
j-grana6 added a line comment Jun 21, 2012

yes it does. In fact, in the introduction to Owen's Empirical Likelihood, he plots the confidence region for Skewness and Kurtosis of number of segments on an earthworm, which he shows is clearly not normally distributed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ bounds=[(mu1_lb, mu1_ub),
+ (var1_lb, var1_ub),
+ (mu2_lb, mu2_ub),
+ (var2_lb, var2_ub)])[1]
+ p_val = 1 - chi2.cdf(llr, 1)
+ if print_weights:
+ return p_val, llr, self.new_weights.T
+ return p_val, llr
+
+ def ci_corr(self, sig=.05, upper_bound=None, lower_bound=None,
+ mu1_min=None,
+ mu1_max=None, mu2_min=None, mu2_max=None,
+ var1_min=None, var1_max=None,
+ var2_min=None, var2_max=None):
+ """
+
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

endog in the class docstring said 1d, I guess this is an exception and requires 2d

@j-grana6
j-grana6 added a line comment Jun 21, 2012

See comment above. Again, I think putting the required shape of endog in each of the functions is the best bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ ul = min(.999, point_est + \
+ 2.5 * ((1. - point_est ** 2.) / \
+ (self.nobs - 2.)) ** .5)
+
+ if lower_bound is not None:
+ ll = lower_bound
+ else:
+ ll = max(- .999, point_est - \
+ 2.5 * (((1. - point_est ** 2.) / \
+ (self.nobs - 2.)) ** .5))
+
+ if mu1_min is not None:
+ self.mu1_lb = mu1_min
+ else:
+ self.mu1_lb = self.endog[:, 0].mean() - ((1.96 * \
+ (self.endog[:, 0].var()) / self.nobs)) ** .5
@josef-pkt
statsmodels member
josef-pkt added a line comment Jun 21, 2012

using sqrt would be more readable I think. And,I guess, computationally more efficient which might not matter in this case.

@j-grana6
j-grana6 added a line comment Jun 21, 2012

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt
statsmodels member

el is not a good directory name or module name too short and uniformative
emplike for the directory ?
and full empirical_likelihood for the module? maybe not to generic since we are allready in emplike directory, maybe just descriptive.py if there are other empirical likelihood models coming.

j-grana6 added some commits Jun 21, 2012
@j-grana6
@rgommers rgommers commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+
+ Also, it contains the creating of self.est_vect (short for estimating
+ equations vector). That then gets read by the log-star equations.
+
+ Parameter
+ --------
+ nuisance_mu: float
+ Value of a nuisance mean parameter.
+
+ Returns
+ -------
+
+ Log likelihood of a pre-specified variance holding the nuisance
+ parameter constant.
+
+ Not intended for end user.
@rgommers
statsmodels member
rgommers added a line comment Jun 21, 2012

Each time you write "not intended for end user" you should start the method name with an underscore. That's the standard way to indicate that a function is private, so it's clearer and you don't have to write this line in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+
+ sig_data = ((self.endog - nuisance_mu) ** 2 \
+ - self.sig2_0)
+ mu_data = (self.endog - nuisance_mu)
+ self.est_vect = np.concatenate((mu_data, sig_data), axis=1)
+ eta_star = self.modif_newton(np.array([1 / self.nobs,
+ 1 / self.nobs]))
+
+ denom = 1 + np.dot(eta_star, self.est_vect.T)
+ self.new_weights = 1 / self.nobs * 1 / denom
+ llr = np.sum(np.log(self.nobs * self.new_weights))
+ if pval: # Used for contour plotting
+ return 1 - chi2.cdf(-2 * llr, 1)
+ return -2 * llr
+
+ def ci_limits_var(self, var_test):
@rgommers
statsmodels member
rgommers added a line comment Jun 21, 2012

2 spaces after def (1 too many)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers and 1 other commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ variable.
+
+ gamma_low: lower bound for gamma when finding lower limit.
+ If function returns f(a) and f(b) must have different signs,
+ consider lowering gamma_low. default | gamma_low =-(10**10)
+
+ gamma_high: upper bound for gamma when finding upper limit.
+ If function returns f(a) and f(b) must have different signs,
+ consider raising gamma_high. default |gamma_high=10**10
+
+ epsilon: When using 'nested-brent', amount to decrease (increase)
+ from the maximum (minimum) of the data when
+ starting the search. This is to protect against the
+ likelihood ratio being zero at the maximum (minimum)
+ value of the data. If data is very small in absolute value
+ (<10 ** -6) consider shrinking epsilon
@rgommers
statsmodels member
rgommers added a line comment Jun 21, 2012

When you use symbols like * or | in a docstring, you should surround them by double backticks so they're rendered by Sphinx as is. Now it will consider those symbols like special characters (** is bold I think) and therefore you'll get rendering errors.

@j-grana6
j-grana6 added a line comment Jun 21, 2012
@rgommers
statsmodels member
rgommers added a line comment Jun 21, 2012

It's not too difficult, and small issues like the above can always be corrected later (they only influence doc build, not using statsmodels). A good primer on reStructuredText markup is http://sphinx.pocoo.org/rest.html, and docstrings are written according to https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt.

@j-grana6
j-grana6 added a line comment Jun 21, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+ Not intended for end user.
+
+ """
+ return self.hy_test_corr(corr0, nuis0=None, mu1_min=self.mu1_lb,
+ mu1_max=self.mu1_ub, mu2_min=self.mu2_lb,
+ mu2_max=self.mu2_ub,
+ var1_min=self.var1_lb, var1_max=self.var1_ub,
+ var2_min=self.var2_lb, var2_max=self.var2_ub,
+ print_weights=0)[1] - self.r0
+
+
+class DescStat(OptFuncts):
+ """
+
+
+ A class for confidence intervals and hypothesis tests invovling mean,
@rgommers
statsmodels member
rgommers added a line comment Jun 21, 2012

Typo: invovling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers and 2 others commented on an outdated diff Jun 21, 2012
statsmodels/el/el.py
+
+ Difference between log-likelihood of corr0 and a pre-specified
+ value.
+
+ Not intended for end user.
+
+ """
+ return self.hy_test_corr(corr0, nuis0=None, mu1_min=self.mu1_lb,
+ mu1_max=self.mu1_ub, mu2_min=self.mu2_lb,
+ mu2_max=self.mu2_ub,
+ var1_min=self.var1_lb, var1_max=self.var1_ub,
+ var2_min=self.var2_lb, var2_max=self.var2_ub,
+ print_weights=0)[1] - self.r0
+
+
+class DescStat(OptFuncts):
@rgommers
statsmodels member
rgommers added a line comment Jun 21, 2012

This is the main class right? At least it's the only one that's not noted as "not intended for the end user". Then here I'd expect a full description of what empirical likelihood estimation does and how to use it.

@j-grana6
j-grana6 added a line comment Jun 21, 2012
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

A usage / brief tutorial style example would be greatly helpful. Ie., a standalone file. I don't have Owen's book because I can't afford it right now ;), and I'm not really sure what I'm supposed to do with this. Is the intention of this class just to be a univariate / multivariate location and scale estimator based on optimizing the empirical likelihood? If that's the case, I might suggest some changes. Otherwise I'm going to keep posting some comments on just the code.

Are there any papers that cover this outside of the book? I don't need all the details.

@j-grana6
j-grana6 added a line comment Jul 25, 2012
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Just like our other examples. Something to kind of explain what's going on and show how (and why) you'd use it. Doesn't have to be anything crazy. Something like the cancer example in Owen (1991) would work.

@j-grana6
j-grana6 added a line comment Jul 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
statsmodels member

Looks good overall from a quick browse around.

@josef-pkt
statsmodels member

To Ralf's and my request for more documentation:

I think the main user access will be through the methods, and the docstrings on the individual methods, for mean, var, skew,... should be the most explicit, as it currently is.

The docstrings for the classes don't have to repeat everything, but should outline the main structure and how to use it, and use a `See Also to the more explicit docstrings for the details.

That's my impression not having run an example yet.

@j-grana6 j-grana6 DOC: Fixed Documentation to comply with numpy documentation. Addresse…
…d issues with Sphinx. Also changed the name of the directory and filenames for empirical likelihood
9659c52
@josef-pkt
statsmodels member

My guess is that you just disconnected all the history of the file.
file renames or moving of code should always be committed as a separate changeset, not mixed with additional edits.

@j-grana6
@josef-pkt
statsmodels member

I just mean that when you rename a file, you don't make any other changes and commit it immediately.

I don't have time right now (still bug hunting), but what I usually do is to check whether gitk or git gui browse files and blame still shows the history for individual lines.

@rgommers
statsmodels member

Josef, nothing was renamed, Justin added everything as a duplicate.

Justin, now you have everything twice. I guess now the easiest way is to delete el.py, but that indeed means that you lose comments etc. because git(hub) can't track them anymore. A better way to rename would have been

$ git mv el.py emplike.py
$ # make minimal changes to make things work (replace ``import el`` with ``import ``emplike``)
$ git commit
$ # now make other commits
$ git commit

Git is smart enough to follow code across files, but if you duplicate it then history (and inline github comments) get lost.

@josef-pkt
statsmodels member

(I haven't gotten use to the new Show Diff Stats button yet, to see what files are in the diff.)

Since the old files still exist, it's still easy to do a "proper" renaming

delete the new emplike directory and files (make backup if there are other changes that are not in "el"
commit (same as reverting the copy)

rename directory el -> emplike (I use Windows explorer, Ralf does "git mv" create information for git about a rename?)
commit
rename el.py -> descriptive.py
rename test_el -> test_descriptive.py
any other rename ?
commit

fix up the imports and other code to correspond to new directory and module names
commit

in contrast to Ralf, I usually do the fix up after a move in a separate commit, and use several commits for rename/move to be on the safe side.

@rgommers
statsmodels member

"git mv" is actually just a convenience function: http://stackoverflow.com/questions/1094269/whats-the-purpose-of-git-mv. I thought it did help if you mix renames and code changes, but I was wrong.

@jseabold
statsmodels member

I'm going to try to ping @jseabold and see if it enables notifications for me. I missed this whole discussion.

@josef-pkt

pep8 no spaces in class names,

@j-grana6

I am debating creating a class for the multivariate functions only. This will make it easier to write the documentation and easier to expand on in the future. Any objections to this?

statsmodels member

Yes, I think it makes sense to have two classes one for univariate and one for multivariate. You shouldn't have methods in a class that only work some of the time.

@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+
+TODO: Write functions for estimationg equations instead
+ of generating the data every time a hypothesis
+ test is called.
+
+
+"""
+import numpy as np
+from scipy import optimize
+from scipy.stats import chi2, skew, kurtosis
+from matplotlib import pyplot as plt
+from statsmodels.base.model import _fit_mle_newton
+import itertools
+
+
+class ElModel(object):
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

I think ELModel might be better here. Lower case L's can be hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+
+ Initializes data for empirical likelihood. Not intended for end user
+ """
+
+ def __init__(self, endog):
+ self.endog = endog
+ self.nobs = float(endog.shape[0])
+ self.weights = np.ones(self.nobs) / float(self.nobs)
+ if endog.ndim == 1:
+ self.endog = self.endog.reshape(self.nobs, 1)
+ # For now, self. weights should always be a vector of 1's and a
+ # variable "new weights" should be created everytime weights are
+ # changed.
+
+
+class OptFuncts(ElModel):
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Not crucial, but it's customary to put a _ in front of classes that aren't meant to be public. Sometimes I don't do this because it's not really cut and dry between public/private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+
+ Returns
+ -------
+ J: n x m matrix
+ J'J is the hessian for optimizing
+
+ y: n x 1 array
+ -J'y is the gradient for maximizing
+
+
+ See Owen pg. 63
+
+ """
+ data = np.copy(self.est_vect.T)
+ data_star_prime = (1 + np.dot(eta1, data))
+ data_star_doub_prime = np.copy((1 + np.dot(eta1, data)))
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Don't need a copy here. [Edit: I don't think you need either copy?]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/test/__init__.py
@@ -0,0 +1 @@
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Can you rename the directory to tests/ for consistency? You'll need to use git mv and commit the move so as not to lose history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/test/__init__.py
@@ -0,0 +1 @@
+#init file
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

FYI, these can be empty, but it's not that important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ test is called.
+
+
+"""
+import numpy as np
+from scipy import optimize
+from scipy.stats import chi2, skew, kurtosis
+from matplotlib import pyplot as plt
+from statsmodels.base.model import _fit_mle_newton
+import itertools
+
+
+class ElModel(object):
+ """
+
+
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Can you cleanup all the whitespace at the beginning of the docstrings? The docs should start on either the line with """ or the next. I don't know if Sphinx/numpydoc will strip it, but I doubt it.

@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Also there's whitespace at the end of docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold
statsmodels member

It looks like the comments are now disconnected from the code, so I added some duplicate comments, though they are still valid in the current code.

@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ sig2_data = ((self.endog[:, 1] - nuis_params[2]) ** 2) -\
+ nuis_params[3]
+ correl_data = ((self.endog[:, 0] - nuis_params[0]) * \
+ (self.endog[:, 1] - nuis_params[2])) - \
+ (self.corr0 * (nuis_params[1] ** .5) \
+ * (nuis_params[3] ** .5))
+ self.est_vect = np.concatenate((self.mu1_data, sig1_data,
+ mu2_data,
+ sig2_data, correl_data),
+ axis=1).reshape(5, self.nobs)
+ self.est_vect = self.est_vect.T
+ eta_star = self._modif_newton(np.array([1 / self.nobs,
+ 1 / self.nobs,
+ 1 / self.nobs,
+ 1 / self.nobs,
+ 1 / self.nobs]))
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

If you're going to be using attributes a lot in a method, it's probably best to go ahead and assign them to a variable. I think it's ever so slightly more performant, but really I just think it looks better. So

endog = self.endog
nobs = self.nobs

Though some people might disagree that it's best practice, we try to stick to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold and 1 other commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ Value of a nuisance mean parameter.
+
+ Returns
+ -------
+
+ llr: float:
+ Log likelihood of a pre-specified variance holding the nuisance
+ parameter constant.
+
+
+ """
+
+ sig_data = ((self.endog - nuisance_mu) ** 2 \
+ - self.sig2_0)
+ mu_data = (self.endog - nuisance_mu)
+ self.est_vect = np.concatenate((mu_data, sig_data), axis=1)
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

This is updated during the optimization correct? I think it should be passed around then rather than attached and read. It's a bit confusing to follow where things are getting changed and what's a set attribute.

@j-grana6
j-grana6 added a line comment Jul 25, 2012
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

I'm not really sure because I don't follow the code well enough yet. If it's a derived estimate then it should just be changed in the objective function and then attached after fitting. I just don't like that there's an attribute set in a method that's changed by other methods. When is it what? It should only be attached when it's set. If it's something that the optimizer is changing then pass it in. If it's derived, attach it after it's done deriving.

Does that make sense?

@j-grana6
j-grana6 added a line comment Jul 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ data_star_doub_prime = np.copy((1 + np.dot(eta1, data)))
+ for elem in range(int(self.nobs)):
+ if data_star_prime[0, elem] <= 1 / self.nobs:
+ data_star_prime[0, elem] = 2 * self.nobs - \
+ (self.nobs) ** 2 * data_star_prime[0, elem]
+ else:
+ data_star_prime[0, elem] = 1. / data_star_prime[0, elem]
+ if data_star_doub_prime[0, elem] <= 1 / self.nobs:
+ data_star_doub_prime[0, elem] = - self.nobs ** 2
+ else:
+ data_star_doub_prime[0, elem] = \
+ - (data_star_doub_prime[0, elem]) ** -2
+ data_star_prime = data_star_prime.reshape(self.nobs, 1)
+ data_star_doub_prime = data_star_doub_prime.reshape(self.nobs, 1)
+ J = ((- 1 * data_star_doub_prime) ** .5) * self.est_vect
+ y = data_star_prime / ((- 1 * data_star_doub_prime) ** .5)
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Things like this can be vectorized to avoid these loops. E.g., with abbreviated names, but you'll probably want to keep yours.

nobs = self.nobs
dsp = (1 + np.dot(eta1, data)).T
dspp = dsp.copy()
idx = dsp < 1./nobs
not_idx = ~idx
dsp[idx] *= 2. * nobs - (nobs) ** 2
dsp[not_idx] = 1./dsp[not_idx]
dspp[idx] = -nobs ** 2
dspp[not_idx] = -dspp[not_idx] ** -2
root_prime = np.sqrt(-1 * dspp)
JJ = root_prime * self.est_vect
yy = dsp / root_prime

Two questions.

Why are you using row vectors and then transposing? Does it make sense just to have everything has column vectors so you're not having to transpose or reshape all the time?

I assume you're going to want to use these elsewhere. Should log_star, log_star_prime, and log_star_prime_prime just be functions instead of methods?

@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

E.g., my general log_star function that I think I sent you

def log_star(z, eps):
    log_idx = z > eps
    loggedarr = np.zeros_like(z)
    z_ok = z[log_idx]
    z_star = z[~log_idx]
    loggedarr[log_idx] = np.log(z_ok)
    loggedarr[~log_idx] = np.log(eps) - 1.5 + 2*z_star/eps - \
            z_star**2/(2*eps**2)
    return loggedarr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ for elem in range(int(self.nobs)):
+ if data_star_prime[0, elem] <= 1 / self.nobs:
+ data_star_prime[0, elem] = 2 * self.nobs - \
+ (self.nobs) ** 2 * data_star_prime[0, elem]
+ else:
+ data_star_prime[0, elem] = 1. / data_star_prime[0, elem]
+ if data_star_doub_prime[0, elem] <= 1 / self.nobs:
+ data_star_doub_prime[0, elem] = - self.nobs ** 2
+ else:
+ data_star_doub_prime[0, elem] = \
+ - (data_star_doub_prime[0, elem]) ** -2
+ data_star_prime = data_star_prime.reshape(self.nobs, 1)
+ data_star_doub_prime = data_star_doub_prime.reshape(self.nobs, 1)
+ J = ((- 1 * data_star_doub_prime) ** .5) * self.est_vect
+ y = data_star_prime / ((- 1 * data_star_doub_prime) ** .5)
+ return np.mat(J), np.mat(y)
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Is there a reason to return matrices instead of arrays? We don't really use matrices anywhere. This could be very confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold and 1 other commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ else:
+ self.var_l = var_lims[0]
+ if var_max is not None:
+ self.var_u = var_max
+ else:
+ self.var_u = var_lims[1]
+ if mu_min is not None:
+ self.mu_l = mu_min
+ else:
+ self.mu_l = mu_lims[0]
+ if mu_max is not None:
+ self.mu_u = mu_max
+ else:
+ self.mu_u = mu_lims[1]
+ self.r0 = chi2.ppf(1 - sig, 1)
+ print 'Finding the lower bound for skewness'
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Need to get rid of the print statements.

@j-grana6
j-grana6 added a line comment Jul 25, 2012
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

Yeah. Print statements are not the way to go I don't think. Once we're done with some cleanup and I have my bearings a bit more I'll see if we can't figure out what's going on.

@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

If you can post to the ML (or here) a full working example that hangs I'll have a look.

@j-grana6
j-grana6 added a line comment Jul 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold
statsmodels member

I'm getting several errors trying to run the tests for test_descriptive. Some are related to my numpy version and will need to be fixed others look to be bugs.

@jseabold jseabold commented on an outdated diff Jul 25, 2012
statsmodels/emplike/descriptive.py
+ """
+
+ self.mu1_data = (self.endog[:, 0] - nuis_params[0])
+ sig1_data = ((self.endog[:, 0] - nuis_params[0]) ** 2) - \
+ nuis_params[1]
+ mu2_data = self.endog[:, 1] - nuis_params[2]
+ sig2_data = ((self.endog[:, 1] - nuis_params[2]) ** 2) -\
+ nuis_params[3]
+ correl_data = ((self.endog[:, 0] - nuis_params[0]) * \
+ (self.endog[:, 1] - nuis_params[2])) - \
+ (self.corr0 * (nuis_params[1] ** .5) \
+ * (nuis_params[3] ** .5))
+ self.est_vect = np.concatenate((self.mu1_data, sig1_data,
+ mu2_data,
+ sig2_data, correl_data),
+ axis=1).reshape(5, self.nobs)
@jseabold
statsmodels member
jseabold added a line comment Jul 25, 2012

One of the test failures is coming from here. I think you just want vstack instead of concatenate + reshape. (That is if you still want to keep row vectors, which I suspect you might not.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@j-grana6
@jseabold
statsmodels member

Then I don't think we're testing the same file / branch. Are you sure what's on github is recent? For example

line 863 in descriptive.py has

mu_array = mu_array.reshape(1, self.endog[1])

Surely this is a bug?

@j-grana6
@j-grana6
@jseabold
statsmodels member

Ok then I'm a bit worried about the test coverage. There's no way that something like

some_array.reshape(1, array([ 16.08324, 59.50397]))

Should lead to correct behavior unless I'm missing something.

@j-grana6
@j-grana6
@jseabold
statsmodels member

Not as far as I'm concerned :). Must be a bug in numpy that's been tightened up in the recent code.

@jseabold
statsmodels member

By the way, there are two ways to accomplish what you're doing that might be a bit easier / more efficient.

x.reshape(1, -1)

-1 will infer the value for the other shape. Might be faster than looking up self.nobs.

Or, I usually do

x[None, :]

This adds an extra leading dimension. Or

X[:, None]

Adds an extra trailing dimension, but do whatever you prefer then I'll have a look at the shaping stuff broadly. I doubt we want to do all of this reshaping in the first place if we can avoid it.

@j-grana6
@jseabold
statsmodels member

No I wouldn't. There are a few bugs and warts in recent numpy. Things that work on the bleeding edge should work there as well. I'm going to look into the copy, but I suspect something else is going on other than the copy.

@j-grana6
@jseabold
statsmodels member

Just the one I mentioned about vstack vs concatenate. That breaks for me.

np.concatenate(([1,2,3], [1,2,3]), axis=1)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-aef62b18b305> in <module>()
----> 1 np.concatenate(([1,2,3], [1,2,3]), axis=1)

IndexError: axis 1 out of bounds [0, 1)

I suspect you want

np.vstack(([1,2,3], [1,2,3]))

Which lets you get rid of the reshape.

@j-grana6
@jseabold
statsmodels member

Yeah that's odd because neither of those lists have an axis 1 to concatenate on.

@jseabold
statsmodels member

Ok the tests pass now, but I'm concerned about speed. Can you pull out and vectorize the log_star stuff then I'll see if there's anything else that can be done.

@j-grana6
@j-grana6
@j-grana6
@jseabold
statsmodels member

Tests went from 120s to 20s. Behold the power of vectorization. I'm going to see if there's any other obvious sticking points.

@jseabold
statsmodels member

Thanks, this is very helpful.

@jseabold jseabold commented on an outdated diff Jul 26, 2012
statsmodels/emplike/descriptive.py
+ endog = self.endog
+ nobs = self.nobs
+ # if trans_data is None:
+ # self.data = self.endog
+ # else:
+ # self.data = trans_data
+ eta_min = (1. - (1. / nobs)) / (self.mu0 - max(endog))
+ eta_max = (1. - (1. / nobs)) / (self.mu0 - min(endog))
+ eta_star = optimize.brentq(self._find_eta, eta_min, eta_max)
+ new_weights = (1. / nobs) * \
+ 1. / (1. + eta_star * (endog - self.mu0))
+ llr = -2 * np.sum(np.log(nobs * new_weights))
+ if print_weights:
+ return 1 - chi2.cdf(llr, 1), llr, new_weights
+ else:
+ return 1 - chi2.cdf(llr, 1), llr
@jseabold
statsmodels member
jseabold added a line comment Jul 26, 2012

You can use the survival function instead of 1 - chi2.cdf. So chi2.sf(llr, 1). Sometimes it will be more accurate.

It's also a bit of an unwritten rule to return test statistic, p-value rather than the other way around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Jul 26, 2012
statsmodels/emplike/descriptive.py
+
+ if var_max is not None:
+ var_ub = var_max
+ else:
+ var_ub = var_ci[1]
+
+ llr = optimize.fmin_l_bfgs_b(self._opt_skew, start_nuisance,
+ approx_grad=1,
+ bounds=[(mu_lb, mu_ub),
+ (var_lb, var_ub)])[1]
+ p_val = 1 - chi2.cdf(llr, 1)
+ if print_weights:
+ return p_val, llr, self.new_weights.T
+ return p_val, llr
+
+ def hy_test_kurt(self, kurt0, nuis0=None, mu_min=None,
@jseabold
statsmodels member
jseabold added a line comment Jul 26, 2012

Very nit-picky comment, but maybe the hy_ isn't needed. Just test_var, test_kurt, etc. I think hypothesis is implied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+Last Updated: 25 July 2012
+
+General References:
+------------------
+
+Owen, A. (2001). "Empirical Likelihood." Chapman and Hall
+
+
+TODO: Write functions for estimationg equations instead
+ of generating the data every time a hypothesis
+ test is called.
+"""
+import numpy as np
+from scipy import optimize
+from scipy.stats import chi2, skew, kurtosis
+from matplotlib import pyplot as plt
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

This should be wrapped in a try/except block since matplotlib is a soft dependency. You can also delay this import to the method that uses it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+ if upper_bound is not None:
+ ul = upper_bound
+ else:
+ ul = ((self.nobs - 1) * endog.var()) / \
+ (chi2.ppf(.0001, self.nobs - 1))
+ if lower_bound is not None:
+ ll = lower_bound
+ else:
+ ll = ((self.nobs - 1) * endog.var()) / \
+ (chi2.ppf(.9999, self.nobs - 1))
+ self.r0 = chi2.ppf(1 - sig, 1)
+ ll = optimize.brentq(self._ci_limits_var, ll, endog.var())
+ ul = optimize.brentq(self._ci_limits_var, endog.var(), ul)
+ return ll, ul
+
+ def var_p_plot(self, lower, upper, step, sig=.05):
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

This should take an existing axis as a keyword and return the figure. See this function for how we work with matplotlib by convention

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/factorplots.py#L6

The preferred way to work interactively with matplotlib is something like this

import matplotlib.pyplot as plt
fig = plt.figure() # only use of plt ever except maybe show
ax = fig.add_subplot(111)
ax.plot(...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+ Will draw a horizontal line at 1- sig. Default= .05
+
+ This function can be helpful when trying to determine limits
+ in the ci_var function.
+ """
+ sig = 1 - sig
+ p_vals = []
+ for test in np.arange(lower, upper, step):
+ p_vals.append(self.test_var(test)[0])
+ p_vals = np.asarray(p_vals)
+ plt.plot(np.arange(lower, upper, step), p_vals)
+ plt.plot(np.arange(lower, upper, step), (1 - sig) * \
+ np.ones(len(p_vals)))
+ return 'Type plt.show to see the figure'
+
+ def mean_var_contour(self, mu_l, mu_h, var_l, var_h, mu_step,
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

Ditto the above. You probably want methods that do plotting to be preceded by a plot_, so plot_contour here might work? plot_mean_var_contour is a handful to type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+
+ def test_skew(self, skew0, nuis0=None, mu_min=None,
+ mu_max=None, var_min=None, var_max=None,
+ print_weights=False):
+ """
+ Returns -2 ``*`` log_likelihood and p_value for the hypothesized
+ skewness.
+
+ Parameters
+ ----------
+ skew0: float
+ Skewness value to be tested
+
+ Optional
+ --------
+
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

You can get rid of these blank lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold
statsmodels member

What about just having a DescStat class that checks for univariate or multivariate and then returns the appropriate class? I still need to dig further to see why exactly univariate isn't just a special case of multivariate, but I'm doing some profiling now. Some method calls take 10 seconds, which seems really unnecessary.

@jseabold jseabold commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+
+ Returns
+ -------
+ J: n x m matrix
+ J'J is the hessian for optimizing
+
+ y: n x 1 array
+ -J'y is the gradient for maximizing
+
+ See Owen pg. 63
+ """
+
+ nobs = self.nobs
+ data = est_vect.T
+ data_star_prime = (1. + np.dot(eta1, data))
+ data_star_doub_prime = np.copy(1. + np.dot(eta1, data))
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

Still think we can get rid of this copy. It's taking time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold
statsmodels member

Is multivariate only intended for the bivariate case?

@j-grana6
@jseabold
statsmodels member

Hmm, then maybe these should be functions? I don't like a class that has methods that only work sometime.

@jseabold
statsmodels member

It would also save a lot this slicing business if you restrict endog to be nobs x 2. So you're not doing things like endog[:, 0].var() endog[:,1].var(). This is inefficient. var and mean should just be called once.

@jseabold jseabold commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+ If True, function also returns the weights that
+ maximize the likelihood ratio. Default = False.
+
+ Returns
+ --------
+
+ test_results: tuple
+ The log-likelihood ratio and p_value of `kurt0`
+ """
+ self.kurt0 = kurt0
+ if nuis0 is not None:
+ start_nuisance = nuis0
+ else:
+ start_nuisance = np.array([self.endog.mean(),
+ self.endog.var()])
+ if mu_min is not None:
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

It would make the code shorter and easier to read if you didn't rename these (I'm also thinking they should just be one argument for mean, not mu1 and mu2), but something like

if mu_min is None:
    mu_min = self.co_mean()[0] # why slice here?

Then you're not having multiple names for the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jseabold jseabold and 1 other commented on an outdated diff Aug 1, 2012
statsmodels/emplike/descriptive.py
+ endog = self.endog
+ if endog.shape[1] != 2:
+ raise Exception('Correlation matrix not yet implemented')
+
+ self.corr0 = corr0
+
+ if nuis0 is not None:
+ start_nuisance = nuis0
+ else:
+ start_nuisance = np.array([endog[:, 0].mean(),
+ endog[:, 0].var(),
+ endog[:, 1].mean(),
+ endog[:, 1].var()])
+
+ if mu1_min is not None:
+ mu1_lb = mu1_min
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

The way I read this is that it's part of a nested optimization for ci_corr. Is there anyway to avoid going through all these if statements each time it's called since we don't really need to? Right now just calling ci_corr takes close to 10 seconds and includes 1.6 million (!) function calls. Anything we can do to avoid unnecessary calculations is going to greatly speed things up. Even if ci_corr doesn't call test_corr and repeats some code since the latter might be (?) called just by a user with the outer optimization step.

@j-grana6
j-grana6 added a line comment Aug 1, 2012
@jseabold
statsmodels member
jseabold added a line comment Aug 1, 2012

Sure. One suggestion is to make test_corr a function _test_corr where the arguments are not optional. Then test_corr the method can call this function with sensible defaults. And ci_corr can call _test_corr giving default values. This goes for all the other methods that follow the same pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@j-grana6

It's probably not very important but should this function be in the _OptFuncts class with the rest of the functions that are optimized over to get confidence intervals.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ test is called.
+"""
+import numpy as np
+from scipy import optimize
+from scipy.stats import chi2, skew, kurtosis
+from statsmodels.base.model import _fit_mle_newton
+import itertools
+from statsmodels.graphics import utils
+
+
+def _test_corr(corr0, self, nuis0, mu1_lb, mu1_ub, mu2_lb, mu2_ub, var1_lb,
+ var1_ub, var2_lb, var2_ub, endog, nobs, x0, weights0, r0):
+ """
+ Parameters
+ ---------
+
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

There are several headers that have the wrong number of underlines. They will cause a REST error during doc build.

recommended: run make html to build the docs and check errors

@j-grana6
j-grana6 added a line comment Aug 17, 2012
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

Yes open a shell in the docs directory and do make html which does the sphinx build of the docs. Depending on the setup, there should be none or a few warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+import numpy as np
+from scipy import optimize
+from scipy.stats import chi2, skew, kurtosis
+from statsmodels.base.model import _fit_mle_newton
+import itertools
+from statsmodels.graphics import utils
+
+
+def _test_corr(corr0, self, nuis0, mu1_lb, mu1_ub, mu2_lb, mu2_ub, var1_lb,
+ var1_ub, var2_lb, var2_ub, endog, nobs, x0, weights0, r0):
+ """
+ Parameters
+ ---------
+
+ corr0: float
+ Hypothesized vaalue for the correlation.
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

vaaaaaalue, quite a few typos with spelling mistakes in docstrings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ Returns
+ -------
+
+ diff: float
+ Difference between log-likelihood of corr0 and a pre-specified
+ value, r0
+ """
+ bounds = [(mu1_lb, mu1_ub), (var1_lb, var1_ub), (mu2_lb, mu2_ub),
+ (var2_lb, var2_ub)]
+ args = (corr0, endog, nobs, x0, weights0)
+ llr = optimize.fmin_l_bfgs_b(self._opt_correl, nuis0, approx_grad=1,
+ bounds=bounds, args=args)[1]
+ return llr - r0
+
+
+def _log_star(eta1, est_vect, wts, nobs):
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

When I'm reading the code, there are many names that are not understandable without the book.
For example, a brief summary of the terminology/names in the module docstring would be helpful. (developer oriented to help understand the code)

missing description line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ llr = optimize.fmin_l_bfgs_b(self._opt_correl, nuis0, approx_grad=1,
+ bounds=bounds, args=args)[1]
+ return llr - r0
+
+
+def _log_star(eta1, est_vect, wts, nobs):
+ """
+ Parameters
+ ---------
+ eta1: float
+ Lagrangian multiplier
+
+ est_vect: nxk array
+ Estimating equations vector
+
+ wts: nx1 array
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

full name weights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ bounds = [(mu1_lb, mu1_ub), (var1_lb, var1_ub), (mu2_lb, mu2_ub),
+ (var2_lb, var2_ub)]
+ args = (corr0, endog, nobs, x0, weights0)
+ llr = optimize.fmin_l_bfgs_b(self._opt_correl, nuis0, approx_grad=1,
+ bounds=bounds, args=args)[1]
+ return llr - r0
+
+
+def _log_star(eta1, est_vect, wts, nobs):
+ """
+ Parameters
+ ---------
+ eta1: float
+ Lagrangian multiplier
+
+ est_vect: nxk array
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

vect is not really informative as part of a name, most of our things are vectors (i.e. arrays)

edit:
unless there is an ambiguity with estimating equations as function. I still don't find the name est_vect very attractive

@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

docstring standard

est_vect: ndarray, (n,k)

ndarray if you explicitly require numpy arrays and not other iterables.

I prefer numpy shape notation, (n,k) but I don't know if this is numpy doc standard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ ----
+
+ This function is really only a placeholder for the _fit_mle_Newton.
+ The function value is not used in optimization and the optimal value
+ is disregarded when computng the log likelihood ratio.
+ """
+ data_star = np.log(wts) + (np.sum(wts) + np.dot(est_vect, eta1))
+ idx = data_star < 1. / nobs
+ not_idx = ~idx
+ nx = nobs * data_star[idx]
+ data_star[idx] = np.log(1. / nobs) - 1.5 + nx * (2. - nx / 2)
+ data_star[not_idx] = np.log(data_star[not_idx])
+ return data_star
+
+
+def DescStat(endog):
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

pep-8 function names are lower case, although I'm not sure in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+
+ def __init__(self, endog):
+ super(_OptFuncts, self).__init__(endog)
+
+ def _hess(self, eta1, est_vect, wts, nobs):
+ """
+ Calculates the hessian of a weighted empirical likelihood
+ provlem.
+
+ Parameters
+ ----------
+ eta1: 1xm array.
+
+ Value of lamba used to write the
+ empirical likelihood probabilities in terms of the lagrangian
+ multiplier.
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

Intend ?

@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

or is eta1 the value of lambda ? (both sound Greek to me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ A class that holds functions that are optimized/solved. Not
+ intended for the end user.
+ """
+
+ def __init__(self, endog):
+ super(_OptFuncts, self).__init__(endog)
+
+ def _hess(self, eta1, est_vect, wts, nobs):
+ """
+ Calculates the hessian of a weighted empirical likelihood
+ provlem.
+
+ Parameters
+ ----------
+ eta1: 1xm array.
+
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

Lagrange multiplier ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt
statsmodels member

What's the relationship between the different emplike branches?

My impression is that some of the things I commented might be already cleaned up.
IVregression branch uses a different descriptive :

from descriptive2 import _OptFuncts

@j-grana6
@josef-pkt
statsmodels member

However, reviewing then descriptive doesn't make sense, if it's not the latest version.
Are all changes from this branch included in another branch ?

@j-grana6
@josef-pkt
statsmodels member

Ok, I understand. I will look more at the descriptive branch later today.

@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ pairs = itertools.product(x, y)
+ z = []
+ for i in pairs:
+ z.append(self.mv_test_mean(np.asarray(i))[0])
+ X, Y = np.meshgrid(x, y)
+ z = np.asarray(z)
+ z = z.reshape(X.shape[1], Y.shape[0])
+ ax.contour(x, y, z.T, levels=levs)
+ if plot_dta:
+ ax.plot(self.endog[:, 0], self.endog[:, 1], 'bo')
+ return fig
+
+ def test_corr(self, corr0, nuis0=None, mu1_lb=None,
+ mu1_ub=None, mu2_lb=None, mu2_ub=None,
+ var1_lb=None, var1_ub=None,
+ var2_lb=None, var2_ub=None, return_weights=0):
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

that's a long list of kwds, and we cannot use **my_bounds if return_weights are set

maybe combining bounds into a dict, see also comment below on boiler plate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ ll = kurtosis(endog) - \
+ (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
+ ((nobs - 2.) * (nobs + 1.) * \
+ (nobs + 3.))) ** .5) * \
+ (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
+ (nobs + 5.))) ** .5)
+ # Need to calculate variance and mu limits here to avoid
+ # recalculating at every iteration in the maximization.
+ if (var_max is None) or (var_min is None):
+ var_lims = self.ci_var(sig=sig)
+ if (mu_max is None) or (mu_min is None):
+ mu_lims = self.ci_mean(sig=sig)
+ if var_min is not None:
+ self.var_l = var_min
+ else:
+ self.var_l = var_lims[0]
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

if they are attached here as a temporary variable to make available in the iterations, then I would not leave them around as attributes for the user.

for example, I used something like self._temp as attribute to hold the values during iteration.
if they are for end user to know what was used in the program, it would need an indication that it was used for ci_kurt, and not for any other ones.

using dicts with dict updating sound also easier here, and lazy/cached attributes for the defaults ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ var_vect = list(np.arange(var_l, var_h, var_step))
+ z = []
+ for sig0 in var_vect:
+ self.sig2_0 = sig0
+ for mu0 in mu_vect:
+ z.append(self._opt_var(mu0, pval=True))
+ z = np.asarray(z).reshape(len(var_vect), len(mu_vect))
+ ax.contour(mu_vect, var_vect, z, levels=levs)
+ return fig
+
+ ## TODO: Use gradient and Hessian to optimize over nuisance params
+ ## TODO: Use non-nested optimization to optimize over nuisance
+ ## parameters. See Owen pgs 234- 241
+
+ def test_skew(self, skew0, nuis0=None, mu_min=None,
+ mu_max=None, var_min=None, var_max=None,
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

what's the difference between mu and a mean?
If mu always refers to the mean (expected value), then I would call it mean, so I don't have to guess if it's not something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ var_step,
+ levs=[.2, .1, .05, .01, .001]):
+ """
+ Returns a plot of the confidence region for a univariate
+ mean and variance.
+
+ Parameters
+ ----------
+
+ mu_l: float
+ Lowest value of the mean to plot
+
+ mu_h: float
+ Highest value of the mean to plot
+
+ var_l: float
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

not so nice, add some letters

what I like most of the time is to use low and upp because they both have 3 letters and we don't use them in other context than limits or bounds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+
+ Returns
+ --------
+
+ test_results: tuple
+ The log-likelihood ratio and p_value of the joint hypothesis test.
+ """
+ self.kurt0 = kurt0
+ self.skew0 = skew0
+ if nuis0 is not None:
+ start_nuisance = nuis0
+ else:
+ start_nuisance = np.array([self.endog.mean(),
+ self.endog.var()])
+ if mu_min is not None:
+ mu_lb = mu_min
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

still redundant renaming, settle on one name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on the diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ else:
+ ul = kurtosis(endog) + \
+ (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
+ ((nobs - 2.) * (nobs + 1.) * \
+ (nobs + 3.))) ** .5) * \
+ (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
+ (nobs + 5.))) ** .5)
+ if lower_bound is not None:
+ ll = lower_bound
+ else:
+ ll = kurtosis(endog) - \
+ (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
+ ((nobs - 2.) * (nobs + 1.) * \
+ (nobs + 3.))) ** .5) * \
+ (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
+ (nobs + 5.))) ** .5)
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

I don't like this code repetition much. One possibility is to use the change in the if structure as in another method, or use a default cache (more separately)

also this will be calculated repeatedly if both bounds are None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt commented on an outdated diff Aug 17, 2012
statsmodels/emplike/descriptive.py
+ If function returns f(a) and f(b) must have different signs, consider
+ expanding lower and upper bound.
+ """
+ endog = self.endog
+ nobs = self.nobs
+ if upper_bound is not None:
+ ul = upper_bound
+ else:
+ ul = kurtosis(endog) + \
+ (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
+ ((nobs - 2.) * (nobs + 1.) * \
+ (nobs + 3.))) ** .5) * \
+ (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
+ (nobs + 5.))) ** .5)
+ if lower_bound is not None:
+ ll = lower_bound
@josef-pkt
statsmodels member
josef-pkt added a line comment Aug 17, 2012

ll or 11, if short names, then ub and lb is better, but I like longer names.

Also my impression is that names and abbreviations change from method to method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt
statsmodels member

I haven't tried to work my way through the calculations yet.

The main thing I would change is to try to get rid of some of the boiler plate in the bounds handling. What I think should work is with attaching defaults to the instance, and working with attached dictionaries for the options that are actually used in the iterations.

mean, variance, skew and kurtosis could be cached attributes of the instance directly.
Then I would add a cache for the bounds for mean, var, skew and kurtosis, that can then be used in any of the methods.
I guess, bounds for skew and kurtosis should be lazy. (I'm not sure what the easiest way to do that - a lazy dict, instance, an extra class with cached attributes).

Then I would just try to do a dict update in the method. The easiest would be if the bounds where just given in a single dict, as options. Otherwise, it would still require going through all arguments explicitly.

I think we could save the optimization options, including bounds, in a method specific attribute, e.g. self.kurt_options = dict(....). They are small and we wouldn't have to worry about memory, and they would be overwritten with a second call to the same method with different arguments.


Given what looks like a good test coverage, I think it can be merged soon, and make more improvements later.
Main thing that should be settled are method signatures.

@j-grana6
@josef-pkt
statsmodels member

One simpler case for options in a dict that I just used

change the default for the options in a call, but let the users overwrite them
https://github.com/josef-pkt/statsmodels/blob/4157ca005d35092f66115b2e9b48eb1062a2f7b0/statsmodels/robust/least_trimmed_squares.py#L417

your case is a bit more complicated because you need to precalculate values, and you don't need all bounds at the same time.

@j-grana6
@josef-pkt
statsmodels member

I don't know the optimization code and don't know if it applies.
In general, an advantage of the bounds might also be that it's more robust in "strange" cases, cases with numerical problems.

@j-grana6
@j-grana6 j-grana6 and 1 other commented on an outdated diff Aug 19, 2012
statsmodels/emplike/descriptive.py
+ """
+ Parameters
+ ----------
+ nuis_params: 1darray
+ array containing two nuisance means and two nuisance variances
+
+ Returns
+ -------
+ llr: float
+ The log-likelihood of the correlation coefficient holding nuisance
+ parameters constant
+ """
+ mu1_data, mu2_data = (endog - nuis_params[::2]).T
+ sig1_data = mu1_data ** 2 - nuis_params[1]
+ sig2_data = mu2_data ** 2 - nuis_params[3]
+ correl_data = correl_data = ((mu1_data * mu2_data) - corr0 *
@j-grana6
j-grana6 added a line comment Aug 19, 2012

Is there a reason for the double assignment here? Can't quite figure this out.

@jseabold
statsmodels member
jseabold added a line comment Aug 19, 2012

What do you mean double assignment? I just vectorized it, so you compute both at the same time and then reuse mu in the sigma calculation. Before this was computed 4 times instead of 1. The [::2] notation just does this

>>> x = np.arange(1,11)
>>> x
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
>>> x[::2]
array([1, 3, 5, 7, 9])

This is because of the (somewhat odd) order of nuis_params, but I didn't want to deal with changing this.

@j-grana6
j-grana6 added a line comment Aug 19, 2012
@jseabold
statsmodels member
jseabold added a line comment Aug 19, 2012

Ha, right. Typo I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josef-pkt josef-pkt referenced this pull request Sep 16, 2012
Merged

Emplike desc reg rebase #460

@josef-pkt
statsmodels member

This pull request has been rebased and merged in #460 into master.

This can be closed, if there is no open discussion left.

@jseabold jseabold closed this Nov 8, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment