Skip to content
This repository

El desc #326

Closed
wants to merge 81 commits into from

4 participants

j-grana6 Josef Perktold Ralf Gommers Skipper Seabold
j-grana6

This should complete all empirical likelihood descriptive statistics, including tests.

Skipper Seabold

This file can be empty if you want.

Skipper Seabold

Could you make the docstrings indented at the same level? It makes the code easier to read.

Josef Perktold

just a general comment

I thinks, it's safer to get into the habit of using floating point numbers ((self.nobs ** 2) - 1.) / ((self.nobs - 3.) just to reduce a possibility of integer division error, even if nobs is defined as float. (avoids having to check if nobs is float, and would be robust to being overwritten as integer)

statsmodels/el/el.py
((36 lines not shown))
  36 + self.nobs = float(endog.shape[0])
  37 + self.weights = np.ones(self.nobs) / float(self.nobs)
  38 + if endog.ndim == 1:
  39 + self.endog = self.endog.reshape(self.nobs, 1)
  40 + # For now, self. weights should always be a vector of 1's and a
  41 + # variable "new weights" should be created everytime weights are
  42 + # changed.
  43 +
  44 +
  45 +class OptFuncts(ElModel):
  46 + """
  47 +
  48 +
  49 + A class that holds functions that are optimized/solved. Not
  50 + intended for the end user.
  51 + """
3
Josef Perktold Owner
josef-pkt added a note

even functions and classes that are no intended for the end user should get full docstrings, at least Parameters and Returns

Josef Perktold Owner
josef-pkt added a note

given the not really informative names j, y, star, ..., (without the book at hand), a basic summary of the names used in the class would be very helpful. Could be added to the docstring as notes for what's going on.

j-grana6
j-grana6 added a note

I tried my best to explain J, y and star in each of the functions without going too much into the theory. The optimization is done in a round about way since we solve for the Lagrangian multipliers and then compute the weights as a function of a Lagrangian multiplier. That might be why the names/ of variables/intuition is not immediately clear. I did however try to replicate Owen's notation as closely as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((54 lines not shown))
  54 + super(OptFuncts, self).__init__(endog)
  55 +
  56 + def get_j_y(self, eta1):
  57 + """
  58 + Calculates J and y via the log*' and log*''.
  59 +
  60 + Maximizing log* is done via sequential regression of
  61 + y on J.
  62 +
  63 + See Owen pg. 63
  64 +
  65 + """
  66 +
  67 + data = np.copy(self.est_vect.T)
  68 + data_star_prime = np.copy((1 + np.dot(eta1, data)))
  69 + data_star_doub_prime = np.copy((1 + np.dot(eta1, data)))
4
Josef Perktold Owner
josef-pkt added a note

my guess is copy is redundant here will copy twice: 1 + ... creates new temporary array

j-grana6
j-grana6 added a note

Deleted

Skipper Seabold Owner
jseabold added a note

Is the code that's on github the most recent? I still see these and just made a similar comment.

j-grana6
j-grana6 added a note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((81 lines not shown))
  81 + data_star_prime = data_star_prime.reshape(self.nobs, 1)
  82 + data_star_doub_prime = data_star_doub_prime.reshape(self.nobs, 1)
  83 + J = ((- 1 * data_star_doub_prime) ** .5) * self.est_vect
  84 + y = data_star_prime / ((- 1 * data_star_doub_prime) ** .5)
  85 + return J, y
  86 +
  87 + def modif_newton(self, x0):
  88 + """
  89 + Modified Newton's method for maximizing the log* equation.
  90 +
  91 + See Owen pg. 64
  92 +
  93 + """
  94 + params = x0.reshape(1, self.est_vect.shape[1])
  95 + diff = 1
  96 + while diff > 10 ** (-10):
1
Josef Perktold Owner
josef-pkt added a note

needs maxiter to avoid possible endless loops

Do we need a newton maximizer here, or can we reuse some other ones? Would be more economical to invest into it only once.
For example LikelihoodModel newton has problems when the step size is too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((20 lines not shown))
  20 +import numpy as np
  21 +from scipy import optimize
  22 +from scipy.stats import chi2, skew, kurtosis
  23 +from matplotlib import pyplot as plt
  24 +import itertools
  25 +
  26 +
  27 +class ElModel(object):
  28 + """
  29 +
  30 +
  31 + Initializes data for empirical likelihood. Not intended for end user
  32 + """
  33 +
  34 + def __init__(self, endog):
  35 + self.endog = endog
1
Josef Perktold Owner
josef-pkt added a note

are there general shape requirements for endog? Does it need some basic conversion code? asarray, atleast_2d ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((391 lines not shown))
  391 + gamma_star_u = optimize.brentq(self.find_gamma, \
  392 + max(self.endog) + epsilon, gamma_high)
  393 + weights_low = ((self.endog - gamma_star_l) ** -1) / \
  394 + np.sum((self.endog - gamma_star_l) ** -1)
  395 + weights_high = ((self.endog - gamma_star_u) ** -1) / \
  396 + np.sum((self.endog - gamma_star_u) ** -1)
  397 + mu_low = np.sum(weights_low * self.endog)
  398 + mu_high = np.sum(weights_high * self.endog)
  399 + return mu_low, mu_high
  400 +
  401 + if method == 'bisect':
  402 + self.r0 = chi2.ppf(sig, 1)
  403 + self.mu_high = self.endog.mean()
  404 + mu_hmax = max(self.endog)
  405 + while abs(self.hy_test_mean(self.mu_high)[1]
  406 + - self.r0) > tol:
2
Josef Perktold Owner
josef-pkt added a note

does this need a safety maxiter, checks if problem is well defined?

j-grana6
j-grana6 added a note

I don't believe so. The LLR for the mean is monotonic between the mean and the maximum value of the data. The algorithm should converge in all cases. Nevertheless, I was thinking about deleting this as a whole since the gamma method works fine and even if it doesn't, brents is much faster too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((487 lines not shown))
  487 + -------
  488 + random_numbers = np.random.standard_normal(100)
  489 + el_analysis = el.DescStat(random_numbers)
  490 + # Initialize El
  491 + el_analysis.ci_var()
  492 + >>>f(a) and f(b) must have different signs
  493 + el_analysis.ci_var(.5, 2)
  494 + # Searches for confidence limits where the lower limit > .5
  495 + # and the upper limit <2.
  496 +
  497 + Troubleshooting Tips
  498 + --------------------
  499 +
  500 + If the function returns the error f(a) and f(b) must have
  501 + different signs, consider lowering lower_bound and raising
  502 + upper_bound.
1
Josef Perktold Owner
josef-pkt added a note

scipy just got an improvement in a similar case, where the bounds are automatically shifted if they don't have different signs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((585 lines not shown))
  585 + p_val = 1 - chi2.cdf(-2 * llr, mu_array.shape[1])
  586 + if print_weights:
  587 + return p_val, -2 * llr, self.new_weights.T
  588 + else:
  589 + return p_val, -2 * llr
  590 +
  591 + def mv_mean_contour(self, mu1_l, mu1_u, mu2_l, mu2_u, step1, step2,
  592 + levs=[.2, .1, .05, .01, .001], plot_dta=False):
  593 + """
  594 +
  595 + Creates confidence region plot for the mean of bivariate data
  596 +
  597 + Parameters
  598 + ----------
  599 +
  600 + m1_l: Minimum value of the mean for variable 1
2
Josef Perktold Owner
josef-pkt added a note

check compliance with numpy doc format

m1_l: float
    Minimum value of the mean for variable 1
j-grana6
j-grana6 added a note

I changed to this format for all functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((775 lines not shown))
  775 + """
  776 +
  777 + Returns the p_value and -2 * log_likelihood for the hypothesized
  778 + kurtosis.
  779 +
  780 + Parameters
  781 + ----------
  782 + kurt0: kurtosis value to be tested
  783 +
  784 + Optional
  785 + --------
  786 +
  787 + mu_min, mu_max, var_min, var_max: Minimum and maximum values
  788 + of the nuisance parameters to be optimized over. If None,
  789 + the function computes the 95% confidence interval for
  790 + the mean and variance and uses the resulting values.
2
Josef Perktold Owner
josef-pkt added a note

statistics question: does this work for any distribution? (with some restrictions like finite moments)

I was once looking for statistical tests for skew and kurtosis, but found essentially only normal distributed examples, not for other distributions.

j-grana6
j-grana6 added a note

yes it does. In fact, in the introduction to Owen's Empirical Likelihood, he plots the confidence region for Skewness and Kurtosis of number of segments on an earthworm, which he shows is clearly not normally distributed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((1,105 lines not shown))
  1105 + bounds=[(mu1_lb, mu1_ub),
  1106 + (var1_lb, var1_ub),
  1107 + (mu2_lb, mu2_ub),
  1108 + (var2_lb, var2_ub)])[1]
  1109 + p_val = 1 - chi2.cdf(llr, 1)
  1110 + if print_weights:
  1111 + return p_val, llr, self.new_weights.T
  1112 + return p_val, llr
  1113 +
  1114 + def ci_corr(self, sig=.05, upper_bound=None, lower_bound=None,
  1115 + mu1_min=None,
  1116 + mu1_max=None, mu2_min=None, mu2_max=None,
  1117 + var1_min=None, var1_max=None,
  1118 + var2_min=None, var2_max=None):
  1119 + """
  1120 +
2
Josef Perktold Owner
josef-pkt added a note

endog in the class docstring said 1d, I guess this is an exception and requires 2d

j-grana6
j-grana6 added a note

See comment above. Again, I think putting the required shape of endog in each of the functions is the best bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((1,144 lines not shown))
  1144 + ul = min(.999, point_est + \
  1145 + 2.5 * ((1. - point_est ** 2.) / \
  1146 + (self.nobs - 2.)) ** .5)
  1147 +
  1148 + if lower_bound is not None:
  1149 + ll = lower_bound
  1150 + else:
  1151 + ll = max(- .999, point_est - \
  1152 + 2.5 * (((1. - point_est ** 2.) / \
  1153 + (self.nobs - 2.)) ** .5))
  1154 +
  1155 + if mu1_min is not None:
  1156 + self.mu1_lb = mu1_min
  1157 + else:
  1158 + self.mu1_lb = self.endog[:, 0].mean() - ((1.96 * \
  1159 + (self.endog[:, 0].var()) / self.nobs)) ** .5
2
Josef Perktold Owner
josef-pkt added a note

using sqrt would be more readable I think. And,I guess, computationally more efficient which might not matter in this case.

j-grana6
j-grana6 added a note

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold
Owner

el is not a good directory name or module name too short and uniformative
emplike for the directory ?
and full empirical_likelihood for the module? maybe not to generic since we are allready in emplike directory, maybe just descriptive.py if there are other empirical likelihood models coming.

j-grana6
statsmodels/el/el.py
((223 lines not shown))
  223 +
  224 + Also, it contains the creating of self.est_vect (short for estimating
  225 + equations vector). That then gets read by the log-star equations.
  226 +
  227 + Parameter
  228 + --------
  229 + nuisance_mu: float
  230 + Value of a nuisance mean parameter.
  231 +
  232 + Returns
  233 + -------
  234 +
  235 + Log likelihood of a pre-specified variance holding the nuisance
  236 + parameter constant.
  237 +
  238 + Not intended for end user.
1
Ralf Gommers Collaborator
rgommers added a note

Each time you write "not intended for end user" you should start the method name with an underscore. That's the standard way to indicate that a function is private, so it's clearer and you don't have to write this line in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((241 lines not shown))
  241 +
  242 + sig_data = ((self.endog - nuisance_mu) ** 2 \
  243 + - self.sig2_0)
  244 + mu_data = (self.endog - nuisance_mu)
  245 + self.est_vect = np.concatenate((mu_data, sig_data), axis=1)
  246 + eta_star = self.modif_newton(np.array([1 / self.nobs,
  247 + 1 / self.nobs]))
  248 +
  249 + denom = 1 + np.dot(eta_star, self.est_vect.T)
  250 + self.new_weights = 1 / self.nobs * 1 / denom
  251 + llr = np.sum(np.log(self.nobs * self.new_weights))
  252 + if pval: # Used for contour plotting
  253 + return 1 - chi2.cdf(-2 * llr, 1)
  254 + return -2 * llr
  255 +
  256 + def ci_limits_var(self, var_test):
1
Ralf Gommers Collaborator
rgommers added a note

2 spaces after def (1 too many)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((540 lines not shown))
  540 + variable.
  541 +
  542 + gamma_low: lower bound for gamma when finding lower limit.
  543 + If function returns f(a) and f(b) must have different signs,
  544 + consider lowering gamma_low. default | gamma_low =-(10**10)
  545 +
  546 + gamma_high: upper bound for gamma when finding upper limit.
  547 + If function returns f(a) and f(b) must have different signs,
  548 + consider raising gamma_high. default |gamma_high=10**10
  549 +
  550 + epsilon: When using 'nested-brent', amount to decrease (increase)
  551 + from the maximum (minimum) of the data when
  552 + starting the search. This is to protect against the
  553 + likelihood ratio being zero at the maximum (minimum)
  554 + value of the data. If data is very small in absolute value
  555 + (<10 ** -6) consider shrinking epsilon
4
Ralf Gommers Collaborator
rgommers added a note

When you use symbols like * or | in a docstring, you should surround them by double backticks so they're rendered by Sphinx as is. Now it will consider those symbols like special characters (** is bold I think) and therefore you'll get rendering errors.

j-grana6
j-grana6 added a note
Ralf Gommers Collaborator
rgommers added a note

It's not too difficult, and small issues like the above can always be corrected later (they only influence doc build, not using statsmodels). A good primer on reStructuredText markup is http://sphinx.pocoo.org/rest.html, and docstrings are written according to https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt.

j-grana6
j-grana6 added a note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((450 lines not shown))
  450 + Not intended for end user.
  451 +
  452 + """
  453 + return self.hy_test_corr(corr0, nuis0=None, mu1_min=self.mu1_lb,
  454 + mu1_max=self.mu1_ub, mu2_min=self.mu2_lb,
  455 + mu2_max=self.mu2_ub,
  456 + var1_min=self.var1_lb, var1_max=self.var1_ub,
  457 + var2_min=self.var2_lb, var2_max=self.var2_ub,
  458 + print_weights=0)[1] - self.r0
  459 +
  460 +
  461 +class DescStat(OptFuncts):
  462 + """
  463 +
  464 +
  465 + A class for confidence intervals and hypothesis tests invovling mean,
1
Ralf Gommers Collaborator
rgommers added a note

Typo: invovling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/el/el.py
((446 lines not shown))
  446 +
  447 + Difference between log-likelihood of corr0 and a pre-specified
  448 + value.
  449 +
  450 + Not intended for end user.
  451 +
  452 + """
  453 + return self.hy_test_corr(corr0, nuis0=None, mu1_min=self.mu1_lb,
  454 + mu1_max=self.mu1_ub, mu2_min=self.mu2_lb,
  455 + mu2_max=self.mu2_ub,
  456 + var1_min=self.var1_lb, var1_max=self.var1_ub,
  457 + var2_min=self.var2_lb, var2_max=self.var2_ub,
  458 + print_weights=0)[1] - self.r0
  459 +
  460 +
  461 +class DescStat(OptFuncts):
6
Ralf Gommers Collaborator
rgommers added a note

This is the main class right? At least it's the only one that's not noted as "not intended for the end user". Then here I'd expect a full description of what empirical likelihood estimation does and how to use it.

j-grana6
j-grana6 added a note
Skipper Seabold Owner
jseabold added a note

A usage / brief tutorial style example would be greatly helpful. Ie., a standalone file. I don't have Owen's book because I can't afford it right now ;), and I'm not really sure what I'm supposed to do with this. Is the intention of this class just to be a univariate / multivariate location and scale estimator based on optimizing the empirical likelihood? If that's the case, I might suggest some changes. Otherwise I'm going to keep posting some comments on just the code.

Are there any papers that cover this outside of the book? I don't need all the details.

j-grana6
j-grana6 added a note
Skipper Seabold Owner
jseabold added a note

Just like our other examples. Something to kind of explain what's going on and show how (and why) you'd use it. Doesn't have to be anything crazy. Something like the cancer example in Owen (1991) would work.

j-grana6
j-grana6 added a note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Ralf Gommers
Collaborator

Looks good overall from a quick browse around.

Josef Perktold
Owner

To Ralf's and my request for more documentation:

I think the main user access will be through the methods, and the docstrings on the individual methods, for mean, var, skew,... should be the most explicit, as it currently is.

The docstrings for the classes don't have to repeat everything, but should outline the main structure and how to use it, and use a `See Also to the more explicit docstrings for the details.

That's my impression not having run an example yet.

j-grana6 j-grana6 DOC: Fixed Documentation to comply with numpy documentation. Addresse…
…d issues with Sphinx. Also changed the name of the directory and filenames for empirical likelihood
9659c52
Josef Perktold
Owner

My guess is that you just disconnected all the history of the file.
file renames or moving of code should always be committed as a separate changeset, not mixed with additional edits.

j-grana6
Josef Perktold
Owner

I just mean that when you rename a file, you don't make any other changes and commit it immediately.

I don't have time right now (still bug hunting), but what I usually do is to check whether gitk or git gui browse files and blame still shows the history for individual lines.

Ralf Gommers
Collaborator

Josef, nothing was renamed, Justin added everything as a duplicate.

Justin, now you have everything twice. I guess now the easiest way is to delete el.py, but that indeed means that you lose comments etc. because git(hub) can't track them anymore. A better way to rename would have been

$ git mv el.py emplike.py
$ # make minimal changes to make things work (replace ``import el`` with ``import ``emplike``)
$ git commit
$ # now make other commits
$ git commit

Git is smart enough to follow code across files, but if you duplicate it then history (and inline github comments) get lost.

Josef Perktold
Owner

(I haven't gotten use to the new Show Diff Stats button yet, to see what files are in the diff.)

Since the old files still exist, it's still easy to do a "proper" renaming

delete the new emplike directory and files (make backup if there are other changes that are not in "el"
commit (same as reverting the copy)

rename directory el -> emplike (I use Windows explorer, Ralf does "git mv" create information for git about a rename?)
commit
rename el.py -> descriptive.py
rename test_el -> test_descriptive.py
any other rename ?
commit

fix up the imports and other code to correspond to new directory and module names
commit

in contrast to Ralf, I usually do the fix up after a move in a separate commit, and use several commits for rename/move to be on the safe side.

Ralf Gommers
Collaborator

"git mv" is actually just a convenience function: http://stackoverflow.com/questions/1094269/whats-the-purpose-of-git-mv. I thought it did help if you mix renames and code changes, but I was wrong.

Skipper Seabold
Owner

I'm going to try to ping @jseabold and see if it enables notifications for me. I missed this whole discussion.

Josef Perktold

pep8 no spaces in class names,

j-grana6

I am debating creating a class for the multivariate functions only. This will make it easier to write the documentation and easier to expand on in the future. Any objections to this?

Owner

Yes, I think it makes sense to have two classes one for univariate and one for multivariate. You shouldn't have methods in a class that only work some of the time.

statsmodels/emplike/descriptive.py
((12 lines not shown))
  12 +
  13 +TODO: Write functions for estimationg equations instead
  14 + of generating the data every time a hypothesis
  15 + test is called.
  16 +
  17 +
  18 +"""
  19 +import numpy as np
  20 +from scipy import optimize
  21 +from scipy.stats import chi2, skew, kurtosis
  22 +from matplotlib import pyplot as plt
  23 +from statsmodels.base.model import _fit_mle_newton
  24 +import itertools
  25 +
  26 +
  27 +class ElModel(object):
1
Skipper Seabold Owner
jseabold added a note

I think ELModel might be better here. Lower case L's can be hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((30 lines not shown))
  30 +
  31 + Initializes data for empirical likelihood. Not intended for end user
  32 + """
  33 +
  34 + def __init__(self, endog):
  35 + self.endog = endog
  36 + self.nobs = float(endog.shape[0])
  37 + self.weights = np.ones(self.nobs) / float(self.nobs)
  38 + if endog.ndim == 1:
  39 + self.endog = self.endog.reshape(self.nobs, 1)
  40 + # For now, self. weights should always be a vector of 1's and a
  41 + # variable "new weights" should be created everytime weights are
  42 + # changed.
  43 +
  44 +
  45 +class OptFuncts(ElModel):
1
Skipper Seabold Owner
jseabold added a note

Not crucial, but it's customary to put a _ in front of classes that aren't meant to be public. Sometimes I don't do this because it's not really cut and dry between public/private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((71 lines not shown))
  71 +
  72 + Returns
  73 + -------
  74 + J: n x m matrix
  75 + J'J is the hessian for optimizing
  76 +
  77 + y: n x 1 array
  78 + -J'y is the gradient for maximizing
  79 +
  80 +
  81 + See Owen pg. 63
  82 +
  83 + """
  84 + data = np.copy(self.est_vect.T)
  85 + data_star_prime = (1 + np.dot(eta1, data))
  86 + data_star_doub_prime = np.copy((1 + np.dot(eta1, data)))
1
Skipper Seabold Owner
jseabold added a note

Don't need a copy here. [Edit: I don't think you need either copy?]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/test/__init__.py
... ... @@ -0,0 +1 @@
1
Skipper Seabold Owner
jseabold added a note

Can you rename the directory to tests/ for consistency? You'll need to use git mv and commit the move so as not to lose history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/test/__init__.py
... ... @@ -0,0 +1 @@
  1 +#init file
1
Skipper Seabold Owner
jseabold added a note

FYI, these can be empty, but it's not that important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((15 lines not shown))
  15 + test is called.
  16 +
  17 +
  18 +"""
  19 +import numpy as np
  20 +from scipy import optimize
  21 +from scipy.stats import chi2, skew, kurtosis
  22 +from matplotlib import pyplot as plt
  23 +from statsmodels.base.model import _fit_mle_newton
  24 +import itertools
  25 +
  26 +
  27 +class ElModel(object):
  28 + """
  29 +
  30 +
2
Skipper Seabold Owner
jseabold added a note

Can you cleanup all the whitespace at the beginning of the docstrings? The docs should start on either the line with """ or the next. I don't know if Sphinx/numpydoc will strip it, but I doubt it.

Skipper Seabold Owner
jseabold added a note

Also there's whitespace at the end of docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Skipper Seabold
Owner

It looks like the comments are now disconnected from the code, so I added some duplicate comments, though they are still valid in the current code.

statsmodels/emplike/descriptive.py
((459 lines not shown))
  459 + sig2_data = ((self.endog[:, 1] - nuis_params[2]) ** 2) -\
  460 + nuis_params[3]
  461 + correl_data = ((self.endog[:, 0] - nuis_params[0]) * \
  462 + (self.endog[:, 1] - nuis_params[2])) - \
  463 + (self.corr0 * (nuis_params[1] ** .5) \
  464 + * (nuis_params[3] ** .5))
  465 + self.est_vect = np.concatenate((self.mu1_data, sig1_data,
  466 + mu2_data,
  467 + sig2_data, correl_data),
  468 + axis=1).reshape(5, self.nobs)
  469 + self.est_vect = self.est_vect.T
  470 + eta_star = self._modif_newton(np.array([1 / self.nobs,
  471 + 1 / self.nobs,
  472 + 1 / self.nobs,
  473 + 1 / self.nobs,
  474 + 1 / self.nobs]))
1
Skipper Seabold Owner
jseabold added a note

If you're going to be using attributes a lot in a method, it's probably best to go ahead and assign them to a variable. I think it's ever so slightly more performant, but really I just think it looks better. So

endog = self.endog
nobs = self.nobs

Though some people might disagree that it's best practice, we try to stick to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((237 lines not shown))
  237 + Value of a nuisance mean parameter.
  238 +
  239 + Returns
  240 + -------
  241 +
  242 + llr: float:
  243 + Log likelihood of a pre-specified variance holding the nuisance
  244 + parameter constant.
  245 +
  246 +
  247 + """
  248 +
  249 + sig_data = ((self.endog - nuisance_mu) ** 2 \
  250 + - self.sig2_0)
  251 + mu_data = (self.endog - nuisance_mu)
  252 + self.est_vect = np.concatenate((mu_data, sig_data), axis=1)
4
Skipper Seabold Owner
jseabold added a note

This is updated during the optimization correct? I think it should be passed around then rather than attached and read. It's a bit confusing to follow where things are getting changed and what's a set attribute.

j-grana6
j-grana6 added a note
Skipper Seabold Owner
jseabold added a note

I'm not really sure because I don't follow the code well enough yet. If it's a derived estimate then it should just be changed in the objective function and then attached after fitting. I just don't like that there's an attribute set in a method that's changed by other methods. When is it what? It should only be attached when it's set. If it's something that the optimizer is changing then pass it in. If it's derived, attach it after it's done deriving.

Does that make sense?

j-grana6
j-grana6 added a note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((86 lines not shown))
  86 + data_star_doub_prime = np.copy((1 + np.dot(eta1, data)))
  87 + for elem in range(int(self.nobs)):
  88 + if data_star_prime[0, elem] <= 1 / self.nobs:
  89 + data_star_prime[0, elem] = 2 * self.nobs - \
  90 + (self.nobs) ** 2 * data_star_prime[0, elem]
  91 + else:
  92 + data_star_prime[0, elem] = 1. / data_star_prime[0, elem]
  93 + if data_star_doub_prime[0, elem] <= 1 / self.nobs:
  94 + data_star_doub_prime[0, elem] = - self.nobs ** 2
  95 + else:
  96 + data_star_doub_prime[0, elem] = \
  97 + - (data_star_doub_prime[0, elem]) ** -2
  98 + data_star_prime = data_star_prime.reshape(self.nobs, 1)
  99 + data_star_doub_prime = data_star_doub_prime.reshape(self.nobs, 1)
  100 + J = ((- 1 * data_star_doub_prime) ** .5) * self.est_vect
  101 + y = data_star_prime / ((- 1 * data_star_doub_prime) ** .5)
2
Skipper Seabold Owner
jseabold added a note

Things like this can be vectorized to avoid these loops. E.g., with abbreviated names, but you'll probably want to keep yours.

nobs = self.nobs
dsp = (1 + np.dot(eta1, data)).T
dspp = dsp.copy()
idx = dsp < 1./nobs
not_idx = ~idx
dsp[idx] *= 2. * nobs - (nobs) ** 2
dsp[not_idx] = 1./dsp[not_idx]
dspp[idx] = -nobs ** 2
dspp[not_idx] = -dspp[not_idx] ** -2
root_prime = np.sqrt(-1 * dspp)
JJ = root_prime * self.est_vect
yy = dsp / root_prime

Two questions.

Why are you using row vectors and then transposing? Does it make sense just to have everything has column vectors so you're not having to transpose or reshape all the time?

I assume you're going to want to use these elsewhere. Should log_star, log_star_prime, and log_star_prime_prime just be functions instead of methods?

Skipper Seabold Owner
jseabold added a note

E.g., my general log_star function that I think I sent you

def log_star(z, eps):
    log_idx = z > eps
    loggedarr = np.zeros_like(z)
    z_ok = z[log_idx]
    z_star = z[~log_idx]
    loggedarr[log_idx] = np.log(z_ok)
    loggedarr[~log_idx] = np.log(eps) - 1.5 + 2*z_star/eps - \
            z_star**2/(2*eps**2)
    return loggedarr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((87 lines not shown))
  87 + for elem in range(int(self.nobs)):
  88 + if data_star_prime[0, elem] <= 1 / self.nobs:
  89 + data_star_prime[0, elem] = 2 * self.nobs - \
  90 + (self.nobs) ** 2 * data_star_prime[0, elem]
  91 + else:
  92 + data_star_prime[0, elem] = 1. / data_star_prime[0, elem]
  93 + if data_star_doub_prime[0, elem] <= 1 / self.nobs:
  94 + data_star_doub_prime[0, elem] = - self.nobs ** 2
  95 + else:
  96 + data_star_doub_prime[0, elem] = \
  97 + - (data_star_doub_prime[0, elem]) ** -2
  98 + data_star_prime = data_star_prime.reshape(self.nobs, 1)
  99 + data_star_doub_prime = data_star_doub_prime.reshape(self.nobs, 1)
  100 + J = ((- 1 * data_star_doub_prime) ** .5) * self.est_vect
  101 + y = data_star_prime / ((- 1 * data_star_doub_prime) ** .5)
  102 + return np.mat(J), np.mat(y)
1
Skipper Seabold Owner
jseabold added a note

Is there a reason to return matrices instead of arrays? We don't really use matrices anywhere. This could be very confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((1,302 lines not shown))
  1302 + else:
  1303 + self.var_l = var_lims[0]
  1304 + if var_max is not None:
  1305 + self.var_u = var_max
  1306 + else:
  1307 + self.var_u = var_lims[1]
  1308 + if mu_min is not None:
  1309 + self.mu_l = mu_min
  1310 + else:
  1311 + self.mu_l = mu_lims[0]
  1312 + if mu_max is not None:
  1313 + self.mu_u = mu_max
  1314 + else:
  1315 + self.mu_u = mu_lims[1]
  1316 + self.r0 = chi2.ppf(1 - sig, 1)
  1317 + print 'Finding the lower bound for skewness'
5
Skipper Seabold Owner
jseabold added a note

Need to get rid of the print statements.

j-grana6
j-grana6 added a note
Skipper Seabold Owner
jseabold added a note

Yeah. Print statements are not the way to go I don't think. Once we're done with some cleanup and I have my bearings a bit more I'll see if we can't figure out what's going on.

Skipper Seabold Owner
jseabold added a note

If you can post to the ML (or here) a full working example that hangs I'll have a look.

j-grana6
j-grana6 added a note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Skipper Seabold
Owner

I'm getting several errors trying to run the tests for test_descriptive. Some are related to my numpy version and will need to be fixed others look to be bugs.

statsmodels/emplike/descriptive.py
((453 lines not shown))
  453 + """
  454 +
  455 + self.mu1_data = (self.endog[:, 0] - nuis_params[0])
  456 + sig1_data = ((self.endog[:, 0] - nuis_params[0]) ** 2) - \
  457 + nuis_params[1]
  458 + mu2_data = self.endog[:, 1] - nuis_params[2]
  459 + sig2_data = ((self.endog[:, 1] - nuis_params[2]) ** 2) -\
  460 + nuis_params[3]
  461 + correl_data = ((self.endog[:, 0] - nuis_params[0]) * \
  462 + (self.endog[:, 1] - nuis_params[2])) - \
  463 + (self.corr0 * (nuis_params[1] ** .5) \
  464 + * (nuis_params[3] ** .5))
  465 + self.est_vect = np.concatenate((self.mu1_data, sig1_data,
  466 + mu2_data,
  467 + sig2_data, correl_data),
  468 + axis=1).reshape(5, self.nobs)
1
Skipper Seabold Owner
jseabold added a note

One of the test failures is coming from here. I think you just want vstack instead of concatenate + reshape. (That is if you still want to keep row vectors, which I suspect you might not.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
j-grana6
Skipper Seabold
Owner

Then I don't think we're testing the same file / branch. Are you sure what's on github is recent? For example

line 863 in descriptive.py has

mu_array = mu_array.reshape(1, self.endog[1])

Surely this is a bug?

j-grana6
j-grana6
Skipper Seabold
Owner

Ok then I'm a bit worried about the test coverage. There's no way that something like

some_array.reshape(1, array([ 16.08324, 59.50397]))

Should lead to correct behavior unless I'm missing something.

j-grana6
j-grana6
Skipper Seabold
Owner

Not as far as I'm concerned :). Must be a bug in numpy that's been tightened up in the recent code.

Skipper Seabold
Owner

By the way, there are two ways to accomplish what you're doing that might be a bit easier / more efficient.

x.reshape(1, -1)

-1 will infer the value for the other shape. Might be faster than looking up self.nobs.

Or, I usually do

x[None, :]

This adds an extra leading dimension. Or

X[:, None]

Adds an extra trailing dimension, but do whatever you prefer then I'll have a look at the shaping stuff broadly. I doubt we want to do all of this reshaping in the first place if we can avoid it.

j-grana6
Skipper Seabold
Owner

No I wouldn't. There are a few bugs and warts in recent numpy. Things that work on the bleeding edge should work there as well. I'm going to look into the copy, but I suspect something else is going on other than the copy.

j-grana6
Skipper Seabold
Owner

Just the one I mentioned about vstack vs concatenate. That breaks for me.

np.concatenate(([1,2,3], [1,2,3]), axis=1)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-aef62b18b305> in <module>()
----> 1 np.concatenate(([1,2,3], [1,2,3]), axis=1)

IndexError: axis 1 out of bounds [0, 1)

I suspect you want

np.vstack(([1,2,3], [1,2,3]))

Which lets you get rid of the reshape.

j-grana6
Skipper Seabold
Owner

Yeah that's odd because neither of those lists have an axis 1 to concatenate on.

Skipper Seabold
Owner

Ok the tests pass now, but I'm concerned about speed. Can you pull out and vectorize the log_star stuff then I'll see if there's anything else that can be done.

j-grana6
j-grana6
j-grana6
Skipper Seabold
Owner

Tests went from 120s to 20s. Behold the power of vectorization. I'm going to see if there's any other obvious sticking points.

Skipper Seabold
Owner

Thanks, this is very helpful.

statsmodels/emplike/descriptive.py
((519 lines not shown))
  519 + endog = self.endog
  520 + nobs = self.nobs
  521 + # if trans_data is None:
  522 + # self.data = self.endog
  523 + # else:
  524 + # self.data = trans_data
  525 + eta_min = (1. - (1. / nobs)) / (self.mu0 - max(endog))
  526 + eta_max = (1. - (1. / nobs)) / (self.mu0 - min(endog))
  527 + eta_star = optimize.brentq(self._find_eta, eta_min, eta_max)
  528 + new_weights = (1. / nobs) * \
  529 + 1. / (1. + eta_star * (endog - self.mu0))
  530 + llr = -2 * np.sum(np.log(nobs * new_weights))
  531 + if print_weights:
  532 + return 1 - chi2.cdf(llr, 1), llr, new_weights
  533 + else:
  534 + return 1 - chi2.cdf(llr, 1), llr
1
Skipper Seabold Owner
jseabold added a note

You can use the survival function instead of 1 - chi2.cdf. So chi2.sf(llr, 1). Sometimes it will be more accurate.

It's also a bit of an unwritten rule to return test statistic, p-value rather than the other way around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((897 lines not shown))
  897 +
  898 + if var_max is not None:
  899 + var_ub = var_max
  900 + else:
  901 + var_ub = var_ci[1]
  902 +
  903 + llr = optimize.fmin_l_bfgs_b(self._opt_skew, start_nuisance,
  904 + approx_grad=1,
  905 + bounds=[(mu_lb, mu_ub),
  906 + (var_lb, var_ub)])[1]
  907 + p_val = 1 - chi2.cdf(llr, 1)
  908 + if print_weights:
  909 + return p_val, llr, self.new_weights.T
  910 + return p_val, llr
  911 +
  912 + def hy_test_kurt(self, kurt0, nuis0=None, mu_min=None,
1
Skipper Seabold Owner
jseabold added a note

Very nit-picky comment, but maybe the hy_ isn't needed. Just test_var, test_kurt, etc. I think hypothesis is implied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((5 lines not shown))
  5 +Last Updated: 25 July 2012
  6 +
  7 +General References:
  8 +------------------
  9 +
  10 +Owen, A. (2001). "Empirical Likelihood." Chapman and Hall
  11 +
  12 +
  13 +TODO: Write functions for estimationg equations instead
  14 + of generating the data every time a hypothesis
  15 + test is called.
  16 +"""
  17 +import numpy as np
  18 +from scipy import optimize
  19 +from scipy.stats import chi2, skew, kurtosis
  20 +from matplotlib import pyplot as plt
1
Skipper Seabold Owner
jseabold added a note

This should be wrapped in a try/except block since matplotlib is a soft dependency. You can also delay this import to the method that uses it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((740 lines not shown))
  740 + if upper_bound is not None:
  741 + ul = upper_bound
  742 + else:
  743 + ul = ((self.nobs - 1) * endog.var()) / \
  744 + (chi2.ppf(.0001, self.nobs - 1))
  745 + if lower_bound is not None:
  746 + ll = lower_bound
  747 + else:
  748 + ll = ((self.nobs - 1) * endog.var()) / \
  749 + (chi2.ppf(.9999, self.nobs - 1))
  750 + self.r0 = chi2.ppf(1 - sig, 1)
  751 + ll = optimize.brentq(self._ci_limits_var, ll, endog.var())
  752 + ul = optimize.brentq(self._ci_limits_var, endog.var(), ul)
  753 + return ll, ul
  754 +
  755 + def var_p_plot(self, lower, upper, step, sig=.05):
1
Skipper Seabold Owner
jseabold added a note

This should take an existing axis as a keyword and return the figure. See this function for how we work with matplotlib by convention

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/factorplots.py#L6

The preferred way to work interactively with matplotlib is something like this

import matplotlib.pyplot as plt
fig = plt.figure() # only use of plt ever except maybe show
ax = fig.add_subplot(111)
ax.plot(...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((773 lines not shown))
  773 + Will draw a horizontal line at 1- sig. Default= .05
  774 +
  775 + This function can be helpful when trying to determine limits
  776 + in the ci_var function.
  777 + """
  778 + sig = 1 - sig
  779 + p_vals = []
  780 + for test in np.arange(lower, upper, step):
  781 + p_vals.append(self.test_var(test)[0])
  782 + p_vals = np.asarray(p_vals)
  783 + plt.plot(np.arange(lower, upper, step), p_vals)
  784 + plt.plot(np.arange(lower, upper, step), (1 - sig) * \
  785 + np.ones(len(p_vals)))
  786 + return 'Type plt.show to see the figure'
  787 +
  788 + def mean_var_contour(self, mu_l, mu_h, var_l, var_h, mu_step,
1
Skipper Seabold Owner
jseabold added a note

Ditto the above. You probably want methods that do plotting to be preceded by a plot_, so plot_contour here might work? plot_mean_var_contour is a handful to type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((838 lines not shown))
  838 +
  839 + def test_skew(self, skew0, nuis0=None, mu_min=None,
  840 + mu_max=None, var_min=None, var_max=None,
  841 + print_weights=False):
  842 + """
  843 + Returns -2 ``*`` log_likelihood and p_value for the hypothesized
  844 + skewness.
  845 +
  846 + Parameters
  847 + ----------
  848 + skew0: float
  849 + Skewness value to be tested
  850 +
  851 + Optional
  852 + --------
  853 +
1
Skipper Seabold Owner
jseabold added a note

You can get rid of these blank lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Skipper Seabold
Owner

What about just having a DescStat class that checks for univariate or multivariate and then returns the appropriate class? I still need to dig further to see why exactly univariate isn't just a special case of multivariate, but I'm doing some profiling now. Some method calls take 10 seconds, which seems really unnecessary.

statsmodels/emplike/descriptive.py
((65 lines not shown))
  65 +
  66 + Returns
  67 + -------
  68 + J: n x m matrix
  69 + J'J is the hessian for optimizing
  70 +
  71 + y: n x 1 array
  72 + -J'y is the gradient for maximizing
  73 +
  74 + See Owen pg. 63
  75 + """
  76 +
  77 + nobs = self.nobs
  78 + data = est_vect.T
  79 + data_star_prime = (1. + np.dot(eta1, data))
  80 + data_star_doub_prime = np.copy(1. + np.dot(eta1, data))
1
Skipper Seabold Owner
jseabold added a note

Still think we can get rid of this copy. It's taking time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Skipper Seabold
Owner

Is multivariate only intended for the bivariate case?

j-grana6
Skipper Seabold
Owner

Hmm, then maybe these should be functions? I don't like a class that has methods that only work sometime.

Skipper Seabold
Owner

It would also save a lot this slicing business if you restrict endog to be nobs x 2. So you're not doing things like endog[:, 0].var() endog[:,1].var(). This is inefficient. var and mean should just be called once.

statsmodels/emplike/descriptive.py
((930 lines not shown))
  930 + If True, function also returns the weights that
  931 + maximize the likelihood ratio. Default = False.
  932 +
  933 + Returns
  934 + --------
  935 +
  936 + test_results: tuple
  937 + The log-likelihood ratio and p_value of `kurt0`
  938 + """
  939 + self.kurt0 = kurt0
  940 + if nuis0 is not None:
  941 + start_nuisance = nuis0
  942 + else:
  943 + start_nuisance = np.array([self.endog.mean(),
  944 + self.endog.var()])
  945 + if mu_min is not None:
1
Skipper Seabold Owner
jseabold added a note

It would make the code shorter and easier to read if you didn't rename these (I'm also thinking they should just be one argument for mean, not mu1 and mu2), but something like

if mu_min is None:
    mu_min = self.co_mean()[0] # why slice here?

Then you're not having multiple names for the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((1,402 lines not shown))
  1402 + endog = self.endog
  1403 + if endog.shape[1] != 2:
  1404 + raise Exception('Correlation matrix not yet implemented')
  1405 +
  1406 + self.corr0 = corr0
  1407 +
  1408 + if nuis0 is not None:
  1409 + start_nuisance = nuis0
  1410 + else:
  1411 + start_nuisance = np.array([endog[:, 0].mean(),
  1412 + endog[:, 0].var(),
  1413 + endog[:, 1].mean(),
  1414 + endog[:, 1].var()])
  1415 +
  1416 + if mu1_min is not None:
  1417 + mu1_lb = mu1_min
3
Skipper Seabold Owner
jseabold added a note

The way I read this is that it's part of a nested optimization for ci_corr. Is there anyway to avoid going through all these if statements each time it's called since we don't really need to? Right now just calling ci_corr takes close to 10 seconds and includes 1.6 million (!) function calls. Anything we can do to avoid unnecessary calculations is going to greatly speed things up. Even if ci_corr doesn't call test_corr and repeats some code since the latter might be (?) called just by a user with the outer optimization step.

j-grana6
j-grana6 added a note
Skipper Seabold Owner
jseabold added a note

Sure. One suggestion is to make test_corr a function _test_corr where the arguments are not optional. Then test_corr the method can call this function with sensible defaults. And ci_corr can call _test_corr giving default values. This goes for all the other methods that follow the same pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
j-grana6

It's probably not very important but should this function be in the _OptFuncts class with the rest of the functions that are optimized over to get confidence intervals.

statsmodels/emplike/descriptive.py
((15 lines not shown))
  15 + test is called.
  16 +"""
  17 +import numpy as np
  18 +from scipy import optimize
  19 +from scipy.stats import chi2, skew, kurtosis
  20 +from statsmodels.base.model import _fit_mle_newton
  21 +import itertools
  22 +from statsmodels.graphics import utils
  23 +
  24 +
  25 +def _test_corr(corr0, self, nuis0, mu1_lb, mu1_ub, mu2_lb, mu2_ub, var1_lb,
  26 + var1_ub, var2_lb, var2_ub, endog, nobs, x0, weights0, r0):
  27 + """
  28 + Parameters
  29 + ---------
  30 +
3
Josef Perktold Owner
josef-pkt added a note

There are several headers that have the wrong number of underlines. They will cause a REST error during doc build.

recommended: run make html to build the docs and check errors

j-grana6
j-grana6 added a note
Josef Perktold Owner
josef-pkt added a note

Yes open a shell in the docs directory and do make html which does the sphinx build of the docs. Depending on the setup, there should be none or a few warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((17 lines not shown))
  17 +import numpy as np
  18 +from scipy import optimize
  19 +from scipy.stats import chi2, skew, kurtosis
  20 +from statsmodels.base.model import _fit_mle_newton
  21 +import itertools
  22 +from statsmodels.graphics import utils
  23 +
  24 +
  25 +def _test_corr(corr0, self, nuis0, mu1_lb, mu1_ub, mu2_lb, mu2_ub, var1_lb,
  26 + var1_ub, var2_lb, var2_ub, endog, nobs, x0, weights0, r0):
  27 + """
  28 + Parameters
  29 + ---------
  30 +
  31 + corr0: float
  32 + Hypothesized vaalue for the correlation.
1
Josef Perktold Owner
josef-pkt added a note

vaaaaaalue, quite a few typos with spelling mistakes in docstrings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((34 lines not shown))
  34 + Returns
  35 + -------
  36 +
  37 + diff: float
  38 + Difference between log-likelihood of corr0 and a pre-specified
  39 + value, r0
  40 + """
  41 + bounds = [(mu1_lb, mu1_ub), (var1_lb, var1_ub), (mu2_lb, mu2_ub),
  42 + (var2_lb, var2_ub)]
  43 + args = (corr0, endog, nobs, x0, weights0)
  44 + llr = optimize.fmin_l_bfgs_b(self._opt_correl, nuis0, approx_grad=1,
  45 + bounds=bounds, args=args)[1]
  46 + return llr - r0
  47 +
  48 +
  49 +def _log_star(eta1, est_vect, wts, nobs):
1
Josef Perktold Owner
josef-pkt added a note

When I'm reading the code, there are many names that are not understandable without the book.
For example, a brief summary of the terminology/names in the module docstring would be helpful. (developer oriented to help understand the code)

missing description line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((44 lines not shown))
  44 + llr = optimize.fmin_l_bfgs_b(self._opt_correl, nuis0, approx_grad=1,
  45 + bounds=bounds, args=args)[1]
  46 + return llr - r0
  47 +
  48 +
  49 +def _log_star(eta1, est_vect, wts, nobs):
  50 + """
  51 + Parameters
  52 + ---------
  53 + eta1: float
  54 + Lagrangian multiplier
  55 +
  56 + est_vect: nxk array
  57 + Estimating equations vector
  58 +
  59 + wts: nx1 array
1
Josef Perktold Owner
josef-pkt added a note

full name weights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((41 lines not shown))
  41 + bounds = [(mu1_lb, mu1_ub), (var1_lb, var1_ub), (mu2_lb, mu2_ub),
  42 + (var2_lb, var2_ub)]
  43 + args = (corr0, endog, nobs, x0, weights0)
  44 + llr = optimize.fmin_l_bfgs_b(self._opt_correl, nuis0, approx_grad=1,
  45 + bounds=bounds, args=args)[1]
  46 + return llr - r0
  47 +
  48 +
  49 +def _log_star(eta1, est_vect, wts, nobs):
  50 + """
  51 + Parameters
  52 + ---------
  53 + eta1: float
  54 + Lagrangian multiplier
  55 +
  56 + est_vect: nxk array
2
Josef Perktold Owner
josef-pkt added a note

vect is not really informative as part of a name, most of our things are vectors (i.e. arrays)

edit:
unless there is an ambiguity with estimating equations as function. I still don't find the name est_vect very attractive

Josef Perktold Owner
josef-pkt added a note

docstring standard

est_vect: ndarray, (n,k)

ndarray if you explicitly require numpy arrays and not other iterables.

I prefer numpy shape notation, (n,k) but I don't know if this is numpy doc standard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((69 lines not shown))
  69 + ----
  70 +
  71 + This function is really only a placeholder for the _fit_mle_Newton.
  72 + The function value is not used in optimization and the optimal value
  73 + is disregarded when computng the log likelihood ratio.
  74 + """
  75 + data_star = np.log(wts) + (np.sum(wts) + np.dot(est_vect, eta1))
  76 + idx = data_star < 1. / nobs
  77 + not_idx = ~idx
  78 + nx = nobs * data_star[idx]
  79 + data_star[idx] = np.log(1. / nobs) - 1.5 + nx * (2. - nx / 2)
  80 + data_star[not_idx] = np.log(data_star[not_idx])
  81 + return data_star
  82 +
  83 +
  84 +def DescStat(endog):
1
Josef Perktold Owner
josef-pkt added a note

pep-8 function names are lower case, although I'm not sure in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((98 lines not shown))
  98 +
  99 + def __init__(self, endog):
  100 + super(_OptFuncts, self).__init__(endog)
  101 +
  102 + def _hess(self, eta1, est_vect, wts, nobs):
  103 + """
  104 + Calculates the hessian of a weighted empirical likelihood
  105 + provlem.
  106 +
  107 + Parameters
  108 + ----------
  109 + eta1: 1xm array.
  110 +
  111 + Value of lamba used to write the
  112 + empirical likelihood probabilities in terms of the lagrangian
  113 + multiplier.
2
Josef Perktold Owner
josef-pkt added a note

Intend ?

Josef Perktold Owner
josef-pkt added a note

or is eta1 the value of lambda ? (both sound Greek to me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((95 lines not shown))
  95 + A class that holds functions that are optimized/solved. Not
  96 + intended for the end user.
  97 + """
  98 +
  99 + def __init__(self, endog):
  100 + super(_OptFuncts, self).__init__(endog)
  101 +
  102 + def _hess(self, eta1, est_vect, wts, nobs):
  103 + """
  104 + Calculates the hessian of a weighted empirical likelihood
  105 + provlem.
  106 +
  107 + Parameters
  108 + ----------
  109 + eta1: 1xm array.
  110 +
1
Josef Perktold Owner
josef-pkt added a note

Lagrange multiplier ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold
Owner

What's the relationship between the different emplike branches?

My impression is that some of the things I commented might be already cleaned up.
IVregression branch uses a different descriptive :

from descriptive2 import _OptFuncts

j-grana6
Josef Perktold
Owner

However, reviewing then descriptive doesn't make sense, if it's not the latest version.
Are all changes from this branch included in another branch ?

j-grana6
Josef Perktold
Owner

Ok, I understand. I will look more at the descriptive branch later today.

statsmodels/emplike/descriptive.py
((1,351 lines not shown))
  1351 + pairs = itertools.product(x, y)
  1352 + z = []
  1353 + for i in pairs:
  1354 + z.append(self.mv_test_mean(np.asarray(i))[0])
  1355 + X, Y = np.meshgrid(x, y)
  1356 + z = np.asarray(z)
  1357 + z = z.reshape(X.shape[1], Y.shape[0])
  1358 + ax.contour(x, y, z.T, levels=levs)
  1359 + if plot_dta:
  1360 + ax.plot(self.endog[:, 0], self.endog[:, 1], 'bo')
  1361 + return fig
  1362 +
  1363 + def test_corr(self, corr0, nuis0=None, mu1_lb=None,
  1364 + mu1_ub=None, mu2_lb=None, mu2_ub=None,
  1365 + var1_lb=None, var1_ub=None,
  1366 + var2_lb=None, var2_ub=None, return_weights=0):
1
Josef Perktold Owner
josef-pkt added a note

that's a long list of kwds, and we cannot use **my_bounds if return_weights are set

maybe combining bounds into a dict, see also comment below on boiler plate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((1,176 lines not shown))
  1176 + ll = kurtosis(endog) - \
  1177 + (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
  1178 + ((nobs - 2.) * (nobs + 1.) * \
  1179 + (nobs + 3.))) ** .5) * \
  1180 + (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
  1181 + (nobs + 5.))) ** .5)
  1182 + # Need to calculate variance and mu limits here to avoid
  1183 + # recalculating at every iteration in the maximization.
  1184 + if (var_max is None) or (var_min is None):
  1185 + var_lims = self.ci_var(sig=sig)
  1186 + if (mu_max is None) or (mu_min is None):
  1187 + mu_lims = self.ci_mean(sig=sig)
  1188 + if var_min is not None:
  1189 + self.var_l = var_min
  1190 + else:
  1191 + self.var_l = var_lims[0]
1
Josef Perktold Owner
josef-pkt added a note

if they are attached here as a temporary variable to make available in the iterations, then I would not leave them around as attributes for the user.

for example, I used something like self._temp as attribute to hold the values during iteration.
if they are for end user to know what was used in the program, it would need an indication that it was used for ci_kurt, and not for any other ones.

using dicts with dict updating sound also easier here, and lazy/cached attributes for the defaults ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((813 lines not shown))
  813 + var_vect = list(np.arange(var_l, var_h, var_step))
  814 + z = []
  815 + for sig0 in var_vect:
  816 + self.sig2_0 = sig0
  817 + for mu0 in mu_vect:
  818 + z.append(self._opt_var(mu0, pval=True))
  819 + z = np.asarray(z).reshape(len(var_vect), len(mu_vect))
  820 + ax.contour(mu_vect, var_vect, z, levels=levs)
  821 + return fig
  822 +
  823 + ## TODO: Use gradient and Hessian to optimize over nuisance params
  824 + ## TODO: Use non-nested optimization to optimize over nuisance
  825 + ## parameters. See Owen pgs 234- 241
  826 +
  827 + def test_skew(self, skew0, nuis0=None, mu_min=None,
  828 + mu_max=None, var_min=None, var_max=None,
1
Josef Perktold Owner
josef-pkt added a note

what's the difference between mu and a mean?
If mu always refers to the mean (expected value), then I would call it mean, so I don't have to guess if it's not something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((775 lines not shown))
  775 + var_step,
  776 + levs=[.2, .1, .05, .01, .001]):
  777 + """
  778 + Returns a plot of the confidence region for a univariate
  779 + mean and variance.
  780 +
  781 + Parameters
  782 + ----------
  783 +
  784 + mu_l: float
  785 + Lowest value of the mean to plot
  786 +
  787 + mu_h: float
  788 + Highest value of the mean to plot
  789 +
  790 + var_l: float
1
Josef Perktold Owner
josef-pkt added a note

not so nice, add some letters

what I like most of the time is to use low and upp because they both have 3 letters and we don't use them in other context than limits or bounds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((989 lines not shown))
  989 +
  990 + Returns
  991 + --------
  992 +
  993 + test_results: tuple
  994 + The log-likelihood ratio and p_value of the joint hypothesis test.
  995 + """
  996 + self.kurt0 = kurt0
  997 + self.skew0 = skew0
  998 + if nuis0 is not None:
  999 + start_nuisance = nuis0
  1000 + else:
  1001 + start_nuisance = np.array([self.endog.mean(),
  1002 + self.endog.var()])
  1003 + if mu_min is not None:
  1004 + mu_lb = mu_min
1
Josef Perktold Owner
josef-pkt added a note

still redundant renaming, settle on one name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold josef-pkt commented on the diff
statsmodels/emplike/descriptive.py
((1,166 lines not shown))
  1166 + else:
  1167 + ul = kurtosis(endog) + \
  1168 + (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
  1169 + ((nobs - 2.) * (nobs + 1.) * \
  1170 + (nobs + 3.))) ** .5) * \
  1171 + (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
  1172 + (nobs + 5.))) ** .5)
  1173 + if lower_bound is not None:
  1174 + ll = lower_bound
  1175 + else:
  1176 + ll = kurtosis(endog) - \
  1177 + (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
  1178 + ((nobs - 2.) * (nobs + 1.) * \
  1179 + (nobs + 3.))) ** .5) * \
  1180 + (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
  1181 + (nobs + 5.))) ** .5)
1
Josef Perktold Owner
josef-pkt added a note

I don't like this code repetition much. One possibility is to use the change in the if structure as in another method, or use a default cache (more separately)

also this will be calculated repeatedly if both bounds are None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
statsmodels/emplike/descriptive.py
((1,159 lines not shown))
  1159 + If function returns f(a) and f(b) must have different signs, consider
  1160 + expanding lower and upper bound.
  1161 + """
  1162 + endog = self.endog
  1163 + nobs = self.nobs
  1164 + if upper_bound is not None:
  1165 + ul = upper_bound
  1166 + else:
  1167 + ul = kurtosis(endog) + \
  1168 + (2.5 * (2. * ((6. * nobs * (nobs - 1.)) / \
  1169 + ((nobs - 2.) * (nobs + 1.) * \
  1170 + (nobs + 3.))) ** .5) * \
  1171 + (((nobs ** 2.) - 1.) / ((nobs - 3.) *\
  1172 + (nobs + 5.))) ** .5)
  1173 + if lower_bound is not None:
  1174 + ll = lower_bound
1
Josef Perktold Owner
josef-pkt added a note

ll or 11, if short names, then ub and lb is better, but I like longer names.

Also my impression is that names and abbreviations change from method to method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Josef Perktold
Owner

I haven't tried to work my way through the calculations yet.

The main thing I would change is to try to get rid of some of the boiler plate in the bounds handling. What I think should work is with attaching defaults to the instance, and working with attached dictionaries for the options that are actually used in the iterations.

mean, variance, skew and kurtosis could be cached attributes of the instance directly.
Then I would add a cache for the bounds for mean, var, skew and kurtosis, that can then be used in any of the methods.
I guess, bounds for skew and kurtosis should be lazy. (I'm not sure what the easiest way to do that - a lazy dict, instance, an extra class with cached attributes).

Then I would just try to do a dict update in the method. The easiest would be if the bounds where just given in a single dict, as options. Otherwise, it would still require going through all arguments explicitly.

I think we could save the optimization options, including bounds, in a method specific attribute, e.g. self.kurt_options = dict(....). They are small and we wouldn't have to worry about memory, and they would be overwritten with a second call to the same method with different arguments.


Given what looks like a good test coverage, I think it can be merged soon, and make more improvements later.
Main thing that should be settled are method signatures.

j-grana6
Josef Perktold
Owner

One simpler case for options in a dict that I just used

change the default for the options in a call, but let the users overwrite them
https://github.com/josef-pkt/statsmodels/blob/4157ca005d35092f66115b2e9b48eb1062a2f7b0/statsmodels/robust/least_trimmed_squares.py#L417

your case is a bit more complicated because you need to precalculate values, and you don't need all bounds at the same time.

j-grana6
Josef Perktold
Owner

I don't know the optimization code and don't know if it applies.
In general, an advantage of the bounds might also be that it's more robust in "strange" cases, cases with numerical problems.

j-grana6
statsmodels/emplike/descriptive.py
((451 lines not shown))
  451 + """
  452 + Parameters
  453 + ----------
  454 + nuis_params: 1darray
  455 + array containing two nuisance means and two nuisance variances
  456 +
  457 + Returns
  458 + -------
  459 + llr: float
  460 + The log-likelihood of the correlation coefficient holding nuisance
  461 + parameters constant
  462 + """