REF: refactor and speedup of mixed LME #1940

Merged
merged 10 commits into from Sep 18, 2014

Projects

None yet

3 participants

@kshedden
Contributor

This PR has turned into a refactoring to improve the structure of the code, and to speed it up.

comments below about the details and mailing list thread, triggered by this
https://groups.google.com/d/msg/pystatsmodels/KXF3CxqYZcI/ZV5m_1ENHTkJ

description was:
Fix one docstring.
Fix a bug that was preventing the history from being returned.

kshedden added some commits Aug 29, 2014
@kshedden kshedden Fix bug preventing history from being returned f4e7e68
@kshedden kshedden Fixed typo
cdae1c7
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling cdae1c7 on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

@kshedden kshedden Fix printing of random effects names in summary
dd4beb8
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling dd4beb8 on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

kshedden added some commits Aug 30, 2014
@kshedden kshedden Minor optimization for no_smw case
c7b6e26
@kshedden kshedden Optimize a bit by precomputing ex_re^2
f027147
@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling f027147 on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

@josef-pkt
Member

an observation on the timing:

The largest amount of time is spent in hessian, the complex derivative hessian is very slow.

>>> t2 = time.time();hess0 = res_m.model.hessian_full(res_m.params);t3 = time.time(); t3 - t2
0.03299999237060547
>>> t2 = time.time();hess0 = res_m.model.hessian(res_m.params);t3 = time.time(); t3 - t2
1.2950000762939453

As far as I can see, you don't use it at all, and it might be automatically called in super fit (I haven't verified this.)

Can hessian be replaced by hessian_full ?

I renamed hessian so it is not found. I get the warning that hessian is not available, but bse and everything is, since you use hessian_full in fit.
Calculation time drops to almost a third..

@josef-pkt
Member

@kshedden

about the hessian and loglike_full versus loglike_sqrt parameterization:

As far as I understand MixedLM, in the super().fit() call you use the full parameterization. I this is correct, then define hessian to just call hessian_full.
In general, hessian should have the same dispatch as loglike depending on self.use_sqrt.

You can separate which parameterization to use from the one you use to estimate, as you are doing for em, as long as you set self.use_sqrt to the one that is reported before the final calculations.

But we need a keyword in super().fit to turn off the automatic hessian calculation which is wrong for the final parameterization (if a different parameterization is used during optimization)
see #1943 that I just opened

@josef-pkt
Member

correction: super().fit with "bfgs" also uses self.use_sqrt, so the super hessian is useless in this case.

@josef-pkt
Member

The only other interesting change I made to MixedLM fit during my experimentation, is to allow users to overwrite method = "bfgs", at least "lbfgs" should also work well.

"newton" and some others require that hessian is also defined for self.use_sqrt=True which could just use the numerical derivative if necessary, and it's use could be discouraged in the doc string in that case.

@kshedden
Contributor

Thanks for looking into this, I've been looking at it too.

One thing I've turned up is that I'm pretty sure there s a bug in the
calculation of the hessian for cov_re (lower right corner of the hessian).
The rest of the hessian agrees very well with the numerical hessian, but
this part is way off, even in simple models. The calculation is quite
complex. I've gone over it several times and can't see the issue yet.
It's also possible that I am misinterpreting a formula from Bates' paper.

Since we don't report standard errors for variance parameters (similar to
R), this doesn't trip any of the tests.

If the Hessian is wrong, this would explain why the optimization methods
that use the Hessian don't work. We might see a bit of a performance gain
if I can get this fixed.

I also have code to calculate the analytic hessian for the square root
parameterization of the covariance, eliminating the need for the numerical
derivative. But this depends on the other hessian being correct, so I need
to figure out what's going on there first.

On Sat, Aug 30, 2014 at 11:21 PM, Josef Perktold notifications@github.com
wrote:

The only other interesting change I made to MixedLM fit during my
experimentation, is to allow users to overwrite method = "bfgs", at least
"lbfgs" should also work well.

"newton" and some others require that hessian is also defined for
self.use_sqrt=True which could just use the numerical derivative if
necessary, and it's use could be discouraged in the doc string in that case.


Reply to this email directly or view it on GitHub
#1940 (comment)
.

@josef-pkt
Member

I thought initially the difference in the hessian in the last value is because of the sqrt transform.

with use_sqrt=False, and changing hessian to

        if self.use_sqrt:
            from statsmodels.tools.numdiff import approx_hess_cs
            hmat = approx_hess_cs(params, self.loglike).real
        else:
            hmat = self.hessian_full(params)
        return hmat

newton converges to almost the same values as bfgs except for the last params

>>> res_m.params - res_m2.params
array([  9.96870698e-07,  -2.92361474e-07,  -2.21537702e-08,
         2.14196420e-08,   9.34700644e-08,   6.99729626e-08,
         5.62599506e-08,   2.33877762e-08,   4.91153204e-08,
        -7.94121182e-08,  -3.65191339e+05])

but it's quite a bit slower than bfgs

One possibility that I used in some draft versions for nonlinear robust M-estimators, is to use the analytical gradient and hessian for the mean part of the model, and the numerical derivatives for the scale estimator.
In this case you already have the score. If that is correct, then you would just need first numerical derivatives of it wrt the covariance parameters. (BTW: cs derivatives are more accurate but slower than the regular forward differences)

Follow-up question:
Is the hessian (asymptotically) block-diagonal between fixed and random effects or covariance of the random effects?
If not, then a mistake in the covariance of the random effects parameters would also affect the fixed effects parameters or their covariance. (need most likely distinguish re params from re cov in this.)

@josef-pkt
Member

correction to last:

newton might just get stuck in a different local approximate optimum. It converges to essentially the same as bfgs, if we start close to it.

(res_m is with 'bfgs', res_m3 is with 'newton')

>>> t2 = time.time(); res_m3 = m.fit(start_params=res_m.params* 0.9, method='newton', reml=False, niter_sd=1); t3 = time.time(); t3 - t2
{'method': 'newton'}
{'retall': False, 'disp': False, 'method': 'newton'}
0.38000011444091797
>>> res_m3.params - res_m.params
array([ -9.96809593e-07,   2.92347037e-07,   2.21472940e-08,
        -2.14196313e-08,  -9.34700641e-08,  -6.99729521e-08,
        -5.62599507e-08,  -2.33877656e-08,  -4.91153203e-08,
         7.94083804e-08,   3.78990097e-07])

score at optimum with newton is smaller than with bfgs

>>> res_m.model.score(res_m.params)
array([ -7.25765191e-06,   3.89320872e-05,   1.08448599e-06,
         1.31684694e-05,  -3.06005474e-05,  -1.63265804e-05,
        -7.99624064e-06,   1.19728722e-05,  -3.65603849e-06,
        -4.00316610e-05,   9.75708543e-06])
>>> res_m.model.score(res_m3.params)
array([ -4.51965484e-14,  -1.87343206e-14,  -1.21773084e-14,
         0.00000000e+00,  -8.19626525e-15,   4.19180423e-14,
         9.60133929e-15,   1.14747713e-14,  -6.58628457e-14,
        -4.66484582e-13,   7.10542736e-15])
@josef-pkt
Member

another issue

the complex step hessian is wrong, I guess

>>> t2 = time.time();hess = nd.approx_hess(res_m.params, res_m.model.loglike);t3 = time.time(); t3 - t2
2.1429998874664307
>>> hess0 = m.hessian_full(res_m.params)
>>> np.max(np.abs(hess - hess0))
0.0029088464943924919
>>> t2 = time.time();hess_cs = nd.approx_hess_cs(res_m.params, res_m.model.loglike);t3 = time.time(); t3 - t2
1.2810001373291016
>>> np.max(np.abs(hess_cs - hess0))
30.035771510610612


>>> np.allclose(hess, hess0, rtol=1e-4, atol=1e-4)
True

So to me your hessian_full looks correct. But in the calculation of the loglike there is something that messes up complex values.

@kshedden
Contributor

Thanks that's great. I will continue to try to get the analytic hessian
working for the square root transform, but it's not critical now.

On Sun, Aug 31, 2014 at 1:55 AM, Josef Perktold notifications@github.com
wrote:

another issue

the complex step hessian is wrong, I guess

t2 = time.time();hess = nd.approx_hess(res_m.params, res_m.model.loglike);t3 = time.time(); t3 - t2
2.1429998874664307
hess0 = m.hessian_full(res_m.params)
np.max(np.abs(hess - hess0))
0.0029088464943924919
t2 = time.time();hess_cs = nd.approx_hess_cs(res_m.params, res_m.model.loglike);t3 = time.time(); t3 - t2
1.2810001373291016
np.max(np.abs(hess_cs - hess0))
30.035771510610612

np.allclose(hess, hess0, rtol=1e-4, atol=1e-4)
True

So to me your hessian_full looks correct. But in the calculation of the
loglike there is something that messes up complex values.


Reply to this email directly or view it on GitHub
#1940 (comment)
.

@josef-pkt
Member

based on master, without the changes in this PR
at 0.87 seconds for the fit, with 0.77 in bfgs, I don't see much that could reduce the time without restructuring the structure or adding special cases.
most of the time is taken up by
_smw_solve total 0.518s for 9576 calls to it, and
_smw_logdet with 0.165s and 2520 calls to it

I tried to use qmat = scipy.linalg.solve(np.atleast_2d(qmat), u, sym_pos=True, check_finite=False)
but in this case it's slower, qmat and u are just scalars, and scipy linalg functions have more overhead. Maybe it's faster in larger problems.

@kshedden
Contributor

Based on comparing just a few test cases, we seem to be doing around 2
times more iterations than R lme4. It could be that our convergence
criteria are stricter, since I know at least one example where R stops too
early and we get a much higher likelihood. It could also be that R's
convergence criteria are smarter and allow it to stop earlier when
appropriate. Or it could be that their optimizers are smarter at choosing
search directions.

I would like to refactor that transformation code that converts gradients
and hessians between the two parameterizations. That would make it easier
to extend the model in the future. In particular, it would be good to
extend the model to allow non-constant variances. But we don't need to
wait on that for 0.6.

Also, there is a more or less complete rewrite needed to accommodate
crossed random effects. This would be based on a different profiled
likelihood in which the mean structure is completely profiled out. This
may be faster but is probably at least 6 months out.

On Sun, Aug 31, 2014 at 12:06 PM, Josef Perktold notifications@github.com
wrote:

based on master, without the changes in this PR
at 0.87 seconds for the fit, with 0.77 in bfgs, I don't see much that
could reduce the time without restructuring the structure or adding special
cases.
most of the time is taken up by
_smw_solve total 0.518s for 9576 calls to it, and
_smw_logdet with 0.165s and 2520 calls to it

I tried to use qmat = scipy.linalg.solve(np.atleast_2d(qmat), u,
sym_pos=True, check_finite=False)
but in this case it's slower, qmat and u are just scalars, and scipy
linalg functions have more overhead. Maybe it's faster in larger problems.


Reply to this email directly or view it on GitHub
#1940 (comment)
.

@josef-pkt
Member

Another possible optimization is to improve the start_params.

I think it could start at least with OLS params_fe (post-fix) and maybe initialize the random effects from the fixed effects model with group dummies or residual group means if there is a constant.
(essentially a simple old fashioned variance component model to initialize MLE or REML, that's essentially what the panel model in the panel PR provides.)

@josef-pkt
Member

another piece: effect of start_params, I changed the source to allow niter_sd=0
starting at the optimial params:
res = m.fit(start_params=res_params, reml=False, method='bfgs', niter_sd=0)

fit takes 0.065s, of which
0.027s is 1 iteration in bfgs
0.023s is the hessian_full evaluation.

@josef-pkt
Member

I just tried out starting with fe_params = OLS(self.endog, self.exog).fit().params
fit goes down to 0.23 seconds with 7 calls to loglike and to score with niter_sd=0
and to 0.159 seconds with niter_sd=1
that's around 20 or 25 times faster compared to original numbers on the mailing list without the changes in this PR.

@kshedden kshedden Refactor parameter handling, improve starting values
a37bf16
@kshedden
Contributor
kshedden commented Sep 1, 2014

Several updates here (sorry, should have placed in different commits):

  • Precompute cross-product matrices used by smw_solve and smw_logdet. Modest speed improvement.
  • Place parameter handling into a separate class that deals with most issues related to conversions among transformations. Will make model extension much easier in the future.
  • Use OLS to get starting values if none are provided. Major speed improvement.
  • Remove steepest descent (no longer needed if OLS starting values are used, also it was causing some problems in some models, easier to remove than to fix).
  • Analytic calculation of hessian under either model parameterization
  • Improved numerical tests of score and Hessian.

Remaining issues: the analytic and numeric Hessians do not agree well away from the MLE. This was always true, not a new issue. I can't find the issue now.

Passes all tests locally, waiting on Travis.

Based on very limited profiling, this is now around 7x slower than GEE-exchangeable, compared to 65x slower before, so around 10 times faster than before.

@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling a37bf16 on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

@josef-pkt
Member

One quick question based on your summary:

I had to revert OLS start_params because the unit tests failed with it, I had opened #1947
Did you have any problems, or was this fixed in the general refactoring?

@kshedden
Contributor
kshedden commented Sep 1, 2014

There was an odd thing happening in which the first steepest descent step
moved the variance parameter estimate exactly to 0 in a few cases, which is
a non-differentiable point under the square-root transform of the
covariance. Then the newton-type optimizers are stuck at that point. I'm
not sure why this wasn't happening before. It's no longer an issue if we
don't use steepest descent.

I also dialed back the rtol parameter a bit in one place, line 212. I
don't think that would explain the majority of errors that you saw.

Other than this, all the same tests I had before run clean. I would still
like to run some of my notebooks through the code in the new branch as a
further check.

On Mon, Sep 1, 2014 at 2:50 PM, Josef Perktold notifications@github.com
wrote:

One quick question based on your summary:

I had to revert OLS start_params because the unit tests failed with it, I
had opened #1947 #1947
Did you have any problems, or was this fixed in the general refactoring?


Reply to this email directly or view it on GitHub
#1940 (comment)
.

@josef-pkt
Member

two questions:

Would it be useful to keep steepest descend around for cases where where the starting values are not good? I don't know when we might run into convergence problems with bfgs and EM.

Is there a test case with unbalanced panel?
based on a quick look, all the random datasets have balanced panels.

I only did a rough browsing through this PR (and might not go over details before merging).
It looks good as far as I can see, and extracting the parameter transformation code into a separate unit/class looks like a good idea.

I guess the only thing I have additionally is to add skip_hessian to super().fit, I prepared a branch but will merge it after this.

This PR is only one commit behind master, so I can just hit the merge button, if you think it's ready.

@kshedden
Contributor
kshedden commented Sep 1, 2014

I'm adding the steepest ascent in now, then you can merge.

On Mon, Sep 1, 2014 at 4:24 PM, Josef Perktold notifications@github.com
wrote:

two questions:

Would it be useful to keep steepest descend around for cases where where
the starting values are not good? I don't know when we might run into
convergence problems with bfgs and EM.

Is there a test case with unbalanced panel?
based on a quick look, all the random datasets have balanced panels.

I only did a rough browsing through this PR (and might not go over details
before merging).
It looks good as far as I can see, and extracting the parameter
transformation code into a separate unit/class looks like a good idea.

I guess the only thing I have additionally is to add skip_hessian to
super().fit, I prepared a branch but will merge it after this.

This PR is only one commit behind master, so I can just hit the merge
button, if you think it's ready.


Reply to this email directly or view it on GitHub
#1940 (comment)
.

kshedden added some commits Sep 1, 2014
@kshedden kshedden Restored steepest ascent, with some modifications 91ad249
@kshedden kshedden Restored steepest ascent, with some improvements
a844982
@kshedden kshedden Add todo about unequal group sizes
27b72e7
@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 27b72e7 on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling 27b72e7 on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

@josef-pkt josef-pkt added a commit to josef-pkt/statsmodels that referenced this pull request Sep 2, 2014
@josef-pkt josef-pkt REF: MixedLM performance, mostly obsolete/superseded by #1940 b62e2d4
@josef-pkt josef-pkt referenced this pull request Sep 2, 2014
Merged

REF/DOC: Misc #1952

@josef-pkt
Member

@kshedden One more question about the different optimization algorithms:
I'm reading parts of a SAS article on influence and outlier diagnostics in Mixed models.
A cheaper alternative for full re-estimation with leave-one-(group)-out, is to do just one updating step to get a one-step approximation to the influence statistics.

Do you know if there is a recommendation with respect to which one step: one step of steepest descend, one step of bfgs, one step of newton, and so on?

If we have maxiter=1 available in fit for one full updating step, then it doesn't look too difficult or very time consuming to get the influence and outlier diagnostics.

@josef-pkt
Member

to clarify the question: do all the different optimizer do one full updating step with maxiter=1?

@kshedden
Contributor
kshedden commented Sep 2, 2014

I'm just guessing, but I would expect that any theory justifying a one step
update for influence diagnostics would be based on a step in the direction
of the gradient. The idea, I think, is that you are already at the MLE for
the complete data, and you want to take one step to increase the likelihood
function for a dataset with an omitted case. Since there is no history of
previous iterations for the case-deleted likelihood, there isn't a clear
way to define a conjugate search direction other than the gradient.

I've often wondered whether the different optimizers always start by
searching in the direction of the gradient, and whether they do a complete
line search. I assume they do (for both questions), but we would have to
look at the code to know for sure.

On Tue, Sep 2, 2014 at 11:49 AM, Josef Perktold notifications@github.com
wrote:

@kshedden https://github.com/kshedden One more question about the
different optimization algorithms:
I'm reading parts of a SAS article on influence and outlier diagnostics in
Mixed models.
A cheaper alternative for full re-estimation with leave-one-(group)-out,
is to do just one updating step to get a one-step approximation to the
influence statistics.

Do you know if there is a recommendation with respect to which one step:
one step of steepest descend, one step of bfgs, one step of newton, and so
on?

If we have maxiter=1 available in fit for one full updating step, then it
doesn't look too difficult or very time consuming to get the influence
and outlier diagnostics.


Reply to this email directly or view it on GitHub
#1940 (comment)
.

@kshedden kshedden Use MixedLMParams to specify free/fixed structure
c51e4ee
@coveralls

Coverage Status

Coverage increased (+0.03%) when pulling c51e4ee on kshedden:lme-docstring into 9ce1605 on statsmodels:master.

@josef-pkt josef-pkt changed the title from LME docstring to REF: refactor and speedup of mixed LME Sep 18, 2014
@josef-pkt josef-pkt added this to the 0.6 milestone Sep 18, 2014
@josef-pkt
Member

I'm merging this as is. (I'm a little bit out of this topic again and I'm not able to read it quickly again.)

@josef-pkt josef-pkt merged commit e6f2f70 into statsmodels:master Sep 18, 2014

2 checks passed

continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci The Travis CI build passed
Details
@josef-pkt
Member

I changed the description to reflect a better what is in this PR

@kshedden kshedden deleted the kshedden:lme-docstring branch Sep 27, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment